9 Professional Workflows: Testing and Git

9.1 Introduction

I hope you are starting to see the pieces of our reproducible workflow coming together. We now have reproducible environments and pipelines with Nix, reproducible logic with functional programming and in the previous chapter, we explored monads as a way to make our functions even more robust by handling logging and errors in a principled way.

This brings us to two final, crucial questions: how do we prove that our functions actually do what we claim they do? And how do we manage the complexity of collaborating with others (and with AI) without breaking everything?

This chapter addresses both questions by introducing unit testing and advanced Git workflows. You will learn what unit tests are and why they are essential for reliable data analysis, how to write and run them in both R (with {testthat}) and Python (with pytest), and how to use LLMs to accelerate test writing while embracing your role as a code reviewer. We will also cover professional Git techniques like Trunk-Based Development and interactive staging, which are especially valuable when managing code generated by AI assistants.

9.2 Part 1: Unit Testing

The answer to “how do we prove it works?” is unit testing. A unit test is a piece of code whose sole job is to check that another piece of code, a “unit”, works correctly. In our functional world, the “unit” is almost always a single function. This is why we spent so much time on FP in the previous chapter. Small, pure functions are not just easy to reason about; they are incredibly easy to test.

Writing tests is your contract with your collaborators and your future self. It’s a formal promise that your function, calculate_mean_mpg(), given a specific input, will always produce a specific, correct output. It’s the safety net that catches bugs before they make it into your final analysis and the tool that gives you the confidence to refactor and improve your code without breaking it.

9.2.1 The Philosophy of a Good Unit Test

So, what should we test? Writing good tests is a skill, but it revolves around answering a few key questions about your function. For any function you write, you should have tests that cover:

The “Happy Path”: does the function return the expected, correct value for a typical, valid input?
Bad Inputs: does the function fail gracefully or throw an informative error when given garbage input (e.g., a string instead of a number, a data frame with the wrong columns)?
Edge Cases: how does the function handle tricky but valid inputs? For example, what happens if it receives an empty data frame, a vector with NA values, or a vector where all the numbers are the same?

Writing tests forces you to think through these scenarios, and in doing so, almost always leads you to write more robust and well-designed functions.

9.2.2 Unit Testing in Practice

Let’s imagine we’ve written a simple helper function to normalise a numeric vector (i.e., scale it to have a mean of 0 and a standard deviation of 1). We’ll save this in a file named utils.R or utils.py.

R version (utils.R):

normalize_vector <- function(x) {
  (x - mean(x, na.rm = TRUE)) / sd(x, na.rm = TRUE)
}

Python version (utils.py):

import numpy as np

def normalize_vector(x):
  return (x - np.nanmean(x)) / np.nanstd(x)

Now, let’s write tests for it.

9.2.2.1 Testing in R with `{testthat}`

In R, the standard for unit testing is the {testthat} package. The convention is to create a tests/testthat/ directory in your project, and for a script utils.R, you would create a test file named test-utils.R.

Inside test-utils.R, we use the test_that() function to group related expectations.

# In file: tests/testthat/test-utils.R

# First, we need to load the function we want to test
source("../../utils.R")

library(testthat)

test_that("Normalization works on a simple vector (the happy path)", {
  # 1. Setup: Create input and expected output
  input_vector <- c(10, 20, 30)
  expected_output <- c(-1, 0, 1)

  # 2. Action: Run the function
  actual_output <- normalize_vector(input_vector)

  # 3. Expectation: Check if the actual output matches the expected output
  expect_equal(actual_output, expected_output)
})

test_that("Normalization handles NA values correctly", {
  input_with_na <- c(10, 20, 30, NA)
  expected_output <- c(-1, 0, 1, NA)

  actual_output <- normalize_vector(input_with_na)

  # We need to use expect_equal because it knows how to compare NAs
  expect_equal(actual_output, expected_output)
})

The expect_equal() function checks for near-exact equality. {testthat} has many other expect_*() functions, like expect_error() to check that a function fails correctly, or expect_warning() to check for warnings.

9.2.2.2 Testing in Python with `pytest`

In Python, the de facto standard is pytest. It’s incredibly simple and powerful. The convention is to create a tests/ directory, and your test files should be named test_*.py. Inside, you just write functions whose names start with test_ and use Python’s standard assert keyword.

# In file: tests/test_utils.py

import numpy as np
from utils import normalize_vector # Import our function

def test_normalize_vector_happy_path():
    # 1. Setup
    input_vector = np.array([10, 20, 30])
    expected_output = np.array([-1.0, 0.0, 1.0])

    # 2. Action
    actual_output = normalize_vector(input_vector)

    # 3. Expectation
    # For floating point numbers, it's better to check for "close enough"
    assert np.allclose(actual_output, expected_output)

def test_normalize_vector_with_nas():
    input_with_na = np.array([10, 20, 30, np.nan])
    expected_output = np.array([-1.0, 0.0, 1.0, np.nan])

    actual_output = normalize_vector(input_with_na)

    # `np.allclose` doesn't handle NaNs, but `np.testing.assert_allclose` does!
    np.testing.assert_allclose(actual_output, expected_output)

Because this isn’t a package though (yet), but a simple project with scripts, you also need to create another file called pytest.ini, which will tell pytest where to find the tests:

[pytest]
# Discover tests in the tests/ directory
testpaths = tests/

# Include default patterns and your hyphenated pattern
python_files = test_*.py *_test.py test-*.py

# Adds the root directory to the pythonpath, without this
# it'll be impossible to import normalize_vector() from
# utils.py
pythonpath = .

To run your tests, you simply navigate to your project’s root directory in the terminal and run the command pytest. It will automatically discover and run all your tests for you. That being said, you will get an error:

>       assert np.allclose(actual_output, expected_output)
E       assert False
E        +  where False = <function allclose at 0x7f0e81c959f0>(array([-1.22474487,  0.        ,  1.22474487]), array([-1.,  0.,  1.])    def test_normalize_vector_with_nas():
        input_with_na = np.array([10, 20, 30, np.nan])
        expected_output = np.array([-1.0, 0.0, 1.0, np.nan])

        actual_output = normalize_vector(input_with_na)

        # `np.allclose` doesn't handle NaNs, but `np.testing.assert_allclose` does!
>       np.testing.assert_allclose(actual_output, expected_output)
E       AssertionError:
E       Not equal to tolerance rtol=1e-07, atol=0
E
E       Mismatched elements: 2 / 4 (50%)
E       Max absolute difference among violations: 0.22474487
E       Max relative difference among violations: 0.22474487
E        ACTUAL: array([-1.224745,  0.      ,  1.224745,       nan])
E        DESIRED: array([-1.,  0.,  1., nan])

tests/test_utils.py:25: AssertionError
====================================================== short test summary info =======================================================
FAILED tests/test_utils.py::test_normalize_vector_happy_path - assert False
FAILED tests/test_utils.py::test_normalize_vector_with_nas - AssertionError:
========================================================= 2 failed in 0.15s ==========================================================

You don’t encounter this issue in R and understanding why reveals how valuable unit tests can be. (Hint: how does the implementation of the sd function, which computes the standard deviation, differ between R and NumPy?)

9.2.3 Testing as a Design Tool

Testing can also help you with programming, by thinking about edge cases. For example, what happens if we try to normalise a vector where all the elements are the same?

Let’s write a test for this edge case first.

pytest version:

# tests/test_utils.py
def test_normalize_vector_with_zero_std():
    input_vector = np.array([5, 5, 5, 5])
    actual_output = normalize_vector(input_vector)
    # The current function will return `[nan, nan, nan, nan]`
    # Let's assert that we expect a vector of zeros instead.
    assert np.allclose(actual_output, np.array([0, 0, 0, 0]))

If we run pytest now, this test will fail. Our test has just revealed a flaw in our function’s design. This process is a core part of Test-Driven Development (TDD): write a failing, but correct, test, then write the code to make it pass.

Let’s improve our function:


import numpy as np

def normalize_vector(x):
  std_dev = np.nanstd(x)
  if std_dev == 0:
    # If std is 0, all elements are the mean. Return a vector of zeros.
    return np.zeros_like(x, dtype=float)
  return (x - np.nanmean(x)) / std_dev

Now, if we run pytest again, our new test will pass. We used testing not just to verify our code, but to actively make it more robust and thoughtful.

9.2.4 The Modern Data Scientist’s Role: Reviewer and AI Collaborator

In the past, writing tests was often seen as a chore. Today, LLMs make this process very easy.

9.2.4.1 Using LLMs to Write Tests

LLMs are fantastic at writing unit tests. They are good at handling boilerplate code and thinking of edge cases. You can provide your function to an LLM and give it a prompt like this:

“Here is my Python function normalize_vector. Please write three pytest unit tests for it. Include a test for the happy path with a simple array, a test for an array containing np.nan, and a test for the edge case where all elements in the array are identical.”

The LLM will likely generate high-quality test code that is very similar to what we wrote above.

All you need is context

When using an LLM to generate tests, context is everything. If you are writing tests for functions that heavily use specific packages (like {dplyr} or pandas), providing the pkgctx output for those packages ensures the LLM writes idiomatic and correct tests.

For example, if you are testing a function that uses {rix}, feed the rix.pkgctx.yaml to the LLM so it knows exactly how to mock or assert the outputs of rix() functions. This is very useful particularly for packages that are not very well-known (or internal packages to your company that aren’t even public) or packages that have been updated very recently, as the LLMs training data cutoff might not include the latest versions of the package.

This is a massive productivity boost. However, this introduces a new, critical role for the data scientist: you are the reviewer.

An LLM does not write your tests; it generates a draft. It is your professional responsibility to:

Read and understand every line of the test code.
Verify that the expected_output is actually correct.
Confirm that the tests cover the cases you care about.
Commit that code under your name, taking full ownership of it.

“A COMPUTER CAN NEVER BE HELD ACCOUNTABLE THEREFORE A COMPUTER MUST NEVER MAKE A MANAGEMENT DECISION” - IBM Training Manual, 1979.

This principle is, in my opinion, even more important for tests than for production code. Even before LLMs, we relied on code we didn’t write ourselves. At my first job, my boss insisted that I avoid external packages entirely and rewrite everything from scratch. I ignored her and continued using the packages I trusted. At some point, you have to trust strangers. That was true then, and it is true now, except that “strangers” now includes LLMs and developers who themselves rely heavily on LLMs.

But here is the key difference: tests are small. Unlike a sprawling codebase, a unit test is short enough to read and understand completely. When you review an LLM-generated test, you can verify that the expected output is correct and that the test covers the right cases. This makes tests uniquely suited to the “trust but verify” workflow that AI-assisted development demands.

In the next chapter, we will learn the very basics of packaging, which makes testing even easier.

9.3 Part 2: Advanced Version Control

I assume you already know the basics of Git: clone, add, commit, and push. If you don’t, there are countless tutorials available, and I highly recommend you check one out and then come back to this.

In this section, we will focus on the techniques that separate the “I submit code via Dropbox” novice from the professional data scientist: Trunk-Based Development and managing AI-generated code.

9.3.1 A Better Way to Collaborate: Trunk-Based Development

A common mistake for new teams is to use branches to add new features, and then let these feature branches live for a very long time. A data scientist might create a branch called feature-big-analysis, work on it for three weeks, and then try to merge it back into main. The result is often what’s called “merge hell”: main has changed so much in three weeks that merging the branch back in creates dozens of conflicts and is a painful, stressful process.

To avoid this, many professional teams use a workflow called Trunk-Based Development (TBD). The philosophy is simple but powerful:

All developers integrate their work back into the main branch (the “trunk”) as frequently as possible, at least once a day.

This means that feature branches are incredibly short-lived. Instead of a single, massive feature branch that takes weeks, you create many tiny branches that each take a few hours or a day at most.

9.3.1.1 How to Work with Short-Lived Branches

But how can you merge something back into main if the feature isn’t finished? The main branch must always be stable and runnable. You can’t merge broken code.

The first way to solve this issue is to use feature flags.

A feature flag is just a simple variable (like a TRUE/FALSE switch) that lets you turn a new, unfinished part of the code on or off. This allows you to merge the code into main while keeping it “off” until it’s ready.

Imagine you are adding a new, complex plot to analysis.R, but it will take a few days to get right.

# At the top of your analysis.R script
# --- Configuration ---
use_new_scatterplot <- FALSE # Set to FALSE while in development

# ... lots of existing, working code ...

# --- New Feature Code ---
if (use_new_scatterplot) {
  # All your new, unfinished, possibly-buggy plotting code goes here.
  # It won't run as long as the flag is FALSE.
  library(scatterplot3d)
  scatterplot3d(mtcars$mpg, mtcars$hp, mtcars$wt)
}

With this if block, you can safely merge your changes into main. The new code is there, but it won’t execute and won’t break the existing analysis. Other developers can pull your changes and won’t even notice. Once you’ve finished the feature in subsequent small commits, the final change is just to flip the switch: use_new_scatterplot <- TRUE.

9.3.2 Collaborative Hygiene: Rebase vs. Merge

When you and a colleague both work on main, your histories can diverge. When you git pull, Git behaves in two ways:

Merge (Default): creates a “merge commit” that ties the histories together. This results in a messy, non-linear history graph if done frequently.
Rebase (Recommended): temporarily lifts your commits, pulls your colleague’s changes, and then replays your commits on top.

In data science projects, where identifying when a plot or number changed is critical, a linear history is valuable. I strongly recommend configuring Git to use rebase by default:

git config --global pull.rebase true

Now, when you git pull, your local changes will stay at the “tip” of the history, keeping the log clean.

9.3.3 Working with LLMs and Git: Managing AI-Generated Changes

When working with LLMs like GitHub Copilot, it’s crucial to review changes carefully before committing them. Git provides a powerful tool for this: interactive staging.

9.3.3.1 Interactive Staging: Accepting changes chunk by chunk

Git’s interactive staging feature (git add -p) is perfect for reviewing LLM changes. Instead of blindly adding all changes with git add ., git add -p lets you review each “hunk” (chunk of changes) individually.

Suppose an LLM refactored your Python script. You run:

git add -p analysis.py

Git will show you a chunk of changes and prompt you:

@@ -1,2 +1,4 @@
 # Load required libraries
 import pandas as pd
+import matplotlib.pyplot as plt
+import seaborn as sns
Stage this hunk [y,n,q,a,d,s,e,?]?

You can reply:

y: Yes, stage this (I approve).
n: No, do not stage this (I reject or want to edit later).
s: Split this hunk into smaller pieces (if the LLM did two unrelated things in one block).
e: Edit the hunk manually before staging.

This forces you to be the reviewer. It prevents “accidental” AI hallucinations (like deleting a critical import) from slipping into your repository.

9.3.3.2 Example LLM Workflow

Save state: git commit -m "Working state before LLM"
Prompt LLM: “Please refactor this function to be more efficient.”
Review: Run git diff to see what it did.
Select: Run git add -p and say y only to the parts that look correct.
Clean: Run git checkout . to discard the parts you rejected.
Verify: Run your unit tests!

This workflow ensures you maintain full control over your codebase while benefiting from LLM assistance.

9.4 Summary

In this chapter, we covered the safety protocols of professional software engineering:

Unit tests prove your code works and protect you from regressions.
Git workflows (Trunk-Based Development, Rebasing) keep your collaboration clean and history linear.
Code review (via Interactive Staging) is your primary defence against AI-generated bugs.