library(rixpress)
list(
rxp_py_file(
name = raw_data,
path = "data/raw.csv",
read_function = "lambda x: pandas.read_csv(x)"
),
rxp_py(
name = predictions,
expr = "train_model(raw_data)",
user_functions = "functions.py"
),
rxp_py2r(
name = predictions_r,
expr = predictions
),
rxp_r(
name = plot,
expr = visualise(predictions_r),
user_functions = "functions.R"
),
rxp_qmd(
name = report,
qmd_file = "report.qmd"
)
) |>
rxp_populate()6 Building Reproducible Pipelines with rixpress
6.1 Introduction: From Scripts and Notebooks to Pipelines
So far, we have learned about reproducible development environments with Nix and {rix}. We can now create project-specific environments with precise versions of R, Python, and all dependencies. But there’s one more piece to the puzzle: orchestration.
How do we take our collection of functions and data files and run them in the correct order to produce our final data product? This problem of managing computational workflows is not new, and a whole category of build automation tools has been created to solve it.
6.1.1 The Evolution of Build Automation
The original solution, dating back to the 1970s, is make. Created by Stuart Feldman at Bell Labs in 1976, make reads a Makefile that describes the dependency graph of a project. If you change the code that generates plot.png, make is smart enough to only re-run the steps needed to rebuild the plot and the final report.
The strength of these tools is their language-agnosticism, but their weaknesses are twofold:
- File-centric: You must manually handle all I/O. Your first script saves
data.csv, your second loads it. This adds boilerplate and surfaces for error. - Environment-agnostic: They track files but know nothing about the software environment needed to create those files.
This is where R’s {targets} package shines. It tracks dependencies between R objects directly, automatically handling serialisation. But {targets} operates within a single R session; for polyglot pipelines, you must manually coordinate via {reticulate}.
6.1.2 The Separation Problem
All these tools (from make to {targets} to Airflow) separate workflow management from environment management. You use one tool to run the pipeline and another (Docker, {renv}) to set up the software.
This separation creates friction. Running {targets} inside Docker ensures reproducibility, but forces the entire pipeline into one monolithic environment. What if your Python step requires TensorFlow 2.15 but your R step needs reticulate with Python 3.9? You’re stuck.
6.1.3 The Imperative Approach: Make + Docker
To illustrate this, consider the traditional setup for a polyglot pipeline. You’d need:
- A
Dockerfileto set up the environment - A
Makefileto orchestrate the workflow - Wrapper scripts for each step
Here’s what a Makefile might look like:
# Makefile for a Python → R pipeline
DATA_DIR = data
OUTPUT_DIR = output
$(OUTPUT_DIR)/predictions.csv: $(DATA_DIR)/raw.csv scripts/train_model.py
python scripts/train_model.py $(DATA_DIR)/raw.csv $@
$(OUTPUT_DIR)/plot.png: $(OUTPUT_DIR)/predictions.csv scripts/visualise.R
Rscript scripts/visualise.R $< $@
$(OUTPUT_DIR)/report.html: $(OUTPUT_DIR)/plot.png report.qmd
quarto render report.qmd -o $@
all: $(OUTPUT_DIR)/report.html
clean:
rm -rf $(OUTPUT_DIR)/*This looks clean, but notice the procedural boilerplate required in each script. Your train_model.py must parse command-line arguments and handle file I/O:
# scripts/train_model.py
import sys
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
def main():
input_path = sys.argv[1]
output_path = sys.argv[2]
# Load data
df = pd.read_csv(input_path)
# ... actual ML logic ...
# Save results
predictions.to_csv(output_path, index=False)
if __name__ == "__main__":
main()And your visualise.R script needs the same boilerplate:
# scripts/visualise.R
args <- commandArgs(trailingOnly = TRUE)
input_path <- args[1]
output_path <- args[2]
# Load data
predictions <- read.csv(input_path)
# ... actual visualisation logic ...
# Save plot
ggsave(output_path, plot)The scientific logic is buried under file I/O scaffolding. And the environment? That’s a separate 100+ line Dockerfile you must maintain.
6.1.4 rixpress: Unified Orchestration
This brings us to a key insight: a reproducible pipeline should be nothing more than a composition of pure functions, each with explicit inputs and outputs, no hidden state, and no reliance on execution order beyond what the data dependencies require.
{rixpress} solves this by using Nix not just as a package manager, but as the build automation engine itself. Each pipeline step is a Nix derivation: a hermetically sealed build unit.
Compare the same pipeline in {rixpress}:
And your functions.py contains only the scientific logic:
# functions.py
def train_model(df):
# ... pure ML logic, no file I/O ...
return predictionsThe difference is stark:
| Aspect | Make + Docker | rixpress |
|---|---|---|
| Files needed | Dockerfile, Makefile, wrapper scripts | gen-env.R, gen-pipeline.R, function files |
| I/O handling | Manual in every script | Automatic via encoders/decoders |
| Dependencies | Explicit file rules | Inferred from object references |
| Environment | Separate Docker setup | Unified via {rix} |
| Expertise needed | Linux admin, Make syntax | R programming |
This provides two key benefits:
True Polyglot Pipelines: Each step can have its own Nix environment. A Python step runs in a pure Python environment, an R step in an R environment, a Quarto step in yet another, all within the same pipeline.
Deep Reproducibility: Each step is cached based on the cryptographic hash of all its inputs: the code, the data, and the environment. Any change in dependencies triggers a rebuild. This is reproducibility at the build level, not just the environment level.
The interface is heavily inspired by {targets}, so you get the ergonomic, object-passing feel you’re used to, combined with the bit-for-bit reproducibility of the Nix build system.
{rixpress} and ryxpress
If the {rixpress} syntax is new to you, remember that you can use pkgctx to generate LLM-ready context (as mentioned in the introduction). Both the {rixpress} (R) and ryxpress (Python) repositories include .pkgctx.yaml files you can feed to your LLM to help it understand the package’s API. You can also generate your own context files:
# For the R package
nix run github:b-rodrigues/pkgctx -- r github:b-rodrigues/rixpress > rixpress.pkgctx.yaml
# For the Python package
nix run github:b-rodrigues/pkgctx -- python ryxpress > ryxpress.pkgctx.yamlWith this context, your LLM can help you write correct pipeline definitions, even if the syntax is completely new to you. You can do so for any package hosted on CRAN, GitHub, or local .tar.gz files.
6.2 What is rixpress?
{rixpress} streamlines creation of micropipelines (small-to-medium, single-machine analytic pipelines) by expressing a pipeline in idiomatic R while delegating build orchestration to the Nix build system.
Key features:
- Define pipeline derivations with concise
rxp_*()helper functions - Seamlessly mix R, Python, Julia, and Quarto steps
- Reuse hermetic environments defined via
{rix}and adefault.nix - Visualise and inspect the DAG; selectively read, load, or copy outputs
- Automatic caching: only rebuild what changed
Here is what a basic pipeline looks like:
library(rixpress)
list(
rxp_r_file(
mtcars,
'mtcars.csv',
\(x) read.csv(file = x, sep = "|")
),
rxp_r(
mtcars_am,
filter(mtcars, am == 1)
),
rxp_r(
mtcars_head,
head(mtcars_am)
),
rxp_qmd(
page,
"page.qmd"
)
) |>
rxp_populate()6.3 Getting Started
6.3.1 Initialising a project
If you’re starting fresh, you can bootstrap a project using a temporary shell:
nix-shell --expr "$(curl -sl https://raw.githubusercontent.com/ropensci/rix/main/inst/extdata/default.nix)"Once inside, start R and run:
rixpress::rxp_init()This creates two essential files:
gen-env.R: Where you define your environment with{rix}gen-pipeline.R: Where you define your pipeline with{rixpress}
6.3.2 Defining the environment
Open gen-env.R and define the tools your pipeline needs:
library(rix)
rix(
date = "2025-10-14",
r_pkgs = c("dplyr", "ggplot2", "quarto", "rixpress"),
ide = "none",
project_path = ".",
overwrite = TRUE
)Run this script to generate default.nix, then build and enter your environment:
nix-build
nix-shell6.3.3 Defining the pipeline
Open gen-pipeline.R and define your pipeline:
library(rixpress)
list(
rxp_r_file(
name = mtcars,
path = "data/mtcars.csv",
read_function = \(x) read.csv(x, sep = "|")
),
rxp_r(
name = mtcars_am,
expr = dplyr::filter(mtcars, am == 1)
),
rxp_r(
name = mtcars_head,
expr = head(mtcars_am)
)
) |>
rxp_populate()Running rxp_populate() generates a pipeline.nix file and builds the entire pipeline.
6.4 Core Functions
6.4.1 Defining derivations
{rixpress} provides several functions to define pipeline steps:
| Function | Purpose |
|---|---|
rxp_r() |
Run R code |
rxp_r_file() |
Read a file using R |
rxp_py() |
Run Python code |
rxp_py_file() |
Read a file using Python |
rxp_qmd() |
Render a Quarto document |
rxp_py2r() |
Convert Python object to R |
rxp_r2py() |
Convert R object to Python |
6.4.2 Building the pipeline
# Generate pipeline.nix only (don't build)
rxp_populate(build = FALSE)
# Build the pipeline
rxp_make()6.4.3 Inspecting outputs
Because outputs live in /nix/store/, {rixpress} provides helpers:
# List all built artifacts
rxp_inspect()
# Read an artifact into R
result <- rxp_read("mtcars_head")
# Load an artifact into the global environment
rxp_load("mtcars_head")
# Copy an output file to current directory
rxp_copy("page")6.4.4 Visualising the pipeline
# Static DAG plot
rxp_ggdag()
# Interactive network
rxp_visnetwork()
# Text-based trace
rxp_trace()6.5 Polyglot Pipelines
One of {rixpress}’s strengths is seamlessly mixing languages. Here’s a pipeline that reads data with Python’s polars, processes it with R’s dplyr, and renders a Quarto report.
Historically, using multiple languages in one project meant significant setup overhead: installing interpreters, managing conflicting dependencies, writing glue code. With Nix, that cost drops to near zero. You declare your R and Python dependencies in one file, and Nix handles the rest.
LLMs lower the barrier further. Even if you are primarily an R programmer, you can ask an LLM to generate the Python code for a specific step, or vice versa. You don’t need to master both languages; you just need to know enough to recognise when each shines. Use R for statistics, Bayesian modelling, and visualisation with {ggplot2}. Use Python for deep learning, web scraping, or leveraging a library that only exists in the Python ecosystem. With Nix handling environments and LLMs helping with syntax, the “cost” of crossing language boundaries becomes negligible.
library(rixpress)
list(
rxp_py_file(
name = mtcars_pl,
path = "data/mtcars.csv",
read_function = "lambda x: polars.read_csv(x, separator='|')"
),
rxp_py(
name = mtcars_filtered,
expr = "mtcars_pl.filter(polars.col('am') == 1).to_pandas()"
),
rxp_py2r(
name = mtcars_r,
expr = mtcars_filtered
),
rxp_r(
name = mtcars_head,
expr = head(mtcars_r)
),
rxp_qmd(
name = report,
qmd_file = "report.qmd"
)
) |>
rxp_populate()6.5.1 Method 1: Using language converters
The rxp_py2r() and rxp_r2py() functions use {reticulate} to convert objects between languages:
rxp_py2r(
name = mtcars_r,
expr = mtcars_py
)6.5.2 Method 2: Using universal data formats
For more control, use encoder and decoder arguments to serialize to formats like JSON:
# Python step: serialize to JSON
rxp_py(
name = mtcars_json,
expr = "mtcars_pl.filter(polars.col('am') == 1)",
user_functions = "functions.py",
encoder = "serialize_to_json"
),
# R step: deserialize from JSON
rxp_r(
name = mtcars_head,
expr = my_head(mtcars_json),
user_functions = "functions.R",
decoder = "jsonlite::fromJSON"
)This approach makes your pipeline more modular: any language that can read JSON could be added in the future.
6.6 Caching and Incremental Builds
One of the most powerful features of using Nix for pipelines is automatic caching. Because Nix tracks all inputs to each derivation, it knows exactly what needs to be rebuilt when something changes.
Try this:
- Build your pipeline with
rxp_make() - Change one step in your pipeline
- Run
rxp_make()again
Nix will detect that unchanged steps are already cached and instantly reuse them. It only rebuilds the steps affected by your change.
6.7 Build Logs and Debugging
Every time you run rxp_populate(), a timestamped log is saved in the _rixpress/ directory. This is like having a Git history for your pipeline’s outputs.
# List all past builds
rxp_list_logs()
# Load artifact from current build
new_result <- rxp_read("mtcars_head")
# Load artifact from previous build
old_result <- rxp_read("mtcars_head", which_log = "z9y8x")
# Compare them
identical(new_result, old_result)This is incredibly powerful for debugging and validation. You can go back in time to inspect any output from any previous pipeline run.
6.8 ryxpress: The Python Interface
If you prefer working in Python, ryxpress provides the same functionality. You still define your pipeline in R (since that’s where {rixpress} runs), but you can build and inspect artifacts from Python.
To set up an environment with ryxpress:
rix(
date = "2025-10-14",
r_pkgs = c("rixpress"),
py_conf = list(
py_version = "3.13",
py_pkgs = c("ryxpress", "rds2py", "biocframe", "pandas")
),
ide = "none",
project_path = ".",
overwrite = TRUE
)Then from Python:
from ryxpress import rxp_make, rxp_inspect, rxp_load
# Build the pipeline
rxp_make()
# Inspect artifacts
rxp_inspect()
# Load an artifact
rxp_load("mtcars_head")ryxpress handles the conversion automatically:
- Tries
pickle.loadfirst - Falls back to
rds2pyfor R objects - Returns file paths for complex outputs
6.9 Running Someone Else’s Pipeline
The ultimate test of reproducibility: can someone else run your pipeline?
With a Nix-based workflow, they need only:
git cloneyour repository- Run
nix-build && nix-shell - Run
source("gen-pipeline.R")orrxp_make()
That’s it. Nix reads your default.nix and pipeline.nix files and builds the exact same environment and data product, bit-for-bit.
6.10 Exporting and Importing Artifacts
For CI/CD or sharing between machines:
# Export build products to a tarball
rxp_export_artifacts()
# Import on another machine before building
rxp_import_artifacts()This speeds up continuous integration by avoiding unnecessary rebuilds.
6.11 Real-World Examples
The rixpress_demos repository contains many complete examples. Here are a few patterns worth studying.
6.11.1 Example 1: Machine Learning with XGBoost
This pipeline trains an XGBoost classifier in Python, then passes predictions to R for evaluation with {yardstick}:
library(rixpress)
list(
# Load data as NumPy array
rxp_py_file(
name = dataset_np,
path = "data/pima-indians-diabetes.csv",
read_function = "lambda x: loadtxt(x, delimiter=',')"
),
# Split features and target
rxp_py(name = X, expr = "dataset_np[:,0:8]"),
rxp_py(name = Y, expr = "dataset_np[:,8]"),
# Train/test split
rxp_py(
name = splits,
expr = "train_test_split(X, Y, test_size=0.33, random_state=7)"
),
# Extract splits
rxp_py(name = X_train, expr = "splits[0]"),
rxp_py(name = X_test, expr = "splits[1]"),
rxp_py(name = y_train, expr = "splits[2]"),
rxp_py(name = y_test, expr = "splits[3]"),
# Train XGBoost model
rxp_py(
name = model,
expr = "XGBClassifier(use_label_encoder=False, eval_metric='logloss').fit(X_train, y_train)"
),
# Make predictions
rxp_py(name = y_pred, expr = "model.predict(X_test)"),
# Export predictions to CSV for R
rxp_py(
name = combined_df,
expr = "DataFrame({'truth': y_test, 'estimate': y_pred})"
),
rxp_py(
name = combined_csv,
expr = "combined_df",
user_functions = "functions.py",
encoder = "write_to_csv"
),
# Compute confusion matrix in R
rxp_r(
combined_factor,
expr = mutate(combined_csv, across(everything(), factor)),
decoder = "read.csv"
),
rxp_r(
name = confusion_matrix,
expr = yardstick::conf_mat(combined_factor, truth, estimate)
)
) |>
rxp_populate(build = FALSE)
# Adjust Python imports
adjust_import("import numpy", "from numpy import array, loadtxt")
adjust_import("import xgboost", "from xgboost import XGBClassifier")
adjust_import("import sklearn", "from sklearn.model_selection import train_test_split")
add_import("from pandas import DataFrame", "default.nix")
rxp_make()This demonstrates:
- Python-heavy computation with XGBoost
- Custom serialization via
encoder/decoder - Adjusting Python imports with
adjust_import()andadd_import() - Passing results to R for evaluation
6.11.2 Example 2: Reading Many Input Files
When you have multiple CSV files in a directory:
library(rixpress)
list(
# R approach: read all files at once
rxp_r_file(
name = mtcars_r,
path = "data",
read_function = \(x) {
readr::read_delim(list.files(x, full.names = TRUE), delim = "|")
}
),
# Python approach: custom function
rxp_py_file(
name = mtcars_py,
path = "data",
read_function = "read_many_csvs",
user_functions = "functions.py"
),
rxp_py(
name = head_mtcars,
expr = "mtcars_py.head()"
)
) |>
rxp_populate()The key insight: rxp_r_file() and rxp_py_file() can point to a directory, and your read_function handles the logic.
6.11.3 Example 3: Full Python→R→Quarto Workflow
A complete pipeline that bounces data between languages and renders a report:
library(rixpress)
list(
# Read with Python polars
rxp_py_file(
name = mtcars_pl,
path = "data/mtcars.csv",
read_function = "lambda x: polars.read_csv(x, separator='|')"
),
# Filter in Python, convert to pandas for reticulate
rxp_py(
name = mtcars_pl_am,
expr = "mtcars_pl.filter(polars.col('am') == 1).to_pandas()"
),
# Convert to R
rxp_py2r(name = mtcars_am, expr = mtcars_pl_am),
# Process in R
rxp_r(
name = mtcars_head,
expr = my_head(mtcars_am),
user_functions = "functions.R"
),
# Back to Python
rxp_r2py(name = mtcars_head_py, expr = mtcars_head),
# More Python processing
rxp_py(name = mtcars_tail_py, expr = "mtcars_head_py.tail()"),
# Back to R
rxp_py2r(name = mtcars_tail, expr = mtcars_tail_py),
# Final R step
rxp_r(name = mtcars_mpg, expr = dplyr::select(mtcars_tail, mpg)),
# Render Quarto document
rxp_qmd(
name = page,
qmd_file = "my_doc/page.qmd",
additional_files = c("my_doc/content.qmd", "my_doc/images")
)
) |>
rxp_populate()Note the additional_files argument for rxp_qmd(): this includes child documents and images that the main Quarto file needs.
6.11.4 More Examples
The rixpress_demos repository includes:
- jl_example: Using Julia in pipelines
- r_qs: Using
{qs}for faster serialization - python_r_typst: Compiling to Typst documents
- r_multi_envs: Different Nix environments for different derivations
- yanai_lercher_2020: Reproducing a published paper’s analysis
6.12 Summary
{rixpress} unifies environment management and workflow orchestration:
- Define pipelines with
rxp_*()functions in familiar R syntax - Mix languages freely: R, Python, Julia, Quarto
- Build with Nix for deterministic, cached execution
- Inspect outputs with
rxp_read(),rxp_load(),rxp_copy() - Debug with timestamped build logs
- Share reproducible pipelines via Git
The Python port ryxpress provides the same experience for Python-first workflows.
By embracing structured, plain-text pipelines over notebooks for production work, your analysis becomes more reliable, more scalable, and fundamentally more reproducible.