Preface

Three years ago, I wrote a book with a straightforward premise: by borrowing a few key ideas from software engineering, people who analyse data could save themselves a great deal of frustration. The response to that book was more positive than I could have ever hoped for, and it confirmed a suspicion I had: we, as a community, are hungry for better ways to work.

That book, however, focused exclusively on the R ecosystem. A recurring question I received was, “This is great, but what about Python?” It was a fair question. The world of data science is not a monologue; it’s a conversation between languages. So, I began what felt like the logical next step: writing a Python edition.

I mapped out the chapters and identified the equivalent tools, pipenv for dependency management, ploomber for pipelines, and started writing. But as I went deeper, a nagging feeling grew. I was solving the same problems all over again, just with a different set of tools. This feeling was compounded by the rapid churn within the Python ecosystem itself. How many package managers have been created to solve virtual environment management? As of writing, uv is all the rage, and while it may be here to stay, history suggests a new contender is always just around the corner.

This pointed to a larger issue. I am convinced that the future of data science is polyglot. An R user and a Python user, both following my original advice, would end up with reproducible projects, but their workflows would be fundamentally incompatible. They couldn’t easily share an environment or build a single pipeline that leverages the strengths of both languages. While companies like Posit have made excellent progress in making it easier to call Python from R, setting up a truly integrated development environment remains a challenge. And what if you wish to bring Julia, the other language of data analysis, into the fold? It is not as popular as Python or R, but it has its own distinct appeal and advantages.

I realised I was treating the symptoms, not the disease. The root problem wasn’t “How do I make R reproducible?” or “How do I make Python reproducible?”. The real challenge was the lack of a universal foundation that could handle any language, any tool, and any system dependency with absolute, bit-for-bit precision.

That’s when I stopped writing the Python book.

The solution wasn’t to create another language-specific guide but to find a tool that operated at a more fundamental level. That tool, I am now convinced, is Nix.

Nix is not just another package manager; it is a powerful, declarative system for building and managing software environments. It allows us to define the entire computational environment—from the operating system libraries up to the specific versions of our R and Python packages—in a single, simple text file. When you use Nix, the phrase “it works on my machine” becomes obsolete. It is replaced by the guarantee: “it builds identically, everywhere, every time”—with a few caveats that we will explore, of course.

This book is the result of that realisation. It is a complete reimagining of the original. We are moving away from language-specific patchworks and toward a unified, polyglot workflow. We will use Nix as our bedrock, with the {rix} and {rixpress} R packages (or ryxpress for Python) serving as our friendly interface to its power. You will learn to build pipelines where R, Python, and Julia aren’t just neighbours; they are collaborators, working together in a single, perfectly reproducible environment.

A note for Python-first users: do not be deterred by the fact that {rix} and {rixpress} are R packages: there is a Python version called ryxpress that will allow you to run your pipelines from an interactive Python session. You will be able to use them to define your environments (even integrating tools like uv) and orchestrate your pipelines, while doing all of your analytical work exclusively in Python. In this workflow, R simply becomes a convenient configuration language.

The core message from three years ago remains unchanged. You, as someone who writes code to analyse data, are a developer. Your work is important, and it deserves to be reliable. This book aims to give you the tools and the mindset to achieve that. The journey is more ambitious this time, but the payoff is far greater.

I hope you’ll join me.

You can read this book for free online at https://b-rodrigues.github.io/reproducible-data-science/.

You can submit issues, suggest improvements, and ask questions on the book’s Github repository.