Reproducible Polyglot Data Science
Welcome!
A modern, unified, and language-agnostic workflow for data science using Nix.
This book is a complete reimagining of my previous work, “Building Reproducible Analytical Pipelines with R.” If you’re looking for that book, you can find it here. But if you’re ready for the next step, you’re in the right place.

Data scientists, statisticians, analysts, researchers, and many other professionals write a lot of code.
Not only do they write a lot of code, but they must also read and review a lot of code as well. They either work in teams and need to review each other’s code, or need to be able to reproduce results from past projects, be it for peer review or auditing purposes. And yet, they never, or very rarely, get taught the tools and techniques that would make the process of writing, collaborating, reviewing and reproducing projects possible.
Which is truly unfortunate because software engineers face the same challenges and solved them decades ago.
The aim of this book is to teach you how to use some of the best practices from software engineering and DevOps to make your projects robust, reliable and reproducible. It doesn’t matter if you work alone, in a small or in a big team. It doesn’t matter if your work gets (peer-)reviewed or audited: the techniques presented in this book will make your projects more reliable and save you a lot of frustration!
As someone whose primary job is analysing data, you might think that you are not a developer. It seems as if developers are these genius types that write extremely high-quality code and create these super useful packages. The truth is that you are a developer as well. It’s just that your focus is on writing code for your purposes to get your analyses going instead of writing code for others. Or at least, that’s what you think. Because in others, your team-mates are included. Reviewers and auditors are included. Any people that will read your code are included, and there will be people that will read your code. At the very least future you will read your code. By learning how to set up projects and write code in a way that future you will understand and not want to murder you, you will actually work towards improving the quality of your work, naturally.
The book can be read for free on https://b-rodrigues.github.io/reproducible-data-science/ and you’ll be able buy a DRM-free Epub or PDF on Leanpub1 once there’s more content.
This book is the culmination of my previous works. I started by writing a book focused on R, and then began working on a Python edition. During that process, I had a realization: tackling reproducibility one language at a time was solving the symptoms, not the root cause. The real solution needed to be universal, powerful, and capable of handling any language or tool we might need.
That universal solution is Nix.
This book moves beyond language-specific tooling. It presents a holistic workflow where R, Python, and Julia are not competitors, but collaborators in a single, cohesive, and perfectly reproducible environment. We will cover:
- The Nix Philosophy: Why Nix is the ultimate tool for solving the “it works on my machine” problem, once and for all.
- Declarative Environments with
{rix}
: How to use a simple R interface to define exact, bit-for-bit reproducible software environments that include specific versions of R, Python, Julia, their packages, and any system-level dependencies. - Polyglot Pipelines with
{rixpress}
(R) orryxpress
(Python): How to orchestrate complex analytical pipelines that seamlessly pass data between different languages, all managed by the Nix build system. - Unit Testing and Functional Programming: Core principles for writing robust, testable, and maintainable code, no matter the language.
- Distribution and Automation: How to package your entire reproducible pipeline into a Docker container for easy sharing and automate your workflow with GitHub Actions.
While this is not a book for beginners (you should be familiar with at least one data-centric programming language before reading this), I will not assume that you have any knowledge of the tools discussed. But be warned, this book will require you to take the time to read it, and then type on your computer. Type a lot.
I hope that you will enjoy reading this book and applying the ideas in your day-to-day, ideas which hopefully should improve the reliability, traceability and reproducibility of your code.
If you find this book useful, don’t hesitate to let me know! You can submit issues, suggest improvements, and ask questions on the book’s Github repository.
You can also buy a physical copy of the book on Amazon.
If you want to get to know me better, read my bio2.
You’ll also be able to buy a physical copy of the book on Amazon once it’s done. In the meantime, you could buy the R edition.
If you find this book useful, don’t hesitate to let me know or leave a rating on Amazon!