Reproducible Environments | Program for Open Scholarship and Education

At this point, we have covered quite a bit. Open as a principle of software development, open as a principle of human and machine interpretability, and open as a facet of reproducible workflows.

If we think back to the content covered in Open Workflows, and the discussion of reproducibility and replicability, it’s worth considering that reproducibility is really about internal validation while replicability is about external validation. Reproducibility confirms the same data and processing methods will produce the same result. Replicability contributes to the evidence base by conducting a new study modeled on a previous study.

In the spirit of open as it relates to the digital environment and reproducibility, one of the gold standards when we’re looking at a single study is computational reproducibility; that is to say, if I pass off all of my inputs (data, scripts, etc.) to someone else, can they, on their computer, reproduce what I did exactly. Unfortunately, the answer is frequently no, because computers are complex environments, and no two machines are going to have exactly the same environment; hardware and software differences will exist and these will impact how data is processed by a program. In complex food production, like brewing beer, it’s often said that making a great beer once is easy, but making it a second time is much harder. Small variations — such as precise temperatures, ingredient sources, and even the weather — can subtly change the flavour. The same challenge exists in computational reproducibility — unless we use a container and apply the concept of containerization.

The full details of how containerization is deployed are really beyond an introductory module on open research. But the principles being addressed by containerization are critical to navigating a digital environment when we think about the ability of work, embedded within a piece of software, to be validated by others.

A Recipe for Understanding Containers

The thing about any piece of software or any script is that it is never fully self-contained. We always rely on dependencies or pre-existing bundles of code, usually called libraries. Think of it like baking muffins.

Image by OpenClipart-Vectors from Pixabay

When you write your R or Python script, you’re writing out your recipe; a set of instructions with particular steps that need to be followed. Your recipe requires certain things to be fully executed though. It requires:

an environment in which to run; let’s call this your kitchen
something to process your ingredients into the end product; let’s call this your oven
something to validate all the ingredients; let’s call this your mixing bowl
a list of ingredients; let’s call these your dependencies or libraries.

Finally, your favourite muffin recipe is dependent on you having eggs, butter, white flour, and cow’s milk. Let’s now imagine that when you built your working environment — your kitchen — you made sure to include a lifetime supply of all of these dependencies — your ingredients. All is good until you come home one day, and realize that your partner has done some upgrades. One of these upgrades is to replace all of your white flour with rye flour and your cow’s milk with goat’s milk.

This upgrade was ostensibly made to reflect the need for a healthier lifestyle. Beyond potentially being annoyed about the lack of consultation, maybe you see the problem? Next time you try to make your muffins, your validator — your mixing bowl — will be expecting white flour and cow’s milk. Unable to find these ingredients, your mixing bowl will fail to pass all the ingredients off to your oven. No more muffins, in spite of the most well-documented script — your recipe — being in hand.

Even if you never had any upgrades done, what if your friend wanted your recipe? Sure, you could give them the script, and they could source all the ingredients. But if you wanted to ensure that the recipe was a perfect match to your own, you’d gift-wrap all the ingredients with the recipe attached, ensuring success.

This gift wrapping, or bundling, is exactly what software that containerizes a piece of code does — it ensures that the code is accompanied by the appropriate environment and dependencies so that it will run into the future. This is a critical aspect of reproducibility.

Docker is a popular open-source tool for containerizing software and code. For those working in the realm of High-Performance Computing, Apptainer is another popular option.

Dig Deeper

To learn more about containers and the way they can improve reproducibility, review the following videos:

A conference presentation on the basic elements and implementation of Docker: Using Docker Containers to Improve Reproducibility in PL/SE Research (42:08)
An introduction to containers using Singularity and some of the differences between Docker and Singularity (48.23)
An introduction to Apptainer (8 part series)

Using GenAI for Data Analysis: Why Reproducibility Can Be Challenging

Large Language Models are powerful tools for data analysis tasks like sentiment analysis and text summarization. They can very quickly make sense of large amounts of text and generate useful insights in a natural, human-like way.

But while they’re impressive, using LLMs comes with a reproducibility challenge. If you run the same data analysis twice using an LLM, you might not get the exact same result. That’s because LLMs rely on built-in randomness when generating their text output. This randomness is intentional and helps make the output more natural, but it also makes it harder to get consistent, repeatable results.

The problem is even more complicated when you’re using proprietary Software as a Service models like GPT-4 or Claude. These models are hosted by companies and act like “black boxes”. You don’t have access to their inner workings such as their training data or model weights. That means if the provider makes a change to the model behind the scenes, your results may change too, even if your inputs stay the same.

To work around this, it’s important to:

Save all your prompts and settings
Use fixed versions of the model when possible
If available, use model settings such as setting Temperature to 0, using Greedy Decoding, and fixing a Random Seed, to minimize randomization
Try repeating your analysis a few times to observe the scale of the random effects
Consider using a locally hosted model such as Llama, Gemma or Mistral