Reproducible dependencies and environments
Objectives
There are not many codes that have no dependencies. How should we deal with dependencies?
Instructor note
xx min teaching/discussion
How to avoid: “It works on my machine 🤷”
Use a standard way to list dependencies in your project:
Python:
requirements.txt
orenvironment.yml
R:
DESCRIPTION
orrenv.lock
Rust:
Cargo.lock
Julia:
Project.toml
C/C++/Fortran:
CMakeLists.txt
orMakefile
orspack.yaml
or the module system on clusters or containersOther languages: …
Install dependencies into isolated environments:
For each project, create a new environment.
Don’t install dependencies globally for all projects.
Install them from a file which documents them at the same time.
Demonstration
The dependencies in our example project are listed in a environment.yml file.
Discussion
Shouldn’t the dependencies in the environment.yml file be pinned to specific versions?
When is a good time to pin them?
We also have a container definition file:
This can be used with Apptainer/ SingularityCE.
A container is like an operating system inside a file.
You can test some container exercises, if interested.
Where to explore more
Exercises
Exercise Reproducibility-1: Time-capsule of dependencies
Imagine the following situation: Five students (A, B, C, D, E) wrote a code that depends on a couple of libraries. They uploaded their projects to GitHub. We now travel 3 years into the future and find their GitHub repositories and try to re-run their code before adapting it.
Which version do you expect to be easiest to re-run? Why?
What problems do you anticipate in each solution?
A: You find a couple of library imports across the code but that’s it.
B: The README file lists which libraries were used but does not mention any versions.
C:
You find a environment.yml
file with:
name: student-project
channels:
- conda-forge
dependencies:
- scipy
- numpy
- sympy
- click
- python
- pip
- pip:
- git+https://github.com/someuser/someproject.git@master
- git+https://github.com/anotheruser/anotherproject.git@master
D:
You find a environment.yml
file with:
name: student-project
channels:
- conda-forge
dependencies:
- scipy=1.3.1
- numpy=1.16.4
- sympy=1.4
- click=7.0
- python=3.8
- pip
- pip:
- git+https://github.com/someuser/someproject.git@d7b2c7e
- git+https://github.com/anotheruser/anotherproject.git@sometag
E:
You find a environment.yml
file with:
name: student-project
channels:
- conda-forge
dependencies:
- scipy=1.3.1
- numpy=1.16.4
- sympy=1.4
- click=7.0
- python=3.8
- someproject=1.2.3
- anotherproject=2.3.4
A: You find a couple of library imports across the code but that’s it.
B: The README file lists which libraries were used but does not mention any versions.
C:
You find a requirements.txt
file with:
scipy
numpy
sympy
click
python
git+https://github.com/someuser/someproject.git@master
git+https://github.com/anotheruser/anotherproject.git@master
D:
You find a requirements.txt
file with:
scipy==1.3.1
numpy==1.16.4
sympy==1.4
click==7.0
python==3.8
git+https://github.com/someuser/someproject.git@d7b2c7e
git+https://github.com/anotheruser/anotherproject.git@sometag
E:
You find a requirements.txt
file with:
scipy==1.3.1
numpy==1.16.4
sympy==1.4
click==7.0
python==3.8
someproject==1.2.3
anotherproject==2.3.4
A:
You find a couple of library()
or require()
calls across the code but that’s it.
B: The README file lists which libraries were used but does not mention any versions.
C: You find a DESCRIPTION file which contains:
Imports:
dplyr,
tidyr
In addition you find these:
remotes::install_github("someuser/someproject@master")
remotes::install_github("anotheruser/anotherproject@master")
D: You find a DESCRIPTION file which contains:
Imports:
dplyr (== 1.0.0),
tidyr (== 1.1.0)
In addition you find these:
remotes::install_github("someuser/someproject@d7b2c7e")
remotes::install_github("anotheruser/anotherproject@sometag")
E: You find a DESCRIPTION file which contains:
Imports:
dplyr (== 1.0.0),
tidyr (== 1.1.0),
someproject (== 1.2.3),
anotherproject (== 2.3.4)
Solution
A: It will be tedious to collect the dependencies one by one. And after the tedious process you will still not know which versions they have used.
B: If there is no standard file to look for and look at and it might become very difficult for to create the software environment required to run the software. But at least we know the list of libraries. But we don’t know the versions.
C: Having a standard file listing dependencies is definitely better than nothing. However, if the versions are not specified, you or someone else might run into problems with dependencies, deprecated features, changes in package APIs, etc.
D and E: In both these cases exact versions of all dependencies are specified and one can recreate the software environment required for the project. One problem with the dependencies that come from GitHub is that they might have disappeared (what if their authors deleted these repositories?).
E is slightly preferable because version numbers are easier to understand than Git commit hashes or Git tags.
Keypoints
If somebody asks you what dependencies you have in your project, you should be able to answer this question with a file.