Collaborations Workshop 2014 (CW14) took place in Oxford in the end of March. I was lucky to attend the last day of the meeting dedicated to ad-hock hacking. I would characterize the main topic of the workshop as introduction of qualitative software development in scientific environment. Here are my thoughts, why reproducibility is a dream that easily becomes a nightmare.

I'll start by comparing scientists to professional programmers and operations and argue that for an experiment to be reproducible the software has to be reusable. Then, I'll give some suggestions on how to make a reproducible setup.

Scientists are developers

Mark Basham at PyData London stated, what I consider, the pain of researchers in one sentence. Here is my free recall:

Nowadays, scientists have to process so much data that they have to become programmers.

I completely agree with it. It's not possible anymore to successfully run an experiment without using highly optimized libraries for computation, IO and result representation.

I would go even further and explain the growing popularity of Python and other high level languages by the fact that they hide complex implementation behind simple interfaces (see scikit-learn's cross validation code for an example).

Therefore, readability and understandability of code plays an important role. To the extent that it's very tempting to use the development version of a library when the code is being written. For example, the user interface of IPython 2, which is still under development at the moment of writing, improved a lot in comparison to the current stable version. Quite often development versions fix encountered bugs, which might be to small to deserve a dedicated release.

In this mode a scientist behaves like a developer. The more recent is the software the better.

Scientists are not operations

In a software company, eagerness of developers is usually compensated by wisdom of operations, the people responsible for running and supporting software.

operations:

/ɒpəˈreɪʃ(ə)ns/

noun

  1. people who are responsible for deploying and monitoring services in a company.

In academia, usually, there is much less interest to code. Not everyone is interested to study how an experiment is implemented. The main deliverable from a scientist-developer is a result that hopefully beats the current state of the art. In addition, scientists-developers have a strong opinion on what tools to use and prefer to reinvent a wheel, rather than reuse code written by other scientists.

The problem is that it's not easy to reuse code. It has to at least be documented and well written. Both of the points require a lot of effort, which might not be appreciated.

Scientists should be operations

Why should scientist care about the quality of their code? The main reason is that eventually good code will be reused by someone else. Someone else could be yet another person in a group, or in a group in another research center. Also, having a widely adopted tool minimizes amount of surprises. For example, NLTK's BNC reader, which takes care of some corner cases. (A note for a careful reader: the amount of regexp based code written to process XML is at least 10 times larger than you think.)

The bad news is that even developers are still working on easy code reuse, aka packaging, to make operations' live easier!

Since scientists are already developers, they should learn from operations. This means that the environment the code runs should be similar from the moment it's being developed to the moment it's being deployed. This prevents bugs and minimizes works-for-me situations.

A good project

The documentation of a good project should include testing and deployment instructions.

The process should be automated as much as possible. Tox is a tool that provides a unified interface to run tests and hides differences across different test runners.

In the Python world, usage of eggs and listing of the requirements in setup.py and their versions in requirements.txt makes deployment easier, but solves only half of the problem: it takes care only of Python dependencies and ignores system dependencies, such as libraries written in C.

Vagrant is a great tool for virtual machine management. It allows to use a completely isolated environment as if it was local. For example, on my Mac I can start a vagrant box with CentOS 6.5 and have the same environment as the computing server I'll use.

Taking this into account, this should be necessary steps to run/develop scientific software:

# Get the Vagrant config
git/hg clone http://project-site.org/project
cd project

# Start the virtual environment
vagrant up

# Connect to it
vagrant ssh

# All system dependencies are installed
cat README
Dialogue act tagging

The environment to run experiments described in Joe Doe. 2015. The Ultimate Dialogue act tagging.

Refer to http://project-site.org/ for more information.

The experiment data is stored data/. To test the setup run:

    tox

To run the experiment type:

    bin/tagger doe2015

# Now I know what to do and happily run the experiment
bin/tagger doe2015
Tagging accuracy is 100%.

All this fancy and clear setup requires loads of love and care. A virtual machine image has to be built and hosted somewhere. It has to be updated from time to time. This is why reproducibility can easily become a nightmare.

On other hand, having a virtual machine image it can be deployed in a cloud on a powerful machine in the matter of minutes.