Portable, Repeatable (, and Reproducible) Notebooks

Background
- Notebooks
- Org-Mode (+ Emacs)
Caveats
Candidate Solution
- Reproducible
- Portable and Repeatable
Discussion

Notebooks represent an important improvement for interactive and iterative research. However, they tend to lack portability, reproducibility, and repeatability. In this post, I want to explore the iterations of using Org mode notebooks combined with other tools to develop a (hopefully) portable, reproducible, and repeatable notebook for software engineering experiments. While the focus remains narrowed to software engineering, the tools and techniques described should be applicable to any study requiring these properties.

By using a number of techniques, it is easy to create notebooks which can be passed along to collaborators, referees, and our future selves. Importantly, this post is NOT an introduction to Org Mode, Org Babel, or notebooks in general. The main contributions of this post is a series of techniques for Org mode to make the documents easily portable and repeatable with minimal fuss.

Background

There exists numerous examples of different "notebook" implementations, such as Jupyter Notebooks, SageMath Worksheets, R Markdown and Elixir Livebook; and some less traditional ones, such as Org Mode and tools such as Pweave.

Notebooks

Jupyter Notebooks is arguably the most common and prevalent notebook system out there, with many features. It supports many features such as different "kernels" to execute different languages, integrated documentation and code, and inline plotting. However, the underlying format is cumbersome and difficult to use in collaboration with other developers using traditional tools such as Git. For a larger list of issues with notebooks in general, see the Caveats section below.

Org-Mode (+ Emacs)

As an alternative, Org Mode (with Emacs) offers all the desired features of notebooks, but similarly gets around some of the limitations typical notebook implementations impose. For example, the notebook is manipulated directly in your editor of choice(™); all creature comforts included. Moreover, Org Babel offers the expected seamless integration between documentation and code blocks and more. For example, the code blocks within a notebook can be "tangled" out and executed in a more typical command-line driven style. Multi-kernel notebooks are not common outside of a few extensions, such a feature is built directly into Org Mode's operation. Moreover, computations from one snippet can be input to another, in a different language. For example, see the Meta Programming Language documentation.

Caveats

Notebooks are susceptible to a number of issues and Org mode notebooks are no exception. These issues are not specifically addressed in this post, but they absolutely require attention and acknowledgment. Notably, state management and "execution flow" are not straightforward concepts within notebooks since developers can execute snippets in any order.

While not the standard way of interacting with notebooks, Org mode provides a mechanism for specifying code snippet/block dependency. Its usage, however, is restricted to when one block explicitly depends on the computational result of another. Specifically, using named variables to pass data between source blocks imposes a computation dependency. In this way, out of order execution problems are alleviated, making the notebook easier to comprehend, even if the format is not top-down.

Candidate Solution

The following is a candidate solution to this problem. It can and should be refined, but it represents a decent step in the right direction. First, we discuss the "reproducible" component. Then using an example, we discuss portability and repeatability together.

Reproducible

GNU Guix provides the necessary reproducibility. Specifically, using envrc and direnv, a manifest is loaded and shimmed into the current environment within Emacs. In this way, Emacs and other shell processes are using a specific environment which includes the necessary tools and libraries for the project. The same manifest is used to create a "really reproducible" package/archive which is deployed to an HPC cluster. This is all done using a channels file to version pin the entire dependency tree (down to the C compiler and GLibC).

Portable and Repeatable

For a recent conference paper, I needed to repeatedly run a few thousand different experiments and then the aggregation and analysis of those experiments. This was accomplished with an Org mode notebook. The major techniques discovered in the process are accessible properties, named example blocks, and liberal use of header arguments.

The following are series of exerts directly from the notebook highlighting each of these features.

The file was structured into several different headings, each for different parts of the process. The heading of the notebook had a few variables which are used throughout the notebook:

* notebook
:PROPERTIES:
:REMOTE: borah.boisestate.edu
:REMOTE-DIR: /bsuhome/kennyballou/scratch
:END:

Everything that interacts with the HPC cluster is then relative to these two variables. Another user can change these two variables and easily execute the notebook. For example, the following is a batch script which is "tangled" to the HPC cluster. Notice, the :tangle property uses these variables:

#+begin_src bash :tangle (concat "/ssh:" (org-entry-get nil "REMOTE" t) ":" (org-entry-get nil "REMOTE-DIR" t) "/run-intervals-analysis.sh")
#!/usr/bin/env bash

#SBATCH -N 1
#SBATCH -n 1
#SBATCH -c 4
#SBATCH --mem=64G
#SBATCH -t 0-03:00:00
#SBATCH -p bsudfq

module purge

RESULTS_PATH=${1}
CLASS_NAME=${2}
METHOD_ID=${3}

exec ~/DFA/bin/dfa interval-numerical \
     --classpath ~/DFA/artifacts.jar \
     --output "${RESULTS_PATH}" \
     "${CLASS_NAME}" \
     "${METHOD_ID}"
#+end_src

Relative to the current notebook, we also have data files. Many code blocks need to refer to these data files throughout the experimental process. To accomplish this, they are named using example blocks:

#+NAME: METHODS
#+begin_example
./in/methods.txt
#+end_example

#+NAME: DOMAINS
#+begin_example
./in/domains.txt
#+end_example

#+NAME: COMPARISONS
#+begin_example
./in/comparisons.txt
#+end_example

#+NAME: REPORTS
#+begin_example
./in/reports.txt
#+end_example

To take this a step further, the contents of the files themselves could be included in the notebook directly and tangled out to the named paths.

Finally, to uniquely identify different experiment runs, the following variable is generated for each invocation:

#+begin_src bash :eval query
echo "#+NAME: UUID"
echo -n $(uuidgen --time)
#+end_src

#+RESULTS:
:results:
#+NAME: UUID
4cf8581e-e2c2-11ed-aab2-8cf8c5ed93dd
:end:

Using this variable, we can create a relative path variable which is referenced for the experiment and analysis output:

#+NAME: OUTPUT_PREFIX
#+begin_src bash :var UUID=(org-sbe UUID) :results silent
echo -n "./out/${UUID}"
#+end_src

Using a relative path for OUTPUT_PREFIX allows for the prefix to be used on the remote servers and locally when processed data files are copied to the local machine for exploratory analysis.

The (org-sbe ...) pattern is used frequently because it inherently removes any newlines that may be introduced either by the literal text or if the variable is computed from a previous source block.

Before jobs are submitted, the remote directory tree needs to be populated:

#+begin_src bash :session *on-borah* :dir (concat "/ssh:" (org-entry-get nil "REMOTE" t) ":" (org-entry-get nil "REMOTE-DIR" t)) :results silent
mkdir -p ${OUTPUT_PREFIX}/{joblogs,intervals}
#+end_src

#+begin_src bash :results silent
scp -q "${METHODS}" ${remote}:${remote_dir}/${OUTPUT_PREFIX}/methods.txt
scp -q "${DOMAINS}" ${remote}:${remote_dir}/${OUTPUT_PREFIX}/domains.txt
scp -q "${COMPARISONS}" ${remote}:${remote_dir}/${OUTPUT_PREFIX}/comparisons.txt
scp -q "${REPORTS}" ${remote}:${remote_dir}/${OUTPUT_PREFIX}/reports.txt
rsync --archive ExperimentData/domains ${remote}:${remote_dir}/${OUTPUT_PREFIX}/.
#+end_src

To keep the source blocks simple and reduce copying, we use header-args to apply certain variables to all code blocks of a particular section. For example, "Job Initialization" has the following header arguments:

**** Job Initialization
:PROPERTIES:
:ID:       7e8302d4-38b7-4a3b-aed4-b329c81b43ce
:header-args:bash: :var OUTPUT_PREFIX=(org-sbe OUTPUT_PREFIX)
:header-args:bash+: :var METHODS=(org-sbe METHODS)
:header-args:bash+: :var DOMAINS=(org-sbe DOMAINS)
:header-args:bash+: :var COMPARISONS=(org-sbe COMPARISONS)
:header-args:bash+: :var REPORTS=(org-sbe REPORTS)
:header-args:bash+: :var remote=(org-entry-get nil "REMOTE" t)
:header-args:bash+: :var remote_dir=(org-entry-get nil "REMOTE-DIR" t)
:END:

Finally, to execute a series of analyses, we use GNU Parallel to produce a cross-product of our input parameters and submit the jobs against the remote machine:

#+name: intervals
#+begin_src bash
parallel --colsep '\t' \
         --shuf \
         --jobs=25% \
         --delay 1s \
         ssh -q ${remote} \
         sbatch --chdir="${remote_dir}" \
         --job-name="intervals-{1}_{2}" \
         --output="${OUTPUT_PREFIX}/joblogs/%x.out" \
         --error="${OUTPUT_PREFIX}/joblogs/%x.err" \
         run-intervals-analysis.sh "${OUTPUT_PREFIX}/intervals" "{1}" "{2}" \
         :::: "${METHODS}"
#+end_src

Once the jobs are complete, we can download the results and begin the analysis process. However, that is essentially the same set of ideas repeated.

Discussion

Org Mode is a huge tool and requires a piecemeal consumption to master. As such, many examples using Org Babel, for example, do not show the full power of passing different arguments or using Elisp to directly manipulate and pass variables to different source blocks. Hopefully, this post can help fill those gaps of what is possible with a meta-notebook tool like Org Mode.

There are some obvious points of improvement. First, Guix and Org Mode could be better integrated such that a single notebook can be entirely self-contained. However, projects tend to be more than one file, so this is not a major limitation. More importantly, however, the process of submitting jobs poses several limitations and problems. SLURM is not built for large job submissions, thus, the delays and limited resources provided to GNU parallel, which ultimately tie up Emacs for several minutes. Furthermore, it can be slightly problematic to have thousands of jobs waiting in SLURM's queue. A better approach would be to create Job Arrays for each set of experiments. This would alleviate the pressure on SLURM and keep Emacs from locking up during the submission process. Similarly, it would enable for the process to be tangled out and sent to the cluster independently of Emacs.