Version Control

Continuing on the journey of increasing the openness of our research with software solutions, we move next into version control.

Version control likely was first born in the context of records management, and has historically relied on naming conventions and multiple copies – or versions – of documents being created and managed. But software developers working in distributed environments really pioneered the way forward in how we track changes in a document and across a project, especially in a digital environment.

In research, version control comes into play in a variety of ways, sometimes in the context of software, sometimes data, as well as in our methods, protocols, and manuscripts. Before delving into each of these aspects, lets start with the following video, LEARN GIT version control in 10 minutes!, that discusses why we might want to leverage a version control system instead of naming conventions across multiple documents. There is no need to watch beyond the first 4 minutes and 10 seconds. Indeed, we highly recommend a hard stop at 4:10.

Learn Git version control in 10 minutes! | Python Programmer

Version Control in Open Research and Open Education

In the context of open research, version control provides a window into the evolution of our project, allowing changes in one’s plans to be identified and rectified against an original plan. This is transparency in the research life cycle. This is equally as relevant for our protocols or methods as it is for our scripts and our manuscripts. In the context of open education, version control can track the evolution of educational content co-developed in a class across cohorts; version control and time stamps can help other users identify the currency and relevancy of the resource.

Undo or Version Control

We should differentiate here though between version control for transparency and version histories for recovery or tracked changes. Tracking every change made is more about recovery than versioning. Versioning instead is very much about capturing moments of significant change or phases of implementation.

If our methods change, our protocols need to be updated to reflect this, and the nature of these changes should be documented. If our script is processing our data in one way, and we discover the need to adjust this, this update should be captured and documented. The same goes for our manuscript; when a revision is made, this should be both noted and documented to allow a transfer of the tacit processes of the project to be embedded into the project workflow.

Recovery (Undo)

When we think recovery, it’s more like having an undo function: the whoops moment, I made a mistake — I didn’t mean to delete that paragraph or I didn’t mean to put that file in the trash — let me undo that.

So, for example, MacOS Time Machine allows you to trace your documents back through time and recover an older version. Google Docs allows you to see a history of edits over a specified time period. Your R script can document each change you make to your data to get from A to B. Open Refine, a graphical data cleaning tool, similarly keeps a record of every change made to your data set.

Common to all these platforms is the ability to retrace your steps and go back one step. Your individual steps or changes are also not commented; we don’t know why what happened happened, we just know it happened.

Version Control

When we think of version control, we should be thinking of a timestamp to capture significant decisions made in the transition or evolution of a project, ideally accompanied by a description of what these changes were. The ability to add comments to a timestamped version of your code is one of the core features of version control. These comments are searchable and you can easily revert to previous versions of your code.

In the context of a manuscript this might be recording two draft versions: the first, ready for review by my supervisor or journal editor, the second reflecting edits made after receiving feedback. Or for our script, a first version that imports and tidies our data, and then an updated version that fixes an issue in our calculations to convert imperial to metric.

Another key feature of version control is branching. Recovery/Undo is an inherently linear process: you make some progress, and you can go back to a prior state. Branching allows you to create multiple parallel versions of your work. Branches can also be merged, so the number of parallel versions of your code can change over time as it develops. Branching and merging are very useful when multiple authors or developers are collaborating on the same software project. Branches are also useful as playgrounds for experimental development: if an idea seems fruitful, it can be merged into your project, or, if the idea was a dead end the branch can be deleted.

The difference between recovery and version control is more nuanced than this, but this dichotomy helps to highlight the role of version control in transparency versus recovering from a mistake.

Version control is made possible with both software and platforms. Most people have heard of GitHub. Git is the software that handles the version control. GitHub is an implementation of Git owned by Microsoft. Git is designed with text-based documents in mind, which is why it is so popular with people who write code.

OSF provides a similar versioning service as GitHub, but is designed primarily for people working with binary documents – non plain text files like pdfs, word documents and the like.

Common to both of these platforms is that major revisions can be stored on both your local computer and in a remote repository for potential sharing and collaborating. They also encourage you to document your changes.

Dig Deeper

To learn more about version control and the different platforms that enable this, review the following: