Organizing Research
Table of Contents
Small tangent into the basic, how to organize research papers and related artifacts. I have a few ideas for how to best handle this, but nothing concrete I can point to that I'm satisfied with.
Is this a useless divergence into tool sharpening and small optimizations for little gain? Or is this a worthwhile venture to get right early on that pays off in the long term?
As I'm starting to dive into the "literature", as it might occasionally be referred to, there is a certain amassing of papers to read. Worse, there tends to be a mixed bag of advice for handling the fire hose of information and how to not drown in the deluge of papers that tends to accompany research.
There are certainly a myriad of different approaches to solving this problem. I have been debating a few approaches and would like to describe them and, ideally, exposit my way towards a final solution.
Goals, Requirements, and Motivation
There are several goals and basic requirements of any organization system, here is no exception. While these goals may be personal, knowing the motivation behind the approaches that follow, the decisions may make more sense. With a different set of goals and requirements, perhaps the following won't make sense.
Meaningful Filenames
The underlying structure and naming of files should be agnostic to any specific application that "manages" them. That is, the filenames and overall directory structure should mirror that of the organization within an application. I want to be able to simply browse the files in a regular file explorer on occasion. Furthermore, copying files to, say, an e-reader with limited metadata capabilities similarly benefits from the foundation of the organization being in the filesystem itself.
DOI or Referencing Material for Cross Referencing and Linking
The organization system will need access to the digital object identifier (DOI) of the paper as it may be needed for later queries such as citations and derivative work. Since there is no clear universal DOI format used between various journals and conferences, e.g., ACM and CiteSeerX use different formats for DOI, I'm not especially concerned with DOI information being part of the filename. Regardless, this information should be available when needed.
Similarly, the BibTeX should be available immediately as well. To some degree, the BibTeX is more important than the raw DOI information.
Simply, information should be stored to enable a workflow of querying for a list of citations and derivative work.
Proposals
There are several possibilities worth investigating.
Regardless of the approach, however, something does seem immediately obvious: no system will work without front-loading the work of organization when a paper or artifact is under consideration. Handling the organization tasks in batches or letting the organization slip for even a moment may kill the entire system.
If there's already a large collection of papers, it will be very difficult to retroactively organize everything.
Zotero
Zotero is a free and open source platform for managing research. It's available in a desktop application and a online web application. I've only considered the former and know very little about the latter.
Zotero is an intuitive tool that has a lot of good features. It can automatically pull metadata for articles, it can generate citations, the choice of organization is on the user, file syncing can be achieved without having to use their hosted service, it can archive/snapshot a web page, and, an implicit requirement, it's free and open source software!
However, I have several issues that dissuade me from using it for my purposes. Chiefly, the file organization on disk is opaque, therefore, it fails to satisfy the first requirement above. Furthermore, the generated BibTeX references are not as complete as I would like. Moreover, the process of pulling references for a new manuscript each time does not seem pleasant.
Org Mode with Org Attach
Another approach would be to use Org mode. Specifically, use Org mode as an "index" into the files, while maintaining the filesystem structure desired as per the first requirement. This can be further augmented by using "attachments" from Org mode.
For example, the workflow for this may be something like the following:
- Downloading a paper, rename it with the full title, the last name of the first author, and the year of publication, all separated by hyphens, all lowercase.
- Create an entry in the appropriate Org entry for the paper.
- Download and add the BibTeX citation to a
references.bib
file. - Using
org-ref
, add a "citation" so that the article can be easily cited or its reference material retrieved later. - Use
org-attach
to copy the file into the appropriate directory tree.
This process can likely be augmented and idealized via Capture Templates. However, I have not yet explored this in depth since I have a few concerns that still need to be addressed regarding this solution.
I can specifically override the directory used for Attachments for specific entries, however, this may be an overwhelming exercise that needs to be duplicated many times and manually maintained . Again, Org Capture and templates may be helpful in better realizing the solution, but certainly further exploration is necessary.
Org-ref
John Kitchin of CMU created Org-Ref, a reference system for Org Mode that better integrates BibTeX citations than Org Links. Really, Org-Ref and the linking system of Org are not even in the same ballpark.
There is a lot of nice features to Org Ref: a suite of utilities for downloading and automatically creating BibTeX entries from the DOI or even the PDF itself, among others. It can automatically download and save the PDF into a specified directory. Org-Ref also nicely handles the process of adding citations to the manuscript during the writing phase.
I'm not specifically sure how Org-Ref handles saving the PDF files, how it names them or otherwise organize these on disk. Given the more automatic nature of Org-Ref, it certainly warrants further investigation.
Discussion
I'm still debating these different solutions. I don't want to spend a ton of time optimizing this as there are a lot of other tasks that are more important than tool sharpening.
There likely exists an idealized solution in the hybrid space between Org Capture and Org-Ref, but it all requires some more investigation.
One reason I wanted to write about this is that I feel these kinds of topics aren't given the attention they probably should receive. Perhaps it's an uninteresting problem. Perhaps it's too dependent on personal preferences. Regardless though, there seems to be very little discussion on the subject. If there are discussions on the topic, they seem to exist in an ephemeral space of the internet or the air on which they were verbally transmitted.
The solutions proposed here are certainly not universally appropriate and I by no means make any such claim. But the solutions proposed seem appropriate and helpful to me. Perhaps they may be helpful to someone else. But further than that, it's hopefully a small step in a larger pattern of sharing tips and personal workflows to help others along similar journeys.