Open science with figshare and object orientated-programming

Update: I’m pleased to say that I was awarded Imperial’s Bradley-Mason Prize for Open Chemistry — see Professor Rzepa’s blog post for more info.

From 1st May 2015, the EPSRC requires that all publications include a statement saying how the underlying research data can be accessed. Technically, you can simply include an email address to contact for the data, but I think that’s hardly in the spirit of open science. In this post, I want to describe how I used object-orientated programming (OOP) and figshare to meet this requirement for my latest paper in Lab on a Chip. You can download the data and MATLAB code to reproduce the graphs at figshare.

Graphical abstract for Microscale extraction and phase separation using a porous capillary.
Graphical abstract for my paper.

In OOP, you create classes that define objects and their properties. For example, if you had a class Animal, instances of this class could be cat and dog. For the Animal class the properties might be legs (an integer) or dateofbirth (a date). The class also defines methods, which are functions that operate on instances of a class. For example, Animal.age() might use the dateofbirth property to return the age of the animal.

For my paper I defined a class called sepexp (short for separation experiment, the subject of the paper) with properties corresponding to the independent and dependent variables. My class definition also included a method runall to run the experiments (which were, thankfully, automated—one of the joys of flow chemistry) and plot to plot the data.

To start an experiment, I would create an instance of my sepexp class. For example, let’s call it exp1, and during its creation I specify the independent variables. Executing exp1.runall() runs all the experiments defined by my properties. The details aren’t relevant here—see the paper if you’re interested—but the key thing is that it saves the results in the properties mass_initial and mass_final.

Now I’ve got an object that defines the experiment and contains the results I can save it, e.g. using save in MATLAB or pickle in Python.

The next step is to plot it, so I execute exp1.plot(), which does a straightforward calculation on the data collected to get the volumetric collection rate at the outlet and plots it. I then repeated this for each experiment.

What does this approach give you? You end up with a class definition and series of objects that contain the parameters of each experiment, how it was carried out, the results, and a means to reproduce the analyses. You can zip this up, upload it to figshare, and you’ve got a publicly accessible link to your data with a DOI.

An OOP approach saves time when analysing data, because you define how the data is analysed once in class definition, and apply it repeatedly to every object/experiment. It’s easy to iterate through all your objects (see the scripts in the /plotting_scripts folder). Distributing the class definition ands the objects together means others can reproduce your analysis. I think that’s pretty cool. If you’ve got MATLAB, download my archive and give it a go.

Or even better, try it out with your next project. There are lots of resources for learning OOP in your language of choice online. The MATLAB OOP documentation is good (although I think MATLAB’s OOP syntax is horrible). I personally like books and learnt about OOP for the first time in the excellent book Learning Python by Mark Lutz.

Christmas wishes for nanoparticle synthesis

Back in August The Baran Laboratory blog posted some thoughts on yields and the need for qualitative assessment of reactions. I agree with their points, particularly about only believing in “0%, 25%, 50%, 75%, or quantitative” yields.

Unfortunately I can’t recall ever reading a paper on nanoparticle synthesis where the authors report a yield. It is difficult to define the yield of a nanoparticle because, unlike molecules, nanoparticles have a distribution of sizes and shapes which makes the concept of molecular weight and consequently yield somewhat hazy. Still, it’d be useful if chemists reported the mass of dry nanoparticles obtained per batch. Practically, it’s quite straightforward and I think it would be a handy metric for assessing reactions.

Their post inspired me to write some Christmas wishes for the field of nanoparticle synthesis. Here we go:

Reporting centrifugation speeds in relative centrifugal force rather than rotations per minute

Centrifugation is the nanoparticle equivalent of column chromatography. Different size and shape particles sediment at different rates so by centrifuging them you can achieve separation. Typically papers report revolutions per minute (rpm), but the relative centrifugal force (RCF) is a more useful number as it’s this force which causes the particles to sediment at different velocities and separate. RCF is dependent on not just the angular velocity (r, i.e. rpm) but also the radius i of the centrifuge rotor:

RCF = i r^2 / g

where g is acceleration due to gravity. Our centrifuge has the option to set this instead of rpm. As an example, if some authors used a centrifuge at 2000 rpm and then your centrifuge has a rotor with a radius n times larger, the RCF will be n times higher and you’ll need to either reduce the rpm or centrifuging time. But you probably won’t know what rotor they had so you’ll have to guess and waste time working it out for yourself…

Characterisation of what’s been washed away during washing/purification procedures

This is obvious to me but no one does it. If you have to centrifuge, decant the supernatant off the sediment, then wash the product multiple times, what are you getting rid of? I want to know.

Representative electron microscopy

I want big, high resolution images with good contrast. Close ups of particles of interest are fine but I also want to see lower magnification shots showing representative samples of the product. A paper claiming to make nanoparticle X but only one image showing just a few of X? Then I won’t bother trying to reproduce it. Electron microscopy leads me on to…

Histograms!

I love histograms and I want to see them showing size distributions of your product. At least 50 particle measurements, preferably a hundred or more. Rather than simply stating that your product “is monodisperse”, actually give statistical data like the standard deviation to back up your claims. I also like seeing ratios of shape X to shape Y. If you used ImageJ to automatically measure the particles, then include what algorithm and parameters you used so that others can reproduce it.

Papers that do what they claim

If you claim to make nanoparticle X, you should be making at least 75% X by mass. I don’t have a problem with other crud as long as it’s a minority product and you acknowledge that it’s there.

I’d be a very happy boy if I got all of these granted…

Correcting the literature

Mathias Brust in Chemistry World:

Ideally, science ought to be self-correcting. … In general, once a new phenomenon has been described in print, it is almost never challenged unless contradicting direct experimental evidence is produced. Thus, it is almost certain that a substantial body of less topical but equally false material remains archived in the scientific literature, some of it perhaps forever.

Philip Moriarty expresses similar concern in a post at Physics Focus. Openly criticising other scientists’ work is generally frowned upon—flaws in the literature are “someone else’s problem”. Erroneous papers sit in the scientific record, accumulating a few citations. Moriarty thinks this is a problem because bibliometrics are (unfortunately) used to assess the performance of scientists.

I think this is a problem too, although for a different reason. During my MRes I wasted a lot of time trying to replicate a nanoparticle synthesis that I’m now convinced is totally wrong. Published in June 2011, it now has five citations according to Web of Knowledge. I blogged about it and asked what I should do. The overall response was to email the authors but in the end I didn’t bother. I wanted to cut my losses and move on. But it still really bugs me that other people could be wasting their limited time and money trying to repeat it when all along it’s (probably) total crap.

I did take my commenters’ advice and email an author about another reaction that has turned out to be a “bit of an art”. (Pro tip: if someone tells you a procedure is a bit of an art, find a different procedure.) I asked some questions about a particular procedure and quoted a couple of contradictions in their papers, asking for clarification/correction. His responses were unhelpful and after a couple of exchanges he stopped replying. Unlike the first case, I don’t believe the results are flat out wrong. Instead I suspect a few experimental details are missing or they don’t really know what happens. I think I’ll get to the bottom of it eventually, but it’s frustrating.

What are your options if you can’t replicate something or think it’s wrong? I can think of four (excluding doing nothing):

  1. Email the corresponding author. They don’t have an incentive to take it seriously. You are ignored.

  2. Email the journal editor. Again, unless they’re receiving a lot of emails, what incentive does the journal have to take it seriously? I suspect you’d be referred to the authors.

  3. Try and publish a rebuttal. Can you imagine the amount of work this would entail? Last time I checked, research proposals don’t get funded to disprove papers. This is only really a viable option if it’s something huge, e.g. arsenic life.

  4. Take to the Internet. Scientists, being irritatingly conservative, think you’re crazy. Potentially career damaging.

With these options, science is hardly self-correcting. I’d like to see a fifth: a proper mechanism for post-publication review. Somewhere it’s academically acceptable to ask questions and present counter results. I think discussion should be public (otherwise authors have little incentive to be involved) and comments signed (to discourage people from writing total nonsense). Publishers could easily integrate such a system into their web sites.

Do you think this would work? Would you use it? This does raise another question: should science try and be self-correcting at all?

Thanks to Adrian for bringing Mathias Brust’s article to my attention.

Details matter

Blog Syn is a new chemistry blog where chemists post their attempts to reproduce reactions from the literature. Each post starts with the following disclaimer:

The following experiments do not constitute rigorous peer review, but rather illustrate typical yields obtained and observations gleaned by trained synthetic chemists attempting to reproduce literature procedures…

I disagree completely. What could be more rigorous than actually trying a reaction?

So far there are three posts. The first gave a lower yield than reported. The second was “moderately reproducible”. The paper omitted details essential to the reaction’s success. The third was “difficult to reproduce” and is well worth reading—there’s a great response from one of the authors, Prof. Phil Baran.

It’s unacceptable for anyone to publish a paper without all the information necessary to replicate the results. It wastes researchers’ time and money. I’ve written before about my difficulties trying to replicate results. It’s infuriating. How do papers like this slip through peer review?

I suspect some authors don’t really know why a reaction gives a particular product, especially in nanoparticle synthesis. They manage to pull something off a few times and publish their findings, but (unknowingly) neglect parameters crucial for other researchers to be able to reproduce it. It could be something seemingly trivial, like the method used to wash the glassware. The next researcher does it differently because it’s not mentioned in the paper and gets a different result.

The only way to deal with this is for reviewers to demand thorough experimental sections. (But to do so they must have a good understanding of typical experimental procedures. This is a problem if your reviewer hasn’t been in the lab for years.)

An alternative scenario could be that the researchers, in the early stages of the work, find that doing X doesn’t work. Later they find doing Y does work. Y gets published. X stays in the laboratory notebook.

X is a negative result. On it’s own, it’s not very useful. Loads of attempted reactions don’t work. But in the context of the positive result (i.e. the paper) the negative result is actually very valuable to anyone who wants to repeat the paper. Serious consideration should be given to including them in the supplementary information.

Experimental methods are grossly oversimplified. We like things to be elegant and simple, but chemistry is complicated. There’s no excuse not to include more information because everything is published online and space constraints aren’t a problem.

Blog Syn shows that subtleties in chemistry are important. We should all acknowledge that in our own papers and demand that others do the same.

The death of my paper lab book?

Nature recently had a feature on the “paperless” lab which mostly focused on electronic laboratory notebooks (ELNs). As a computer nerd, I’ve been thinking about using one for a while.

ELNs have lots of advantages over paper notebooks. They’re searchable, easily backed up and can automatically incorporate data from instruments—no more cutting and pasting. Businesses like them as it’s easier to find out what an ex-employee did in an ELN than in loads of paper notebooks.

I’ve always used the my department’s standard synthetic chemistry lab book which has a risk assessment and reaction scheme on every left page and lines on every right. It works quite well. I number every reaction TWP001, TWP002 etc and samples are labelled TWP001-A, TWP001-B, etc. Spectra follow a similar convention, e.g. TWP001-A_em_spec.txt or TWP001-A_abs_spec.txt, and all data and code used for data analysis is kept in a folder called TWP001_brief_description.

But there are a few things that I really hate about paper lab books. Going back through my notes when writing up work is a real chore, especially with seemingly never ending notes along the lines of “same as TWP050 except…”. Reaction TWP050 says: “same as TWP049 except…”. With an ELN you can just copy and paste.

The inherent linearity of a paper lab book is a pain. Entries are in chronological order and reactions are performed sequentially, one at a time, but I usually work on two or three reactions at a time. Leaving blank pages looks sloppy, but cramming notes into small gaps is messy.

The biggest problem is that paper notebooks have become incomplete records of research in the modern laboratory. A lab book should be a complete record of your thoughts, observations, measurements and results. However with modern lab instrumentation it’s impractical or impossible to include all the data by printing, cutting and sticking it in. For example, a search on my computer (not a look in my lab book) reveals 510 UV-vis absorption, fluorescence and excitation spectra recorded since August 2010. There’s no way I could print that out (and even if I did, the data is useless in that format). Furthermore, a paper lab book can’t capture any of the data analysis on the computer. My MATLAB (and now Python) code is riddled with comments. With paper lab books, this information is highly fragmented.

Considering these problems I’ve been looking at electronic alternatives for some time, but what I’ve disliked about them boils down to two things: inflexibility and how they handle data. They seem to try to fit everything into a particular template or form. With a paper lab book, I can write and draw whatever I want, which is important to me as I’m not a “normal” synthetic chemist—I with flow reactors and I’m more interested in my residence time than yield.

I want to be able to access my plain text data files as plain text files and not have them converted into horrible proprietary binary formats subject to the whims of the ELN vendor. Think of the hassle caused when Microsoft switched from .doc to .docx—I don’t want this happening with my data. Plain text files from 30 years ago can still be read today and will be readable for longer than I’ll be alive. It also worries me that a web based ELN could disappear and leave me with a load of horribly formatted files to wade through.

Researching online I found advocates of open notebook science—the (left field) practice of making your entire lab book and data available online as it is recorded—using blogs and wikis as ELNs. Cameron Neylon’s blog-like open lab book used the University of Southampton’s free LabTrove software. Lab book entries are like blog posts, with attachements for data, and you can organise posts using tags, e.g. “NMR” or categories, perhaps to organise posts related to a single reaction. Jean-Claude Bradley’s group notebook, called the UsefulChem Project used a wiki. I really like Bradley’s wiki and there are lots of nice examples if you click about on the list of reactions. His group upload and link to spectra and photographs—a complete research record.

I did a bit more research into using a wiki for an ELN and they seem to be the perfect match. They’re flexible in terms of organising data however I want and pages are versioned so you can see what was written when. There are loads of different wiki applications available, so I narrowed the possibilities down with the following criteria:

  • active development
  • proven large scale deployment for stability and reliability
  • open source and free
  • page access control
  • supports attachments
  • self-hosted because I don’t trust anyone
  • written in a nice programming language
  • stores data nicely, i.e. not binary formats

This boiled down to MediaWiki (runs Wikipedia), FosWiki (used for loads of corporate intranets) and MoinMoin (large scale deployments are the Apache Software Foundation, Python and Ubuntu wikis).

MediaWiki doesn’t handle attachments very well for ELNs since attachments are available globally, i.e. across the whole wiki at the top level rather being linked to individual pages. The latter makes more sense to me as spectra or photos (the attachment) are related to the experiment (the page) rather than the whole notebook (the wiki). MediaWiki is designed for open content, so it doesn’t do access control without dodgy extensions. It’s also written in PHP, which I have no intension of learning. So that’s MediaWiki struck off.

FosWiki is aimed at corporations, which I think you can tell from it’s look and feature list. It’s also written in Perl, which I really don’t want to learn. So that’s FosWiki gone.

Last is Moinmoin. Unlike MediaWiki, attachments are linked to pages. MoinMoin is written in Python, a really nice language I’ve started to use instead of MATLAB, so there’s the possibility of writing my own extensions. It’s currently at version 1.9.4, so it should be very stable, and version 2.0 is under active development. It’s very clean and tidy.

I spoke to my supervisor about an ELN and he was extremely keen so I’ve decided to give MoinMoin a go. I’ve installed it on a Linode virtual server running Ubuntu linux.[^VPS] It took a about 6 hours to install the whole server from scratch—not bad having never administered a server before! Initially I was a little worried about security, with data being on a internet server, but I’ve locked down the server pretty tight and am going to make off site backups to my office machine. If anyone is interested, I’ll write up how to set it up.

It would be cool to make MoinMoin chemically-savvy—perhaps by pulling in data from ChemSpider or Wolfram Alpha, or COSHH info from Sigma-Alrich? I think this could be done with a little Python scripting. I’ll open source anything good for others to use. I’m also planning on setting up an old scanner in the lab to upload paper drawings.

This could all prove to be an embarrassing experiment or even a complete nightmare and ending with me dusting off my most recent lab book and finding a pen. On the other hand, it could be great. We’ll have to wait and see!

[^VPS]: I could have installed it on a dedicated machine in the office, but we’re a bit short on machines and didn’t want to have to deal with hardware.