## Open science with figshare and object orientated-programming

Update: I’m pleased to say that I was awarded Imperial’s Bradley-Mason Prize for Open Chemistry — see Professor Rzepa’s blog post for more info.

From 1st May 2015, the EPSRC requires that all publications include a statement saying how the underlying research data can be accessed. Technically, you can simply include an email address to contact for the data, but I think that’s hardly in the spirit of open science. In this post, I want to describe how I used object-orientated programming (OOP) and figshare to meet this requirement for my latest paper in Lab on a Chip. You can download the data and MATLAB code to reproduce the graphs at figshare.

In OOP, you create classes that define objects and their properties. For example, if you had a class Animal, instances of this class could be cat and dog. For the Animal class the properties might be legs (an integer) or dateofbirth (a date). The class also defines methods, which are functions that operate on instances of a class. For example, Animal.age() might use the dateofbirth property to return the age of the animal.

For my paper I defined a class called sepexp (short for separation experiment, the subject of the paper) with properties corresponding to the independent and dependent variables. My class definition also included a method runall to run the experiments (which were, thankfully, automated—one of the joys of flow chemistry) and plot[^overloading] to plot the data.

To start an experiment, I would create an instance of my sepexp class. For example, let’s call it exp1, and during its creation I specify the independent variables. Executing exp1.runall() runs all the experiments defined by my properties. The details aren’t relevant here—see the paper if you’re interested—but the key thing is that it saves the results in the properties mass_initial and mass_final.

Now I’ve got an object that defines the experiment and contains the results I can save it, e.g. using save in MATLAB or pickle in Python.[^binary]

The next step is to plot it, so I execute exp1.plot(), which does a straightforward calculation on the data collected to get the volumetric collection rate at the outlet and plots it. I then repeated this for each experiment.

What does this approach give you? You end up with a class definition and series of objects that contain the parameters of each experiment, how it was carried out, the results, and a means to reproduce the analyses. You can zip this up, upload it to figshare, and you’ve got a publicly accessible link to your data with a DOI.

An OOP approach saves time when analysing data, because you define how the data is analysed once in class definition, and apply it repeatedly to every object/experiment. It’s easy to iterate through all your objects (see the scripts in the /plotting_scripts folder). Distributing the class definition ands the objects together means others can reproduce your analysis. I think that’s pretty cool. If you’ve got MATLAB, download my archive and give it a go.

Or even better, try it out with your next project. There are lots of resources for learning OOP in your language of choice online. The MATLAB OOP documentation is good (although I think MATLAB’s OOP syntax is horrible). I personally like books and learnt about OOP for the first time in the excellent book Learning Python by Mark Lutz.

[^binary]: The main disadvantage of these methods are that they save the data as binary objects. There are also security issues around opening pickle objects from untrusted sources. Therefore I recommend that when you come to publishing your data you also export it as ASCII, which is straightforward. See the export_mat2csv.m script in the figshare archive.

## How I beat the second year blues

I’m not sure why, but I thought I would never suffer from the second year PhD blues. Despite it taking me about two years of work (including part of my MRes) to get decent results, I remained positive. Last November, I started to get particularly exciting results and it laid out a clear path to the end of my doctorate.

But a few months ago, my reactions stopped working. Endless repeat reactions and tweaks were unsuccessful; I wanted to quit. The second year blues had found me and hit me hard.

In the last few weeks I’ve managed to get everything back on track. In fact the failed reactions might have shed some light on why the reaction works so well in the first place.

For anyone else in a similar position, I think it’s most important to stay motivated. I adopted a strategy of working on my main project and easier side projects on alternate days.

I think this has several benefits. By breaking up the disappointing results with easier work, I feel happier. Dealing with negative results for weeks on end was too much for me to handle.

I maintain momentum with side projects—something I struggled with before. I see side projects as backup publications, in case my main project goes down the drain. The time I spend not thinking about the main project helps me approach it with a fresh perspective too.

I find it helpful to tell people, like my supervisor and friends, whether I’m having a “main project day” or “side project day”. This stops me taking a risk of two consecutive days on the same project.

I recommend this strategy to any struggling students. There’s no point in slogging along, miserable. At least until I submit my thesis, it’s how I will work.

## No funding, no placement

Today the Social Mobility and Child Poverty Commission published Elitist Britain?, a report on social mobility in the UK. The conclusions aren’t surprising. Numerous outlets have covered it (e.g. BBC, Guardian).

Careers in the media, politics and law are often singled out as being tough to crack unless you’re from a privileged background. What about science? A search for “science” in the report returns zero hits. It’s interesting that it’s not mentioned.

Every summer, departments at Imperial host students on the Undergraduate Research Opportunities Programme, my department included. I’m sure other universities run similar things. There are bursaries for living costs but competition is tough. In roughly five years I’ve yet to come across a recipient. So really they are unpaid internships, no different to those that are criticised in industries like the press or fashion. Only the offspring of the rich can afford to work for free, especially in London.

In 2009 I applied for a college bursary but I was unsuccessful. I thought it was game over, as I worked full time every holiday to pay off debts that accumulated during term time (when I only worked part time). But the principal investigator generously paid me to do the project anyway, for which I’m still very grateful. I think this is quite rare.

If I had not completed the placement would I have been accepted onto my fully-funded PhD programme? I don’t know. Is it fair that only the wealthiest students can afford to undertake placements and gain valuable research experience? No. If I were a PI, I would not employ unpaid students in my lab, even though I’d be losing out on free labour. Considering the gulf between the richest and poorest and lack of social mobility in the UK, I think a policy of “no funding, no placement” is well overdue.

## Christmas wishes for nanoparticle synthesis

Back in August The Baran Laboratory blog posted some thoughts on yields and the need for qualitative assessment of reactions. I agree with their points, particularly about only believing in “0%, 25%, 50%, 75%, or quantitative” yields.

Unfortunately I can’t recall ever reading a paper on nanoparticle synthesis where the authors report a yield. It is difficult to define the yield of a nanoparticle because, unlike molecules, nanoparticles have a distribution of sizes and shapes which makes the concept of molecular weight and consequently yield somewhat hazy. Still, it’d be useful if chemists reported the mass of dry nanoparticles obtained per batch. Practically, it’s quite straightforward and I think it would be a handy metric for assessing reactions.

Their post inspired me to write some Christmas wishes for the field of nanoparticle synthesis. Here we go:

## Reporting centrifugation speeds in relative centrifugal force rather than rotations per minute

Centrifugation is the nanoparticle equivalent of column chromatography. Different size and shape particles sediment at different rates so by centrifuging them you can achieve separation. Typically papers report revolutions per minute (rpm), but the relative centrifugal force (RCF) is a more useful number as it’s this force which causes the particles to sediment at different velocities and separate. RCF is dependent on not just the angular velocity $\omega$ (i.e. rpm) but also the radius $r$ of the centrifuge rotor:

$\textrm{RCF} = r \omega^2 / g$

where $g$ is acceleration due to gravity. Our centrifuge has the option to set this instead of rpm. As an example, if some authors used a centrifuge at 2000 rpm and then your centrifuge has a rotor with a radius $n$ times larger, the RCF will be $n$ times higher and you’ll need to either reduce the rpm or centrifuging time. But you probably won’t know what rotor they had so you’ll have to guess and waste time working it out for yourself…

## Characterisation of what’s been washed away during washing/purification procedures

This is obvious to me but no one does it. If you have to centrifuge, decant the supernatant off the sediment, then wash the product multiple times, what are you getting rid of? I want to know.

## Representative electron microscopy

I want big, high resolution images with good contrast. Close ups of particles of interest are fine but I also want to see lower magnification shots showing representative samples of the product. A paper claiming to make nanoparticle X but only one image showing just a few of X? Then I won’t bother trying to reproduce it. Electron microscopy leads me on to…

## Histograms!

I love histograms and I want to see them showing size distributions of your product. At least 50 particle measurements, preferably a hundred or more. Rather than simply stating that your product “is monodisperse”, actually give statistical data like the standard deviation to back up your claims. I also like seeing ratios of shape X to shape Y. If you used ImageJ to automatically measure the particles, then include what algorithm and parameters you used so that others can reproduce it.

## Papers that do what they claim

If you claim to make nanoparticle X, you should be making at least 75% X by mass. I don’t have a problem with other crud as long as it’s a minority product and you acknowledge that it’s there.

I’d be a very happy boy if I got all of these granted…

## Nature Chemistry Blogroll: Exposing fraud

Every month Nature Chemistry’s Blogroll column features interesting posts from the chemistry blogosphere. I wrote the column for the November 2013 issue, titled Exposing Fraud. Despite having to submit the copy by 16th September for publication on 24th October, the theme turned out to be quite timely, coinciding with the publication of ACS Nano’s editorial Be Critical But Fair. “The best way to avoid potential academic fraud is through rigorous peer review”—it’s a way for sure, but the best way? I’m not convinced.

## The size of the challenge for OPV

In a recent paper titled Green chemistry for organic solar cells, published (open access) in Energy and Environmental Science, the authors Burke and Lipomi ask an interesting question: how much organic semiconductor do we need in order to make a sizeable dent in global energy consumption with organic photovoltaics (OPV)?

[I]f 10 TW of the 30 TW of power demanded in the year 2050 is to be generated by photovoltaics, and if organics account for 500 GW to 5 TW, then 10–100 kilotonnes of organic semiconductors will be required… given an average solar flux of 200 W m-2, a module efficiency of 5 %, a typical thickness of the active layer of 200 nm, and a density 1000 kg m-3. This extremely rough estimate assumes 100 % yield of working modules, no waste in the coating processes, and infinite lifetime of devices.

To put 10,000–100,000 tonnes in perspective, polyethylene is the most common plastic with an annual production of 80 million tonnes (according to Wikipedia, but it sounds reasonable). Taking the higher end of their rough estimate, we’re talking 0.1 million tonnes over nearly 40 years—basically nothing compared to commodity polymers. But organic semiconductors have more complicated structures than commodity polymers and require more complicated chemistry. Burke and Lipomi also compare their estimate with pharmaceuticals, with their figure “2–3 orders of magnitude greater than those of top-selling small-molecule drugs of similar structural complexity”.

Making this amount of high quality polymer—with a specific molecular weight and acceptable PDI, high purity and low batch-to-batch variation—is a challenge. Material currently on the market is terrible, meeting none of these requirements.

Both conjugated polymers and structurally complex drugs require synthetic sequences of 5–10 steps to produce. The multi-tonne synthesis of conjugated polymers will be a challenge in process chemistry with few precedents, and will in consequence the materials that could be seriously considered for installations that cover many square kilometers.

The challenge is made even greater by the need for it to be cheap because if OPV isn’t cheap, it’s commercially inviable. If OPV is to stand a chance researchers must remember that they’re working on a technology that has to cost <\$10 m-2 to compete with fossil fuels.[^price] Burke and Lipomi point out is roughly the same price as carpet.

No matter how efficient your polymer, if its synthesis can’t be scaled up or is too expensive (e.g. PCDTBT), it will fail. It worries me a little that this is lost in the quest for high device efficiency (because that’s what makes for a HIGH IMPACT[^IMPACT] paper). Taking a “not my problem” approach is unacceptable. We must stop passing the responsibility off to process chemists and engineers and instead remember the scale of the problem that OPV is trying to solve, otherwise it’ll never make the slightest dent in our global energy consumption.

[^IMPACT]: Just to be clear, I’m being flippant. [^price]: Burke and Lipomi get this price from Lewis and Nocera.

## Correcting the literature

Mathias Brust in Chemistry World:

Ideally, science ought to be self-correcting. … In general, once a new phenomenon has been described in print, it is almost never challenged unless contradicting direct experimental evidence is produced. Thus, it is almost certain that a substantial body of less topical but equally false material remains archived in the scientific literature, some of it perhaps forever.

Philip Moriarty expresses similar concern in a post at Physics Focus. Openly criticising other scientists’ work is generally frowned upon—flaws in the literature are “someone else’s problem”. Erroneous papers sit in the scientific record, accumulating a few citations. Moriarty thinks this is a problem because bibliometrics are (unfortunately) used to assess the performance of scientists.

I think this is a problem too, although for a different reason. During my MRes I wasted a lot of time trying to replicate a nanoparticle synthesis that I’m now convinced is totally wrong. Published in June 2011, it now has five citations according to Web of Knowledge. I blogged about it and asked what I should do. The overall response was to email the authors but in the end I didn’t bother. I wanted to cut my losses and move on. But it still really bugs me that other people could be wasting their limited time and money trying to repeat it when all along it’s (probably) total crap.

I did take my commenters’ advice and email an author about another reaction that has turned out to be a “bit of an art”. (Pro tip: if someone tells you a procedure is a bit of an art, find a different procedure.) I asked some questions about a particular procedure and quoted a couple of contradictions in their papers, asking for clarification/correction. His responses were unhelpful and after a couple of exchanges he stopped replying. Unlike the first case, I don’t believe the results are flat out wrong. Instead I suspect a few experimental details are missing or they don’t really know what happens. I think I’ll get to the bottom of it eventually, but it’s frustrating.

What are your options if you can’t replicate something or think it’s wrong? I can think of four (excluding doing nothing):

1. Email the corresponding author. They don’t have an incentive to take it seriously. You are ignored.

2. Email the journal editor. Again, unless they’re receiving a lot of emails, what incentive does the journal have to take it seriously? I suspect you’d be referred to the authors.

3. Try and publish a rebuttal. Can you imagine the amount of work this would entail? Last time I checked, research proposals don’t get funded to disprove papers. This is only really a viable option if it’s something huge, e.g. arsenic life.

4. Take to the Internet. Scientists, being irritatingly conservative, think you’re crazy. Potentially career damaging.

With these options, science is hardly self-correcting. I’d like to see a fifth: a proper mechanism for post-publication review. Somewhere it’s academically acceptable to ask questions and present counter results. I think discussion should be public (otherwise authors have little incentive to be involved) and comments signed (to discourage people from writing total nonsense). Publishers could easily integrate such a system into their web sites.

Do you think this would work? Would you use it? This does raise another question: should science try and be self-correcting at all?

Thanks to Adrian for bringing Mathias Brust’s article to my attention.

## Routine operations

On Friday I went to a talk by Steven Ley titled Going with the Flow: Enabling Technologies for Molecule Makers. His group at Cambridge have done a lot of impressive work on flow chemistry over many years, both developing the technology and using it to synthesise organic molecules.

He covered a lot of ground in the talk, but one of his main points was that it is “unsustainable to use people for routine operations”. Chemists train for 10 years to then stand in front of a fume hood running columns. Ley wants to develop tools that allow researchers to make better use of their time in the laboratory. Flow chemistry has many benefits over batch chemistry, one of them being that it is easy to automate.

His talk left me wondering where I’m particularly inefficient in the lab. Sample collection and recording absorption spectra are particularly time consuming. Last year I started to build an (Arduino-powered) automatic sample collector, but made it far too complicated and never finished it. Now I’ve drastically simplified it (to the design my supervisor said I should use in the first place, as he often likes to remind me) and hope to have it working by the end of next week. I reckon it could save me anywhere between 5–10 hours a week of standing around swapping vials. I’m also going to make a start on recording absorption spectra inline. Again, this will save me a few hours a week, leaving me to do something more valuable.

I completely agree with Ley about the benefits of flow chemistry, but you can’t ignore that all this equipment costs money. Ley’s group use a lot of commercially available equipment and it’s not cheap. In my group, we build a lot of apparatus ourselves because we can tailor it to our needs and it’s a lot more “hackable” (as well as cheaper).

Someone in the audience tried to make the point during questions that funding is tight, especially for those working in organic synthesis. How they meant to afford equipment like £40,000 inline infrared spectrometers? Ley didn’t really answer this question (and I’m not sure he can). He’s obviously very well funded so he can build and develop the “lab of the future“.[^1] A lot of this technology might be out of the budget of the chemists who will benefit from it the most. Unfortunately they might be performing “routine operations” for some time to come.

[^1]: M.D. Hopkin, I.R. Baxendale, S.V. Ley, Chim. Oggi./Chemistry Today, 2011, 29, 28-32.

## Details matter

Blog Syn is a new chemistry blog where chemists post their attempts to reproduce reactions from the literature. Each post starts with the following disclaimer:

The following experiments do not constitute rigorous peer review, but rather illustrate typical yields obtained and observations gleaned by trained synthetic chemists attempting to reproduce literature procedures…

I disagree completely. What could be more rigorous than actually trying a reaction?

So far there are three posts. The first gave a lower yield than reported. The second was “moderately reproducible”. The paper omitted details essential to the reaction’s success. The third was “difficult to reproduce” and is well worth reading—there’s a great response from one of the authors, Prof. Phil Baran.

It’s unacceptable for anyone to publish a paper without all the information necessary to replicate the results. It wastes researchers’ time and money. I’ve written before about my difficulties trying to replicate results. It’s infuriating. How do papers like this slip through peer review?

I suspect some authors don’t really know why a reaction gives a particular product, especially in nanoparticle synthesis. They manage to pull something off a few times and publish their findings, but (unknowingly) neglect parameters crucial for other researchers to be able to reproduce it. It could be something seemingly trivial, like the method used to wash the glassware. The next researcher does it differently because it’s not mentioned in the paper and gets a different result.

The only way to deal with this is for reviewers to demand thorough experimental sections. (But to do so they must have a good understanding of typical experimental procedures. This is a problem if your reviewer hasn’t been in the lab for years.)

An alternative scenario could be that the researchers, in the early stages of the work, find that doing X doesn’t work. Later they find doing Y does work. Y gets published. X stays in the laboratory notebook.

X is a negative result. On it’s own, it’s not very useful. Loads of attempted reactions don’t work. But in the context of the positive result (i.e. the paper) the negative result is actually very valuable to anyone who wants to repeat the paper. Serious consideration should be given to including them in the supplementary information.

Experimental methods are grossly oversimplified. We like things to be elegant and simple, but chemistry is complicated. There’s no excuse not to include more information because everything is published online and space constraints aren’t a problem.

Blog Syn shows that subtleties in chemistry are important. We should all acknowledge that in our own papers and demand that others do the same.

## Tools and technologies for researchers

The Library at Imperial run a course called Blogs, Twitter, wikis and other web-based tools. They asked me (and also Jon Tennant) to give a quick talk to the attendees yesterday on the things I use to do my work.

Rather than give a slide-based presentation I decided the best thing to do was give a demo. I quite like mind mapping to help me structure ideas so I made one for this. I’ve included links to web sites where appropriate. You can download a PDF of the mind map here (PDF).

It’s split into two halves: the tools that I do use, categorised into “inputs” (e.g. Twitter and RSS) and “outputs” (e.g. Google Drive), and those that I don’t with some short reasons why. If you’re interested in trying some of this out, give one or two a go and see if you find them useful. If you use something that I haven’t mentioned, let me know in the comments.