Author Interview with Alex Clark
Two weeks ago, we published “Fast and accurate semantic annotation of bioassays exploiting a hybrid of machine learning and user confirmation”. In this study, Alex Clark and his colleagues describe a hybrid machine learning / interactive method for marking up bioassay data. Alex shared his “very positive experience” of submitting to us on his blog so we got in touch with him, as we wanted to hear more about his work.
PJ: Can you tell us a bit about yourself?
AC: I grew up in New Zealand, and migrated abroad as scientists tend to do, living in the United States initially before settling in Canada. As a kid, I became fascinated with computer programming, but quickly realized that I did not want to make that my only profession, and so went to university and eventually ended up with a doctorate in chemistry in 1999. Pursuing a career that involves equal parts science and software engineering has been quite the balancing act, and in 2010 entrepreneur was added to my list of day jobs when I founded Molecular Materials Informatics, Inc., which is dedicated to helping bring chemical informatics into the modern software era. The most visible products to date are a variety of chemistry themed mobile apps for Apple iOS and Google Android devices, though there are a number of advanced original algorithms keeping things moving under the hood.
While many of the projects that I work on are exclusive to my own company, about half of them are collaborative in nature, involving joint efforts with individuals and companies, such as Collaborative Drug Discovery, Inc.
PJ: Can you briefly explain the research you published in PeerJ?
AC: The research addresses the fact that when scientists setup a new screening experiment for testing small molecules for bioactivity, they document the details using plain scientific English text. This is a problem for informaticians, who would like to create software capable of analyzing activity measurements, which is rate limited by the inability of a computer to determine whether two screening configurations are measuring the same thing. The solution is to express the experiments using semantic markup, where the important properties such as disease target, cell line, protein, measurement type, reference controls, etc., are annotated using a consistent scheme.
Past efforts to solve this problem have mainly focused on either fully automated parsing of text, or completely manual user-operated markup. Unfortunately the former tends to have an unacceptable error rate, while the latter consumes far too much of a scientist’s time. We built a proof of concept software tool that splits the difference between the two extremes: taking the best of automated text-to-markup machine learning in order to get the right answer most of the time, while keeping the user in the loop to confirm when the automated assignments are correct, and step in and intervene when they are not. In this way scientists can make their experiment descriptions useful to informatics software with just a few minutes of their time per experiment – a burden which decreases as more annotations are collected, which improves the quality of the training sets.
PJ: Do you have any anecdotes about this research?
AC: Once we built the initial prototype and had it working well in our hands, we demonstrated it to a number of colleagues in the industry. One of the many things we learned is that recognition of the problem is widespread: it seems like every organization that has collected a significant number of textual assay descriptions is well aware of the limitations, and many have already looked into trying to find a solution. I’m accustomed to having to provide a reasonably thorough introduction to why a problem is important and why current solutions are not as good as they could be, but in this case that part was pretty much taken as given.
PJ: What surprised you the most with these results?
AC: First of all that the first approach I tried worked well (black box natural language processing followed by Bayesian analysis). And secondly as I mentioned before, that explaining the need for this research was relatively easy due to high awareness of the importance of the problem and that it remains largely unsolved.
PJ: What kinds of lessons do you hope the public takes away from the research?
AC: That writing up an experiment in human readable text is only the first half of the exercise. To be fully useful, documentation has to be processed into a form that computers can use for precise searching, categorization and large-scale decision support informatics. If your data remains as words and arbitrary diagrams, it will remain just an isolated data point that will only ever be read by a handful of other humans. If it is machine readable, it will be able to influence every relevant scientific decision that follows. The research seeks to demonstrate that by balancing the best of natural language processing and the best of user interface design, it is possible to reduce the amount of time a scientist needs to invest in this process to a nominal commitment that is quickly paid back in terms of new capabilities.
PJ: Where do you hope to go from here?
AC: We intend to upgrade the prototype into a modular web interface that can be plugged into a number of data entry systems, starting with CDD Vault. As users annotate more bioassay descriptions with semantic terminology, the training set will continue to improve. As domain coverage increases, the likelihood that an assay can be marked up very quickly increases, i.e. the user just approves all the predicted annotations, rather than having to hunt through and dig them out. As the data grows, the capabilities that we can built on top of it grow too: being able to search for assay properties, or compare assays for similarity, are immediate examples, but there are larger scale options too: once the marked up data becomes prevalent, analysis software can observe trends over the entire domain of drug discovery, revealing trends that might have otherwise been very difficult to spot.
PJ: If you had unlimited resources, what study would you run?
AC: First of all, I would hire enough expert professionals to painstakingly annotate every biological assay ever written down, and thus create an exhaustive training set. Then I would commission every creator of data entry software for biological content to make use of the annotation interface, so that all lab notebook software would provide scientists with the opportunity to conveniently describe their bioassays in a machine-friendly format.
PJ: Why did you choose to reproduce the complete peer-review history of your article?
AC: The reviews were thoughtful and constructive, and I saw no reason to keep them private. And since the first reviewer had taken the first step and made her identity known to us, it only seemed fair.
PJ: How did you first hear about PeerJ, and what persuaded you to submit to us?
AC: I heard about it before it went live, on a blog or a tweet, I forget which. The PeerJ decision was a combination of moral and financial reasons: I am personally a gigantic fan of open access scientific literature, since the peer review process is essentially an exercise in crowd sourcing, and the whole point of science is to be open. Unfortunately the current breed of scientific publishers has carried over its legacy cost structure from the dead-tree era, which means that scientists have a choice between reader-pays and author-pays, and the fees involved can be prohibitive to many. That system works fine if everyone who is involved in creating or consuming science is rolling around in excess grant money, but that certainly does not describe all of us. PeerJ could be summed up as bringing the lean startup technology movement to scientific publishing. In my opinion it’s not a moment too soon, and I welcome the opportunity to play a small part.
PJ: Do you have any anecdotes about your overall experience with us? Anything surprising?
AC: In context it is not surprising, but the responsiveness of the staff took some getting used to: having an email conversation with an identifiable person on the other end is unusual for scientific publishing. I tend to expect to receive automated messages from firstname.lastname@example.org whenever I have an inquiry. The manuscript submission process gave me the overall impression that the journal was a partner with an interest in making this happen smoothly, rather than a system that is rather indifferent about my contribution.
PJ: How would you describe your experience of our submission/review process?
AC: The website for receiving submissions is very well designed. Given that it is quite detailed, it is no surprise that there are one or two ambiguities, but as long as the staff keep paying attention and iterating, I am optimistic that the design will reduce a significant amount of manual labour for both the authors and publishers. Also, the peer review was done very promptly, and no less thoroughly for its quick turnaround.
PJ: Did you get any comments from your colleagues about your publication with PeerJ?
AC: Just the usual congratulatory encouragement, but it’s early days yet.
PJ: Would you submit again, and would you recommend that your colleagues submit?
AC: I would ideally like to make PeerJ my go-to journal, but there are no categories for chemistry or cheminformatics, which means I can only use it when I occasionally venture out into bioinformatics. I look forward to the day when PeerJ branches out in the direction of chemistry, and/or when other publishing startups recognize that the PeerJ business model is successful and rush to fill all of the other vacant niches.
PJ: Anything else you would like to talk about?
AC: I wish the company every success. In terms of the bigger picture, it’s really important that PeerJ can successfully demonstrate that its lean business model works, because the contemporary journal fee structures are keeping authors and readers out of science. This is indefensible in the information age, but somebody has to step up and show that there is a better way.
PJ: In conclusion, how would you describe PeerJ in three words?
AC: Disrupting scientific publishing.
PJ: Many thanks for your time!
AC: My pleasure.