4
Would a virtual machine be a useful link for this paper?

Thanks to excellent instructions in this paper and open source code, I was able to get PeptideBuilder examples running on my ubuntu machine. For me, I had to change the first line in runUnitTests.sh to

#!/bin/bash

Also, as noted in the paper, I needed the biopython library, which I obtained with pip:

pip install biopython # using sudo

The unit tests (so far) fail for me, for unknown reasons, but the evaluation.py code seems to produce structures that look good with PyMOL.

My question is a bit out there: I'm wondering if the authors or anyone else think it would be useful for me to post a link to a VirtualBox virtual machine that runs the code correctly (once I figure out the tweaks)? That way users could get around some of the quirks that arise from different linux installs and more quickly get to exploring and using PeptideBuilder? What do you think?

waiting for moderation
1

Note: I see now that the unit tests are failing for a simple reason: the new test files have an extra line at the end that is simple "END"

-
waiting for moderation
4 Answers
2
Accepted answer

I think that virtual machines could be a good answer to the problem of reproducibility in science, given that the use of complex software is increasing, and the issues of package dependencies and slightly different versions (and other issues I can't imagine now) could be crucial. But reproducibility is about reproduce a work so virtual machines could serve as comparisons to know why the soft doesn't work in another setup. And there should be a permanent repository for virtual machines, I think that Zenodo could work. Summarizing I think that virtual machines are a good idea for open science and reproducibility!

waiting for moderation
2

Thank you for your perspective! I think I am going to give it a whirl and see what happens. I'm going to use our Institutional Repository as a start, since I know that's quick. I can imagine issues emerging, but rather than speculating, I think just trying it out and seeing what problems arise is a good step to take.

-
waiting for moderation
1

Just an update: I have a virtual machine ready to go and have run into my first problem. It's too big to post anywhere! Because of the size (5-6 GB), it will take a few days to weeks to get it linked on my own university repository. It's too big for Zenodo and Figshare, though both say they can make exceptions. I've sent an email to Figshare asking for an exception, and am also working with our expert at the Library to see if I can get it on the repository sooner. I'm also considering rebuilding it with a smaller virtual disk--that wouldn't take too much time, but time that I don't have at the moment. And this is just the first of many problems I expect to encounter with this experiment :)

-
waiting for moderation
1

QIIME is also distributting a ready-to-run virtual box, and we've had good experience running it with people using various OSes and architectures. Definitely an option to consider for big ambitious projects.

-
waiting for moderation
0

@Timothée Poisot that looks really useful and impressive! I can see that some (all?) of the previous VMs are available at ftp://thebeast.colorado.edu/pub/qiime-release-VMs/ . Is there any effort to preserve those VMs? Ability to cite and link to the specific VM image when publishing would be great for reproducibility. In the case of QIIME it's not too much data (10's of GB / year it looks like). On the other hand if everyone were as supportive as your team, that would be a lot of VMs to preserve. This problem of scale is just one of probably many problems with my VM experiment on this Q&A. I can envision a solution, which would be to start with common VM images, accompanied by installation scripts that refer to permanent URLs--but I don't really know how to do this. Something like vagrant would probably be involved, but like I said, I don't know how to implement it :)

-
waiting for moderation
1

Just came across these comments/questions about the QIIME VM (I'm one of the core QIIME devs, have been involved with the project since the beginning). We do plan to continue to host the old ones, and we do the same with our Amazon Virtual Machines, but something that was guaranteed to be longer term than a university server would be ideal.

I was skeptical about sharing QIIME via Virtual Machines at first - I didn't think people would use them, and they are a fair amount of work to build and test at each release - but they've been very popular, and I'm sold on their usefulness at this point. For software with complex dependencies (like QIIME) VMs are an excellent way to let new users try out the system without a big upfront investment in installation.

To go even one step further, if all bioinformatics steps in an analysis were run in an IPython Notebook in a Virtual Machine, and then the IPython Notebook and VM were published, you'd essentially have fully reproducible bioinformatics methods. We talked about this idea in an ISME commentary and it's definitely something I'm striving for in my analyses these days!

- edited
waiting for moderation
1
Accepted answer

Virtual machines for scientific computing is incredibly useful! Docker (http://www.docker.com) is gaining momentum in both industry and academia. We are looking at being able to support this as a general solution at NERSC for the genomics users. For a cool example of how this facilitates reproducibility in scientific research as well as a good platform for comparing solution methods, check out http://nucleotid.es - a site dedicated to the evaluation of genome assemblers through Docker containers.

waiting for moderation
0

Thanks! I think I've heard Docker mentioned a few times over the past year, but you finally inspired me to look into it today. Docker would have been much better than the VM I created, for at least two reasons: (1) the Docker image would be much smaller, seems like >10x smaller, and (2) distributing via DockerHub much simpler than needing a VM player. And (3) instead of publishing a whole image, I think something like for this paper could simply publish a Dockerfile with a few instructions to build the image based on an existing ubuntu image. In fact, I should probably give that a whirl and update this thread!

-
waiting for moderation
1

Those are excellent points! Let us know how the Docker container creation goes for you... I should have mentioned DockerHub in my post, glad you mentioned it!

-
waiting for moderation
1
Accepted answer

I'm glad you figured out why the unit tests didn't work. This is probably a change in the underlying biopython code. We don't write the PDB files ourselves, we just use the biopython PDB framework for that. I'm surprised you had to use #!/bin/bash though. Normally all unix installs should have /bin/sh. On my system, it's set as a symlink to /bin/bash.

waiting for moderation
1

I'm using Ubuntu 12.04 LTS and I do have /bin/sh. However, the file on github had the line:

#!/usr/bin/sh

I'm not a power user, but I don't have bash or sh in the /usr/bin directory nor simlinks. An easy fix, of course. Did you have an opinion on linking to a virtual machine that is "ready to go?" (Sorry I don't have the reputation points to upvote your answer :) )

-
waiting for moderation
0

Oh, it's our fault then. The line should be #!/bin/sh for maximum compatibility. I'll fix it in the github repository.

I'm not sure the virtual machine would be useful. It seems more work getting that to work than installing this relatively straightforward python script. However, others may have a different opinion, so if you want to go ahead, sure why not.

-
waiting for moderation
1

I agree with you that you've done excellent work and there really isn't any problem to address. I'm thinking of this as more of a test case for future cases that may not be as straight-forward. I won't try it out unless I think it may benefit someone. For example, if I set up the virtual machine with PyMOL visualization scripts and some other things that maybe some people (like me) weren't very familiar with. I definitely don't want to do anything that you deem a distraction--but on the other hand, it may drive a little traffic.

-
waiting for moderation
1
Accepted answer

I have created a virtual machine that is now available from UNM's institutional repository, LoboVault: virtual machine. I've also added this under the "links" section of this article.

As discussed on this question, I hope that this may be helpful to those reproducing the work described in the PeptideBuilder paper, perhaps even accelerating the re-use and citation of Tien et al.'s work. I am also very interested in feedback from anyone who tries out the virtual machine! Is it useful? Should people consider including virtual machines along with computational work in some cases?

I am aware that my virtual machine is not even close to optimal -- I am just hoping I've done enough to be useful and to spark conversation about the future use of VM's to promote reproducibility of research.

Thank you to the authors of the PeerJ paper for responding to my original question and allowing me to post the VM!

waiting for moderation