This comprehensive map of molecular switches in…. ISB News. Press Releases Click to see our news releases about special projects and other announcements. Read More. Media Requests Journalists: We are at your service. Please contact us with your request. To help understand… Read More.
To identify and understand the intertwining gene regulatory… Read More. This comprehensive map of molecular switches in… Read More. Research Featured Projects Publications. The abstract workflow describes the steps in a manner that is independent of the software used to implement them. Third, we published the workflow and all of its constituents including input and output data, software and scripts for the steps as Linked Data  , which means that each constituent of the workflow can be accessed by its URI through HTTP, and its properties are described using W3C RDF standards .
This means that the published workflow is accessible over the Web, in a way that does not require figuring out how to access institutional catalogs or file systems. With this maximally open form of publication of the workflow, the effort that we invested in reproducing the workflow does not have to be incurred by others. Each step and its inputs and outputs are explicitly and separately represented as well as linked to the workflow. The software for each step is available as well, as are the intermediate and final results.
The effort involved in creating a workflow is negligible compared with the time to implement the computational method. Implementing the computational method typically takes months, and involves activities such as finding software packages that implement some of the steps, figuring out how to set up the software e. Once this is all done, creating the workflow can be done in a few hours, and can be as simple as wrapping each step so it can be invoked as a software component and expressing the dataflow among the components.
Learning to create simple workflows requires only a few hours, more advanced capabilities clearly require additional time investment e. Similarly, publishing workflows takes no effort at all since the workflow system takes care of the publication. Technical details on how the workflow is published can be found in . All the materials related to the workflow and its execution results have been published online . Additionally, input and output datasets have been associated to DOIs and uploaded to a persistent data sharing repository .
Reproducibility is considered a cornerstone of the scientific method and yet rarely is scientific research reproducible without significant effort, if at all  - . Authors submitting papers know this; as do those reading the papers and trying to reproduce the experiment. For computational work like that described here, where data, methods, and control parameters are all explicitly defined there is less of an excuse for not making the work reproducible.
Note that making the software available or accessible through a webserver, while commendable, is not the same as making the work reproducible. Workflows, which define the scientific process as well as all the components, provide the tools for improved reproducibility. While workflows are commonly used for highly repetitive tasks, they are less used for earlier stage research. Whether this is a result of shortcomings in the tools or insufficient emphasis on the need to make work reproducible requires further consideration.
This then raises the further issue of whether the emphasis itself is justified. Do we really care if work is exactly reproducible? This generally only becomes important if some variation of the original work cannot be reproduced at all, then the original work is fully scrutinized. This speaks to a need for better quantification of what is really needed to improve productivity in science.
When, as is the case here, the experiment is conducted completely in silico , the opportunity to accurately capture what has transpired becomes a relatively straightforward task i. What does doing better imply? We believe it is rare that work is purposely made irreproducible; rather the system of peer review speaks to reproducibility but is cursory in demanding it. The scientific reward is in publishing another paper, not making your current paper more reproducible.
Tools help, but changes in policy are also needed. It will be a brave publisher indeed that demands that workflows be deposited with the paper. Publishing after all is a business and if one publisher demands workflows, authors are more likely to publish elsewhere than go to the trouble. Journals are beginning to provide guidelines for reproducibility and minimum requirements for method descriptions  — . The credit comes indirectly from acknowledgement by the community that the software is useful. Perhaps publishing end-to-end methods as workflows would bring similar reputation.
For this to work, authors must be recognized and credited by other researchers reusing their workflow. We posit that the authors of the original method need not be the ones publishing the workflow. Third parties interested in reproducing the method could publish the workflow once reproduced, and get credit not for the method but for the workflow as a reusable software instrument.
In one sense this is no different than taking other scientists data and developing a database that extends the use of these data to a wider community. It is a value-added service worthy of attention through publication. Federal mandates similar to those emerging around shared data could also be put in place for reproducibility too.
In the end, funding for science ultimately comes from taxes from the public, and we need to be responsible in making science as efficient and productive as possible. Many government agencies already require data to be published and shared with other researchers. Workflows should follow the same path. The recent emphasis on open availability of research products resulting from public funds  —  will eventually include the publication of software and the methods workflows. This will likely be sometime coming as the easier issue of meaningful data provision is not fully understood and solved yet.
Notwithstanding, if this remains a difficult issue on a global scale we can make progress in our own laboratories. A new researcher coming to almost any laboratory and picking up tools used by previous laboratory members can likely testify to what is described in this paper.
If we are to accelerate scientific discovery we must surely do better both within a laboratory and beyond. This is particularly important in an era of interdisciplinary science where we often wish to apply methods that we are not experts in. Some would argue that irreproducibility in the laboratory is part of the learning process; we would argue yes, but with so much to learn that is more relevant to discovery we should do better now that we have tools to assist us.
Or should we? Reproducibility aside, is there indeed a favorable cost:benefit ratio in using workflows with respect to productivity? There is a dearth of literature that addresses this question. Rather the value of the workflow is assumed and different workflow systems on different computer architectures are analyzed for their relative performance. At best the question can be addressed by work habits.
We must be careful as such work habits could be mandated, in a large company say, rather than by choice, which would be the case in an independent research laboratory. Creating workflows results in overhead for exploratory research, where many paths are discarded. However, once created a workflow can be reused many times.
This makes them ideal for repetitive procedures such as might be found in aspects of the pharmaceutical industry. Pharmaceutical companies use workflows for computational experiments . Taking an independent computational biology laboratory, as is the case for this study, it is fair to say that workflows are making inroads into daily work habits.
These inroads are still localized to specific subareas of study — Galaxy  for high-throughput genomic sequence analysis; KNIME  for high-throughput drug screening, and so on, but with that nucleation and with new applications being added by an open source-minded community, adoption is increasing. Adoption would assume a favorable cost:benefit ratio in that use of a workflow system provides increased productivity over not using such a system. This is a cost measured in time rather than money since most academic laboratories in computational biology would use free open source workflow systems.
Finally, when articles cannot be easily reproduced the authors are often contacted to clarify or describe additional details. This requires effort that might as well have been invested in writing the article more precisely in the first place. Workflows can also be seen as an important tool to make the research in a lab more rigorous.
Analyses must be captured so they can be inspected by others and errors detected as easily as possible. For example, writing code to transform data makes the transformation inspectable, while using a spreadsheet to do the task makes it much harder to verify that it was done correctly. Ensuring consistency and reproducibility requires more effort without workflows.
In our own laboratory we find that the workflow can act as a reference such that new users can more quickly familiarize themselves with the various applications than would be the case without the benefit of the workflow organization, but then choose to go on and run applications outside of the workflow system. As the workflow systems themselves continue to be easier to use and more intuitive we anticipate that more work will be done within the workflow system itself, presumably improving productivity. For the practitioner, what are the pluses and minuses of workflow use today?
An obvious minus is the time required to establish the workflow itself. In some sense this is analogous to documenting a procedure to run a set of software programs. But in most cases once codes are prepared for publication little additional effort is required to include them in a workflow. The advantage of a workflow is that capturing the steps themselves defines the procedure and it can be re-run, in principle, without any further effort.
Virtual machines offer the promise of capturing the complete executable environment for future use, however they introduce other issues . For example, virtual machines often act as black boxes that allow repeating the experiment verbatim, but do not allow for any changes to the computational execution pipeline, limiting its reproducibility. Furthermore, virtual machines cannot store external dynamic databases accessed at runtime like the PDB in our work due to their size. These databases are commonly used for experiments in computational biology.
All taken together, it may be that we are at this tipping point of broad workflow adoption and it will be interesting to review workflow use by the computational biology community two or more years from now. We conclude by summarizing the main observations resulting from our work, leading to desiderata for reproducibility shown in Table 2 , and a set of guidelines for authors shown in Table 3.
We have restrained from making too many absolute conclusions from a single instance of applying a workflow to a scientific method. It would be interesting to carry out similar studies in other domains and compare findings. Performed the experiments: DG. Analyzed the data: DG YG.
Tested the automated workflow and reported feedback: YZ. Browse Subject Areas? Click through the PLOS taxonomy to find articles in your field. Abstract How easy is it to reproduce the results found in a typical computational biology paper? Introduction Computation is now an integral part of the biological sciences either applied as a technique or as a science in its own right - bioinformatics.
Related Work As stated, scientific articles describe computational methods informally, as the computational aspects of the method may not be the main focus of the article. Methods and Analysis Quantifying Reproducibility We focus on an article that describes a method that lends itself to workflow representation, since others can, in principle, use the same exact procedures . Methodology The workflow was reproduced as a joint effort between computer scientists and the original authors of the article.
We considered reproducibility by researchers of four types: REP-AUTHOR , is a researcher who did the original work and who may need to reproduce the method to update or extend the results published. It is assumed that the authors have enough backup materials to answer any questions that arise in reconstructing the method.
In practice, some authors may be students that move away from the lab and their materials and notes may or may not be available, confounding reproducibility . These researchers could reproduce the method even if the methods section of the paper is incomplete and ambiguous. They can use their knowledge of the domain, the software tools and the process to make very complex inferences from the text and reconstruct the method.
However, there may be some non-trivial inferences that require significant effort. They may be asked to use the method with new data, but are only able to make limited inferences based on analyzing the text and software tools. For them reproducibility can be very costly since it may involve a lot of trial and error, or perhaps additional research. In some cases reproducibility may become impossible. They need some programming skills to assemble the software necessary to run the different steps of the method.
They represent researchers from other areas of science with minimal knowledge about biology, students, and even entrepreneurial citizen scientists e. Unless the steps of the method are explicitly stated, they would not be able to reproduce the results. They have minimal background knowledge in biology. REP-NOVICE - The computer scientists subsequently consulted the documentation on the software tools mentioned in the article to try to infer how the data were being processed by each of the steps of the method.
Based on this, they refined their initial workflows. REP-AUTHOR - Lastly the computer scientists approached the original paper authors to ask specific questions, resolve execution failures and errors and consult concerning the validity of the results for each step. They created the final workflow based on these conversations with the authors.
Conceptual Overview of the Method and Final Workflow An interesting result of our initial discussions of the method was a collaborative diagram that indicated each of the steps in the method and how data were generated and used by each step. Download: PPT. Figure 1. A high-level dataflow diagram of the TB drugome method. Figure 2. The reproduced TB Drugome workflow with the different subsections highlighted. Reproducibility Analysis We now analyze each of the subsections of the method as described in the original paper, discussing the difficulties encountered in reproducing the method, highlighting recommendations to improve reproducibility, and show reproducibility scores for each step of the final workflow.
Reproducibility Maps We present reproducibility maps created as a summary of the reproducibility scores for all the major steps in the workflow. Figure 3. Reproducibility maps of the three major subsections of the workflow. Productivity and Effort We kept detailed records in a wiki of the effort involved in reproducing the method throughout the project. Publishing the Reproduced Workflow Now that we had invested significant effort in reproducing the workflow, our goal was to maximize its reusability.
Discussion Reproducibility is considered a cornerstone of the scientific method and yet rarely is scientific research reproducible without significant effort, if at all  - . Conclusions We conclude by summarizing the main observations resulting from our work, leading to desiderata for reproducibility shown in Table 2 , and a set of guidelines for authors shown in Table 3. Supporting Information. Supplement S1.
Author Contributions Performed the experiments: DG. References 1. Scientific Workflows for Grids, 1st Edition. View Article Google Scholar 6. Hothorn T, Leisch F Case Studies in Reproducibility. Briefings in Bioinformatics, 12 3. View Article Google Scholar 8. Annals of Applied Statistics, 3 4 — View Article Google Scholar 9. BMC Research Notes 6: View Article Google Scholar Fang CF, Casadevall A Retracted Science and the retracted index.
Infection and Immunity. Nature Editorial. Illuminating the Black Box Nature, Accessed October The Wall Street Journal Website. Claerbout J, Karrenbach M Electronic documents give reproducible research a new meaning. Making Scientific computations reproducible.
The reproducibility of psychological science. Report of the Open Science Collaboration.
Tuberculosis (Edinb). Sep;91(5) doi: /rapyzure.tk Epub Apr 1. Systems biology of tuberculosis. Chandra N(1), Kumar D, Rao . Systems biology seeks to study biological systems as a whole, by adopting an A holistic view of the causative organism Mycobacterium tuberculosis (Mtb) and.
DOI: Biostatistics 10 3. Beyond the PDF website. Stodden V Biostatistics, 11 3. Yong E Replication studies: Bad copy. Nature , — Proceedings of Computational Statistics.
Compstat, Proceedings in Computational Statistics. Falcon S Caching code chunks in dynamic documents: The weaver package. Computational Statistics, 24 2. Moreau L, Ludaescher B editors Special Issue on The third provenance challenge on using the open provenance model for interoperability. Future Generation Computer Systems, 27 6. The Protein Data Bank. Nucleic Acid Research 28 1 , — Nucleic Acids Research 32 Database issue :D— TB-Drugome website.
Wings Drugome website. Journal of Experimental and Theoretical Artificial Intelligence, 23 4. Wings workflow management system website. Accessed on October 15, PLoS Comp. USA , 14 — Bioinformatics, doi: Improving molecular docking through eHiTS' tunable scoring function. Journal of ComputerAided Molecular Design.
W3C Recommendation. W3C Provenance Working Group website.