How “High-Throughput” Collaboration Improves Speed to Discovery
Over the last couple of decades two bottlenecks slowing down life science research have been successfully addressed. Now a third one challenges progress: the difficulty and complexity of sharing research work. Here is how we can successfully eliminate this new constraint on research productivity.
Data and Analysis Tools – No Longer Limiting
For the longest time data was the limiting factor in life science research. We couldn’t get enough of it. Sequencing is a good example in that context: Sanger sequencing with its very limiting read lengths was never going to get whole genomes sequenced. It took the monumental effort of the Human Genome Project – 13 years and $2.7 billions – to get the job done and usher in an era of abundant genomic data. Now the cost of sequencing a whole genome has dropped well below $1,000  and organizations like The Sanger Institute are looking to sequence 500,000 full genomes .
Genomics data, however, is only one of many sources contributing to big data in the biopharma and biotechnology industry: high-throughput and high-content screening, microarrays, data associated with the full complement of ‘omics all add up. The data explosion in healthcare overall is even more impressive: one 2018 paper  estimates the total amount of healthcare data to have reached 40 zettabytes (1021 bytes) in 2020.
The next bottleneck was created by the lack of software tools powerful enough to analyze all that data in a reasonable time frame. Back in the mid-2000s bioinformatics algorithms frequently took days to crunch through a decent sized dataset turning comprehensive data analysis into weeks and months long processes.
This caused more research pipeline bottlenecks due to limited computing power. In addition it created a need for more sophisticated algorithms, such as advanced clustering, pattern search, structure analysis and data visualization tools. Now cloud computing allows for storage of large datasets while powerful instances make rapid analysis with novel – increasingly AI/ML-based – algorithms possible. These innovations have allowed us to efficiently address these bottlenecks and enable researchers to analyze large data sets quickly.
Limiting Now: Our Speed of Collaboration
Nowadays, we have huge amounts of data in different formats and different locations, a wealth of sophisticated analysis tools, ever more complex computing environments as well as increasingly larger teams of specialized scientists. Combined these factors have created a third bottleneck: the ability of scientists, bioinformaticians and computational researchers to efficiently collaborate to turn all that data into knowledge and insights.
Big data, a wealth of sophisticated analysis tools, complex computing environments and large teams of specialized scientists have created a new bottleneck: the ability of scientists, bioinformaticians and computational researchers to efficiently collaborate to turn all that data into knowledge and insights.
Collaboration between multidisciplinary and often geographically dispersed teams is challenging. Fundamentally, it requires every researcher working with the data uses not just the same dataset but also the same analysis packages, the same code in the same computing environment. This is difficult for a number of reasons:
- Creating and/or operationalizing the right analysis packages at the right time requires a level of IT expertise and infrastructure that many (computational) scientists and bioinformaticians do not have.
- Maintaining the exact same version of all analysis packages on all computing environments is highly challenging and gets increasingly more difficult as teams grow in size and span different countries. Sharing with external collaborators and research partners is an extra challenge – one that even IT can’t solve.
- Security related to sharing data sets is particularly critical given the sensitive nature of healthcare data. However, it is difficult for users to implement and manage security, access control, centralized management and cost control themselves or to communicate with IT for support.
- For IT, it is difficult to continuously keep track of all the dependencies, configurations, changes and versions of all the related research artifacts.
These challenges impact reproducibility, slow down progress and innovation, and take up significant IT resources. Because collaborating can be complex and frustratingly slow, research team members often end up developing suboptimal and non-compliant work-arounds that create more problems in the long run and/or compromise security.
Solving this third bottleneck means providing a new way for research teams and their supporting IT teams to easily share computational research in a highly traceable and reproducible manner without specialized software engineering knowledge. In short, we need “high-throughput collaboration”.
Enabling High-Throughput Collaboration
At Code Ocean we spent significant time analyzing how the ideal platform for high-throughput collaboration would look like by asking and answering the following questions:
Q: What makes a project easily shareable between researchers?
A: The ability of all team members, even those without deep software engineering expertise, to quickly access data, code, computing environment and results so they can easily work on, update or reproduce the analysis. This requires a user experience that is optimized for bioinformaticians, computational researchers and scientists.
Q: What needs to be tracked when researchers share?
A: While this depends on the specific project, generally data, code, compute environment and the results need to be tracked from concept to analysis. Tracking means recording who accessed what files when, specific configurations and dependencies, version history along with all metadata for compliance, security, and performance.
Q: What needs to be reproduced to support collaborative research?
A: Everything required to replay and reuse the exact data and analysis for the entire life of the project. This level of reproducibility makes it possible for both internal and external researchers to quickly and easily build on previous project assets and results.
These three key criteria – shareability, traceability and reproducibility – are the foundation of the Code Ocean platform, specifically the Compute Capsules™. Compute Capsules are computational “apps” that enable seamless collaboration by allowing researchers to package code, data, the computing environment, and the results. These self-contained executable software packages can then be easily shared with and used by others without the need to install operating systems, libraries, and application packages in the correct versions on all computing machines. A simple click is enough to share a Compute Capsule with internal and external collaborators and facilitate high-throughput collaboration between bioinformaticians, computational researchers, and scientists with different specializations and varying software engineering/devops expertise all over the globe.
In a soon to be published Code Ocean eBook titled “A Research Guide to High Throughput Collaboration”, we discuss the importance of shareability, traceability and reproducibility for speeding up time to discovery in more detail and provide concrete examples and use cases of how the Code Ocean platform can take collaboration from hassle to high-throughput.
If you have any questions about Code Ocean and how our Compute Capsules enable high-throughput collaboration, please contact us here.
 “Big Data in Health Care: Applications and Challenges”, Liang Hong, Mengqi Luo, Ruixue Wang, Peixin Lu, Wei Lu, Long Lu, DOI: https://doi.org/10.2478/dim-2018-0014, https://sciendo.com/article/10.2478/dim-2018-0014