Introducing Code Ocean Models: a unified environment for ML in CompBio
We’re launching new machine learning tools in Code Ocean 3.0, the latest version of our computational science platform. They make it easier for computational biologists, bioinformaticians, and ML engineers to work together in training, validating, and using ML models alongside their existing workflows. They also come with full traceability and lineage from development to pre-production.
Here's what this post covers:
- Demo video: see new ML features in action
- What’s changing in version 3.0
- New features at a glance
- Why we're doing this
- What is pre-production?
- What is MLflow?
- Security and compliance
- About Code Ocean
- One more thing: Hugging Face 🤗 integration
For more information, you can take a look at our Models page.
We're also running a webinar about managing ML models in bioinformatics
on Wednesday, September 25th. You can register here (and get an email with the recording if you can’t make it live).
Product demo video
What’s changing in Code Ocean 3.0
The main update in this release is a native integration of MLflow into our platform. This allows us to introduce a whole host of features for anyone working with ML models.
Users can now:
- Use a simple toggle to track Compute Capsules during development and training
- View information about models and register them from within MLflow
- View all their registered models in a single dashboard UI
- Drag and drop/easily reference their models in nextflow-based Pipelines
- Quickly swap and attach new Data for training or validation
- Stand up no-code versions of models for others to validate with their own data
- Get full model lineage and traceability with the Lineage Graph
New features at a glance
Built-in model tracking with native MLflow integration
Track any Compute Capsule during development by toggling on MLflow tracking. Then, click "View in MLflow" to view runs and model information via our natively-integrated MLflow dashboard.
Full lineage for code, data, and environments used in model development and training
The Code Ocean Lineage Graph covers all development and training of ML models, automating an otherwise difficult process. Code Ocean already has native integration of Git and Docker, meaning all work is tracked and containerized.
Git integration: Git is built into every Capsule, Pipeline, and Model. This tracks all development work and syncs with multiple Git providers.
Docker integration: Capsules auto-generate a Dockerfile in the background while users select a base image and install required packages. Advanced users can edit the Dockerfile themselves.
Deploy no-code models for inference and validation with non-coding domain scientists
Stand up an inference version of your model and share it with a domain expert in a couple of minutes. They can then swap in their data and hit "run" just as fast.
Drag and drop ML models directly into nextflow-based pipeline UI (or reference its container)
You can now drag and drop models into our visual Pipeline tool. This means you can either use registered models in a pipeline itself, or insert MLflow-tracked Capsules into pipelines to train your models with parallel compute (Code Ocean pipelines run on AWS batch).
[Users writing their own nextflow can also simply copy and paste the UID of the model container/Compute Capsule (we support both DSL1.0 and 2.0).]
Build reproducible environments and scale GPU resources up (and down)
Straightforward provisioning from within Compute Capsules makes it easy to provision the compute you need for training or finetuning your models. (This feature is preexisting, but worth mentioning again in this context.) Because Code Ocean installs into your cloud architecture as a VPC, compute availability will vary depending on your AWS region.
Why introduce machine learning features? Why now?
We speak to different CompBio and bioinformatics teams every week if not every day. Many of them have long told us how they need better ways to work with ML models, with ML engineers, and especially with non-coding experts who can help validate their work during what we're calling the "pre-production" phase of model development.
The problem is, aside from the hard work of model development and training, there’s a lot of other work needed to get ML models ready for use and into production:
- It’s hard to keep track of what data was used to train different versions of a model
- It’s hard to keep track of multiple versions of a model during development
- It’s hard to know the exact lineage and provenance of a given model
- It’s hard to validate models without pushing them to production first
- It’s hard for non-coding users to run models
- It's hard to keep up with (often rapidly changing) GPU hardware requirements
- It's hard to set up compute environments compatible with changing GPU hardware (e.g. CUDA versions, specific dependency versions)
We're working at a time when there’s an increasing need for reality to catch up with hype in this field, and when there'll soon be FDA guidance about using AI/ML in drug development. We believe this new functionality will help our customers accelerate adoption, usability, and trust in ML for their research.
What is pre-production?
The machine learning lifecycle is usually split in two: a development phase and a production phase. However, there is another mostly unacknowledged phase: pre-production.
Pre-production is when the dev team has finished tweaking the model but it still needs testing by its first early adopters outside of that team. Their main interest is trying the model with real-life data they are familiar with.
Non-dev early adaptors are characterized by two things:
- Their domain expertise (e.g. pharmacologists, radiologists, cell biologists, etc.) and;
- Their relative lack of engineering skills
These users can uniquely judge the outcomes of a model, assess assumed success criteria, pick relevant test data, and give valuable feedback. In short, they can actually kick the tires on a given model before it gets pushed to production.
The challenge in this phase is to place a model in development in the hands of these domain experts without requiring them to set up a test environment, make API calls, or use scripts in a terminal.
Our VP of Product, Daniel Koster, has just published a more involved post about pre-production. You can read it here.
What is MLflow?
MLflow is an open-source tool developed by Databricks to manage the machine learning lifecycle. Its primary function is to keep track of models in development, making it easier to manage, identify, and select models for further development or use.
It can be difficult to stand up a standalone MLflow server to function with existing DIY systems, especially if security and compliance are a concern. Native integration of MLflow into Code Ocean solves this problem because it inherits all of the existing security, authentication, and compliance that Code Ocean already has (along with all of the features and tech that CompBio/bioinformatics teams use in Code Ocean every day).
Security and compliance
Code Ocean installs directly into our customers’ cloud architecture as a virtual private cloud (VPC). This gives greater peace of mind than typical SaaS platforms can provide because it installs where our customers' data already is.
Because we’ve natively integrated MLflow into our platform, it inherits our existing security, compliance, and authentication features:
- Industry-standard permissions and access management with identity provision and single sign-on (IDP and SSO)
- You’ll only be able to see the MLflow experiments that you’re allowed to see; i.e. models you’ve created and have been shared with you
- HIPAA, GDPR, and ISO27001 compliance (with SOC2 underway)
- Group permissions support allows you to share experiments with groups, not just individual users
We’ve also made it secure from the POV of the user’s code:
- Users who train models in their capsules register these models with MLflow with their capsule code, this communicates with the MLflow server and tracks the models there
- We have ensured that this connection is secure (no intercept or ability to affect the models of others)
About Code Ocean
Code Ocean is a no lock-in computational science platform built for CompBio and bioinformatics teams working in biotech and pharma. It installs as a VPC where our customers' data already is: in their private cloud architecture. This means there's no data egress and they can take full advantage of our suite of tools with full ISO27001, HIPAA, and GDPR compliance, as well as SSO, IDP, and SCIM integration and an open API.
If you have any questions or want to discuss how you can use Code Ocean in your organization, feel free to book a demo.
Not ready for a demo just yet?
- Subscribe to our newsletter for future product updates
- Register for our upcoming webinar
- Follow us on LinkedIn
Hugging Face 🤗 integration
One more thing: we're already hard at work on an integration with Hugging Face that will work directly with these new ML features. Users will be able to:
- Input public HF models' model ID and provide metadata
- Import models into the Code Ocean Models dashboard
- Share them with other users
- Attach them to Capsules and Pipelines
- Add them to Collections to improve internal use/visibility
Read more from our blog:
View All PostsPre-production: the missing ML link in Biotech & Pharma
View PostMap of foundational models for use in biotech and pharma R&D
View PostCode Ocean and The Allen Institute partner to accelerate neuroscience
View PostSubscribe to our newsletter
Get the latest product updates, company news, and be the first to hear about upcoming webinars and events.