Skip to content
Book a Demo
June 26, 2024

Map of foundational models for use in biotech and pharma R&D

New foundational models are published all the time. Knowing where to start can be daunting. It can also be an overwhelming task to keep track of new models as they're released, let alone find the time to even use them in an already busy schedule.

That's why we've been producing different resources tailored for computational biologists and bioinformaticians in biotech and pharma R&D to help them cut through the noise, find what they're looking for, and make use of these models.

map-of-foundational-models-biotech-and-pharma-closeup-example-01

Here's what we've made so far:

What are generative foundational models?

Generative foundational models refer to a class of AI models, such as large language models (LLMs) and other generative AI technologies, which are pre-trained on vast amounts of data. These models can generate new content—such as text, images, or even molecular structures—based on their training data. In the life sciences, particularly in biotech and pharmaceutical research, these models are accelerating various aspects of research and development.

These models have three core characteristics:

  1. Large-scale pre-training
  2. Generative capabilities
  3. Fine-tuning and customization
Large-scale pre-training

Generative foundational models are pre-trained on extensive datasets that include scientific literature, clinical trial data, genetic sequences, and molecular structures, enabling them to comprehend and generate complex biological and chemical information. Using transformer architectures, these models handle sequential data and capture long-range dependencies essential for understanding intricate biological sequences and molecular interactions​.

Generative capabilities

These models can generate human-like text, aiding in summarizing scientific papers and extracting insights from large datasets. They are also capable of designing new molecules by predicting chemical properties and interactions, significantly contributing to tasks such as protein folding, protein sequence design, and molecular optimization, which are crucial for advancing drug discovery and personalized medicine​.

Fine-tuning and customization

Generative foundational models can be fine-tuned with domain-specific data to enhance their performance on specialized tasks such as antibody design or gene expression analysis. They are integrated into proprietary research pipelines, supporting seamless data flow and enabling more precise and effective research capabilities in areas like in silico screening and virtual trials​.

map-of-foundational-models-biotech-and-pharma-closeup-example-02

Interactive foundational model map

We've developed a map of 59 different foundational models suited for use cases in biotechs and pharmaceutical companies. We've categorzied them in two different ways: by research domain or by input-to-output.

You can either:

Foundational model use cases

The foundational models included in the map fall into one of the seven below use case categories. Each foundational model is linked to its relevant summary in the list below.

Important note: this list is not exhaustive, and will be out of date as soon as new models emerge. However, we'll attempt to keep this list as up-to-date as we can.

Full list of foundational models 

We've compiled our list of 59 foundational models in to a single list below, with anchor links above for navigation. The below list contains short summaries of each model and includes a link to the relevant GitHub repo where available. 

AF_unmasked

The AF_unmasked model was developed by Claudio Mirabello et al. to enhance AlphaFold's ability to predict multimeric protein structures by integrating experimental data. It addresses AlphaFold's limitations with large complexes and experimental data integration by using a new template strategy and structural inpainting without retraining the neural network. AF_unmasked improves prediction accuracy and speed, even without evolutionary information, and effectively resolves model clashes.

GitHub: https://github.com/clami66/AF_unmasked

AlphaFold

The AlphaFold model was trained to predict protein-protein interaction (PPI) interfaces, particularly those involving folded domains binding to short linear motifs. Chop Yan Lee et al. explored its use, finding high sensitivity but low specificity, especially with longer or disordered protein fragments. They developed a fragmentation strategy to enhance sensitivity and applied it to neurodevelopmental disorder-associated proteins. Experimental validation confirmed several predicted interactions, providing new molecular insights.

GitHub: https://github.com/google-deepmind/alphafold

AlphaFold 3

The AlphaFold 3 model was trained to predict the structures of biomolecular complexes with high accuracy, building on the success of AlphaFold 2. Developed by Josh Abramson et al., AF3 introduces a diffusion-based architecture and innovations like the Pairformer and Diffusion Modules to handle proteins, nucleic acids, small molecules, ions, and modified residues. The model excels in predicting protein-ligand, protein-nucleic acid, and antibody-antigen interactions, outperforming specialized tools.

GitHub: https://github.com/lucidrains/alphafold3-pytorch

AncLearn

The AncLearn model was trained to enhance the robustness of holistic indoor scene understanding by addressing noisy instance feature learning and difficulties in retrieving instances from sparse point clouds. Developed by Mingyue Dong et al., AncLearn generates dynamic shape anchors to fit instance surfaces, reducing noise and outliers during detection and reconstruction. Integrated into the AncRec system, it operates in an instance-oriented manner to produce high-quality semantic scene models.

AttentiveChrome

The AttentiveChrome model was trained to predict gene expression from chromatin data by addressing the spatially structured, high-dimensional nature of chromatin signals and their interactions. Developed by Ritambhara Singh et al., it uses a hierarchical LSTM architecture with dual attention mechanisms to identify relevant regions and significant chromatin marks. Evaluated on datasets from the Roadmap Epigenome Project covering 56 human cell types, AttentiveChrome outperforms existing methods in accuracy and interpretability.

GitHub: https://github.com/QData/AttentiveChrome

Basenji

The Basenji model was trained to predict cell-type-specific epigenetic and transcriptional profiles in large mammalian genomes using only DNA sequence. Developed by Kelley et al., this convolutional neural network (CNN) identifies promoters and distal regulatory elements to make accurate gene expression predictions. The model processes large input sequences and uses multiple convolutional layers to predict read coverage across long chromosome sequences. Trained on comprehensive datasets including DNase-seq, histone modification ChIP-seq, and CAGE experiments.

GitHub: https://github.com/calico/basenji

Basset

The Basset model was trained to predict the functional activity of DNA sequences using deep convolutional neural networks (CNNs), as described by David R. Kelley, Jasper Snoek, and John L. Rinn. It leverages DNaseI-seq data from 164 cell types to learn the chromatin accessibility code, achieving superior predictive accuracy compared to previous methods. Basset annotates mutations in the genome with their influence on accessibility, aiding researchers in interpreting noncoding variants associated with human disease.

GitHub: https://github.com/davek44/Basset

BioNeMo

The BioNeMo model was trained to provide detailed molecular information on biodegradation metabolism, as described by the authors in the article "Bionemo: Molecular Information on Biodegradation Metabolism," published in Nucleic Acids Research in December 2008. The database focuses on the molecular details of proteins and genes involved in biodegradation, offering data on protein sequences, domains, structures, gene sequences, regulatory elements, and transcription units. Unlike other databases such as UM-BBD and Metarouter, BioNeMo emphasizes sequence-level information,...

GitHub: https://github.com/NVIDIA/BioNeMo

BiomedParse

The BiomedParse model was trained to provide comprehensive biomedical image parsing by integrating segmentation, detection, and recognition tasks across 82 object types and 9 imaging modalities. Traditional methods often handle these tasks separately, leading to inefficiencies. BiomedParse leverages task interdependencies to improve accuracy and enable applications like text-prompt-based segmentation. It uses the BiomedParseData dataset, which includes over six million image, mask, and text description triples, harmonized using GPT-4. The model outperforms state-of-the-art methods, particularly excelling...

Brain TokenGT

The Brain TokenGT model was trained to analyze brain functional connectome (FC) trajectories for diagnosing and prognosing neurodegenerative diseases like Alzheimer's Disease (AD). Developed by Zijian Dong et al., the model addresses the limitations of traditional Graph Neural Networks (GNNs) by embedding FC trajectories and incorporating node and spatio-temporal edge embeddings through its Graph Invariant and Variant Embedding (GIVE) module. The Brain Informed Graph Transformer Readout (BIGTR) module processes these embeddings using a transformer encoder.

GitHub: https://github.com/ZijianD/Brain-TokenGT

Caduceus

The Caduceus model was trained to address key challenges in genomics, such as long-range token interactions, bi-directional context, and reverse complementarity (RC) of DNA sequences. Developed by Yair Schiff et al., Caduceus builds on the long-range Mamba block, extending it to BiMamba for bi-directionality and MambaDNA for RC equivariance. This architecture outperforms previous models in tasks like variant effect prediction (VEP) and achieves superior performance on benchmarks such as Genomics Benchmarks and Nucleotide Transformer tasks.

GitHub: https://github.com/kuleshov-group/caduceus

CellOracle

The CellOracle model was trained to understand cell identity regulation through gene-regulatory networks (GRNs) using single-cell multi-omics data. Developed by Kenji Kamimoto et al., this machine-learning-based tool simulates transcription factor (TF) perturbations to predict changes in cell identity without experimental data. Applied to systems like mouse and human hematopoiesis and zebrafish embryogenesis, CellOracle accurately models known phenotypic changes and identifies new regulatory factors. It constructs cell-type-specific GRNs, estimates cell-identity transitions, and visualizes changes.

GitHub: https://github.com/morris-lab/CellOracle

ChemCrow

The ChemCrow model was trained to enhance performance in chemistry-related tasks such as organic synthesis, drug discovery, and materials design by integrating 18 expert-designed tools with GPT-4. Discussed in a Nature Machine Intelligence article, ChemCrow autonomously planned and executed the synthesis of DEET and organocatalysts, and guided the discovery of a novel chromophore. Tested on 14 use cases, it demonstrated the ability to adapt and execute standardized synthesis procedures.

GitHub: https://github.com/ur-whitelab/chemcrow-public

Chemprop

The Chemprop model was trained to enhance molecular generative models for structure-based drug design by incorporating target 3D structural information. The method integrates a message-passing neural network (D-MPNN) that predicts docking scores with a generative neural network (RNN) to efficiently explore chemical space and identify molecules that bind favorably to specific targets.

GitHub: https://github.com/chemprop/chemprop

Chromoformer

The Chromoformer model was trained to quantitatively decipher histone codes in gene regulation by incorporating three-dimensional chromatin interactions, addressing limitations of traditional models that focus on narrow genomic regions. It leverages large genomic windows and three-dimensional interactions to achieve state-of-the-art performance in predicting gene expression levels.

GitHub: https://github.com/dohlee/chromoformer

ClinicalBERT

The ClinicalBERT model was trained to evaluate the effectiveness of OpenAI's GPT-4 in extracting clinical phenotypes from Electronic Health Records (EHRs) of non-small cell lung cancer (NSCLC) patients, as detailed in the study by Kriti Bhattarai et al. The research compared GPT-4's performance with GPT-3.5-turbo, Flan-T5-xl, Flan-T5-xxl, and spaCy’s methods, focusing on identifying disease stages, treatments, recurrence, and affected organs from 13,646 records of 63 NSCLC patients.

GitHub: https://github.com/EmilyAlsentzer/clinicalBERT

ClinicalGPT

The ClinicalGPT model was trained to enhance clinical applications by incorporating diverse real-world medical data, including medical records, domain-specific knowledge, and multi-round dialogue consultations. Developed by Guangyu Wang et al., the model aims to address limitations in traditional large language models, such as factual inaccuracies and insufficient real-world grounding. ClinicalGPT excels in multiple clinical tasks, including medical knowledge question-answering, patient consultations, and diagnostic analysis, as demonstrated by a comprehensive evaluation framework.

GitHub: https://github.com/jiyingz/clinicalGPT-2

CPA

The CPA model was trained to predict and interpret single-cell responses to various perturbations, such as drug treatments and genetic modifications, by integrating the interpretability of linear models with the flexibility of deep learning. It handles complex, high-dimensional single-cell RNA sequencing (scRNA-seq) data, enabling predictions for unseen conditions and facilitating drug similarity analysis. The model's performance was validated on datasets including drug dose-response, drug combinations, and cross-species time-series data.

GitHub: https://github.com/RustAudio/cpal

CROssBAR

The CROssBAR model was trained to enhance drug-target interaction (DTI) prediction by evaluating various protein representation techniques. The original study highlights the superior performance of learned embeddings like transformer-avg and unirep1900, which leverage deep learning to capture complex protein sequence patterns. Conventional descriptors such as k-sep_pssm and dde also show competitive results, emphasizing the importance of evolutionary and sequence composition features.

GitHub: https://github.com/crossbario/crossbar

Evo

The Evo model was trained to predict and generate DNA sequences at scales ranging from molecular to genome level, leveraging deep signal processing and a 7-billion-parameter architecture. It uses a byte-level, single-nucleotide tokenizer and was trained on 2.7 million prokaryotic and phage genomes. Evo excels in zero-shot function prediction across DNA, RNA, and protein modalities, and can generate synthetic CRISPR-Cas complexes and transposable systems.

GitHub: https://github.com/the1812/Bilibili-Evolved

DeepProteomics

The DeepProteomics model was trained to enhance feature matching in single-cell proteomics, addressing the limitations of low proteomic depth and throughput. Developed by Karl K. Krull, Syed A. Ali, and Jeroen Krijgsveld, the DIA-ME (Data-Independent Acquisition with Matching Enhancer) strategy co-analyzes low-input samples with higher-input matching enhancers to improve proteome coverage and data completeness.

DeepTox

The DeepTox model was trained to predict chemical toxicity in aquatic organisms using a transformer-based approach, significantly improving upon traditional QSAR methods. It leverages a pre-trained RoBERTa transformer to convert chemical structures into numerical representations, which are then processed by a deep neural network to predict toxic effects on algae, aquatic invertebrates, and fish. The model was trained on tens of thousands of exposure experiments.

GitHub: https://github.com/zake7749/DeepToxic

DeepVariant

The DeepVariant model was trained to enhance long-read small variant calling by integrating a new local haplotype approximation method, as detailed by Alexey Kolesnikov et al. in their article. This method simplifies the variant calling process while maintaining high accuracy across multiple sequencing platforms, including PacBio Revio and ONT R10.4. By incorporating local haplotagging directly within the DeepVariant framework, it eliminates the need for external tools like WhatsHap, reducing complexity and overhead.†

GitHub: https://github.com/google/deepvariant

DeepRadiology

The DeepRadiology model was trained to rapidly and accurately detect pneumonia and Covid-19 from chest X-rays and CT scans, addressing the global health burden posed by these diseases. Aakash Shah and Manan Shah's article reviews various deep learning architectures, such as ResNet, Yolo, and GANs, highlighting their design, challenges, and trade-offs. The model aims to improve diagnostic speed and accuracy compared to traditional methods, which are time-consuming and require expert radiologists.

GitHub: https://github.com/pyaf/DeepRadiology

DNAGPT

The DNAGPT model was trained to handle a variety of DNA sequence analysis tasks by integrating sequence and numerical data into a single framework. Developed by Daoan Zhang et al., DNAGPT enhances the classic GPT model with tasks like binary classification and numerical regression, using over 200 billion base pairs from all mammals. It excels in genomic signal and region recognition, mRNA abundance prediction, and artificial genome generation, outperforming existing models like DNABERT and Nucleotide Transformer

GitHub: https://github.com/TencentAILabHealthcare/DNAGPT

EHR-GPT

The EHR-GPT model was trained to enhance clinical decision support and precision medicine by leveraging a hybrid approach that combines unsupervised learning of word embeddings, semi-supervised learning for clinical vocabulary and concept building, and deterministic rules for fine-grained information extraction. This method addresses the limitations of existing NLP techniques in healthcare, such as performance, efficiency, and transparency.

GitHub: https://github.com/Anitej185/live-doc-gpt

Enformer

The Enformer model was trained to predict gene expression and chromatin states from DNA sequences by leveraging a transformer-based architecture that integrates long-range interactions up to 100 kb. This approach overcomes the limitations of traditional deep convolutional neural networks, which struggle with sequences beyond 20 kb from the transcription start site. Enformer was trained on extensive human and mouse genomic data and tested on held-out sequences.

GitHub: https://github.com/lucidrains/enformer-pytorch

GEARS

The GEARS model was trained to predict transcriptional responses to genetic perturbations by integrating deep learning with a knowledge graph of gene-gene relationships. It addresses the challenge of combinatorial explosion in multigene perturbations, using single-cell RNA-sequencing data from perturbational screens. GEARS outperforms existing methods in precision and effectiveness, leveraging gene coexpression and Gene Ontology-derived knowledge graphs to predict outcomes for genes without prior experimental data.

GitHub: https://github.com/LappleApple/awesome-leading-and-managing

GeminiMol

The GeminiMol model was trained to enhance molecular representation learning by incorporating the dynamic conformational space of molecules, which is often neglected in traditional models. Developed by Lin Wang et al., this model employs a hybrid contrastive learning framework combining inter-molecular contrastive learning with molecular similarity projection heads. It was trained on a dataset of 39,290 molecules without needing experimental molecular properties.

GitHub: https://github.com/Wang-Lin-boop/GeminiMol

GeneBERT

The GeneBERT model was trained to enhance the understanding of regulatory genome interactions across different cell types by integrating 1D genome sequences and 2D matrices of transcription factors and regions. Developed by Shentong Mo et al., it employs three pre-training tasks—masked genome modeling, next genome segment prediction, and sequence-region matching—to capture complex regulatory element interactions. Pre-trained on the ATAC-seq dataset with 17 million genome sequences.

GitHub: https://github.com/ZovcIfzm/GeneBERT

Geneformer

The GeneFormer model was trained to efficiently compress gene sequencing data, addressing the limitations of traditional methods like G-zip and 7zip, which are not optimized for the repetitive sequences of nucleotides (A, G, C, T). GeneFormer leverages a modified transformer architecture to capture dependencies in nucleotide sequences, incorporating a latent array and multi-level grouping to enhance compression efficiency and reduce latency.

GitHub: https://github.com/cx0/geneformer-finetune

GenSLM

The GenSLM model was trained to identify and classify emergent variants of SARS-CoV-2 using large language models adapted for genomic data. Developed by Maxim Zvyagin et al., it leverages over 110 million prokaryotic gene sequences for pre-training and 1.5 million SARS-CoV-2 genomes for fine-tuning. This approach enables rapid and accurate identification of variants of concern, addressing the need for efficient computational tools in pandemic monitoring.

GitHub: https://github.com/ramanathanlab/genslm

HyenaDNA

The HyenaDNA model was trained to handle long-range genomic sequences at single nucleotide resolution, overcoming the limitations of traditional Transformer-based models. Developed by Eric Nguyen et al., it leverages implicit convolutions to process up to 1 million tokens, achieving 160x faster training and maintaining global context. HyenaDNA excels in detecting subtle genetic variations and adapts to novel tasks without updating pretrained weights. It outperforms previous models on multiple benchmarks, including Nucleotide Transformer and GenomicBenchmarks datasets,...

GitHub: https://github.com/HazyResearch/hyena-dna

iDNA

The iDNA-OpenPrompt model was trained to identify DNA methylation sites using the OpenPrompt learning framework, addressing the limitations of traditional and current deep learning methods. It leverages a prompt template, prompt verbalizer, and Pre-trained Language Model (PLM) to enhance accuracy, incorporating a DNA vocabulary library, BERT tokenizer, and specific label words.

GitHub: https://github.com/rthalley/dnspython

INTERACT

The INTERACT model was trained to predict the effects of genetic variations on DNA methylation (DNAm) levels at CpG sites in the human brain by integrating convolutional neural networks (CNN) with transformer models. It addresses the challenge of identifying causal genetic variations influencing DNAm levels amidst extensive linkage disequilibrium (LD). The model captures both local and distant DNA sequence features using a self-attention mechanism.

GitHub: https://github.com/kamranahmedse/developer-roadmap

LLaVA-Med

The LLaVA-Med model was trained to create a multimodal conversational AI tailored for the biomedical domain, addressing the limitations of general-domain vision-language models in interpreting biomedical images. Chunyuan Li et al. developed a cost-efficient training method using a large-scale biomedical figure-caption dataset from PubMed Central and GPT-4 generated instruction-following data. The model underwent a two-stage training process: aligning biomedical vocabulary and mastering open-ended conversational semantics.

GitHub: https://github.com/microsoft/LLaVA-Med

M3-CAD

The M3-CAD model was trained to design antimicrobial peptides (AMPs) to combat multidrug-resistant organisms (MDROs) using a novel AI-driven approach. Developed by Yue Wang et al., the model leverages the QLAPD database, which contains sequences, structures, and antimicrobial properties of 12,914 AMPs. The M3-CAD pipeline integrates generation, regression, and classification modules, utilizing a 3D voxel coloring method to enhance peptide structural characterization.

GitHub: https://github.com/AmoebeLabs/m3-02-cadmiumgreen

Med-PaLM

The Med-PaLM model was trained to evaluate and enhance the clinical knowledge of large language models (LLMs) for medical applications. Karan Singhal et al. introduced MultiMedQA, a benchmark combining six existing medical question-answering datasets and a new dataset, HealthSearchQA, to assess LLMs' performance. The study used PaLM and its instruction-tuned variant, Flan-PaLM, achieving state-of-the-art accuracy on multiple-choice datasets.

GitHub: https://github.com/kyegomez/Med-PaLM

mEthAE

The mEthAE model was trained to provide interpretable dimensionality reduction for high-dimensional DNA methylation data, addressing the challenge of understanding CpG site relationships. Developed by Sonja Katz et al., this chromosome-wise autoencoder significantly reduces data dimensions—up to 400-fold—while maintaining reconstruction accuracy and predictive power.

GitHub: https://github.com/bilalhusain/mushroom-methaemoglobin

MuLan-Methyl

The MuLan-Methyl model was trained to predict DNA methylation sites using an ensemble of five transformer-based language models: BERT, DistilBERT, ALBERT, XLNet, and ELECTRA. It targets three types of DNA methylation: N6-adenine (6mA), N4-cytosine (4mC), and 5-hydroxymethylcytosine (5hmC). The model leverages a custom corpus of DNA fragments and taxonomic lineages, pre-trained with self-supervised learning and fine-tuned for specific methylation tasks.

GitHub: https://github.com/husonlab/mulan-methyl

Nucleotide Transformer

The Nucleotide Transformer model was trained to predict molecular phenotypes from DNA sequences, addressing challenges in human genomics. Developed by Hugo Dalla-Torre et al., it leverages deep learning to generate context-specific nucleotide representations from 3,202 human genomes and 850 multispecies genomes. The model excels in identifying genomic patterns and predicting gene expression, outperforming specialized methods in up to 15 out of 18 tasks after fine-tuning.

GitHub: https://github.com/instadeepai/nucleotide-transformer

Pathformer

The Pathformer model was trained to integrate multi-omics data for disease diagnosis and prognosis, particularly in cancer, addressing limitations in interpretability and utilization of prior biological knowledge. Developed by Xiaofan Liu and colleagues, it employs a pathway-based sparse neural network and criss-cross attention mechanism to enhance prediction accuracy and interpretability. Pathformer demonstrated superior performance in cancer survival, stage, and drug response predictions across multiple datasets, with significant improvements in F1 scores.

GitHub: https://github.com/decisionintelligence/pathformer

ProGen

The ProGen model was trained to investigate the role and characteristics of axonal mitochondria in cortical pyramidal neurons (CPNs) of the mammalian central nervous system (CNS). Contrary to the traditional view that mitochondria are essential for ATP production, the study by [Authors' Names] reveals that most axonal mitochondria in CPNs lack mitochondrial DNA (mtDNA) and instead consume ATP.

GitHub: https://github.com/salesforce/progen

Prov-GigaPath

The Prov-GigaPath model was trained to address computational challenges in digital pathology by leveraging a vast and diverse dataset of 1.3 billion pathology image tiles from 171,189 whole slides, sourced from over 30,000 patients across 31 tissue types within the Providence health network. Utilizing the GigaPath vision transformer architecture and the LongNet method, the model excels in long-context modeling of gigapixel slides, capturing both local and global pathological patterns.

GitHub: https://github.com/prov-gigapath/prov-gigapath

Puffin

The Puffin model was trained to elucidate the mechanisms of transcription initiation in the human genome at basepair resolution. Developed by Kseniia Dudnyk, Chenlai Shi, and Jian Zhou from the University of Texas Southwestern Medical Center, Puffin uses deep learning to predict transcription initiation signals from sequence data, identifying key sequence patterns with distinct position-specific effects. The model integrates data from multiple techniques and outperforms existing methods in predicting transcription initiation.

GitHub: https://github.com/EmbarkStudios/puffin

RiskPredictionNet

The RiskPredictionNet model was trained to predict continuous pain intensity using machine learning (ML) models and electroencephalographic (EEG) data, as investigated by Tyler Mari et al. in their study. The research aimed to address the gap in externally validated ML models for pain assessment, particularly for continuous pain prediction on a 101-point scale. The study involved 91 subjects who underwent pneumatic pressure stimuli, with EEG data collected and preprocessed for model training.

scBERT

The scBERT model was trained to enhance cell type annotation in single-cell RNA-seq (scRNA-seq) data by overcoming limitations of traditional methods, such as reliance on curated marker gene lists and batch effects. Inspired by the BERT model in natural language processing, scBERT pretrains on vast unlabelled scRNA-seq data to learn gene-gene interactions and fine-tunes on specific datasets for improved generalizability.

GitHub: https://github.com/TencentAILabHealthcare/scBERT

scBOL

The scBOL model was trained to address the challenges in single-cell and spatial transcriptomics data, particularly the identification of novel cell types not present in reference data. Developed by Yuyao Zhai, Liang Chen, and Minghua Deng, scBOL employs an end-to-end algorithm based on Bipartite prototype alignment to improve cell type identification.

GitHub: https://github.com/pastak/scboloo

scELMo

The scELMo model was trained to leverage Large Language Models (LLMs) for single-cell data analysis, transforming sequencing data into meaningful text descriptions and embeddings. Developed by the authors, it addresses tasks such as cell clustering, batch effect correction, and cell-type annotation without requiring new model training. scELMo uses GPT 3.5 for generating embeddings, demonstrating effectiveness in clustering, annotation, therapeutic target identification, and perturbation analysis.

GitHub: https://github.com/HelloWorldLTY/scELMo

scFormer

The scFormer model was trained to optimize cell and gene embeddings for single-cell RNA sequencing (scRNA-seq) data using a transformer-based deep learning framework. Designed by the authors, it employs self-attention mechanisms to capture complex relationships between cells and genes through masked gene modeling. This unsupervised approach enables scFormer to excel in various downstream tasks, including data integration, gene function analysis, and perturbation response prediction.

GitHub: https://github.com/bowang-lab/scFormer

scFoundation

The scFoundation model was trained to analyze single-cell transcriptomics data using a large-scale transformer-based architecture called xTrimoGene. Developed by [Authors' Names], it contains 100 million parameters and was trained on over 50 million human single-cell RNA sequencing datasets. The model aims to decipher complex molecular features of cells and is evaluated on tasks such as gene expression enhancement, drug response prediction, single-cell drug response classification, and perturbation prediction.

GitHub: https://github.com/biomap-research/scFoundation

scGPT

The scGPT model was trained to advance single-cell multi-omics research by leveraging generative AI and transformer architecture. It utilizes a vast dataset of over 33 million human cells to perform tasks such as cell type annotation, multi-batch and multi-omic integration, genetic perturbation response prediction, and gene network inference. The model excels in accurately representing cell types, predicting genetic perturbations, and integrating diverse datasets.

GitHub: https://github.com/bowang-lab/scGPT

SCimilarity

The SCimilarity model was trained to create a unified and interpretable representation of single-cell RNA-seq (scRNA-seq) data, facilitating efficient annotation and querying of cell states across a vast dataset of 22.7 million cells from 399 studies. Developed by Graham Heimberg et al., this metric learning framework addresses the challenge of integrating and querying rapidly growing scRNA-seq data by learning a low-dimensional representation that clusters similar cells together.

GitHub: https://github.com/Genentech/scimilarity

Sent-e-Med and LLaMA2-EHR

The Sent-e-Med and LLaMA2-EHR model was trained to enhance clinical risk prediction using structured EHR data by leveraging textual descriptions of medical codes. These models demonstrated superior performance, particularly in scenarios with limited data, by generalizing across different medical vocabularies and datasets.

TransferChrome

The TransferChrome model was trained to predict gene expression from histone modifications, leveraging a densely connected convolutional network and self-attention layers for feature extraction and aggregation. Developed by Yuchi Chen, Minzhu Xie, and Jie Wen, the model employs transfer learning to enhance prediction accuracy across different cell lines.

TULIP

The TULIP model was trained to predict the binding between T-cell receptors (TCR) and epitopes using an unsupervised learning approach, addressing limitations in current methods such as data scarcity and biases from negative training data. It leverages incomplete data and integrates various data sources, outperforming state-of-the-art models in recognizing TCRs binding to unseen epitopes.

GitHub: https://github.com/junegunn/goyo.vim

UMedPT

The UMedPT model was trained to address data scarcity in biomedical imaging by leveraging a multi-task learning strategy that decouples training tasks from memory requirements. Developed by the authors, it efficiently handles various imaging modalities and labeling strategies, such as classification, segmentation, and object detection. UMedPT outperformed ImageNet pretraining and state-of-the-art models, maintaining high performance with significantly less training data.

XA4C

The XA4C model was trained to enhance the interpretability of autoencoders in gene expression analysis by identifying "Critical genes" that significantly contribute to the learned representations. Utilizing SHapley Additive exPlanations (SHAP), XA4C quantifies each gene's contribution to latent variables, prioritizing genes for pathway enrichment and connectivity analysis.

GitHub: https://github.com/QingrunZhangLab/XA4C

xTrimoGene

The xTrimoGene model was trained to efficiently represent single-cell RNA sequencing (scRNA-seq) data, addressing the computational and memory challenges posed by its vast and sparse nature. It employs an asymmetric encoder-decoder transformer architecture that significantly reduces computational load while maintaining high accuracy. Key use cases include cell type annotation, perturbation effect prediction, and drug combination prediction.

Daniel Koster, PhD

VP Product, Code Ocean

Read more from our blog:

View All Posts