PROTOplast
PROTOplast
Accelerate your scRNA ML training.

A lightweight, open-source Python library for fast data loading, cloud-native workflows, and scalable ML training, providing the ability to harness massive-scale datasets like Tahoe100M and X-Atlas/Orion.

Challenges

The Problems We're Solving.

Working with molecular data at scale presents unique challenges that traditional ML pipelines weren't designed to handle:

Data Management

Staging data adds overhead
Anndata reads from local file paths, requiring data to be copied to the compute instance prior to analysis
Loading data is time consuming
Large scRNA datasets remain slow to load—often hours to days—even on cloud or HPC systems.
Densification is costly
Sparse matrices, which optimize the amount of storage for scRNA datasets, require densification

Scalability

Memory is constrained
Bottlenecks occur when the size of the data exceeds the amount of physical memory available on a machine
Cluster management is complex
Managing distributed workloads across multiple workers requires specialized expertise
Code environments are fragmented
Rewriting entire analysis pipelines is often necessary when scaling to cluster environments.

How PROTOplast helps

PROTOplast was built to remove these bottlenecks

1300X faster I/O than standard AnnData: What once required 22.5 days, now takes only 14.5 minutes to train 1 epoch on the entire Tahoe-100M dataset using a 4-L40S instance, transforming large-scale ML training from a bottleneck into a routine step.
WorkflowElapsed# of workers
AnnLoader (AnnData)22.5 days12
PROTOplast14.5 minutes12

* The benchmark was timed on 1 epoch, 2 MLP classifier, 4 NVIDIA L40S GPUs (See benchmarking scripts )

Seamless integration: Easily plug in your secret sauce by subclassing PyTorch Lightning's LightningModule. This keeps full compatibility with the PyTorch ecosystem while giving you the flexibility to build specialized models for your molecular and single-cell data.
from state.tx.models.embed_sum import EmbedSumPerturbationModel
from protoplast import RayTrainRunner
trainer = RayTrainRunner(
   EmbedSumPerturbationModel,
   ...
)
Read the tutorial
Native cloud integration: Allowing you to stream data directly from remote storage(S3, GCS, Azure) without intermediate downloads.
trainer.train([
   "s3://collaborator-1/cohort_1.h5ad",
   "gcs://collaborator-2/cohort_2.h5ad",
   "adl://collaborator-3/cohort_3.h5ad",
   "dnanexus://project-xxx:/cohort_4.h5ad",
], ...)
With PROTOplast, you can use 4 NVIDIA L40S GPUs to train 1 epoch on the entire Tahoe 100M dataset in 14.5 minutes — something that was previously unfeasible.

Quick Start

(It’s simple)

That’s it — no extra code, no tuning. PROTOplast automatically scales across GPUs, nodes, or clusters.

Installation guide:

pip install protoplast

A minimal code example showcasing end-to-end:

from protoplast import RayTrainRunner, DistributedCellLineAnnDataset, LinearClassifier
import glob

trainer = RayTrainRunner(
   LinearClassifier,  # replace with your own model
   DistributedCellLineAnnDataset,  # replace with your own Dataset
   ["num_genes", "num_classes"],  # change according to what you need for your model
)

file_paths = glob.glob("/data/tahoe100/*.h5ad")
trainer.train(file_paths)

That’s it — no extra code, no tuning. PROTOplast automatically scales across GPUs, nodes, or clusters.


Resources

{ 1 }Examples
Training perturbation prediction models on scRNA-seq data.

Advancing precision in drug and gene response modeling

Use with any classification models

Seamless integration with external and custom models

Create a submission to the Virtual Cell Challenge

Step-by-step guide to packaging and submitting your model for evaluation

{ 2 }Get started

Related Blog Posts

Perturbation effect is not an on-off switch
{Sci-tech}
5 mins read

In this blog, we examine how the “perturbation effect” can vary depending on the metrics used to define it, and why these differences matter. While these metrics may appear interchangeable, they often capture fundamentally different aspects of the underlying biology. As Perturb-seq datasets continue to grow exponentially, understanding how perturbation effects are measured becomes critical for reliable downstream analysis. When suppression is not an on-off switch In 2025, Nadig and colleagues

Virtual Cell: It Might Start From The Mean
{Insights}
{Sci-tech}
3 mins read

The Virtual Cell is a concept at the intersection of computational biology and systems science. At its core, it aims to represent a predictive model of how a living cell responds to internal and external cues. In its ideal form, a Virtual Cell captures every molecular detail - dynamic proteins, metabolic fluxes, physical interactions, and more. Building such a fully mechanistic model remains beyond current computational capabilities. A more tractable approximation has emerged over the past deca

A Note on Parquet-based scRNA ML Pipelines
{Insights}
{Sci-tech}
2 mins read

Single-cell RNA sequencing (scRNA-seq) is revolutionizing our understanding of cellular biology, but the computational challenges of processing these massive datasets continue to evolve. As datasets grow from thousands to millions of cells, the choice of data format and processing pipeline becomes critical.  Parquet files, with their columnar storage and excellent compression ratios, seem like a natural fit for intermediate data storage in machine learning workflows. In a previous blog post, we

More articles

Swipe to Explore

Have an idea?
Drop us a line