PROTOplast

{Developer Preview}

{Developer Preview}

We've accelerated the timeline for this early preview to support contestants in the Arc Institute's Virtual Cell Challenge (VCC), recognizing the immediate need for robust tools in this rapidly evolving field. If you are participating in the VCC, try out PROTOplast and share with us your experience and suggestions for improvements to support your ML training process!

PROTOplast
Accelerate your scRNA ML training.

A lightweight, open-source Python library for fast data loading, cloud-native workflows, and scalable ML training, providing the ability to harness massive-scale datasets like Tahoe100M and X-Atlas/Orion.

Challenges

The Problems We're Solving.

Working with molecular data at scale presents unique challenges that traditional ML pipelines weren't designed to handle:

Data Management

Staging data adds overhead
Anndata reads from local file paths, requiring data to be copied to the compute instance prior to analysis
Loading data is time consuming
Large scRNA datasets remain slow to load—often hours to days—even on cloud or HPC systems.
Densification is costly
Sparse matrices, which optimize the amount of storage for scRNA datasets, require densification

Scalability

Memory is constrained
Bottlenecks occur when the size of the data exceeds the amount of physical memory available on a machine
Cluster management is complex
Managing distributed workloads across multiple workers requires specialized expertise
Code environments are fragmented
Rewriting entire analysis pipelines is often necessary when scaling to cluster environments.

How PROTOplast helps

PROTOplast was built to remove these bottlenecks

1300X faster I/O than standard AnnData: What once required 22.5 days, now takes only 14.5 minutes to train 1 epoch on the entire Tahoe-100M dataset using a 4-L40S instance, transforming large-scale ML training from a bottleneck into a routine step.
WorkflowElapsed# of workers
AnnLoader (AnnData)22.5 days12
PROTOplast14.5 minutes12

* The benchmark was timed on 1 epoch, 2 MLP classifier, 4 NVIDIA L40S GPUs (See benchmarking scripts )

Seamless integration: Easily plug in your secret sauce by subclassing PyTorch Lightning's LightningModule. This keeps full compatibility with the PyTorch ecosystem while giving you the flexibility to build specialized models for your molecular and single-cell data.
from state.tx.models.embed_sum import EmbedSumPerturbationModel
from protoplast import RayTrainRunner
trainer = RayTrainRunner(
   EmbedSumPerturbationModel,
   ...
)
Read the tutorial
Native cloud integration: Allowing you to stream data directly from remote storage(S3, GCS, Azure) without intermediate downloads.
trainer.train([
   "s3://collaborator-1/cohort_1.h5ad",
   "gcs://collaborator-2/cohort_2.h5ad",
   "adl://collaborator-3/cohort_3.h5ad",
   "dnanexus://project-xxx:/cohort_4.h5ad",
], ...)
With PROTOplast, you can use 4 NVIDIA L40S GPUs to train 1 epoch on the entire Tahoe 100M dataset in 14.5 minutes — something that was previously unfeasible.

Quick Start

(It’s simple)

That’s it — no extra code, no tuning. PROTOplast automatically scales across GPUs, nodes, or clusters.

Installation guide:

pip install protoplast

A minimal code example showcasing end-to-end:

from protoplast import RayTrainRunner, DistributedCellLineAnnDataset, LinearClassifier
import glob

trainer = RayTrainRunner(
   LinearClassifier,  # replace with your own model
   DistributedCellLineAnnDataset,  # replace with your own Dataset
   ["num_genes", "num_classes"],  # change according to what you need for your model
)

file_paths = glob.glob("/data/tahoe100/*.h5ad")
trainer.train(file_paths)

That’s it — no extra code, no tuning. PROTOplast automatically scales across GPUs, nodes, or clusters.


Resources

{ 1 }Examples
Training perturbation prediction models on scRNA-seq data.

Advancing precision in drug and gene response modeling

Use with any classification models

Seamless integration with external and custom models

Create a submission to the Virtual Cell Challenge

Step-by-step guide to packaging and submitting your model for evaluation

{ 2 }Get started

Related Blog Posts

Introducing PROTOplast: Scalable Machine Learning for Molecular Data Analysis
{News}
{scRNA-seq}
{PROTOplast}
3 mins read

We're excited to announce the early developer preview of PROTOplast, our new Python library designed for fast scalable analysis of molecular data. PROTOplast addresses the unique challenges of working with large-scale molecular datasets while maintaining the flexibility needed for cutting-edge research. What is PROTOplast? PROTOplast is an open-source Python library, released under the Apache License 2.0, that bridges the gap between molecular data analysis and modern machine learning infrast

A Note on Parquet-based scRNA ML Pipelines
{Insight}
{scRNA-seq}
2 mins read

Single-cell RNA sequencing (scRNA-seq) is revolutionizing our understanding of cellular biology, but the computational challenges of processing these massive datasets continue to evolve. As datasets grow from thousands to millions of cells, the choice of data format and processing pipeline becomes critical.  Parquet files, with their columnar storage and excellent compression ratios, seem like a natural fit for intermediate data storage in machine learning workflows. In a previous blog post, we

Tahoe-100M in Practice: Workflows, Pitfalls, and Pathways to Scalable scRNA Analysis
{scRNA-seq}
{Insight}
9 mins read

Single-cell transcriptomics (scRNA) studies now profile millions of cells, revealing identity, state, and tissue heterogeneity, and create unprecedented opportunities to extract biological insights that would be invisible in smaller studies. Tahoe-100M, a groundbreaking resource hosted by Arc Institute, contains 100 million cells covering 379 distinct drugs and 50 cancer cell lines, is one such study. On the other hand, at Tahoe-100M scale, even routine queries pose significant computational ch

More articles

Swipe to Explore

Have an idea?
Drop us a line