PROTOplast

{Developer Preview}

We've accelerated the timeline for this early preview to support contestants in the Arc Institute's Virtual Cell Challenge (VCC), recognizing the immediate need for robust tools in this rapidly evolving field. If you are participating in the VCC, try out PROTOplast and share with us your experience and suggestions for improvements to support your ML training process!

Accelerate your scRNA ML training.

A lightweight, open-source Python library for fast data loading, cloud-native workflows, and scalable ML training, providing the ability to harness massive-scale datasets like Tahoe100M and X-Atlas/Orion.

Challenges

The Problems We're Solving.

Working with molecular data at scale presents unique challenges that traditional ML pipelines weren't designed to handle:

Data Management

Staging data adds overhead

Anndata reads from local file paths, requiring data to be copied to the compute instance prior to analysis

Loading data is time consuming

Large scRNA datasets remain slow to load—often hours to days—even on cloud or HPC systems.

Densification is costly

Sparse matrices, which optimize the amount of storage for scRNA datasets, require densification

Scalability

Memory is constrained

Bottlenecks occur when the size of the data exceeds the amount of physical memory available on a machine

Cluster management is complex

Managing distributed workloads across multiple workers requires specialized expertise

Code environments are fragmented

Rewriting entire analysis pipelines is often necessary when scaling to cluster environments.

How PROTOplast helps

PROTOplast was built to remove these bottlenecks

1300X faster I/O than standard AnnData: What once required 22.5 days, now takes only 14.5 minutes to train 1 epoch on the entire Tahoe-100M dataset using a 4-L40S instance, transforming large-scale ML training from a bottleneck into a routine step.

Workflow	Elapsed	# of workers
AnnLoader (AnnData)	22.5 days	12
PROTOplast	14.5 minutes	12

* The benchmark was timed on 1 epoch, 2 MLP classifier, 4 NVIDIA L40S GPUs (See benchmarking scripts )

Seamless integration: Easily plug in your secret sauce by subclassing PyTorch Lightning's LightningModule. This keeps full compatibility with the PyTorch ecosystem while giving you the flexibility to build specialized models for your molecular and single-cell data.

from state.tx.models.embed_sum import EmbedSumPerturbationModel
from protoplast import RayTrainRunner
trainer = RayTrainRunner(
   EmbedSumPerturbationModel,
   ...
)

Read the tutorial

Native cloud integration: Allowing you to stream data directly from remote storage(S3, GCS, Azure) without intermediate downloads.

trainer.train([
   "s3://collaborator-1/cohort_1.h5ad",
   "gcs://collaborator-2/cohort_2.h5ad",
   "adl://collaborator-3/cohort_3.h5ad",
   "dnanexus://project-xxx:/cohort_4.h5ad",
], ...)

With PROTOplast, you can use 4 NVIDIA L40S GPUs to train 1 epoch on the entire Tahoe 100M dataset in 14.5 minutes — something that was previously unfeasible.

Quick Start

(It’s simple)

That’s it — no extra code, no tuning. PROTOplast automatically scales across GPUs, nodes, or clusters.

Installation guide:

pip install protoplast

A minimal code example showcasing end-to-end:

from protoplast import RayTrainRunner, DistributedCellLineAnnDataset, LinearClassifier
import glob

trainer = RayTrainRunner(
   LinearClassifier,  # replace with your own model
   DistributedCellLineAnnDataset,  # replace with your own Dataset
   ["num_genes", "num_classes"],  # change according to what you need for your model
)

file_paths = glob.glob("/data/tahoe100/*.h5ad")
trainer.train(file_paths)

That’s it — no extra code, no tuning. PROTOplast automatically scales across GPUs, nodes, or clusters.

Resources

{ 1 }Examples

Training perturbation prediction models on scRNA-seq data.

Advancing precision in drug and gene response modeling

Use with any classification models

Seamless integration with external and custom models

Create a submission to the Virtual Cell Challenge

Step-by-step guide to packaging and submitting your model for evaluation

{ 2 }Get started

Documentation More Tutorials & Examples

Join our community: Github

Related Blog Posts

Introducing PROTOplast: Scalable Machine Learning for Molecular Data Analysis

{News}

{scRNA-seq}

{PROTOplast}

3 mins read

Introducing PROTOplast: Scalable Machine Learning for Molecular Data Analysis

We're excited to announce the early developer preview of PROTOplast, our new Python library designed for fast scalable analysis of molecular data. PROTOplast addresses the unique challenges of working with large-scale molecular datasets while maintaining the flexibility needed for cutting-edge research. What is PROTOplast? PROTOplast is an open-source Python library, released under the Apache License 2.0, that bridges the gap between molecular data analysis and modern machine learning infrast

Learn more

A Note on Parquet-based scRNA ML Pipelines

{Insight}

{scRNA-seq}

2 mins read

A Note on Parquet-based scRNA ML Pipelines

Single-cell RNA sequencing (scRNA-seq) is revolutionizing our understanding of cellular biology, but the computational challenges of processing these massive datasets continue to evolve. As datasets grow from thousands to millions of cells, the choice of data format and processing pipeline becomes critical. Parquet files, with their columnar storage and excellent compression ratios, seem like a natural fit for intermediate data storage in machine learning workflows. In a previous blog post, we

Learn more

Tahoe-100M in Practice: Workflows, Pitfalls, and Pathways to Scalable scRNA Analysis

{scRNA-seq}

{Insight}

9 mins read

Tahoe-100M in Practice: Workflows, Pitfalls, and Pathways to Scalable scRNA Analysis

Single-cell transcriptomics (scRNA) studies now profile millions of cells, revealing identity, state, and tissue heterogeneity, and create unprecedented opportunities to extract biological insights that would be invisible in smaller studies. Tahoe-100M, a groundbreaking resource hosted by Arc Institute, contains 100 million cells covering 379 distinct drugs and 50 cancer cell lines, is one such study. On the other hand, at Tahoe-100M scale, even routine queries pose significant computational ch

Learn more

Swipe to Explore

Have an idea?
Drop us a line

Contact Now

Challenges

Data Management

Scalability

How PROTOplast helps

Quick Start

Resources

Related Blog Posts

Introducing PROTOplast: Scalable Machine Learning for Molecular Data Analysis

A Note on Parquet-based scRNA ML Pipelines

Tahoe-100M in Practice: Workflows, Pitfalls, and Pathways to Scalable scRNA Analysis

Have an idea?Drop us a line

Have an idea?
Drop us a line