PROTOplast

Accelerate your scRNA ML training.

A lightweight, open-source Python library for fast data loading, cloud-native workflows, and scalable ML training, providing the ability to harness massive-scale datasets like Tahoe100M and X-Atlas/Orion.

Challenges

The Problems We're Solving.

Working with molecular data at scale presents unique challenges that traditional ML pipelines weren't designed to handle:

Data Management

Staging data adds overhead

Anndata reads from local file paths, requiring data to be copied to the compute instance prior to analysis

Loading data is time consuming

Large scRNA datasets remain slow to load—often hours to days—even on cloud or HPC systems.

Densification is costly

Sparse matrices, which optimize the amount of storage for scRNA datasets, require densification

Scalability

Memory is constrained

Bottlenecks occur when the size of the data exceeds the amount of physical memory available on a machine

Cluster management is complex

Managing distributed workloads across multiple workers requires specialized expertise

Code environments are fragmented

Rewriting entire analysis pipelines is often necessary when scaling to cluster environments.

How PROTOplast helps

PROTOplast was built to remove these bottlenecks

1300X faster I/O than standard AnnData: What once required 22.5 days, now takes only 14.5 minutes to train 1 epoch on the entire Tahoe-100M dataset using a 4-L40S instance, transforming large-scale ML training from a bottleneck into a routine step.

Workflow	Elapsed	# of workers
AnnLoader (AnnData)	22.5 days	12
PROTOplast	14.5 minutes	12

* The benchmark was timed on 1 epoch, 2 MLP classifier, 4 NVIDIA L40S GPUs (See benchmarking scripts )

Seamless integration:Easily plug in your secret sauce by subclassing PyTorch Lightning's LightningModule. This keeps full compatibility with the PyTorch ecosystem while giving you the flexibility to build specialized models for your molecular and single-cell data.

from state.tx.models.embed_sum import EmbedSumPerturbationModel
from protoplast import RayTrainRunner
trainer = RayTrainRunner(
   EmbedSumPerturbationModel,
   ...
)

Read the tutorial

Native cloud integration: Allowing you to stream data directly from remote storage(S3, GCS, Azure) without intermediate downloads.

trainer.train([
   "s3://collaborator-1/cohort_1.h5ad",
   "gcs://collaborator-2/cohort_2.h5ad",
   "adl://collaborator-3/cohort_3.h5ad",
   "dnanexus://project-xxx:/cohort_4.h5ad",
], ...)

With PROTOplast, you can use 4 NVIDIA L40S GPUs to train 1 epoch on the entire Tahoe 100M dataset in 14.5 minutes — something that was previously unfeasible.

Quick Start

(It’s simple)

That’s it — no extra code, no tuning. PROTOplast automatically scales across GPUs, nodes, or clusters.

Installation guide:

pip install protoplast

A minimal code example showcasing end-to-end:

from protoplast import RayTrainRunner, DistributedCellLineAnnDataset, LinearClassifier
import glob

trainer = RayTrainRunner(
   LinearClassifier,  # replace with your own model
   DistributedCellLineAnnDataset,  # replace with your own Dataset
   ["num_genes", "num_classes"],  # change according to what you need for your model
)

file_paths = glob.glob("/data/tahoe100/*.h5ad")
trainer.train(file_paths)

That’s it — no extra code, no tuning. PROTOplast automatically scales across GPUs, nodes, or clusters.

Resources

{ 1 }Examples

Training perturbation prediction models on scRNA-seq data.

Advancing precision in drug and gene response modeling

Use with any classification models

Seamless integration with external and custom models

Create a submission to the Virtual Cell Challenge

Step-by-step guide to packaging and submitting your model for evaluation

{ 2 }Get started

Documentation More Tutorials & Examples

Join our community: Github

Related Blog Posts

DataXight Launches protoXell to Unlock Mechanistic Insight from Large-Scale Perturbation Data

{News}{Sci-tech}

2 mins read

DataXight Launches protoXell to Unlock Mechanistic Insight from Large-Scale Perturbation Data

New scientific software enables researchers to explore chemical and genetic perturbations, accelerating target discovery and drug repurposing MOUNTAIN VIEW, CA – May 19, 2026 – Addressing the persistent challenge scientists face in transforming complex biological perturbation data into actionable mechanistic insight, DataXight today announced protoXell, a new scientific software designed to streamline discovery. To learn more about protoXell and explore access options, visit https://dataxight.c

Learn more

Comparing Perturbations: E-distance and Euclidean distance are Your Best Allies

{Sci-tech}

7 mins read

Comparing Perturbations: E-distance and Euclidean distance are Your Best Allies

Summary Our benchmarking reveals a surprising truth: in the race to translate massive perturbation datasets into discovery, the most effective mathematical "lens" isn't the most complex one. While sophisticated metrics like Wasserstein or Mean Pairwise are often favored due to their mathematical impressiveness, we found that E-distance and Euclidean distance provide the superior balance of speed and signal resolution for high-throughput pipelines. By delivering sharper biological contrast at a

Learn more

Perturbation effect is not an on-off switch

{Sci-tech}

5 mins read

Perturbation effect is not an on-off switch

In this blog, we examine how the “perturbation effect” can vary depending on the metrics used to define it, and why these differences matter. While these metrics may appear interchangeable, they often capture fundamentally different aspects of the underlying biology. As Perturb-seq datasets continue to grow exponentially, understanding how perturbation effects are measured becomes critical for reliable downstream analysis. When suppression is not an on-off switch In 2025, Nadig and colleagues

Learn more

Swipe to Explore

Have an idea?
Drop us a line

Contact Now

Data Management

Staging data adds overhead

Anndata reads from local file paths, requiring data to be copied to the compute instance prior to analysis

Loading data is time consuming

Large scRNA datasets remain slow to load—often hours to days—even on cloud or HPC systems.

Densification is costly

Sparse matrices, which optimize the amount of storage for scRNA datasets, require densification

Scalability

Memory is constrained

Bottlenecks occur when the size of the data exceeds the amount of physical memory available on a machine

Cluster management is complex

Managing distributed workloads across multiple workers requires specialized expertise

Code environments are fragmented

Rewriting entire analysis pipelines is often necessary when scaling to cluster environments.

How PROTOplast helps

PROTOplast was built to remove these bottlenecks

Workflow	Elapsed	# of workers
AnnLoader (AnnData)	22.5 days	12
PROTOplast	14.5 minutes	12

* The benchmark was timed on 1 epoch, 2 MLP classifier, 4 NVIDIA L40S GPUs (See benchmarking scripts )

from state.tx.models.embed_sum import EmbedSumPerturbationModel
from protoplast import RayTrainRunner
trainer = RayTrainRunner(
   EmbedSumPerturbationModel,
   ...
)

Read the tutorial

Native cloud integration: Allowing you to stream data directly from remote storage(S3, GCS, Azure) without intermediate downloads.

trainer.train([
   "s3://collaborator-1/cohort_1.h5ad",
   "gcs://collaborator-2/cohort_2.h5ad",
   "adl://collaborator-3/cohort_3.h5ad",
   "dnanexus://project-xxx:/cohort_4.h5ad",
], ...)

from protoplast import RayTrainRunner, DistributedCellLineAnnDataset, LinearClassifier import glob trainer = RayTrainRunner( LinearClassifier, # replace with your own model DistributedCellLineAnnDataset, # replace with your own Dataset ["num_genes", "num_classes"], # change according to what you need for your model ) file_paths = glob.glob("/data/tahoe100/*.h5ad") trainer.train(file_paths)

Challenges

Data Management

Scalability

How PROTOplast helps

Quick Start

Resources

Related Blog Posts

DataXight Launches protoXell to Unlock Mechanistic Insight from Large-Scale Perturbation Data

Comparing Perturbations: E-distance and Euclidean distance are Your Best Allies

Perturbation effect is not an on-off switch

Have an idea?Drop us a line

Challenges

Data Management

Scalability

How PROTOplast helps

Quick Start

Resources

Related Blog Posts

DataXight Launches protoXell to Unlock Mechanistic Insight from Large-Scale Perturbation Data

Comparing Perturbations: E-distance and Euclidean distance are Your Best Allies

Perturbation effect is not an on-off switch

Have an idea?Drop us a line

Have an idea?
Drop us a line

Have an idea?
Drop us a line