
Teaching machines the language of biology: Scaling large language models for next-generation single-cell analysis
April 17, 2025
David van Dijk, Assistant Professor, Yale University, and Bryan Perozzi, Research Scientist, Google Research
C2S-Scale looks deep into how to best represent cells and biological information as text, opening up exciting applications for language-driven single-cell analysis with large language models.
Quick links
Every human is made up of trillions of cells, each with its own function, whether it’s carrying oxygen, fighting infections, or building organs. Even within the same tissue, no two cells are exactly alike. Single-cell RNA sequencing (scRNA-seq) allows us to measure the gene expression of individual cells, revealing what each cell is doing at a given moment.
But there’s a catch: single-cell data are massive, high-dimensional, and hard to interpret. Each cell can be represented by thousands of numbers — its gene expression measurements — which traditionally require specialized tools and models to analyze. This makes single-cell analysis slow, difficult to scale, and limited to expert users.
What if we could turn those thousands of numbers into language that humans and language models can understand? That is, what if we could ask a cell how it's feeling, what it’s doing, or how it might respond to a drug or disease — and get an answer back in plain English? From individual cells to entire tissues, understanding biological systems at this level could transform how we study, diagnose, and treat disease.
Today in "Scaling Large Language Models for Next-Generation Single-Cell Analysis", we’re excited to introduce Cell2Sentence-Scale (C2S-Scale), a family of powerful, open-source large language models (LLMs) trained to “read” and “write” biological data at the single-cell level. In this post, we’ll walk through the basics of single-cell biology, how we transform cells into sequences of words, and how C2S-Scale opens up new possibilities for biological discovery.
From cells to sentences
C2S-Scale transforms each cell’s gene expression profile into a sequence of text, called a “cell sentence”, that consists of a list of the most active genes in that cell ordered by their gene expression level. This makes it possible to apply natural language models, like those used in Google’s Gemini or Gemma models, to scRNA-seq data.

C2S-Scale orders gene names by expression and converts them into natural language “cell sentences”.
By using language as the interface, we make single-cell data more accessible, interpretable, and flexible. And because much of biology — like gene names, cell types, and experimental metadata — is already expressed in text, LLMs are a natural fit for processing and understanding this information.
Meet the C2S-Scale model family
C2S-Scale builds on top of Google's Gemma open model family, adapting them for biological reasoning through data engineering and carefully designed prompts that integrate cell sentences, metadata, and other relevant biological context. The underlying LLM architecture remains unchanged, allowing C2S-Scale to fully benefit from the infrastructure, scalability, and rich ecosystem built around general-purpose language models. The result is a suite of LLMs trained on over 1 billion tokens from real-world transcriptomic datasets, biological metadata, and scientific literature.
C2S-Scale includes a family of models ranging from 410 million to 27 billion parameters, designed to meet the diverse needs of the research community. Smaller models are more efficient and accessible — they can be fine-tuned or deployed with limited compute, making them ideal for exploratory analyses or resource-constrained environments. Larger models, while more computationally intensive, offer higher performance across a wide range of biological tasks. By releasing this spectrum of model sizes, we empower users to choose the best model for their specific use case, balancing performance, speed, and compute requirements. All models will be made open-source and available for fine-tuning or downstream use.
C2S-Scale can respond to diverse input queries for both prediction and generation tasks, enabling conversational single-cell analysis.
What can C2S-Scale do?
Chat with biology: Question answering from single-cell data
Imagine someone asking, “How will this T cell respond to anti-PD-1 therapy (a common therapy for cancer treatment)?”
As shown on the left below, C2S-Scale models can answer in natural language, drawing from both the cellular data and biological knowledge they’ve seen during pre-training. This enables conversational analysis, where researchers can interact with their data through natural language in a way that was previously not possible, as shown on the right below.

Question answering performance of C2S-Scale versus different LLM models.
Interpret data with natural language
C2S-Scale can automatically generate biological summaries of scRNA-seq data at different levels of complexity, from describing the cell types of single cells to generating summaries of entire tissues or experiments. This helps researchers interpret new datasets faster and with greater confidence, even without writing complex code.
Scaling laws in biology
A central finding of our work is that biological language models follow clear scaling laws — performance improves predictably as model size increases. Larger C2S-Scale models consistently outperform smaller ones across a range of biological tasks, from cell type annotation to generating cells and tissues. For dataset interpretation, we observed consistent gains in semantic similarity scores when scaling model size in the parameter-efficient regime. With full fine-tuning, gene overlap percentage in tissue generation significantly improved as model capacity increased to 27 billion parameters. This trend mirrors what’s observed in general-purpose LLMs and underscores a powerful insight: with more data and compute, biological LLMs will keep getting better, opening the door to increasingly sophisticated and generalizable tools for biological discovery.

C2S-Scale performance across prediction and generation tasks improves as model capacity increases. Predictive tasks are measured by BERTScore, a measure of semantic similarity, and generative tasks are measured by the percentage of expressed genes that overlap between predicted and generated cells.
Predicting the future of cells
One of the most exciting applications of C2S-Scale is forecasting how a cell will respond to a perturbation — like a drug, a gene knockout, or exposure to a cytokine. Given a baseline cell sentence and a description of the treatment, the model can generate a new sentence representing the expected gene expression changes.
This ability to simulate cellular behavior in silico accelerates drug discovery, personalized medicine, and prioritizing experiments before they’re performed in the lab. C2S-Scale represents a major step towards creating realistic “virtual cells", which have been proposed as the next generation of model systems — potentially offering faster, cheaper, and more ethical alternatives to traditional cell lines and animal models.
Optimizing with reinforcement learning
Just as large language models like Gemini are fine-tuned with reinforcement learning to follow instructions and respond in helpful, human-aligned ways, we apply similar techniques to optimize C2S-Scale models for biological reasoning. By using reward functions designed for semantic text evaluation (e.g., BERTScore), we train C2S-Scale to output biologically accurate and informative answers that are more like real answers in the dataset. This guides the model toward responses that are useful for scientific discovery — especially in complex tasks like modeling therapeutic interventions.
Try it yourself
Cell2Sentence models and resources are now available on platforms such as HuggingFace and GitHub. We invite you to explore these tools, experiment with your own single-cell data, and see how far we can go when we teach machines to understand the language of life — one cell at a time.
Acknowledgements
Key contributors to this project include: Syed Rizvi1,2, Daniel Levine2, Aakash Patel2, Shiyang Zhang2, Eric Wang3, Sizhuang He2, David Zhang2, Cerise Tang2, Zhuoyang Lyu4, Rayyan Darji2, Chang Li2, Emily Sun2, David Jeong2, Lawrence Zhao2, Jennifer Kwan2, David Braun2, Brian Hafler2, Jeffrey Ishizuka2, Rahul M. Dhodapkar5, Hattie Chung2, Shekoofeh Azizi3, Bryan Perozzi1, and David van Dijk2.
Affiliations:
- Google Research, Graph Mining Team
- Yale University
- Google DeepMind
- Brown University
- University of Southern California