Mapping Biopharma’s AI Strategy: From Custom Datasets to Foundation Models

Key Takeaways

Biotech companies are pursuing three distinct strategies to move beyond expensive custom data generation: foundation model fine-tuning, data-efficient computational methods using existing databases and continued targeted investment in proprietary experimental datasets.
Success depends on matching computational strategy to specific therapeutic focus and resources and startups may favor foundation models or data-efficient approaches, while large pharma can integrate multiple methods across different programs.
Rather than following a single industry trend, companies should choose their AI approach based on their unique circumstances, as multiple viable paths exist for accelerating drug discovery while reducing R&D costs.

The biopharma industry is grappling with a fundamental challenge: how to move beyond expensive, time-intensive custom data generation that has been the bottleneck in artificial intelligence-driven drug discovery.

At SynBioBeta 2025, a panel of industry leaders revealed markedly different solutions to this problem, with no clear consensus emerging on the optimal path forward.

The diversity of approaches suggests the industry is entering a period of strategic experimentation, where success may depend less on following a single trend and more on matching the right computational strategy to specific therapeutic challenges and resource constraints.

Three Distinct Strategic Paths

The panel discussion revealed three primary approaches companies are taking to address the data generation bottleneck:

Foundation Model Fine-Tuning: Leveraging pre-trained AI models and customizing them with smaller, targeted datasets rather than building massive proprietary databases from scratch.

Data-Efficient Computational Methods: Using sophisticated algorithms that extract maximum insight from existing public databases and evolutionary data, requiring minimal experimental validation.

Continued Custom Data Investment: Maintaining the traditional approach of generating comprehensive proprietary experimental datasets, but with a more targeted focus and improved efficiency.

The Foundation Model Strategy

“There’s a new way of thinking about how to deal with data,” explained Nicholas Sofroniew, a research scientist at EvolutionaryScale, a biotech that offers its own large multimodal protein language model, ESM3. “You can now start from a foundation model that has already been trained on a vast amount of data. You couldn’t start there four or five years ago. These models weren’t at that stage – but now they are.”

Eric Kelsic, CEO, Dyno Therapeutics (Dyno Therapeutics)

Eric Kelsic, CEO and co-founder of Dyno Therapeutics, exemplifies how some companies are adapting their strategies. Dyno originally built its business around solving a specific, high-value problem: gene therapy delivery. Initially, it build a massive custom dataset.

“I thought a lot about all the technology that was out there, and what it would mean in terms of what could be done that was different from before,” Kelsic explained. “We really had to invest in the data generation up front. Data was going to be really key, and it was clear that we didn’t have enough.”

After a decade of building proprietary datasets, Dyno is now incorporating foundation models into its strategy. The company recently announced a collaboration with Nvidia to develop a foundation-scale model, Dynofold, for predicting protein structure and dynamics.

“Foundation models enable us to leverage all the protein data and then fine-tune that using our own experimental data,” Kelsic said. “It’s a more hybrid approach.”

However, Kelsic emphasized that this evolution builds on, rather than replaces, the company’s substantial experimental foundation. When asked whether he would trade his company’s accumulated data for better computational resources, he chose to keep the data, noting that “things are changing really fast on the compute side.”

The Data-Efficient Alternative: Scala’s Approach

Not all companies see foundation models as the answer. Ravit Netzer, co-founder and CEO of Scala Biodesign, has built her company around a fundamentally different philosophy: extracting maximum value from existing evolutionary data, rather than generating new experimental datasets or depending on foundation models.

“Generating data is very expensive. It takes a lot of time. It’s very laborious.”
Ravit Netzer, CEO, Scala Biodesign

“Generating data is very expensive. It takes a lot of time. It’s very laborious,” Netzer explained. “Sometimes you can only do it in proxy systems, which means you can’t generate a lot of specific data for the system you’re interested in.”

Scala’s proprietary PROSS algorithm combines phylogenetic analysis of evolutionary data with physics-based modeling, requiring only “several dozen sequences of naturally occurring homologues” from public databases. This requires a fraction of the data needed for either custom experimental approaches or foundation model training.

The company’s success with a malaria vaccine candidate demonstrates the potential of this strategy. Working with researchers at Oxford University, Scala engineered variants of an exotic protein with virtually no preliminary mutational data available, generating three optimized variants that required only minimal experimental testing.

“Within this very small experimental space, they could find a molecule that solved their problem, and now it’s advancing to Phase II clinical trials,” Netzer said. “For many applications, you can really get away without specific data on the system, because we can learn a lot from general databases that we have.”

Continued Investment In Custom Data: 64x Bio’s Success

Meanwhile, other companies continue to demonstrate that the traditional custom data generation approach can deliver significant results when executed with sufficient focus and resources.

Alexis Rovner, co-founder and CEO of 64x Bio, a spinout from Harvard’s Wyss Institute that develops “purpose-built” cell lines to improve viral vector production yields and quality, described how her company has invested heavily in building proprietary datasets for AAV manufacturing optimization.

“We’ve been spending a good amount of time generating very rich datasets for perturbations and cell lines that would lead to very specific production outcomes,” she explained.

This approach has yielded tangible results: 64x Bio has developed products that generate some of the highest yields in the AAV manufacturing market and secured multiple partnerships ranging from big pharma to contract development and manufacturing organizations (CDMOs).

We’ve been spending a good amount of time generating very rich datasets for perturbations and cell lines that would lead to very specific production outcome.
Alexis Rovner, 64x Bio

However, even successful practitioners of the custom data approach acknowledge its limitations. As Rovner noted, the company is now looking toward computational tools to help them “leverage more combinatorial aspects of the data” to generate even better products.

The Startup Vs. Big Pharma Divide

Resource differences between start-ups and large pharmaceutical companies significantly influence which data strategy makes the most sense. Justin Farlow, who transitioned from cofounding Serotiny (acquired by Johnson & Johnson) to working within J&J, offered insights into how these dynamics play out.

“In the context of a start-up, you don’t often have that luxury [of extensive data],” Farlow explained. “You have one goal, you really need to achieve that, and you think really long and hard about what key piece of data is most important.”

Within a large pharma environment, however, companies can bring more diverse resources to bear. “You have the luxury to bring in knowledge and data from foundational models, from other modalities, from the clinic,” he said.

This resource difference creates interesting strategic implications. Start-ups may find foundation models or data-efficient approaches more accessible than building comprehensive proprietary datasets, while large pharma companies can potentially integrate multiple approaches across different therapeutic programs.

Emerging Considerations: Access And Compute

The panel also highlighted emerging strategic considerations that may influence which approaches prove most viable:

Model Access: The debate over the need for open source versus proprietary access to foundation models could significantly impact adoption. “Open source is really important, because you can modify the model,” Kelsic argued, emphasizing the need for customization beyond simple API access.

Test-Time Compute: Several panelists discussed the potential for “test-time compute,” meaning the ability to scale computational resources during actual use rather than just during training. “What we found is there’s also a scaling component when you pay more for compute to test more sequences – you get better results,” Kelsic explained.

Integration Challenges: As Farlow noted from his big pharma perspective, different therapeutic areas are at vastly different levels of AI sophistication, creating new challenges around integration and resource allocation. “Using AI and ML in a chemistry context or a biologics context or a gene editing context are at such different levels,” he explained. “What is quite sophisticated and really challenging for a cell therapy context is nothing compared to what the chemists can do or even the folks building biologics.”

Strategic Implications For Biopharma

The diversity of successful approaches revealed by the panel suggests several key implications for biopharma companies:

There Is No Universal Solution: Rather than a single winning strategy, the optimal approach appears to depend heavily on a company’s specific therapeutic focus, resource constraints and competitive positioning.

Timing Is A Consideration: Companies may need to carefully consider whether to adopt emerging approaches like foundation models immediately or wait for further maturation, particularly given the rapid pace of advancement in computational resources.

Allocating Limited Resources: The choice between custom data generation, foundation model adoption, or data-efficient computational methods has profound implications for how companies structure their R&D operations and allocate capital.

Collaboration Opportunities: The complexity of modern computational approaches may drive increased collaboration between companies with different core competencies, as evidenced by partnerships like Dyno’s work with NVIDIA.

The Bottom Line

Rather than a simple transition from one paradigm to another, the biotech industry appears to be entering a period of strategic diversification in computational approaches. Success will likely depend on companies’ ability to match their computational strategy to their specific therapeutic challenges and organizational capabilities.

As the panel discussion demonstrated, multiple viable paths exist for moving beyond the traditional custom data generation bottleneck. The companies that succeed will be those that choose their approach thoughtfully, execute it well, and remain flexible enough to adapt as the computational landscape continues to evolve.

For biopharma leaders, the key insight may be that there is no single “right” answer to the data challenge. There are only approaches that are well-suited to specific circumstances and objectives. The winners will be those who make this strategic choice most wisely.