The Data Conundrum: Bigger isn’t always better

June 20, 2023

The saying goes -  "Bigger isn't always better". Would you rather listen to the late, great Tina Turner belting out one last song live in concert, or a primary school orchestra with a hundred children butchering Beethoven’s 5th Symphony for an hour? Quality matters over quantity, and the same is true with biological data.

Data collected from scientific experiments can be harnessed to generate key insights and accelerate drug discovery. However as the volume of available data increases, inconsistencies in how that data is collected and collated can lead to messy models and poor outcomes.

So what can we do to harness the power of biological data, without getting stuck with nothing but noise?

Using biological data to accelerate drug discovery

Drug discovery is slow and expensive, with estimates suggesting it costs over $2bn and 12-15 years to bring a new drug to market. Predictive computational methods are appealing as they are faster and more cost-effective than traditional lab-based experiments. However, the effectiveness of these methods rely heavily on the quality and relevance of the experimental data used for algorithm training.

There are two distinct types of experimental data to consider. In vivo data is taken from experiments conducted within living organisms, e.g., animal tests or clinical trials. In vitro data comes from experiments performed outside of living organisms, e.g., single cells in petri dishes. Both data types play vital roles in understanding the complexity of what happens when a human takes a drug.

The in vivo approach: Powerful, but unwieldy

In vivo experiments provide a holistic view of how drugs interact with the whole body, including factors such as drug metabolism. In vivo data can enable you to gain insights into drug efficacy, safety profiles, and early identification of potential toxic side effects.

However, in vivo data poses unique challenges. Genetic variations between individuals significantly impact how drugs work, introducing variability that creates noise in the data. Furthermore, extrinsic factors like diet and lifestyle or age-related changes in organ function complicate the picture further. Finally, ethical and practical limitations restrict the scale and scope of in vivo experiments, making it difficult to gather large enough volumes of data for analysis - meaning predictive models typically don’t work.

The in vitro approach: Controlled, accurate, but limited

In vitro experiments are conducted in controlled conditions and allow you to investigate specific cellular processes and responses, removing the complexity of the wider organism. Additionally, public domain in vitro data provides what is in theory a wealth of information for machine learning models. These models leverage molecular features, cellular characteristics, and experimental conditions to predict drug responses, aiding in drug candidate prioritisation, target identification, and experimental design optimisation.

As always though, there is a trade off. In vitro experiments, while highly controlled, may not fully capture the complexity of a whole organism. Reductionist approaches, such as using the inhibition of a specific protein as a proxy for broader adverse drug reactions, might not encompass the true nature of in vivo responses. In vitro experiments also lack important factors such as absorption, distribution, metabolism, and excretion (ADME), which influence drug behaviour within living organisms. Perhaps the biggest challenge comes from a lack of standardised curation. More often than not in vitro data is presented without proper context and metadata making combining datasets especially challenging. Notable examples include issues with combining assay types, synchronising compound representations and even gathering a strong consensus about at output of certain assays.

So what approach is best? How do I best leverage experimental data?

It may seem like you’re damned if you do and damned if you don’t, but fear not, times are changing and more focus than ever is being put into addressing these challenges. Ultimately in the modern era data must be stored and presented with AI in mind.

To address in vitro data challenges, focus on improving data quality and standardisation. Collaborative initiatives (such as FAIR) ensure consistent and accurate metadata collection, making data more suitable for machine learning and informatics applications. Integrating multiple data sources and assay types through data curation and harmonisation allows for comprehensive analysis, reducing noise in datasets - as long as low quality data and incompatible assays are filtered out. Advancements in experimental techniques, such as organ-on-a-chip and 3D cell culture models, bridge the gap between in vitro and in vivo environments, providing more physiologically relevant data and enhancing the predictive power of computational models.

As for in vivo data, you can leverage advancements in genetics and genomics to account for individual variability. Integration of genetic information, including single nucleotide polymorphisms (SNPs) and gene expression profiles, enables a personalised approach to drug discovery. Genetic data incorporated into computational models helps identify genetic markers associated with drug responses, facilitating treatments tailored to specific patient populations. Large-scale collaborative efforts, such as biobanks and data sharing initiatives, aid in the collection and analysis of diverse in vivo datasets, providing insights into population-specific drug responses.


At Ignota Labs, we use a mixture of many data types and leverage a range of computational techniques. This enables us to break down the complex problems of safety in humans into component parts, solve them individually, and then recombine them so they are in vivo relevant. We devote a lot of time and effort into data cleansing, ensuring our data is of the highest scientific quality and there is no noise. It is a complex challenge, but by bringing together the right people from our interdisciplinary team, we are committed to solving it.

Contact Us

The Bradfield Centre, 184
Cambridge Science Park Rd,
Milton, Cambridge CB4 0GA
img