Standardisation of Machine Learning Variants

The need for a diverse universally accepted standard set of variants for machine learning

Dàmi Rebergen
January 16, 2020
March 5, 2020

In the age of information everybody has a definition for what data is and what it encompasses. There are tables, small and large, binding values together into a single element of data, the datum. But is that datum the truth, and where did it come from? In many applications this is not a problem, as the real-world entity that describes the data is inspectable by humans. However, if the thing you are measuring gets smaller this becomes harder, especially if there is also the added uncertainty that nature brings to the table. This brings us to the smallest unit of genetic data, the nucleotide: four small molecules that chain together to code for life — also called DNA.

Here at we are interested in what happens at this starting point of genetics, and how a couple of small alterations accumulating throughout life can have huge consequences. We are interested in how these small changes, called somatic variants, can lead to cancer — one of the world’s deadliest diseases[1].

With the ever-declining cost of genetic sequencing we are now getting closer to measuring these DNA errors in patients with high confidence. But the detection of these somatic variants is a complex task: their signal can vary anywhere between as low as 1% and as high as 100% in the tumour sample. This low signal is caused by the variant allele fraction, the ratio of the tumour containing the mutation, in combination with the fact that normal samples (non-tumour cells) can be contaminated with DNA from the tumour. Additionally, sequencing techniques are still only 99.9% accurate.

Many statistical models have been proposed and have shown some effectiveness in the detection of these mutations. This includes sifting mutations out from the noise induced from various sources, such as the aforementioned contamination, as well as sequencing errors.

However, many of these models — called somatic variant callers (SVC) — show very large disagreement amongst each other. Independent studies have shown that overlap as low as 30% among callers, when used by third parties, is not rare[2–4]: in sharp contrast to the 99% accuracy claimed in many papers by the creators of these tools[5–7] and in competitions such as the ICGC-TCGA DREAM Mutation Calling challenge[8]. In this challenge, every participant was given the task to classify mutations resulting from a simulation with a strict and short list of VAF values to choose from.

While these challenges are good for competition in the field, their data is being misused as a set of representative somatic mutations, which was never was the goal. With no strict limit to the number of models every participant could enter, many people took the opportunity to submit many models, almost doing a grid search for the ideal parameters. This leads to models that perform very well for a given task and dataset, but not for generalizable tools that work out of the box.

To move the field of somatic variant calling forward we need a universal standard set of mutations that:

  • Represent real somatic mutations in cancer, to the best of our knowledge
  • Encompasses a wide range of mutation types
  • Consists of multiple sequencing techniques
  • Is widely accepted within the community of SVC developers for its representativeness

This dataset should come with community guidelines on how to use this data, to prevent people from using this dataset as training or validation data in their ML models — instead of for validation usage only — as sometimes done in ML image recognition datasets such as CIFAR-10 and ImageNet.

Currently, the data from the DREAM challenge is used for this purpose, in combination with data from The Cancer Genome Atlas (TCGA) and from the MC3 study[9]. But these come with drawbacks, such as the fact that the tools that are being tested are also being used to select candidates, leading to feedback loops.

Overall the field of oncology has made huge progressions in somatic variant calling. With a dataset as proposed above, the whole field could come to an agreement about performance for given types of sequencing data, without crowning anybody the king of SVC. This would result in a more transparent and reproducible overview of tools and methods, which the field desperately needs.

  • Written by Dami Rebergen, Bioinformatician at
  • Edited by Belle Taylor, Strategic Communications and Partnerships Manager at
  • Thanks to Harry Clifford, Geoffroy Dubourg-Felonneau, Nirmesh Patel and Christopher Parsons for valuable discussions

References consulted:

This is some text inside of a div block.