ML Approaches for Cancer Mutation Datasets

NeurIPS 2019 fact-sheets: overviews of the research we presented, and its future impact

Yasmeen Kussad
January 16, 2020
February 10, 2020


Analysis of somatic (non-inherited) mutations across all cancer patients can bring very useful insights that are essential to the development of cancer research. However, the low frequency of most mutations and the varying rates of mutations across patients makes the data extremely challenging to statistically analyse. As a result, cancer data is also difficult to use in: classification problems, for clustering, for visualisation or for learning useful information. Thus, the creation of low dimensional representations of somatic mutation profiles, that hold useful information about the DNA of cancer cells, will facilitate the use of such data in applications that will progress precision medicine.

In this paper, we talk about the open problem of learning from somatic mutations, and explore two different approaches: Flatsomatic, a solution that utilises Variational AutoEncoders (VAEs) to create latent representations of somatic profiles; and set-based learning for mutation features.

The work done in this paper shows great potential for both methods separately, but we also go a step further and combine representations from both methods. We believe the methods presented can be of great value in future research and in bringing data-driven models into precision oncology.

"The low frequency of cancer mutations and the varying rates of these mutations across patients, makes cancer datasets extremely challenging to statistically analyse"

How Does It Work?

Our exploration started with creating embeddings using a Variational AutoEncoder (VAEs) — unsupervised machine learning models able to learn hidden patterns in the data — based on known positional features of mutations. The aim of this was to capture underlying rules and patterns among thousands of cancer patients and create a less sparse representation of this data.

We also used an approach based on the “Deep Sets” [1] paper that handles mutations in each patient as a set. We used this approach to build a classification model using several other mutation features (including Variant Allele Frequency (VAF), impact, consequence, pathways, etc.); we then extracted the embeddings from the model after training.

Our approach is unique because we then combined the two representations of mutation data created by these models, and used this new data for a classification task.

Learned representation from VAE (left) and DeepSets model (right) are combined to create a more meaningful representation of mutations for the main machine learning task: classification of patient cancer-type

Combining the representations allowed us to use both positional and non-positional mutation features, thus creating more useful and meaningful representations. Our low-dimension version of the data performed better than the raw data and the two separate representations in a classification task that predicts the cancer-type of the patient. This (combined) approach therefore promises more meaningful representation of cancer mutation data.

What’s the Impact?

The use of multiple mutation features across cancer patients can make the data more meaningful however, due to the varying frequencies of some features (e.g. number of mutations in each cancer patient), the datasets can be large, sparse and therefore restrictive to learn from.

The work we presented at NeurIPS holds a lot of promise in creating meaningful representations of the data that are lower in dimension than the raw data and does not lose valuable information. We aim at to use these representations to develop machine learning algorithms that will bring us closer to true precision oncology.

Find out more

This blog gives a high level overview of a paper presented at the NeurIPS 2019 workshop: Sets & Partitions.

To learn more about this research, read the full paper. To find out why we think all patients deserve precision oncology, read this blog post.

We published 5 papers in total at NeurIPS 2019. Check out our press release to learn about our other Machine Learning advances.

  • Written by Yasmeen Kussad, Machine Learning Researcher at
  • Edited by Belle Taylor, Strategic Communications and Partnerships Manager at
  • Thanks to Geoffroy Dubourg-Felonneau, Harry Clifford, and Dominic Kirkham for valuable discussions

References consulted:

This is some text inside of a div block.