NeurIPS 2019 fact-sheets: overviews of the research we presented, and its future impact
Analysis of somatic (non-inherited) mutations across all cancer patients can bring very useful insights that are essential to the development of cancer research. However, the low frequency of most mutations and the varying rates of mutations across patients makes the data extremely challenging to statistically analyse. As a result, cancer data is also difficult to use in: classification problems, for clustering, for visualisation or for learning useful information. Thus, the creation of low dimensional representations of somatic mutation profiles, that hold useful information about the DNA of cancer cells, will facilitate the use of such data in applications that will progress precision medicine.
In this paper, we talk about the open problem of learning from somatic mutations, and explore two different approaches: Flatsomatic, a solution that utilises Variational AutoEncoders (VAEs) to create latent representations of somatic profiles; and set-based learning for mutation features.
The work done in this paper shows great potential for both methods separately, but we also go a step further and combine representations from both methods. We believe the methods presented can be of great value in future research and in bringing data-driven models into precision oncology.
Our exploration started with creating embeddings using a Variational AutoEncoder (VAEs) — unsupervised machine learning models able to learn hidden patterns in the data — based on known positional features of mutations. The aim of this was to capture underlying rules and patterns among thousands of cancer patients and create a less sparse representation of this data.
We also used an approach based on the “Deep Sets”  paper that handles mutations in each patient as a set. We used this approach to build a classification model using several other mutation features (including Variant Allele Frequency (VAF), impact, consequence, pathways, etc.); we then extracted the embeddings from the model after training.
Our approach is unique because we then combined the two representations of mutation data created by these models, and used this new data for a classification task.
Combining the representations allowed us to use both positional and non-positional mutation features, thus creating more useful and meaningful representations. Our low-dimension version of the data performed better than the raw data and the two separate representations in a classification task that predicts the cancer-type of the patient. This (combined) approach therefore promises more meaningful representation of cancer mutation data.
The use of multiple mutation features across cancer patients can make the data more meaningful however, due to the varying frequencies of some features (e.g. number of mutations in each cancer patient), the datasets can be large, sparse and therefore restrictive to learn from.
The work we presented at NeurIPS holds a lot of promise in creating meaningful representations of the data that are lower in dimension than the raw data and does not lose valuable information. We aim at CCG.ai to use these representations to develop machine learning algorithms that will bring us closer to true precision oncology.
This blog gives a high level overview of a paper presented at the NeurIPS 2019 workshop: Sets & Partitions.
We published 5 papers in total at NeurIPS 2019. Check out our press release to learn about our other Machine Learning advances.