NeurIPS 2019 fact-sheets: overviews of the research we presented, and its future impact
With the widespread adoption of next generation sequencing technologies, we are beginning to understand how each cancer tumour is unique on a genetic level, leading to an increased optimism surrounding the development of personalised cancer treatments. Precision oncology involves the process of identifying the genomic features driving an individual tumour and designing a personalised therapeutic strategy in response. The classification of these genomic features is a problem well suited to supervised machine learning algorithms.
Due to the high complexity and dimensionality of genomic data, applying models directly on the raw data can be difficult. Whilst we cannot reduce the complexity, there are techniques to reduce the dimensionality of this data: a common way is to select for features with known impact (e.g. driver genes, cell signalling pathways, etc). Another way is to use models that compress the data whilst keeping most of the signal. In this paper, we present Flatsomatic, a Variational Auto Encoder (VAE) optimised to compress somatic mutations, allowing for unbiased data compression whilst maintaining the signal. We show that the Flatsomatic representations keep the same predictive power that the original vector had for drug response prediction.
We have trained Variational AutoEncoders (VAEs) — unsupervised machine learning models able to learn hidden patterns that represent the distribution of the data — on binary data showing the location of mutations in the genome for each cancer patient. Our intuition is that the VAEs will be able to capture underlying rules and patterns among thousands of cancer patients and create a low-dimensional representation of this data to allow us to use it for other machine learning algorithms.
To design the best VAE, we trialled many neural network architectures and devised several changes to the loss function of the VAE to help it learn better representations of the data. Our results show that the low-dimension representations generated by our best VAE perform better than a low-dimension representation of the data created by Principal Component Analysis (PCA), and they perform just as well as the raw data on a classification task, showing there is no information lost. More details about the results and performance can be found in the paper.
The advantage of working with smaller spaces (lower dimensions) is that it enables the use of other relevant features — including clinical features or other abstract representations of genomic profiles — for future work. Our work has shown that there is a great potential for the use of VAEs in creating utilisable lower dimension representations of somatic profiles. At CCG.ai, we are planning to use this data to develop more machine learning algorithms to understand similarities among cancer patients, without neglecting what makes each patient unique: bringing us closer to delivering the right treatment for cancer patients at the right time.
This blog gives a high level overview of a paper presented at the NeurIPS 2019 workshop: Learning Meaningful Representations of Life.
We published 5 papers in total at NeurIPS 2019. Check out our press release to learn about our other Machine Learning advances.