Metabolomics is the scientific field that aims to map all molecules in an organism and is thus perfectly positioned to map this structural diversity. Fortunately, technical advances in analytical chemical equipment have boosted mass spectrometers to obtain information-dense metabolic profiles. However, these profiles typically consist of mass spectra rather than structures. Spectral libraries containing spectra of known structures are growing but currently cover about 2.5% of the known natural products and typically 2 – 25% of experimental data gets annotated through library matching - there is thus a lot of hidden information in metabolomics data waiting to be uncovered. My research vision is therefore to close the gap between what we can see in metabolomics and what we can actually learn from it. This will enable biochemical interpretation of spectral data obtained from complex metabolite mixtures through structural and functional annotations. This will depend on finding out: i) which structural information is encoded in metabolomics data; ii) how novel chemistry can be recognised in spectral data, and iii) how to effectively identify relevant metabolite groups in metabolomics profiles of complex metabolite mixtures?
My research agenda is therefore to develop algorithms and models to improve structural annotation of metabolite features and to obtain direct biochemical knowledge from metabolomics profiles. In my group, I will develop computational metabolomics approaches inspired by two other fields - that of natural language processing (NLP) and genomics. For example, I have demonstrated the use of topic modeling NLP algorithms to discover substructures from metabolomics profiles, and I am currently pioneering the use of word embedding NLP approaches to aid in metabolomics analyses. Furthermore, genomics analyses tools have rapidly expanded over the last decade and by making metabolomics data fit to those tools, it will be possible to exploit a large range of tools originally developed for genomics data and make improved use of the increasing amount of the available consistent curated sample information. In addition, linking the outcome of genomics and metabolomics data mining tools will accelerate the natural products discovery field by making connections between biosynthetic gene clusters and spectra of the molecular structures they encode for - thus learning who can produce these molecules as well as offering the option to transfer structural information back and forth. I will use the plant root microbiome and human food metabolome as prime applications since they represent complex metabolite mixtures full of yet unknown metabolic matter that once elucidated will boost our insights in molecular mechanisms underpinning the regulation of growth, development, and health.