Towards a better understanding of proteins through protein-family data integration

Bergh, Tom van den


Proteins are key to many different processes in and around living cells, they are the molecular mechanisms that facilitate life. Therefore, protein function and structure have been of great interest to researchers over the past decades. A better understanding of how proteins work is important, as it will help us understand the various diseases caused by their malfunction and help us apply proteins in various applications, such as industrial processes. The research on proteins has resulted in an ever-increasing amount of data related to proteins.

Protein information systems were developed to make sense out of all the data available for specific protein classes. The work described in this thesis makes use of the 3DM platform to store and order the enormous amount of protein data that is nowadays available. The 3DM platform integrates heterogeneous datasets for complete protein superfamilies to generate molecular class specific information systems (3DM systems).

The main objective of this thesis is to facilitate a better understanding of protein function through data integration- and analysis-tools to support both DNA diagnostics and protein engineering efforts. To that end, chapter 2 and chapter 3 focus on the collection and integration of mutational data from scientific literature, followed by protein superfamily analysis to reveal predictive patterns. In chapter 4, mutational data mined for whole families are combined with a superfamily alignment analysis to assign function to networks of residues that appear to have co-evolved. And in chapter 5, the data of a 3DM system is used to train a machine learning model for the prediction of pathogenicity of novel genetic variants.

In chapter 2 a method for the automated collection and integration of mutation data from scientific literature is described. This text-mining tool, Mutator, was designed to automatically identify and retrieve relevant literature for a complete protein family.

Mutator was tested on the alpha-amylase protein family which contains a protein that, when mutated, can lead to Fabry’s disease. The Fabry disease mutation database was constructed which contained 30% more mutation data when compared to the manually curated HGMD. To differentiate between natural variation and pathogenic mutations, alignment patterns in the alpha-amylase superfamily were investigated and Validator was developed, a tool to evaluate the potential pathogenicity of individual mutations.

In a follow up study, chapter 3, nine disease-related proteins across two different protein families are investigated. An improved version of Mutator was developed and used to collect mutational data for these protein families. A comparison to the manually curated HGMD showed that Mutator extracts far more mutation data, both in terms of unique mutations extracted, as well as number of references per mutation.

These mutation datasets were analyzed in the context of the protein families to identify patterns to assess pathogenicity for novel variants. This analysis showed that pathogenic mutations would more often be observed in the structurally conserved core of the family than in structurally variable regions. A significant overlap was observed of protein positions with pathogenic mutations between very distant homologs, indicating that data can be transferred between equivalent residue positions even across large phylogenetic distances. Protein superfamily information systems that use structural alignments, such as 3DM, can facilitate the transfer of data across equivalent residues of structurally conserved positions.

Chapter 4 describes a web-based tool for the analysis of co-evolving residues in protein super-family alignments. This tool, CorNet, combines networks of co-evolving positions with literature data to shed light on the function of these residues. When keywords, such as “specificity”, are mentioned in the sentences that contain mutational data for an alignment position, these were used to annotate the alignment position of a network.

For six different protein families, a list of literature keywords was used to annotate network positions and to calculate an enrichment score of any keyword to be specifically associated with a network of co-evolving positions. Enriched keywords indicate the function behind evolutionary pressures that caused networks of residues to co-evolve. It was shown that an assigned function can be transferred from annotated to non-annotated positions in a network with a mutagenesis study that targeted only the non-annotated positions of a network enriched for the keyword “enantioselectivity”.

In chapter 5, the integrated data of 3DM systems is utilized to train a model to predict variant pathogenicity with machine learning techniques. For three genes related to Long QT syndrome and Brugada syndrome, both cardiac arrhythmia syndromes, datasets of both benign and pathogenic variants were collected. 3DM was used to generate much heterogeneous data to describe each variant. Using this data, a pathogenicity predictor was generated to separate disease-causing from non-disease causing variants. This predictor outperformed frequently used genome-wide prediction tools and even gene-family specific predictors.

In the general discussion an outlook is given on the scale-up of protein information systems to cover the complete structural space and even the whole human exome. This thesis provides a variety of avenues to apply integrated data to gain insight in proteins and support protein related research and development.