Publications

From existing data to novel hypotheses : design and application of structure-based Molecular Class Specific Information Systems

Kuipers, R.K.P.

Summary

As the active component of many biological systems, proteins are of great interest to life scientists. Proteins are used in a large number of different applications such as the production of precursors and compounds, for bioremediation, as drug targets, to diagnose patients suffering from genetic disorders, etc. Many research projects have therefore focused on the characterization of proteins and on improving the understanding of the functional and mechanistic properties of proteins. Studies have examined folding mechanisms, reaction mechanisms, stability under stress, effects of mutations, etc. All these research projects have resulted in an enormous amount of available data in lots of different formats that are difficult to retrieve, combine, and use efficiently.

The main topic of this thesis is the 3DM platform that was developed to generate Molecular Class Specific Information Systems (3DM systems) for protein superfamilies. These superfamily systems can be used to collect and interlink heterogeneous data sets based on structure based multiple sequence alignments. 3DM systems can be used to integrate protein, structure, mutation, reaction, conservation, correlation, contact, and many other types of data. Data is visualized using websites, directly in protein structures using YASARA, and in literature using Utopia Documents. 3DM systems contain a number of modules that can be used to analyze superfamily characteristics namely Comulator for correlated mutation analyses, Mutator for mutation retrieval, and Validator for mutant pathogenicity prediction. To be able to determine the characteristics of subsets of proteins and to be able to compare the characteristics of different subsets a powerful filtering mechanism is available. 3DM systems can be used as a central knowledge base for projects in protein engineering, DNA diagnostics, and drug design.

The scientific and technical background of the 3DM platform is described in the first two chapters. Chapter 1 describes the scientific background, starting with an overview of the foundations of the 3DM platform. Alignment methods and tools for both structure and sequence alignments, and the techniques used in the 3DM modules are described in detail. Alternative methods are also described with the advantages and disadvantages of the various strategies. Chapter 2 contains a technical description of the implementation of the 3DM platform and the 3DM modules. A schematic overview of the database used to store the data is provided together with a description of the various tables and the steps required to create new 3DM systems. The techniques used in the Comulator, Mutator and Validator modules of the 3DM platforms are discussed in more detail.

Chapter 3 contains a concise overview of the 3DM platform, its capabilities, and the results of protein engineering projects using 3DM systems. Thirteen 3DM systems were generated for superfamilies such as the PEPM/ICL and Nuclear Receptors. These systems are available online for further examination. Protein engineering studies aimed at optimizing substrate specificity, enzyme activity, or thermostability were designed targeting proteins from these superfamilies. Preliminary results of drug design and DNA diagnostics projects are also included to highlight the diversity of projects 3DM systems can be applied to.

Project HOPE: a biomedical tool to predict the effect of a mutation on the structure of a protein is described in chapter 4. Project HOPE is developed at the Radboud University Nijmegen Medical Center under supervision of H. Venselaar. Project HOPE employs webservices to optimally reuse existing databases and computing facilities. After selection of a mutant in a protein, data is collected from various sources such as UniProt and PISA. A homology model is created to determine features such as contacts and side-chain accessibility directly in the structure. Using a decision tree, the available data is evaluated to predict the effects of the mutation on the protein.

Chapter 5 describes Comulator: the 3DM module for correlated mutation analyses. Two positions in an alignment correlate when they co-evolve, that is they mutate simultaneously or not at all. Comulator uses a statistical coupling algorithm to calculate correlated mutation analyses. Correlated mutations are visualized using heatmaps, or directly in protein structures using YASARA. Analyses of correlated mutations in various superfamilies showed that positions that correlate are often found in networks and that the positions in these networks often share a common function. Using these networks, mutants were predicted to increase the specificity or activity of proteins. Mutational studies confirmed that correlated mutation analyses are a valuable tool for rational design of proteins.

Mutator, the text mining tool used to incorporate mutations into 3DM systems is described in chapter 6. Mutator was designed to automatically retrieve mutations from literature and store these mutations in a 3DM system. A PubMed search using keywords from the 3DM system is used to preselect articles of interest. These articles are retrieved from the internet, converted to text, and parsed for mutations. Mutations are then grounded to proteins and stored in a 3DM database. Mutation retrieval was tested on the alpha-amylase superfamily as this superfamily contains the enzyme involved in Fabry’s disease: an x linked lysosomal storage disease. Compared to existing mutant databases, such as the HGMD and SwissProt, Mutator retrieved 30% more mutations from literature. A major problem in DNA diagnostics is the differentiation between natural variants and pathogenic mutations. To distinguish between pathogenic mutations and natural variation in proteins the Validator modules was added to 3DM. Validator uses the data available in a 3DM system to predict the pathogenicity of a mutant using, for example, the residue conservation of the mutants alignment position, side-chain accessibility of the mutant in the structure, and the number of mutations found in literature for the alignment position. Mutator and Validator can be used to study mutants found in disorder related genes. Although these tools are not the definitive solution for DNA diagnostics they can hopefully be used to increase our understanding of the molecular basis of disorders.

Chapter 7 and 8 describe applied research projects using 3DM systems containg proteins of potential commercial interest. A 3DM system for the a/b-beta hydrolases superfamily is described in chapter 7. This superfamily consists of almost 20,000 proteins with a diverse range of functions. Superfamily alignments were generated for the common beta-barrel fold shared by all superfamily members, and for five distinct subtypes within the superfamily. Due to the size and functional diversity of the superfamily, there is a lot of potential for industrial application of superfamily members. Chapter 8 describes a study focusing on a sucrose phosphorylase enzyme from the a-amylase superfamily. This enzyme can be potentially used in an industrial setting for the transfer of glucose to a wide variety of molecules. The aim of the study was to increase the stability of the protein at higher temperatures. A combination of rational design using a 3DM system, and in-depth study of the protein structure, led to a series of mutations that resulted in more than doubling the half-life of the protein at 60°C.

3DM systems have been successfully applied in a wide range of protein engineering and DNA diagnostics studies. Currently, 3DM systems are applied most successfully in project studying a single protein family or monogenetic disorder. In the future, we hope to be able to apply 3DM to more complex scenarios such as enzyme factories and polygenetic disorders by combining multiple 3DM systems for interacting proteins.