Onderwerp scriptie

Protein Function Prediction

Protein function information is available in the Gene Ontology (GO). While reliable, the GO is far from complete, and machine learning algorithms are applied to predict novel annotations based on additional (measurement) data. This is challenging, as: (a) the data sources available are noisy, biased and incomplete; (b) the GO contains only positive labels, there is no information on functions that proteins do not have; and (c) there is inherent structure between proteins and between functions that should be exploited.

In recent years, a successful protein function prediction tool has been developed at Wageningen University called BMRF [1]. The goal of this project is to extend BMRF by taking into account additional available data, such as various known (tissue-specific) networks and/or QTL data [2]; investigate the use of learning methods tailored to deal with only positive labels, such as PU learning [3]; and model relationships between annotations, such as the hierarchical nature of the Gene Ontology [4], for example by structured output learning of protein/function modules rather than individual proteins.

[1] Y.A.I. Kourmpetis et al. (2010) Bayesian Markov Random Field analysis for protein function prediction based on network data. PLoS ONE 5(2):e9293. [2] J.W. Bargsten et al. (in preparation) Linking rice traits to candidate genes via biological processes obtained by integrating gene function prediction with QTL data. [3] B. Calvo (2008) Positive unlabeled learning with applications in computational biology. PhD thesis, Dept. of Computer Science and Articial Intelligence, University of the Basque Country. [4] A. Sokolov and A. Ben-Hur (2010). Hierarchical classification of Gene Ontology terms using the GOstruct method. Journal of Bioinformatics and Computational Biology 8(2):357-76.

Used skills: Programming, statistics.

Requirements: INF-22306 Programming in Python, ABG-30306 Genomics, MAT-20306 Advanced statistics or BRD-31806 Parameter estimation and model structure ident. or ABG-30806 Modern statistics for the life sciences.