Thesis subject

Mining structural variation in co-assembly graphs

Structural variation between genomes is usually assessed by comparing the finished genomes. However, when comparing many genomes of an organism it is not affordable to obtain sufficient next-generation sequencing (NGS) data to assemble novel genomes for each individual. In such cases, the novel genomes are resequenced at lower depth and the resulting reads are mapped to a reference genome [1]. A problem is that novel genomic content will not be taken into account, and that comparison between two individuals will always have to go through a (potentially dissimilar) reference genome.

A solution is to combine NGS data and create a colored co-assembly graph. In such graphs, nodes represent contigs, edges links between these contigs and colors indicate which nodes and edges are present in which organism [2]. Structural variation between genomes leads to particular structures in such graphs, such as branches, bubbles and cross-links [3]. The goal of this project is to explore methods for mining such graphs for interesting structures, which can be related to structural variation between genomes. The desired outcome is a method that takes assembly graphs as input and produces an annotated list of structural variation detected.

[1] P. Medvedev et al. (2009). Computational methods for discovering structural variation with next-generation sequencing. Nature Methods 6(11 Suppl), S1320. [2] Z. Iqbal et al. (2012). De novo assembly and genotyping of variants using colored De Bruijn graphs. Nature Genetics 44(2), 22632. [3] J.F. Nijkamp et al. (2013). Exploring variation aware contig graphs for (comparative) metagenomics using MaryGold, Bioinformatics 29(22):2826-34.

Used skills: Genomics, programming.

Requirements: INF-22306 Programming in Python, BIF-30806 Advanced bioinformatics, ABG-30306 Genomics