As next-generation sequencing (NGS) is getting increasingly affordable, researchers attempt to sequence the genomes of large numbers of species, some of which have never been thoroughly been researched before. When sequencing novel genomes with no close reference available, it is hard to gain an a priori idea of the size and structure of the genome and its relation to already sequenced genomes. This makes it hard to choose optimal settings for the assembly and scaffolding algorithms and to validate the results.
In this project, we would like to explore the use of summary statistics of NGS data, such as k-mer distributions, to develop measures of genome size, complexity, zygosity, repeat content and (partial) similarity [1,2]. Such measures could aid in assembling new genomes, but may also be useful in comparative genomics and metagenomics, for example to define new measures for evolutionary relations or to learn about the makeup of complex mixtures of genomes. The desired outcome is a suite of tools that calculate these measures on a given NGS data set.
 B. Chor et al. (2009) Genomic DNA k-mer spectra: models and modalities. Genome Biology 10:R108.  B. Liu et al. (2013) Estimation of genomic characteristics by analyzing k-mer frequency in de novo genome projects. ArXiv 1308.2012.
Used skills: Genomics, statistics
INF-22306 Programming in Python
BIF-30806 Advanced bioinformatics