Chloroplast DNA sequence polymorphisms are a primary source of data in many plant phylogenetic studies. The chloroplast genome is relatively conserved in its evolution making it an ideal molecule to retain phylogenetic signals. The chloroplast genome is also largely, but not completely, free from other evolutionary processes such as gene duplication, concerted evolution, pseudogene formation and genome rearrangements. The conservation of the chloroplast genome sequence allows designing primers targeting regions conserved well beyond species boundaries, and amplification of these targets.
The small size together with their high copy number in leaf cells greatly facilitates chloroplast genome sequencing. In this thesis, chloroplast phylogenomics was conducted using complete chloroplast DNA genomes obtained by a newly developed method of de novo assembly. The method was not only cost-effective but also has the potential to extract a wealth of useful information of thousands of chloroplast genomes from Whole Genome Shotgun (WGS) data. We used k-mer frequency tables to identify and extract the chloroplast reads from the WGS reads and assemble these using a highly integrated and automated custom pipeline. This pipeline includes steps aimed at optimizing assemblies and filling gaps that are left due to coverage variation in the WGS dataset. The pipeline enabled successful de novo assembly across a range of nuclear genome sizes, from Solanum lycopersicon (tomato, 0.9 Gb) to Paphiopedilum heryanum (slipper orchid, 35 Gb).
The pipeline is suitable for studying structural variation in the chloroplast genome, as opposed to the common procedure of read mapping against a reference genome. To support the putative rearrangements, a flexible assembly quality comparison tool was created that combines and visualizes read mapping and alignment results in a two-dimensional plot. We have evaluated the ability of this tool using the de novo assemblies of S. lycopersicon and P. henryanum chloroplasts. The results show that not only we can immediately select the best of two options, but also determine the location of specific artefacts.
In order to explore and evaluate the utility of complete chloroplast phylogenomics, tomato and Paphiopedilum spp were used to conduct phylogenetic inferences based on the complete chloroplast genome. In total 84 tomato chloroplast genomes within the section Lycopersicon were assembled and phylogenetic trees produced. The analyses revealed that next to the chloroplast regions and spacers traditionally used for phylogenetics, additional regions of protein coding and non-coding DNA may be exploited for intraspecific phylogenetic studies. In particular, more than 50% of all phylogenetically relevant information could be included by just using four genes (ycf1, ndhF, ndhA, and ndhH), of which 34% in ycf1 alone. The topology of the phylogenetic tree inferred from ycf1 was the same as that of trees based on all other protein coding genes, although with lower bootstrap values. The phylogenetic analyses based on 32 complete Paphiopedilum spp. chloroplast genomes confirmed the division of the genus into three subgenera Parvisepalum, Brachypetalum and Paphiopedilum. The division of five sections of subgenus Paphiopedilum was also recovered. The de novo assemblies revealed several structural rearrangements including gene loss and inversion. In addition, the chloroplast genome of Paphiopedilum has experienced extreme IR expansion that has included part of or the entire SSC region, resulting in larger IR regions than commonly observed among monocots.
In conclusion, WGS data offer opportunities to generate partial or entire chloroplast genomes for phylogenetic studies. Species discrimination can be achieved already with partial data (subsets of genes), but evolutionarily young lineages may require more informative characters. Therefore, it is expected that many complete chloroplast genomes will be produced in the years to come. While generating these genomes, the urge for de novo assembly of chloroplast genomes rather than mapping against reference genomes is adamant in order to also uncover structural rearrangements in chloroplast genome.