Accuracy of Imputation to Whole-Genome Sequence Data in Holstein Friesian Cattle

Binsbergen, R. van; Bink, M.C.A.M.; Calus, M.P.L.; Hayes, B.; Eeuwijk, F.A. van; Veerkamp, R.F.


The use of whole-genome sequence data can lead to more accurate genomic predictions in animal and plants. Despite the fact that costs of sequencing are falling, sequencing a high number of individuals is still far too expensive. A promising approach is to sequence the genomes of a core set of individuals and impute the missing genotypes for the remaining individuals that are genotyped with currently available marker arrays. Relevant questions are how many animals do we need to sequence and what SNP arrays can we impute from for accurate imputation? Sequence data of 124 Holstein Friesian bulls from different countries were provided by the 1000 bull genomes project consortium ( Two chromosomes with distinct sizes (1 and 29) were selected for this study. The Beagle software was used for imputation and accuracy was assessed via cross validation. The 124 bulls were randomly divided into five sets: four sets were merged into a reference set (n_ref=100), and the remaining set in turn as the validation set. For the validation individuals all markers were set to missing, except for markers that occur on two commonly used arrays that include 777k and 54k SNP across the genome. In a second scenario the same was done, except half of the reference individuals were randomly removed (n_ref=50). Accuracy of imputation was calculated by the correlation between true and imputed genotypes per locus. The results will be presented and the impact of the size of the reference set and the marker density will be discussed.