Split-and-merge Bayesian variable selection enables efficient genomic prediction using sequence data

Calus, M.P.L.; Schrooten, C.; Veerkamp, R.F.


Simultaneous use of more than 10,000,000 SNP imputed from sequence data hardly improved the accuracy of genomic prediction, even with commonly used Bayesian variable selection models. One reason may be that the overparametrization problem is more severe than with e.g. 50k SNPs. We hypothesize that splitand-merge Bayesian variable selection may provide a solution to overcome this issue. Our application of split-and-merge, also known as divide and conquer, combined with Bayesian variable selection involves two steps. The first step divides the SNPs in ~300 subsets of 40k SNPs. Subsets are formed by going through the list of SNP ordered by their position on the genome and assigning each next SNP to the next subset in line. In each subset, BayesC is used for genomic prediction, and SNPs are ranked based on their posterior probabilities indicating their likelihood to be strongly associated with the investigated trait. In the second step, from each subset a few hundred SNPs with the largest posterior probability will be selected into a final set of SNPs that are used to build the final genomic prediction model using BayesC. Next to attempting to alleviate the overparametrization problem, an additional practical benefit of this modelling procedure is that the first step comprises ~300 analyses with ~40,000 SNPs each, rather than one analysis with >10,000,000 SNPs. Since all these analyses can be run in parallel, computation time will be a matter of hours instead of more than a month. Results will be presented at the conference.