Classifying aneuploidy in genotype intensity data using deep learning

Bouwman, Aniek C.; Hulsegge, Ina; Hawken, Rachel J.; Henshall, John M.; Veerkamp, Roel F.; Schokker, Dirkjan; Kamphuis, Claudia


Aneuploidy is the loss or gain of one or more chromosomes. Although it is a rare phenomenon in liveborn individuals, it is observed in livestock breeding populations. These breeding populations are often routinely genotyped and the genotype intensity data from single nucleotide polymorphism (SNP) arrays can be exploited to identify aneuploidy cases. This identification is a time-consuming and costly task, because it is often performed by visual inspection of the data per chromosome, usually done in plots of the intensity data by an expert. Therefore, we wanted to explore the feasibility of automated image classification to replace (part of) the visual detection procedure for any diploid species. The aim of this study was to develop a deep learning Convolutional Neural Network (CNN) classification model based on chromosome level plots of SNP array intensity data that can classify the images into disomic, monosomic and trisomic cases. A multispecies dataset enriched for aneuploidy cases was collected containing genotype intensity data of 3321 disomic, 1759 monosomic and 164 trisomic chromosomes. The final CNN model had an accuracy of 99.9%, overall precision was 1, recall was 0.98 and the F1 score was 0.99 for classifying images from intensity data. The high precision assures that cases detected are most likely true cases, however, some trisomy cases may be missed (the recall of the class trisomic was 0.94). This supervised CNN model performed much better than an unsupervised k-means clustering, which reached an accuracy of 0.73 and had especially difficult to classify trisomic cases correctly. The developed CNN classification model provides high accuracy to classify aneuploidy cases based on images of plotted X and Y genotype intensity values. The classification model can be used as a tool for routine screening in large diploid populations that are genotyped to get a better understanding of the incidence and inheritance, and in addition, avoid anomalies in breeding candidates.