Cargando…

The application of deep learning for the classification of correct and incorrect SNP genotypes from whole-genome DNA sequencing pipelines

A downside of next-generation sequencing technology is the high technical error rate. We built a tool, which uses array-based genotype information to classify next-generation sequencing–based SNPs into the correct and the incorrect calls. The deep learning algorithms were implemented via Keras. Seve...

Descripción completa

Detalles Bibliográficos
Autores principales: Kotlarz, Krzysztof, Mielczarek, Magda, Suchocki, Tomasz, Czech, Bartosz, Guldbrandtsen, Bernt, Szyda, Joanna
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Springer Berlin Heidelberg 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7652806/
https://www.ncbi.nlm.nih.gov/pubmed/32996082
http://dx.doi.org/10.1007/s13353-020-00586-0
Descripción
Sumario:A downside of next-generation sequencing technology is the high technical error rate. We built a tool, which uses array-based genotype information to classify next-generation sequencing–based SNPs into the correct and the incorrect calls. The deep learning algorithms were implemented via Keras. Several algorithms were tested: (i) the basic, naïve algorithm, (ii) the naïve algorithm modified by pre-imposing different weights on incorrect and correct SNP class in calculating the loss metric and (iii)–(v) the naïve algorithm modified by random re-sampling (with replacement) of the incorrect SNPs to match 30%/60%/100% of the number of correct SNPs. The training data set was composed of data from three bulls and consisted of 2,227,995 correct (97.94%) and 46,920 incorrect SNPs, while the validation data set consisted of data from one bull with 749,506 correct (98.05%) and 14,908 incorrect SNPs. The results showed that for a rare event classification problem, like incorrect SNP detection in NGS data, the most parsimonious naïve model and a model with the weighting of SNP classes provided the best results for the classification of the validation data set. Both classified 19% of truly incorrect SNPs as incorrect and 99% of truly correct SNPs as correct and resulted in the F1 score of 0.21 — the highest among the compared algorithms. We conclude the basic models were less adapted to the specificity of a training data set and thus resulted in better classification of the independent, validation data set, than the other tested models. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1007/s13353-020-00586-0) contains supplementary material, which is available to authorized users.