Cargando…

The application of deep learning for the classification of correct and incorrect SNP genotypes from whole-genome DNA sequencing pipelines

A downside of next-generation sequencing technology is the high technical error rate. We built a tool, which uses array-based genotype information to classify next-generation sequencing–based SNPs into the correct and the incorrect calls. The deep learning algorithms were implemented via Keras. Seve...

Descripción completa

Detalles Bibliográficos
Autores principales:	Kotlarz, Krzysztof, Mielczarek, Magda, Suchocki, Tomasz, Czech, Bartosz, Guldbrandtsen, Bernt, Szyda, Joanna
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Springer Berlin Heidelberg 2020
Materias:	Animal Genetics • Original Paper
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7652806/ https://www.ncbi.nlm.nih.gov/pubmed/32996082 http://dx.doi.org/10.1007/s13353-020-00586-0

_version_	1783607768844861440
author	Kotlarz, Krzysztof Mielczarek, Magda Suchocki, Tomasz Czech, Bartosz Guldbrandtsen, Bernt Szyda, Joanna
author_facet	Kotlarz, Krzysztof Mielczarek, Magda Suchocki, Tomasz Czech, Bartosz Guldbrandtsen, Bernt Szyda, Joanna
author_sort	Kotlarz, Krzysztof
collection	PubMed
description	A downside of next-generation sequencing technology is the high technical error rate. We built a tool, which uses array-based genotype information to classify next-generation sequencing–based SNPs into the correct and the incorrect calls. The deep learning algorithms were implemented via Keras. Several algorithms were tested: (i) the basic, naïve algorithm, (ii) the naïve algorithm modified by pre-imposing different weights on incorrect and correct SNP class in calculating the loss metric and (iii)–(v) the naïve algorithm modified by random re-sampling (with replacement) of the incorrect SNPs to match 30%/60%/100% of the number of correct SNPs. The training data set was composed of data from three bulls and consisted of 2,227,995 correct (97.94%) and 46,920 incorrect SNPs, while the validation data set consisted of data from one bull with 749,506 correct (98.05%) and 14,908 incorrect SNPs. The results showed that for a rare event classification problem, like incorrect SNP detection in NGS data, the most parsimonious naïve model and a model with the weighting of SNP classes provided the best results for the classification of the validation data set. Both classified 19% of truly incorrect SNPs as incorrect and 99% of truly correct SNPs as correct and resulted in the F1 score of 0.21 — the highest among the compared algorithms. We conclude the basic models were less adapted to the specificity of a training data set and thus resulted in better classification of the independent, validation data set, than the other tested models. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1007/s13353-020-00586-0) contains supplementary material, which is available to authorized users.
format	Online Article Text
id	pubmed-7652806
institution	National Center for Biotechnology Information
language	English
publishDate	2020
publisher	Springer Berlin Heidelberg
record_format	MEDLINE/PubMed
spelling	pubmed-76528062020-11-12 The application of deep learning for the classification of correct and incorrect SNP genotypes from whole-genome DNA sequencing pipelines Kotlarz, Krzysztof Mielczarek, Magda Suchocki, Tomasz Czech, Bartosz Guldbrandtsen, Bernt Szyda, Joanna J Appl Genet Animal Genetics • Original Paper A downside of next-generation sequencing technology is the high technical error rate. We built a tool, which uses array-based genotype information to classify next-generation sequencing–based SNPs into the correct and the incorrect calls. The deep learning algorithms were implemented via Keras. Several algorithms were tested: (i) the basic, naïve algorithm, (ii) the naïve algorithm modified by pre-imposing different weights on incorrect and correct SNP class in calculating the loss metric and (iii)–(v) the naïve algorithm modified by random re-sampling (with replacement) of the incorrect SNPs to match 30%/60%/100% of the number of correct SNPs. The training data set was composed of data from three bulls and consisted of 2,227,995 correct (97.94%) and 46,920 incorrect SNPs, while the validation data set consisted of data from one bull with 749,506 correct (98.05%) and 14,908 incorrect SNPs. The results showed that for a rare event classification problem, like incorrect SNP detection in NGS data, the most parsimonious naïve model and a model with the weighting of SNP classes provided the best results for the classification of the validation data set. Both classified 19% of truly incorrect SNPs as incorrect and 99% of truly correct SNPs as correct and resulted in the F1 score of 0.21 — the highest among the compared algorithms. We conclude the basic models were less adapted to the specificity of a training data set and thus resulted in better classification of the independent, validation data set, than the other tested models. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1007/s13353-020-00586-0) contains supplementary material, which is available to authorized users. Springer Berlin Heidelberg 2020-09-29 2020 /pmc/articles/PMC7652806/ /pubmed/32996082 http://dx.doi.org/10.1007/s13353-020-00586-0 Text en © The Author(s) 2020, corrected publication 2020 Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
spellingShingle	Animal Genetics • Original Paper Kotlarz, Krzysztof Mielczarek, Magda Suchocki, Tomasz Czech, Bartosz Guldbrandtsen, Bernt Szyda, Joanna The application of deep learning for the classification of correct and incorrect SNP genotypes from whole-genome DNA sequencing pipelines
title	The application of deep learning for the classification of correct and incorrect SNP genotypes from whole-genome DNA sequencing pipelines
title_full	The application of deep learning for the classification of correct and incorrect SNP genotypes from whole-genome DNA sequencing pipelines
title_fullStr	The application of deep learning for the classification of correct and incorrect SNP genotypes from whole-genome DNA sequencing pipelines
title_full_unstemmed	The application of deep learning for the classification of correct and incorrect SNP genotypes from whole-genome DNA sequencing pipelines
title_short	The application of deep learning for the classification of correct and incorrect SNP genotypes from whole-genome DNA sequencing pipelines
title_sort	application of deep learning for the classification of correct and incorrect snp genotypes from whole-genome dna sequencing pipelines
topic	Animal Genetics • Original Paper
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7652806/ https://www.ncbi.nlm.nih.gov/pubmed/32996082 http://dx.doi.org/10.1007/s13353-020-00586-0
work_keys_str_mv	AT kotlarzkrzysztof theapplicationofdeeplearningfortheclassificationofcorrectandincorrectsnpgenotypesfromwholegenomednasequencingpipelines AT mielczarekmagda theapplicationofdeeplearningfortheclassificationofcorrectandincorrectsnpgenotypesfromwholegenomednasequencingpipelines AT suchockitomasz theapplicationofdeeplearningfortheclassificationofcorrectandincorrectsnpgenotypesfromwholegenomednasequencingpipelines AT czechbartosz theapplicationofdeeplearningfortheclassificationofcorrectandincorrectsnpgenotypesfromwholegenomednasequencingpipelines AT guldbrandtsenbernt theapplicationofdeeplearningfortheclassificationofcorrectandincorrectsnpgenotypesfromwholegenomednasequencingpipelines AT szydajoanna theapplicationofdeeplearningfortheclassificationofcorrectandincorrectsnpgenotypesfromwholegenomednasequencingpipelines AT kotlarzkrzysztof applicationofdeeplearningfortheclassificationofcorrectandincorrectsnpgenotypesfromwholegenomednasequencingpipelines AT mielczarekmagda applicationofdeeplearningfortheclassificationofcorrectandincorrectsnpgenotypesfromwholegenomednasequencingpipelines AT suchockitomasz applicationofdeeplearningfortheclassificationofcorrectandincorrectsnpgenotypesfromwholegenomednasequencingpipelines AT czechbartosz applicationofdeeplearningfortheclassificationofcorrectandincorrectsnpgenotypesfromwholegenomednasequencingpipelines AT guldbrandtsenbernt applicationofdeeplearningfortheclassificationofcorrectandincorrectsnpgenotypesfromwholegenomednasequencingpipelines AT szydajoanna applicationofdeeplearningfortheclassificationofcorrectandincorrectsnpgenotypesfromwholegenomednasequencingpipelines

The application of deep learning for the classification of correct and incorrect SNP genotypes from whole-genome DNA sequencing pipelines

Ejemplares similares