Cargando…

Factorial estimating assembly base errors using k-mer abundance difference (KAD) between short reads and genome assembled sequences

Genome sequences provide genomic maps with a single-base resolution for exploring genetic contents. Sequencing technologies, particularly long reads, have revolutionized genome assemblies for producing highly continuous genome sequences. However, current long-read sequencing technologies generate in...

Descripción completa

Detalles Bibliográficos
Autores principales:	He, Cheng, Lin, Guifang, Wei, Hairong, Tang, Haibao, White, Frank F, Valent, Barbara, Liu, Sanzhen
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Oxford University Press 2020
Materias:	Standard Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7671381/ https://www.ncbi.nlm.nih.gov/pubmed/33575622 http://dx.doi.org/10.1093/nargab/lqaa075

_version_	1783610918622461952
author	He, Cheng Lin, Guifang Wei, Hairong Tang, Haibao White, Frank F Valent, Barbara Liu, Sanzhen
author_facet	He, Cheng Lin, Guifang Wei, Hairong Tang, Haibao White, Frank F Valent, Barbara Liu, Sanzhen
author_sort	He, Cheng
collection	PubMed
description	Genome sequences provide genomic maps with a single-base resolution for exploring genetic contents. Sequencing technologies, particularly long reads, have revolutionized genome assemblies for producing highly continuous genome sequences. However, current long-read sequencing technologies generate inaccurate reads that contain many errors. Some errors are retained in assembled sequences, which are typically not completely corrected by using either long reads or more accurate short reads. The issue commonly exists, but few tools are dedicated for computing error rates or determining error locations. In this study, we developed a novel approach, referred to as k-mer abundance difference (KAD), to compare the inferred copy number of each k-mer indicated by short reads and the observed copy number in the assembly. Simple KAD metrics enable to classify k-mers into categories that reflect the quality of the assembly. Specifically, the KAD method can be used to identify base errors and estimate the overall error rate. In addition, sequence insertion and deletion as well as sequence redundancy can also be detected. Collectively, KAD is valuable for quality evaluation of genome assemblies and, potentially, provides a diagnostic tool to aid in precise error correction. KAD software has been developed to facilitate public uses.
format	Online Article Text
id	pubmed-7671381
institution	National Center for Biotechnology Information
language	English
publishDate	2020
publisher	Oxford University Press
record_format	MEDLINE/PubMed
spelling	pubmed-76713812021-02-10 Factorial estimating assembly base errors using k-mer abundance difference (KAD) between short reads and genome assembled sequences He, Cheng Lin, Guifang Wei, Hairong Tang, Haibao White, Frank F Valent, Barbara Liu, Sanzhen NAR Genom Bioinform Standard Article Genome sequences provide genomic maps with a single-base resolution for exploring genetic contents. Sequencing technologies, particularly long reads, have revolutionized genome assemblies for producing highly continuous genome sequences. However, current long-read sequencing technologies generate inaccurate reads that contain many errors. Some errors are retained in assembled sequences, which are typically not completely corrected by using either long reads or more accurate short reads. The issue commonly exists, but few tools are dedicated for computing error rates or determining error locations. In this study, we developed a novel approach, referred to as k-mer abundance difference (KAD), to compare the inferred copy number of each k-mer indicated by short reads and the observed copy number in the assembly. Simple KAD metrics enable to classify k-mers into categories that reflect the quality of the assembly. Specifically, the KAD method can be used to identify base errors and estimate the overall error rate. In addition, sequence insertion and deletion as well as sequence redundancy can also be detected. Collectively, KAD is valuable for quality evaluation of genome assemblies and, potentially, provides a diagnostic tool to aid in precise error correction. KAD software has been developed to facilitate public uses. Oxford University Press 2020-09-21 /pmc/articles/PMC7671381/ /pubmed/33575622 http://dx.doi.org/10.1093/nargab/lqaa075 Text en © The Author(s) 2019. Published by Oxford University Press on behalf of NAR Genomics and Bioinformatics. http://creativecommons.org/licenses/by-nc/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com
spellingShingle	Standard Article He, Cheng Lin, Guifang Wei, Hairong Tang, Haibao White, Frank F Valent, Barbara Liu, Sanzhen Factorial estimating assembly base errors using k-mer abundance difference (KAD) between short reads and genome assembled sequences
title	Factorial estimating assembly base errors using k-mer abundance difference (KAD) between short reads and genome assembled sequences
title_full	Factorial estimating assembly base errors using k-mer abundance difference (KAD) between short reads and genome assembled sequences
title_fullStr	Factorial estimating assembly base errors using k-mer abundance difference (KAD) between short reads and genome assembled sequences
title_full_unstemmed	Factorial estimating assembly base errors using k-mer abundance difference (KAD) between short reads and genome assembled sequences
title_short	Factorial estimating assembly base errors using k-mer abundance difference (KAD) between short reads and genome assembled sequences
title_sort	factorial estimating assembly base errors using k-mer abundance difference (kad) between short reads and genome assembled sequences
topic	Standard Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7671381/ https://www.ncbi.nlm.nih.gov/pubmed/33575622 http://dx.doi.org/10.1093/nargab/lqaa075
work_keys_str_mv	AT hecheng factorialestimatingassemblybaseerrorsusingkmerabundancedifferencekadbetweenshortreadsandgenomeassembledsequences AT linguifang factorialestimatingassemblybaseerrorsusingkmerabundancedifferencekadbetweenshortreadsandgenomeassembledsequences AT weihairong factorialestimatingassemblybaseerrorsusingkmerabundancedifferencekadbetweenshortreadsandgenomeassembledsequences AT tanghaibao factorialestimatingassemblybaseerrorsusingkmerabundancedifferencekadbetweenshortreadsandgenomeassembledsequences AT whitefrankf factorialestimatingassemblybaseerrorsusingkmerabundancedifferencekadbetweenshortreadsandgenomeassembledsequences AT valentbarbara factorialestimatingassemblybaseerrorsusingkmerabundancedifferencekadbetweenshortreadsandgenomeassembledsequences AT liusanzhen factorialestimatingassemblybaseerrorsusingkmerabundancedifferencekadbetweenshortreadsandgenomeassembledsequences

Factorial estimating assembly base errors using k-mer abundance difference (KAD) between short reads and genome assembled sequences

Ejemplares similares