Cargando…

Improving the sensitivity of long read overlap detection using grouped short k-mer matches

BACKGROUND: Single-molecule, real-time sequencing (SMRT) developed by Pacific BioSciences produces longer reads than second-generation sequencing technologies such as Illumina. The increased read length enables PacBio sequencing to close gaps in genome assembly, reveal structural variations, and cha...

Descripción completa

Detalles Bibliográficos
Autores principales:	Du, Nan, Chen, Jiao, Sun, Yanni
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2019
Materias:	Research
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6456931/ https://www.ncbi.nlm.nih.gov/pubmed/30967123 http://dx.doi.org/10.1186/s12864-019-5475-x

_version_	1783409829343133696
author	Du, Nan Chen, Jiao Sun, Yanni
author_facet	Du, Nan Chen, Jiao Sun, Yanni
author_sort	Du, Nan
collection	PubMed
description	BACKGROUND: Single-molecule, real-time sequencing (SMRT) developed by Pacific BioSciences produces longer reads than second-generation sequencing technologies such as Illumina. The increased read length enables PacBio sequencing to close gaps in genome assembly, reveal structural variations, and characterize the intra-species variations. It also holds the promise to decipher the community structure in complex microbial communities because long reads help metagenomic assembly. One key step in genome assembly using long reads is to quickly identify reads forming overlaps. Because PacBio data has higher sequencing error rate and lower coverage than popular short read sequencing technologies (such as Illumina), efficient detection of true overlaps requires specially designed algorithms. In particular, there is still a need to improve the sensitivity of detecting small overlaps or overlaps with high error rates in both reads. Addressing this need will enable better assembly for metagenomic data produced by third-generation sequencing technologies. RESULTS: In this work, we designed and implemented an overlap detection program named GroupK, for third-generation sequencing reads based on grouped k-mer hits. While using k-mer hits for detecting reads’ overlaps has been adopted by several existing programs, our method uses a group of short k-mer hits satisfying statistically derived distance constraints to increase the sensitivity of small overlap detection. Grouped k-mer hit was originally designed for homology search. We are the first to apply group hit for long read overlap detection. The experimental results of applying our pipeline to both simulated and real third-generation sequencing data showed that GroupK enables more sensitive overlap detection, especially for datasets of low sequencing coverage. CONCLUSIONS: GroupK is best used for detecting small overlaps for third-generation sequencing data. It provides a useful supplementary tool to existing ones for more sensitive and accurate overlap detection. The source code is freely available at https://github.com/Strideradu/GroupK.
format	Online Article Text
id	pubmed-6456931
institution	National Center for Biotechnology Information
language	English
publishDate	2019
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-64569312019-04-19 Improving the sensitivity of long read overlap detection using grouped short k-mer matches Du, Nan Chen, Jiao Sun, Yanni BMC Genomics Research BACKGROUND: Single-molecule, real-time sequencing (SMRT) developed by Pacific BioSciences produces longer reads than second-generation sequencing technologies such as Illumina. The increased read length enables PacBio sequencing to close gaps in genome assembly, reveal structural variations, and characterize the intra-species variations. It also holds the promise to decipher the community structure in complex microbial communities because long reads help metagenomic assembly. One key step in genome assembly using long reads is to quickly identify reads forming overlaps. Because PacBio data has higher sequencing error rate and lower coverage than popular short read sequencing technologies (such as Illumina), efficient detection of true overlaps requires specially designed algorithms. In particular, there is still a need to improve the sensitivity of detecting small overlaps or overlaps with high error rates in both reads. Addressing this need will enable better assembly for metagenomic data produced by third-generation sequencing technologies. RESULTS: In this work, we designed and implemented an overlap detection program named GroupK, for third-generation sequencing reads based on grouped k-mer hits. While using k-mer hits for detecting reads’ overlaps has been adopted by several existing programs, our method uses a group of short k-mer hits satisfying statistically derived distance constraints to increase the sensitivity of small overlap detection. Grouped k-mer hit was originally designed for homology search. We are the first to apply group hit for long read overlap detection. The experimental results of applying our pipeline to both simulated and real third-generation sequencing data showed that GroupK enables more sensitive overlap detection, especially for datasets of low sequencing coverage. CONCLUSIONS: GroupK is best used for detecting small overlaps for third-generation sequencing data. It provides a useful supplementary tool to existing ones for more sensitive and accurate overlap detection. The source code is freely available at https://github.com/Strideradu/GroupK. BioMed Central 2019-04-04 /pmc/articles/PMC6456931/ /pubmed/30967123 http://dx.doi.org/10.1186/s12864-019-5475-x Text en © The Author(s) 2019 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Research Du, Nan Chen, Jiao Sun, Yanni Improving the sensitivity of long read overlap detection using grouped short k-mer matches
title	Improving the sensitivity of long read overlap detection using grouped short k-mer matches
title_full	Improving the sensitivity of long read overlap detection using grouped short k-mer matches
title_fullStr	Improving the sensitivity of long read overlap detection using grouped short k-mer matches
title_full_unstemmed	Improving the sensitivity of long read overlap detection using grouped short k-mer matches
title_short	Improving the sensitivity of long read overlap detection using grouped short k-mer matches
title_sort	improving the sensitivity of long read overlap detection using grouped short k-mer matches
topic	Research
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6456931/ https://www.ncbi.nlm.nih.gov/pubmed/30967123 http://dx.doi.org/10.1186/s12864-019-5475-x
work_keys_str_mv	AT dunan improvingthesensitivityoflongreadoverlapdetectionusinggroupedshortkmermatches AT chenjiao improvingthesensitivityoflongreadoverlapdetectionusinggroupedshortkmermatches AT sunyanni improvingthesensitivityoflongreadoverlapdetectionusinggroupedshortkmermatches

Improving the sensitivity of long read overlap detection using grouped short k-mer matches

Ejemplares similares