Cargando…

Improving the sensitivity of long read overlap detection using grouped short k-mer matches

BACKGROUND: Single-molecule, real-time sequencing (SMRT) developed by Pacific BioSciences produces longer reads than second-generation sequencing technologies such as Illumina. The increased read length enables PacBio sequencing to close gaps in genome assembly, reveal structural variations, and cha...

Descripción completa

Detalles Bibliográficos
Autores principales: Du, Nan, Chen, Jiao, Sun, Yanni
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2019
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6456931/
https://www.ncbi.nlm.nih.gov/pubmed/30967123
http://dx.doi.org/10.1186/s12864-019-5475-x
_version_ 1783409829343133696
author Du, Nan
Chen, Jiao
Sun, Yanni
author_facet Du, Nan
Chen, Jiao
Sun, Yanni
author_sort Du, Nan
collection PubMed
description BACKGROUND: Single-molecule, real-time sequencing (SMRT) developed by Pacific BioSciences produces longer reads than second-generation sequencing technologies such as Illumina. The increased read length enables PacBio sequencing to close gaps in genome assembly, reveal structural variations, and characterize the intra-species variations. It also holds the promise to decipher the community structure in complex microbial communities because long reads help metagenomic assembly. One key step in genome assembly using long reads is to quickly identify reads forming overlaps. Because PacBio data has higher sequencing error rate and lower coverage than popular short read sequencing technologies (such as Illumina), efficient detection of true overlaps requires specially designed algorithms. In particular, there is still a need to improve the sensitivity of detecting small overlaps or overlaps with high error rates in both reads. Addressing this need will enable better assembly for metagenomic data produced by third-generation sequencing technologies. RESULTS: In this work, we designed and implemented an overlap detection program named GroupK, for third-generation sequencing reads based on grouped k-mer hits. While using k-mer hits for detecting reads’ overlaps has been adopted by several existing programs, our method uses a group of short k-mer hits satisfying statistically derived distance constraints to increase the sensitivity of small overlap detection. Grouped k-mer hit was originally designed for homology search. We are the first to apply group hit for long read overlap detection. The experimental results of applying our pipeline to both simulated and real third-generation sequencing data showed that GroupK enables more sensitive overlap detection, especially for datasets of low sequencing coverage. CONCLUSIONS: GroupK is best used for detecting small overlaps for third-generation sequencing data. It provides a useful supplementary tool to existing ones for more sensitive and accurate overlap detection. The source code is freely available at https://github.com/Strideradu/GroupK.
format Online
Article
Text
id pubmed-6456931
institution National Center for Biotechnology Information
language English
publishDate 2019
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-64569312019-04-19 Improving the sensitivity of long read overlap detection using grouped short k-mer matches Du, Nan Chen, Jiao Sun, Yanni BMC Genomics Research BACKGROUND: Single-molecule, real-time sequencing (SMRT) developed by Pacific BioSciences produces longer reads than second-generation sequencing technologies such as Illumina. The increased read length enables PacBio sequencing to close gaps in genome assembly, reveal structural variations, and characterize the intra-species variations. It also holds the promise to decipher the community structure in complex microbial communities because long reads help metagenomic assembly. One key step in genome assembly using long reads is to quickly identify reads forming overlaps. Because PacBio data has higher sequencing error rate and lower coverage than popular short read sequencing technologies (such as Illumina), efficient detection of true overlaps requires specially designed algorithms. In particular, there is still a need to improve the sensitivity of detecting small overlaps or overlaps with high error rates in both reads. Addressing this need will enable better assembly for metagenomic data produced by third-generation sequencing technologies. RESULTS: In this work, we designed and implemented an overlap detection program named GroupK, for third-generation sequencing reads based on grouped k-mer hits. While using k-mer hits for detecting reads’ overlaps has been adopted by several existing programs, our method uses a group of short k-mer hits satisfying statistically derived distance constraints to increase the sensitivity of small overlap detection. Grouped k-mer hit was originally designed for homology search. We are the first to apply group hit for long read overlap detection. The experimental results of applying our pipeline to both simulated and real third-generation sequencing data showed that GroupK enables more sensitive overlap detection, especially for datasets of low sequencing coverage. CONCLUSIONS: GroupK is best used for detecting small overlaps for third-generation sequencing data. It provides a useful supplementary tool to existing ones for more sensitive and accurate overlap detection. The source code is freely available at https://github.com/Strideradu/GroupK. BioMed Central 2019-04-04 /pmc/articles/PMC6456931/ /pubmed/30967123 http://dx.doi.org/10.1186/s12864-019-5475-x Text en © The Author(s) 2019 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Research
Du, Nan
Chen, Jiao
Sun, Yanni
Improving the sensitivity of long read overlap detection using grouped short k-mer matches
title Improving the sensitivity of long read overlap detection using grouped short k-mer matches
title_full Improving the sensitivity of long read overlap detection using grouped short k-mer matches
title_fullStr Improving the sensitivity of long read overlap detection using grouped short k-mer matches
title_full_unstemmed Improving the sensitivity of long read overlap detection using grouped short k-mer matches
title_short Improving the sensitivity of long read overlap detection using grouped short k-mer matches
title_sort improving the sensitivity of long read overlap detection using grouped short k-mer matches
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6456931/
https://www.ncbi.nlm.nih.gov/pubmed/30967123
http://dx.doi.org/10.1186/s12864-019-5475-x
work_keys_str_mv AT dunan improvingthesensitivityoflongreadoverlapdetectionusinggroupedshortkmermatches
AT chenjiao improvingthesensitivityoflongreadoverlapdetectionusinggroupedshortkmermatches
AT sunyanni improvingthesensitivityoflongreadoverlapdetectionusinggroupedshortkmermatches