Cargando…

Skewer: a fast and accurate adapter trimmer for next-generation sequencing paired-end reads

BACKGROUND: Adapter trimming is a prerequisite step for analyzing next-generation sequencing (NGS) data when the reads are longer than the target DNA/RNA fragments. Although typically used in small RNA sequencing, adapter trimming is also used widely in other applications, such as genome DNA sequenc...

Descripción completa

Detalles Bibliográficos
Autores principales: Jiang, Hongshan, Lei, Rong, Ding, Shou-Wei, Zhu, Shuifang
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2014
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4074385/
https://www.ncbi.nlm.nih.gov/pubmed/24925680
http://dx.doi.org/10.1186/1471-2105-15-182
_version_ 1782323210986979328
author Jiang, Hongshan
Lei, Rong
Ding, Shou-Wei
Zhu, Shuifang
author_facet Jiang, Hongshan
Lei, Rong
Ding, Shou-Wei
Zhu, Shuifang
author_sort Jiang, Hongshan
collection PubMed
description BACKGROUND: Adapter trimming is a prerequisite step for analyzing next-generation sequencing (NGS) data when the reads are longer than the target DNA/RNA fragments. Although typically used in small RNA sequencing, adapter trimming is also used widely in other applications, such as genome DNA sequencing and transcriptome RNA/cDNA sequencing, where fragments shorter than a read are sometimes obtained because of the limitations of NGS protocols. For the newly emerged Nextera long mate-pair (LMP) protocol, junction adapters are located in the middle of all properly constructed fragments; hence, adapter trimming is essential to gain the correct paired reads. However, our investigations have shown that few adapter trimming tools meet both efficiency and accuracy requirements simultaneously. The performances of these tools can be even worse for paired-end and/or mate-pair sequencing. RESULTS: To improve the efficiency of adapter trimming, we devised a novel algorithm, the bit-masked k-difference matching algorithm, which has O(kn) expected time with O(m) space, where k is the maximum number of differences allowed, n is the read length, and m is the adapter length. This algorithm makes it possible to fully enumerate all candidates that meet a specified threshold, e.g. error ratio, within a short period of time. To improve the accuracy of this algorithm, we designed a simple and easy-to-explain statistical scoring scheme to evaluate candidates in the pattern matching step. We also devised scoring schemes to fully exploit the paired-end/mate-pair information when it is applicable. All these features have been implemented in an industry-standard tool named Skewer (https://sourceforge.net/projects/skewer). Experiments on simulated data, real data of small RNA sequencing, paired-end RNA sequencing, and Nextera LMP sequencing showed that Skewer outperforms all other similar tools that have the same utility. Further, Skewer is considerably faster than other tools that have comparative accuracies; namely, one times faster for single-end sequencing, more than 12 times faster for paired-end sequencing, and 49% faster for LMP sequencing. CONCLUSIONS: Skewer achieved as yet unmatched accuracies for adapter trimming with low time bound.
format Online
Article
Text
id pubmed-4074385
institution National Center for Biotechnology Information
language English
publishDate 2014
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-40743852014-07-01 Skewer: a fast and accurate adapter trimmer for next-generation sequencing paired-end reads Jiang, Hongshan Lei, Rong Ding, Shou-Wei Zhu, Shuifang BMC Bioinformatics Methodology Article BACKGROUND: Adapter trimming is a prerequisite step for analyzing next-generation sequencing (NGS) data when the reads are longer than the target DNA/RNA fragments. Although typically used in small RNA sequencing, adapter trimming is also used widely in other applications, such as genome DNA sequencing and transcriptome RNA/cDNA sequencing, where fragments shorter than a read are sometimes obtained because of the limitations of NGS protocols. For the newly emerged Nextera long mate-pair (LMP) protocol, junction adapters are located in the middle of all properly constructed fragments; hence, adapter trimming is essential to gain the correct paired reads. However, our investigations have shown that few adapter trimming tools meet both efficiency and accuracy requirements simultaneously. The performances of these tools can be even worse for paired-end and/or mate-pair sequencing. RESULTS: To improve the efficiency of adapter trimming, we devised a novel algorithm, the bit-masked k-difference matching algorithm, which has O(kn) expected time with O(m) space, where k is the maximum number of differences allowed, n is the read length, and m is the adapter length. This algorithm makes it possible to fully enumerate all candidates that meet a specified threshold, e.g. error ratio, within a short period of time. To improve the accuracy of this algorithm, we designed a simple and easy-to-explain statistical scoring scheme to evaluate candidates in the pattern matching step. We also devised scoring schemes to fully exploit the paired-end/mate-pair information when it is applicable. All these features have been implemented in an industry-standard tool named Skewer (https://sourceforge.net/projects/skewer). Experiments on simulated data, real data of small RNA sequencing, paired-end RNA sequencing, and Nextera LMP sequencing showed that Skewer outperforms all other similar tools that have the same utility. Further, Skewer is considerably faster than other tools that have comparative accuracies; namely, one times faster for single-end sequencing, more than 12 times faster for paired-end sequencing, and 49% faster for LMP sequencing. CONCLUSIONS: Skewer achieved as yet unmatched accuracies for adapter trimming with low time bound. BioMed Central 2014-06-12 /pmc/articles/PMC4074385/ /pubmed/24925680 http://dx.doi.org/10.1186/1471-2105-15-182 Text en Copyright © 2014 Jiang et al.; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/4.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Methodology Article
Jiang, Hongshan
Lei, Rong
Ding, Shou-Wei
Zhu, Shuifang
Skewer: a fast and accurate adapter trimmer for next-generation sequencing paired-end reads
title Skewer: a fast and accurate adapter trimmer for next-generation sequencing paired-end reads
title_full Skewer: a fast and accurate adapter trimmer for next-generation sequencing paired-end reads
title_fullStr Skewer: a fast and accurate adapter trimmer for next-generation sequencing paired-end reads
title_full_unstemmed Skewer: a fast and accurate adapter trimmer for next-generation sequencing paired-end reads
title_short Skewer: a fast and accurate adapter trimmer for next-generation sequencing paired-end reads
title_sort skewer: a fast and accurate adapter trimmer for next-generation sequencing paired-end reads
topic Methodology Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4074385/
https://www.ncbi.nlm.nih.gov/pubmed/24925680
http://dx.doi.org/10.1186/1471-2105-15-182
work_keys_str_mv AT jianghongshan skewerafastandaccurateadaptertrimmerfornextgenerationsequencingpairedendreads
AT leirong skewerafastandaccurateadaptertrimmerfornextgenerationsequencingpairedendreads
AT dingshouwei skewerafastandaccurateadaptertrimmerfornextgenerationsequencingpairedendreads
AT zhushuifang skewerafastandaccurateadaptertrimmerfornextgenerationsequencingpairedendreads