High-Throughput Identification of Adapters in Single-Read Sequencing Data

Sequencing datasets available in public repositories are already high in number, and their growth is exponential. Raw sequencing data files constitute a substantial portion of these data, and they need to be pre-processed for any downstream analyses. The removal of adapter sequences is the first ess...

Descripción completa

Detalles Bibliográficos
Autores principales: Mohideen, Asan M.S.H., Johansen, Steinar D., Babiak, Igor
Formato: Online Artículo Texto
Lenguaje:English
Publicado: MDPI 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7356586/
https://www.ncbi.nlm.nih.gov/pubmed/32521604
http://dx.doi.org/10.3390/biom10060878
_version_ 1783558524002893824
author Mohideen, Asan M.S.H.
Johansen, Steinar D.
Babiak, Igor
author_facet Mohideen, Asan M.S.H.
Johansen, Steinar D.
Babiak, Igor
author_sort Mohideen, Asan M.S.H.
collection PubMed
description Sequencing datasets available in public repositories are already high in number, and their growth is exponential. Raw sequencing data files constitute a substantial portion of these data, and they need to be pre-processed for any downstream analyses. The removal of adapter sequences is the first essential step. Tools available for the automated detection of adapters in single-read sequencing protocol datasets have certain limitations. To explore these datasets, one needs to retrieve the information on adapter sequences from the methods sections of appropriate research articles. This can be time-consuming in metadata analyses. Moreover, not all research articles provide the information on adapter sequences. We have developed adapt_find, a tool that automates the process of adapter sequences identification in raw single-read sequencing datasets. We have verified adapt_find through testing a number of publicly available datasets. adapt_find secures a robust, reliable and high-throughput process across different sequencing technologies and various adapter designs. It does not need prior knowledge of the adapter sequences. We also produced associated tools: random_mer, for the detection of random N bases either on one or both termini of the reads, and fastqc_parser, for consolidating the results from FASTQC outputs. Together, this is a valuable tool set for metadata analyses on multiple sequencing datasets.
format Online
Article
Text
id pubmed-7356586
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher MDPI
record_format MEDLINE/PubMed
spelling pubmed-73565862020-07-22 High-Throughput Identification of Adapters in Single-Read Sequencing Data Mohideen, Asan M.S.H. Johansen, Steinar D. Babiak, Igor Biomolecules Article Sequencing datasets available in public repositories are already high in number, and their growth is exponential. Raw sequencing data files constitute a substantial portion of these data, and they need to be pre-processed for any downstream analyses. The removal of adapter sequences is the first essential step. Tools available for the automated detection of adapters in single-read sequencing protocol datasets have certain limitations. To explore these datasets, one needs to retrieve the information on adapter sequences from the methods sections of appropriate research articles. This can be time-consuming in metadata analyses. Moreover, not all research articles provide the information on adapter sequences. We have developed adapt_find, a tool that automates the process of adapter sequences identification in raw single-read sequencing datasets. We have verified adapt_find through testing a number of publicly available datasets. adapt_find secures a robust, reliable and high-throughput process across different sequencing technologies and various adapter designs. It does not need prior knowledge of the adapter sequences. We also produced associated tools: random_mer, for the detection of random N bases either on one or both termini of the reads, and fastqc_parser, for consolidating the results from FASTQC outputs. Together, this is a valuable tool set for metadata analyses on multiple sequencing datasets. MDPI 2020-06-08 /pmc/articles/PMC7356586/ /pubmed/32521604 http://dx.doi.org/10.3390/biom10060878 Text en © 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
spellingShingle Article
Mohideen, Asan M.S.H.
Johansen, Steinar D.
Babiak, Igor
High-Throughput Identification of Adapters in Single-Read Sequencing Data
title High-Throughput Identification of Adapters in Single-Read Sequencing Data
title_full High-Throughput Identification of Adapters in Single-Read Sequencing Data
title_fullStr High-Throughput Identification of Adapters in Single-Read Sequencing Data
title_full_unstemmed High-Throughput Identification of Adapters in Single-Read Sequencing Data
title_short High-Throughput Identification of Adapters in Single-Read Sequencing Data
title_sort high-throughput identification of adapters in single-read sequencing data
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7356586/
https://www.ncbi.nlm.nih.gov/pubmed/32521604
http://dx.doi.org/10.3390/biom10060878
work_keys_str_mv AT mohideenasanmsh highthroughputidentificationofadaptersinsinglereadsequencingdata
AT johansensteinard highthroughputidentificationofadaptersinsinglereadsequencingdata
AT babiakigor highthroughputidentificationofadaptersinsinglereadsequencingdata