Cargando…

Zseq: An Approach for Preprocessing Next-Generation Sequencing Data

Next-generation sequencing technology generates a huge number of reads (short sequences), which contain a vast amount of genomic data. The sequencing process, however, comes with artifacts. Preprocessing of sequences is mandatory for further downstream analysis. We present Zseq, a linear method that...

Descripción completa

Detalles Bibliográficos
Autores principales:	Alkhateeb, Abedalrhman, Rueda, Luis
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Mary Ann Liebert, Inc. 2017
Materias:	Research Articles
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5563921/ https://www.ncbi.nlm.nih.gov/pubmed/28414515 http://dx.doi.org/10.1089/cmb.2017.0021

_version_	1783258182716489728
author	Alkhateeb, Abedalrhman Rueda, Luis
author_facet	Alkhateeb, Abedalrhman Rueda, Luis
author_sort	Alkhateeb, Abedalrhman
collection	PubMed
description	Next-generation sequencing technology generates a huge number of reads (short sequences), which contain a vast amount of genomic data. The sequencing process, however, comes with artifacts. Preprocessing of sequences is mandatory for further downstream analysis. We present Zseq, a linear method that identifies the most informative genomic sequences and reduces the number of biased sequences, sequence duplications, and ambiguous nucleotides. Zseq finds the complexity of the sequences by counting the number of unique k-mers in each sequence as its corresponding score and also takes into the account other factors such as ambiguous nucleotides or high GC-content percentage in k-mers. Based on a z-score threshold, Zseq sweeps through the sequences again and filters those with a z-score less than the user-defined threshold. Zseq algorithm is able to provide a better mapping rate; it reduces the number of ambiguous bases significantly in comparison with other methods. Evaluation of the filtered reads has been conducted by aligning the reads and assembling the transcripts using the reference genome as well as de novo assembly. The assembled transcripts show a better discriminative ability to separate cancer and normal samples in comparison with another state-of-the-art method. Moreover, de novo assembled transcripts from the reads filtered by Zseq have longer genomic sequences than other tested methods. Estimating the threshold of the cutoff point is introduced using labeling rules with optimistic results.
format	Online Article Text
id	pubmed-5563921
institution	National Center for Biotechnology Information
language	English
publishDate	2017
publisher	Mary Ann Liebert, Inc.
record_format	MEDLINE/PubMed
spelling	pubmed-55639212017-08-22 Zseq: An Approach for Preprocessing Next-Generation Sequencing Data Alkhateeb, Abedalrhman Rueda, Luis J Comput Biol Research Articles Next-generation sequencing technology generates a huge number of reads (short sequences), which contain a vast amount of genomic data. The sequencing process, however, comes with artifacts. Preprocessing of sequences is mandatory for further downstream analysis. We present Zseq, a linear method that identifies the most informative genomic sequences and reduces the number of biased sequences, sequence duplications, and ambiguous nucleotides. Zseq finds the complexity of the sequences by counting the number of unique k-mers in each sequence as its corresponding score and also takes into the account other factors such as ambiguous nucleotides or high GC-content percentage in k-mers. Based on a z-score threshold, Zseq sweeps through the sequences again and filters those with a z-score less than the user-defined threshold. Zseq algorithm is able to provide a better mapping rate; it reduces the number of ambiguous bases significantly in comparison with other methods. Evaluation of the filtered reads has been conducted by aligning the reads and assembling the transcripts using the reference genome as well as de novo assembly. The assembled transcripts show a better discriminative ability to separate cancer and normal samples in comparison with another state-of-the-art method. Moreover, de novo assembled transcripts from the reads filtered by Zseq have longer genomic sequences than other tested methods. Estimating the threshold of the cutoff point is introduced using labeling rules with optimistic results. Mary Ann Liebert, Inc. 2017-08-01 2017-08-01 /pmc/articles/PMC5563921/ /pubmed/28414515 http://dx.doi.org/10.1089/cmb.2017.0021 Text en © Abedalrhman Alkhateeb and Luis Rueda, 2017. Published by Mary Ann Liebert, Inc. This Open Access article is distributed under the terms of the Creative Commons Attribution Noncommercial License (http://creativecommons.org/licenses/by-nc/4.0/) which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and the source are credited.
spellingShingle	Research Articles Alkhateeb, Abedalrhman Rueda, Luis Zseq: An Approach for Preprocessing Next-Generation Sequencing Data
title	Zseq: An Approach for Preprocessing Next-Generation Sequencing Data
title_full	Zseq: An Approach for Preprocessing Next-Generation Sequencing Data
title_fullStr	Zseq: An Approach for Preprocessing Next-Generation Sequencing Data
title_full_unstemmed	Zseq: An Approach for Preprocessing Next-Generation Sequencing Data
title_short	Zseq: An Approach for Preprocessing Next-Generation Sequencing Data
title_sort	zseq: an approach for preprocessing next-generation sequencing data
topic	Research Articles
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5563921/ https://www.ncbi.nlm.nih.gov/pubmed/28414515 http://dx.doi.org/10.1089/cmb.2017.0021
work_keys_str_mv	AT alkhateebabedalrhman zseqanapproachforpreprocessingnextgenerationsequencingdata AT ruedaluis zseqanapproachforpreprocessingnextgenerationsequencingdata

Zseq: An Approach for Preprocessing Next-Generation Sequencing Data

Ejemplares similares