Cargando…

Masking as an effective quality control method for next-generation sequencing data analysis

BACKGROUND: Next generation sequencing produces base calls with low quality scores that can affect the accuracy of identifying simple nucleotide variation calls, including single nucleotide polymorphisms and small insertions and deletions. Here we compare the effectiveness of two data preprocessing...

Descripción completa

Detalles Bibliográficos
Autores principales: Yun, Sajung, Yun, Sijung
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2014
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4268903/
https://www.ncbi.nlm.nih.gov/pubmed/25494997
http://dx.doi.org/10.1186/s12859-014-0382-2
_version_ 1782349309479485440
author Yun, Sajung
Yun, Sijung
author_facet Yun, Sajung
Yun, Sijung
author_sort Yun, Sajung
collection PubMed
description BACKGROUND: Next generation sequencing produces base calls with low quality scores that can affect the accuracy of identifying simple nucleotide variation calls, including single nucleotide polymorphisms and small insertions and deletions. Here we compare the effectiveness of two data preprocessing methods, masking and trimming, and the accuracy of simple nucleotide variation calls on whole-genome sequence data from Caenorhabditis elegans. Masking substitutes low quality base calls with ‘N’s (undetermined bases), whereas trimming removes low quality bases that results in a shorter read lengths. RESULTS: We demonstrate that masking is more effective than trimming in reducing the false-positive rate in single nucleotide polymorphism (SNP) calling. However, both of the preprocessing methods did not affect the false-negative rate in SNP calling with statistical significance compared to the data analysis without preprocessing. False-positive rate and false-negative rate for small insertions and deletions did not show differences between masking and trimming. CONCLUSIONS: We recommend masking over trimming as a more effective preprocessing method for next generation sequencing data analysis since masking reduces the false-positive rate in SNP calling without sacrificing the false-negative rate although trimming is more commonly used currently in the field. The perl script for masking is available at http://code.google.com/p/subn/. The sequencing data used in the study were deposited in the Sequence Read Archive (SRX450968 and SRX451773).
format Online
Article
Text
id pubmed-4268903
institution National Center for Biotechnology Information
language English
publishDate 2014
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-42689032014-12-18 Masking as an effective quality control method for next-generation sequencing data analysis Yun, Sajung Yun, Sijung BMC Bioinformatics Methodology Article BACKGROUND: Next generation sequencing produces base calls with low quality scores that can affect the accuracy of identifying simple nucleotide variation calls, including single nucleotide polymorphisms and small insertions and deletions. Here we compare the effectiveness of two data preprocessing methods, masking and trimming, and the accuracy of simple nucleotide variation calls on whole-genome sequence data from Caenorhabditis elegans. Masking substitutes low quality base calls with ‘N’s (undetermined bases), whereas trimming removes low quality bases that results in a shorter read lengths. RESULTS: We demonstrate that masking is more effective than trimming in reducing the false-positive rate in single nucleotide polymorphism (SNP) calling. However, both of the preprocessing methods did not affect the false-negative rate in SNP calling with statistical significance compared to the data analysis without preprocessing. False-positive rate and false-negative rate for small insertions and deletions did not show differences between masking and trimming. CONCLUSIONS: We recommend masking over trimming as a more effective preprocessing method for next generation sequencing data analysis since masking reduces the false-positive rate in SNP calling without sacrificing the false-negative rate although trimming is more commonly used currently in the field. The perl script for masking is available at http://code.google.com/p/subn/. The sequencing data used in the study were deposited in the Sequence Read Archive (SRX450968 and SRX451773). BioMed Central 2014-12-13 /pmc/articles/PMC4268903/ /pubmed/25494997 http://dx.doi.org/10.1186/s12859-014-0382-2 Text en © Yun and Yun; licensee BioMed Central Ltd. 2014 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Methodology Article
Yun, Sajung
Yun, Sijung
Masking as an effective quality control method for next-generation sequencing data analysis
title Masking as an effective quality control method for next-generation sequencing data analysis
title_full Masking as an effective quality control method for next-generation sequencing data analysis
title_fullStr Masking as an effective quality control method for next-generation sequencing data analysis
title_full_unstemmed Masking as an effective quality control method for next-generation sequencing data analysis
title_short Masking as an effective quality control method for next-generation sequencing data analysis
title_sort masking as an effective quality control method for next-generation sequencing data analysis
topic Methodology Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4268903/
https://www.ncbi.nlm.nih.gov/pubmed/25494997
http://dx.doi.org/10.1186/s12859-014-0382-2
work_keys_str_mv AT yunsajung maskingasaneffectivequalitycontrolmethodfornextgenerationsequencingdataanalysis
AT yunsijung maskingasaneffectivequalitycontrolmethodfornextgenerationsequencingdataanalysis