Cargando…
Mapinsights: deep exploration of quality issues and error profiles in high-throughput sequence data
High-throughput sequencing (HTS) has revolutionized science by enabling super-fast detection of genomic variants at base-pair resolution. Consequently, it poses the challenging problem of identification of technical artifacts, i.e. hidden non-random error patterns. Understanding the properties of se...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Oxford University Press
2023
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10415152/ https://www.ncbi.nlm.nih.gov/pubmed/37378434 http://dx.doi.org/10.1093/nar/gkad539 |
_version_ | 1785087459476897792 |
---|---|
author | Das, Subrata Biswas, Nidhan K Basu, Analabha |
author_facet | Das, Subrata Biswas, Nidhan K Basu, Analabha |
author_sort | Das, Subrata |
collection | PubMed |
description | High-throughput sequencing (HTS) has revolutionized science by enabling super-fast detection of genomic variants at base-pair resolution. Consequently, it poses the challenging problem of identification of technical artifacts, i.e. hidden non-random error patterns. Understanding the properties of sequencing artifacts holds the key in separating true variants from false positives. Here, we develop Mapinsights, a toolkit that performs quality control (QC) analysis of sequence alignment files, capable of detecting outliers based on sequencing artifacts of HTS data at a deeper resolution compared with existing methods. Mapinsights performs a cluster analysis based on novel and existing QC features derived from the sequence alignment for outlier detection. We applied Mapinsights on community standard open-source datasets and identified various quality issues including technical errors related to sequencing cycles, sequencing chemistry, sequencing libraries and across various orthogonal sequencing platforms. Mapinsights also enables identification of anomalies related to sequencing depth. A logistic regression-based model built on the features of Mapinsights shows high accuracy in detecting ‘low-confidence’ variant sites. Quantitative estimates and probabilistic arguments provided by Mapinsights can be utilized in identifying errors, bias and outlier samples, and also aid in improving the authenticity of variant calls. |
format | Online Article Text |
id | pubmed-10415152 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2023 |
publisher | Oxford University Press |
record_format | MEDLINE/PubMed |
spelling | pubmed-104151522023-08-12 Mapinsights: deep exploration of quality issues and error profiles in high-throughput sequence data Das, Subrata Biswas, Nidhan K Basu, Analabha Nucleic Acids Res Methods High-throughput sequencing (HTS) has revolutionized science by enabling super-fast detection of genomic variants at base-pair resolution. Consequently, it poses the challenging problem of identification of technical artifacts, i.e. hidden non-random error patterns. Understanding the properties of sequencing artifacts holds the key in separating true variants from false positives. Here, we develop Mapinsights, a toolkit that performs quality control (QC) analysis of sequence alignment files, capable of detecting outliers based on sequencing artifacts of HTS data at a deeper resolution compared with existing methods. Mapinsights performs a cluster analysis based on novel and existing QC features derived from the sequence alignment for outlier detection. We applied Mapinsights on community standard open-source datasets and identified various quality issues including technical errors related to sequencing cycles, sequencing chemistry, sequencing libraries and across various orthogonal sequencing platforms. Mapinsights also enables identification of anomalies related to sequencing depth. A logistic regression-based model built on the features of Mapinsights shows high accuracy in detecting ‘low-confidence’ variant sites. Quantitative estimates and probabilistic arguments provided by Mapinsights can be utilized in identifying errors, bias and outlier samples, and also aid in improving the authenticity of variant calls. Oxford University Press 2023-06-28 /pmc/articles/PMC10415152/ /pubmed/37378434 http://dx.doi.org/10.1093/nar/gkad539 Text en © The Author(s) 2023. Published by Oxford University Press on behalf of Nucleic Acids Research. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Methods Das, Subrata Biswas, Nidhan K Basu, Analabha Mapinsights: deep exploration of quality issues and error profiles in high-throughput sequence data |
title | Mapinsights: deep exploration of quality issues and error profiles in high-throughput sequence data |
title_full | Mapinsights: deep exploration of quality issues and error profiles in high-throughput sequence data |
title_fullStr | Mapinsights: deep exploration of quality issues and error profiles in high-throughput sequence data |
title_full_unstemmed | Mapinsights: deep exploration of quality issues and error profiles in high-throughput sequence data |
title_short | Mapinsights: deep exploration of quality issues and error profiles in high-throughput sequence data |
title_sort | mapinsights: deep exploration of quality issues and error profiles in high-throughput sequence data |
topic | Methods |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10415152/ https://www.ncbi.nlm.nih.gov/pubmed/37378434 http://dx.doi.org/10.1093/nar/gkad539 |
work_keys_str_mv | AT dassubrata mapinsightsdeepexplorationofqualityissuesanderrorprofilesinhighthroughputsequencedata AT biswasnidhank mapinsightsdeepexplorationofqualityissuesanderrorprofilesinhighthroughputsequencedata AT basuanalabha mapinsightsdeepexplorationofqualityissuesanderrorprofilesinhighthroughputsequencedata |