Cargando…

Improving the filtering of false positive single nucleotide variations by combining genomic features with quality metrics

MOTIVATION: Technical errors in sequencing or bioinformatics steps and difficulties in alignment at some genomic sites result in false positive (FP) variants. Filtering based on quality metrics is a common method for detecting FP variants, but setting thresholds to reduce FP rates may reduce the num...

Descripción completa

Detalles Bibliográficos
Autores principales: Eren, Kazım Kıvanç, Çınar, Esra, Karakurt, Hamza U, Özgür, Arzucan
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10692869/
https://www.ncbi.nlm.nih.gov/pubmed/38019945
http://dx.doi.org/10.1093/bioinformatics/btad694
_version_ 1785153036926058496
author Eren, Kazım Kıvanç
Çınar, Esra
Karakurt, Hamza U
Özgür, Arzucan
author_facet Eren, Kazım Kıvanç
Çınar, Esra
Karakurt, Hamza U
Özgür, Arzucan
author_sort Eren, Kazım Kıvanç
collection PubMed
description MOTIVATION: Technical errors in sequencing or bioinformatics steps and difficulties in alignment at some genomic sites result in false positive (FP) variants. Filtering based on quality metrics is a common method for detecting FP variants, but setting thresholds to reduce FP rates may reduce the number of true positive variants by overlooking the more complex relationships between features. The goal of this study is to develop a machine learning-based model for identifying FPs that integrates quality metrics with genomic features and with the feature interpretability property to provide insights into model results. RESULTS: We propose a random forest-based model that utilizes genomic features to improve identification of FPs. Further examination of the features shows that the newly introduced features have an important impact on the prediction of variants misclassified by VEF, GATK-CNN, and GARFIELD, recently introduced FP detection systems. We applied cost-sensitive training to avoid errors in misclassification of true variants and developed a model that provides a robust mechanism against misclassification of true variants while increasing the prediction rate of FP variants. This model can be easily re-trained when factors such as experimental protocols might alter the FP distribution. In addition, it has an interpretability mechanism that allows users to understand the impact of features on the model’s predictions. AVAILABILITY AND IMPLEMENTATION: The software implementation can be found at https://github.com/ideateknoloji/FPDetect.
format Online
Article
Text
id pubmed-10692869
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-106928692023-12-03 Improving the filtering of false positive single nucleotide variations by combining genomic features with quality metrics Eren, Kazım Kıvanç Çınar, Esra Karakurt, Hamza U Özgür, Arzucan Bioinformatics Original Paper MOTIVATION: Technical errors in sequencing or bioinformatics steps and difficulties in alignment at some genomic sites result in false positive (FP) variants. Filtering based on quality metrics is a common method for detecting FP variants, but setting thresholds to reduce FP rates may reduce the number of true positive variants by overlooking the more complex relationships between features. The goal of this study is to develop a machine learning-based model for identifying FPs that integrates quality metrics with genomic features and with the feature interpretability property to provide insights into model results. RESULTS: We propose a random forest-based model that utilizes genomic features to improve identification of FPs. Further examination of the features shows that the newly introduced features have an important impact on the prediction of variants misclassified by VEF, GATK-CNN, and GARFIELD, recently introduced FP detection systems. We applied cost-sensitive training to avoid errors in misclassification of true variants and developed a model that provides a robust mechanism against misclassification of true variants while increasing the prediction rate of FP variants. This model can be easily re-trained when factors such as experimental protocols might alter the FP distribution. In addition, it has an interpretability mechanism that allows users to understand the impact of features on the model’s predictions. AVAILABILITY AND IMPLEMENTATION: The software implementation can be found at https://github.com/ideateknoloji/FPDetect. Oxford University Press 2023-11-29 /pmc/articles/PMC10692869/ /pubmed/38019945 http://dx.doi.org/10.1093/bioinformatics/btad694 Text en © The Author(s) 2023. Published by Oxford University Press. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Original Paper
Eren, Kazım Kıvanç
Çınar, Esra
Karakurt, Hamza U
Özgür, Arzucan
Improving the filtering of false positive single nucleotide variations by combining genomic features with quality metrics
title Improving the filtering of false positive single nucleotide variations by combining genomic features with quality metrics
title_full Improving the filtering of false positive single nucleotide variations by combining genomic features with quality metrics
title_fullStr Improving the filtering of false positive single nucleotide variations by combining genomic features with quality metrics
title_full_unstemmed Improving the filtering of false positive single nucleotide variations by combining genomic features with quality metrics
title_short Improving the filtering of false positive single nucleotide variations by combining genomic features with quality metrics
title_sort improving the filtering of false positive single nucleotide variations by combining genomic features with quality metrics
topic Original Paper
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10692869/
https://www.ncbi.nlm.nih.gov/pubmed/38019945
http://dx.doi.org/10.1093/bioinformatics/btad694
work_keys_str_mv AT erenkazımkıvanc improvingthefilteringoffalsepositivesinglenucleotidevariationsbycombininggenomicfeatureswithqualitymetrics
AT cınaresra improvingthefilteringoffalsepositivesinglenucleotidevariationsbycombininggenomicfeatureswithqualitymetrics
AT karakurthamzau improvingthefilteringoffalsepositivesinglenucleotidevariationsbycombininggenomicfeatureswithqualitymetrics
AT ozgurarzucan improvingthefilteringoffalsepositivesinglenucleotidevariationsbycombininggenomicfeatureswithqualitymetrics