Cargando…

A multilocus approach for accurate variant calling in low-copy repeats using whole-genome sequencing

MOTIVATION: Low-copy repeats (LCRs) or segmental duplications are long segments of duplicated DNA that cover [Formula: see text] 5% of the human genome. Existing tools for variant calling using short reads exhibit low accuracy in LCRs due to ambiguity in read mapping and extensive copy number variat...

Descripción completa

Detalles Bibliográficos
Autores principales: Prodanov, Timofey, Bansal, Vikas
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10311303/
https://www.ncbi.nlm.nih.gov/pubmed/37387146
http://dx.doi.org/10.1093/bioinformatics/btad268
_version_ 1785066714388496384
author Prodanov, Timofey
Bansal, Vikas
author_facet Prodanov, Timofey
Bansal, Vikas
author_sort Prodanov, Timofey
collection PubMed
description MOTIVATION: Low-copy repeats (LCRs) or segmental duplications are long segments of duplicated DNA that cover [Formula: see text] 5% of the human genome. Existing tools for variant calling using short reads exhibit low accuracy in LCRs due to ambiguity in read mapping and extensive copy number variation. Variants in more than 150 genes overlapping LCRs are associated with risk for human diseases. METHODS: We describe a short-read variant calling method, ParascopyVC, that performs variant calling jointly across all repeat copies and utilizes reads independent of mapping quality in LCRs. To identify candidate variants, ParascopyVC aggregates reads mapped to different repeat copies and performs polyploid variant calling. Subsequently, paralogous sequence variants that can differentiate repeat copies are identified using population data and used for estimating the genotype of variants for each repeat copy. RESULTS: On simulated whole-genome sequence data, ParascopyVC achieved higher precision (0.997) and recall (0.807) than three state-of-the-art variant callers (best precision = 0.956 for DeepVariant and best recall = 0.738 for GATK) in 167 LCR regions. Benchmarking of ParascopyVC using the genome-in-a-bottle high-confidence variant calls for HG002 genome showed that it achieved a very high precision of 0.991 and a high recall of 0.909 across LCR regions, significantly better than FreeBayes (precision = 0.954 and recall = 0.822), GATK (precision = 0.888 and recall = 0.873) and DeepVariant (precision = 0.983 and recall = 0.861). ParascopyVC demonstrated a consistently higher accuracy (mean F(1) = 0.947) than other callers (best F(1) = 0.908) across seven human genomes. AVAILABILITY AND IMPLEMENTATION: ParascopyVC is implemented in Python and is freely available at https://github.com/tprodanov/ParascopyVC.
format Online
Article
Text
id pubmed-10311303
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-103113032023-07-01 A multilocus approach for accurate variant calling in low-copy repeats using whole-genome sequencing Prodanov, Timofey Bansal, Vikas Bioinformatics Genome Sequence Analysis MOTIVATION: Low-copy repeats (LCRs) or segmental duplications are long segments of duplicated DNA that cover [Formula: see text] 5% of the human genome. Existing tools for variant calling using short reads exhibit low accuracy in LCRs due to ambiguity in read mapping and extensive copy number variation. Variants in more than 150 genes overlapping LCRs are associated with risk for human diseases. METHODS: We describe a short-read variant calling method, ParascopyVC, that performs variant calling jointly across all repeat copies and utilizes reads independent of mapping quality in LCRs. To identify candidate variants, ParascopyVC aggregates reads mapped to different repeat copies and performs polyploid variant calling. Subsequently, paralogous sequence variants that can differentiate repeat copies are identified using population data and used for estimating the genotype of variants for each repeat copy. RESULTS: On simulated whole-genome sequence data, ParascopyVC achieved higher precision (0.997) and recall (0.807) than three state-of-the-art variant callers (best precision = 0.956 for DeepVariant and best recall = 0.738 for GATK) in 167 LCR regions. Benchmarking of ParascopyVC using the genome-in-a-bottle high-confidence variant calls for HG002 genome showed that it achieved a very high precision of 0.991 and a high recall of 0.909 across LCR regions, significantly better than FreeBayes (precision = 0.954 and recall = 0.822), GATK (precision = 0.888 and recall = 0.873) and DeepVariant (precision = 0.983 and recall = 0.861). ParascopyVC demonstrated a consistently higher accuracy (mean F(1) = 0.947) than other callers (best F(1) = 0.908) across seven human genomes. AVAILABILITY AND IMPLEMENTATION: ParascopyVC is implemented in Python and is freely available at https://github.com/tprodanov/ParascopyVC. Oxford University Press 2023-06-30 /pmc/articles/PMC10311303/ /pubmed/37387146 http://dx.doi.org/10.1093/bioinformatics/btad268 Text en © The Author(s) 2023. Published by Oxford University Press. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Genome Sequence Analysis
Prodanov, Timofey
Bansal, Vikas
A multilocus approach for accurate variant calling in low-copy repeats using whole-genome sequencing
title A multilocus approach for accurate variant calling in low-copy repeats using whole-genome sequencing
title_full A multilocus approach for accurate variant calling in low-copy repeats using whole-genome sequencing
title_fullStr A multilocus approach for accurate variant calling in low-copy repeats using whole-genome sequencing
title_full_unstemmed A multilocus approach for accurate variant calling in low-copy repeats using whole-genome sequencing
title_short A multilocus approach for accurate variant calling in low-copy repeats using whole-genome sequencing
title_sort multilocus approach for accurate variant calling in low-copy repeats using whole-genome sequencing
topic Genome Sequence Analysis
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10311303/
https://www.ncbi.nlm.nih.gov/pubmed/37387146
http://dx.doi.org/10.1093/bioinformatics/btad268
work_keys_str_mv AT prodanovtimofey amultilocusapproachforaccuratevariantcallinginlowcopyrepeatsusingwholegenomesequencing
AT bansalvikas amultilocusapproachforaccuratevariantcallinginlowcopyrepeatsusingwholegenomesequencing
AT prodanovtimofey multilocusapproachforaccuratevariantcallinginlowcopyrepeatsusingwholegenomesequencing
AT bansalvikas multilocusapproachforaccuratevariantcallinginlowcopyrepeatsusingwholegenomesequencing