Cargando…
A multilocus approach for accurate variant calling in low-copy repeats using whole-genome sequencing
MOTIVATION: Low-copy repeats (LCRs) or segmental duplications are long segments of duplicated DNA that cover [Formula: see text] 5% of the human genome. Existing tools for variant calling using short reads exhibit low accuracy in LCRs due to ambiguity in read mapping and extensive copy number variat...
Autores principales: | , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Oxford University Press
2023
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10311303/ https://www.ncbi.nlm.nih.gov/pubmed/37387146 http://dx.doi.org/10.1093/bioinformatics/btad268 |
_version_ | 1785066714388496384 |
---|---|
author | Prodanov, Timofey Bansal, Vikas |
author_facet | Prodanov, Timofey Bansal, Vikas |
author_sort | Prodanov, Timofey |
collection | PubMed |
description | MOTIVATION: Low-copy repeats (LCRs) or segmental duplications are long segments of duplicated DNA that cover [Formula: see text] 5% of the human genome. Existing tools for variant calling using short reads exhibit low accuracy in LCRs due to ambiguity in read mapping and extensive copy number variation. Variants in more than 150 genes overlapping LCRs are associated with risk for human diseases. METHODS: We describe a short-read variant calling method, ParascopyVC, that performs variant calling jointly across all repeat copies and utilizes reads independent of mapping quality in LCRs. To identify candidate variants, ParascopyVC aggregates reads mapped to different repeat copies and performs polyploid variant calling. Subsequently, paralogous sequence variants that can differentiate repeat copies are identified using population data and used for estimating the genotype of variants for each repeat copy. RESULTS: On simulated whole-genome sequence data, ParascopyVC achieved higher precision (0.997) and recall (0.807) than three state-of-the-art variant callers (best precision = 0.956 for DeepVariant and best recall = 0.738 for GATK) in 167 LCR regions. Benchmarking of ParascopyVC using the genome-in-a-bottle high-confidence variant calls for HG002 genome showed that it achieved a very high precision of 0.991 and a high recall of 0.909 across LCR regions, significantly better than FreeBayes (precision = 0.954 and recall = 0.822), GATK (precision = 0.888 and recall = 0.873) and DeepVariant (precision = 0.983 and recall = 0.861). ParascopyVC demonstrated a consistently higher accuracy (mean F(1) = 0.947) than other callers (best F(1) = 0.908) across seven human genomes. AVAILABILITY AND IMPLEMENTATION: ParascopyVC is implemented in Python and is freely available at https://github.com/tprodanov/ParascopyVC. |
format | Online Article Text |
id | pubmed-10311303 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2023 |
publisher | Oxford University Press |
record_format | MEDLINE/PubMed |
spelling | pubmed-103113032023-07-01 A multilocus approach for accurate variant calling in low-copy repeats using whole-genome sequencing Prodanov, Timofey Bansal, Vikas Bioinformatics Genome Sequence Analysis MOTIVATION: Low-copy repeats (LCRs) or segmental duplications are long segments of duplicated DNA that cover [Formula: see text] 5% of the human genome. Existing tools for variant calling using short reads exhibit low accuracy in LCRs due to ambiguity in read mapping and extensive copy number variation. Variants in more than 150 genes overlapping LCRs are associated with risk for human diseases. METHODS: We describe a short-read variant calling method, ParascopyVC, that performs variant calling jointly across all repeat copies and utilizes reads independent of mapping quality in LCRs. To identify candidate variants, ParascopyVC aggregates reads mapped to different repeat copies and performs polyploid variant calling. Subsequently, paralogous sequence variants that can differentiate repeat copies are identified using population data and used for estimating the genotype of variants for each repeat copy. RESULTS: On simulated whole-genome sequence data, ParascopyVC achieved higher precision (0.997) and recall (0.807) than three state-of-the-art variant callers (best precision = 0.956 for DeepVariant and best recall = 0.738 for GATK) in 167 LCR regions. Benchmarking of ParascopyVC using the genome-in-a-bottle high-confidence variant calls for HG002 genome showed that it achieved a very high precision of 0.991 and a high recall of 0.909 across LCR regions, significantly better than FreeBayes (precision = 0.954 and recall = 0.822), GATK (precision = 0.888 and recall = 0.873) and DeepVariant (precision = 0.983 and recall = 0.861). ParascopyVC demonstrated a consistently higher accuracy (mean F(1) = 0.947) than other callers (best F(1) = 0.908) across seven human genomes. AVAILABILITY AND IMPLEMENTATION: ParascopyVC is implemented in Python and is freely available at https://github.com/tprodanov/ParascopyVC. Oxford University Press 2023-06-30 /pmc/articles/PMC10311303/ /pubmed/37387146 http://dx.doi.org/10.1093/bioinformatics/btad268 Text en © The Author(s) 2023. Published by Oxford University Press. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Genome Sequence Analysis Prodanov, Timofey Bansal, Vikas A multilocus approach for accurate variant calling in low-copy repeats using whole-genome sequencing |
title | A multilocus approach for accurate variant calling in low-copy repeats using whole-genome sequencing |
title_full | A multilocus approach for accurate variant calling in low-copy repeats using whole-genome sequencing |
title_fullStr | A multilocus approach for accurate variant calling in low-copy repeats using whole-genome sequencing |
title_full_unstemmed | A multilocus approach for accurate variant calling in low-copy repeats using whole-genome sequencing |
title_short | A multilocus approach for accurate variant calling in low-copy repeats using whole-genome sequencing |
title_sort | multilocus approach for accurate variant calling in low-copy repeats using whole-genome sequencing |
topic | Genome Sequence Analysis |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10311303/ https://www.ncbi.nlm.nih.gov/pubmed/37387146 http://dx.doi.org/10.1093/bioinformatics/btad268 |
work_keys_str_mv | AT prodanovtimofey amultilocusapproachforaccuratevariantcallinginlowcopyrepeatsusingwholegenomesequencing AT bansalvikas amultilocusapproachforaccuratevariantcallinginlowcopyrepeatsusingwholegenomesequencing AT prodanovtimofey multilocusapproachforaccuratevariantcallinginlowcopyrepeatsusingwholegenomesequencing AT bansalvikas multilocusapproachforaccuratevariantcallinginlowcopyrepeatsusingwholegenomesequencing |