Cargando…

An optimized GATK4 pipeline for Plasmodium falciparum whole genome sequencing variant calling and analysis

BACKGROUND: Accurate variant calls from whole genome sequencing (WGS) of Plasmodium falciparum infections are crucial in malaria population genomics. Here a falciparum variant calling pipeline based on GATK version 4 (GATK4) was optimized and applied to 6626 public Illumina WGS samples. METHODS: Con...

Descripción completa

Detalles Bibliográficos
Autores principales: Niaré, Karamoko, Greenhouse, Bryan, Bailey, Jeffrey A.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10327343/
https://www.ncbi.nlm.nih.gov/pubmed/37420214
http://dx.doi.org/10.1186/s12936-023-04632-0
_version_ 1785069604940283904
author Niaré, Karamoko
Greenhouse, Bryan
Bailey, Jeffrey A.
author_facet Niaré, Karamoko
Greenhouse, Bryan
Bailey, Jeffrey A.
author_sort Niaré, Karamoko
collection PubMed
description BACKGROUND: Accurate variant calls from whole genome sequencing (WGS) of Plasmodium falciparum infections are crucial in malaria population genomics. Here a falciparum variant calling pipeline based on GATK version 4 (GATK4) was optimized and applied to 6626 public Illumina WGS samples. METHODS: Control WGS and accurate PacBio assemblies of 10 laboratory strains were leveraged to optimize parameters that control the heterozygosity, local assembly region size, ploidy, mapping and base quality in both GATK HaplotypeCaller and GenotypeGVCFs. From these controls, a high-quality training dataset was generated to recalibrate the raw variant data. RESULTS: On current high-quality samples (read length = 250 bp, insert size = 405–524 bp), the optimized pipeline shows improved sensitivity (86.6 ± 1.7% for SNPs and 82.2 ± 5.9% for indels) compared to the default GATK4 pipeline (77.7 ± 1.3% for SNPs; and 73.1 ± 5.1% for indels, adjusted P < 0.001) and previous variant calling with GATK version 3 (GATK3, 70.3 ± 3.0% for SNPs and 59.7 ± 5.8% for indels, adjusted P < 0.001). Its sensitivity on simulated mixed infection samples (80.8 ± 6.1% for SNPs and 78.3 ± 5.1% for indels) was again improved relative to default GATK4 (68.8 ± 6.0% for SNPs and 38.9 ± 0.7% for indels, adjusted, adjusted P < 0.001). Precision was high and comparable across all pipelines on each type of data tested. The resulting combination of high-quality SNPs and indels increases the resolution of local population population structure detection in sub-Saharan Africa. Finally, increasing ploidy improves the detection of drug resistance mutations and estimation of complexity of infection. CONCLUSIONS: Overall, this study provides an optimized falciparum GATK4 pipeline resource for variant calling which should help improve genomic studies of malaria. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12936-023-04632-0.
format Online
Article
Text
id pubmed-10327343
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-103273432023-07-08 An optimized GATK4 pipeline for Plasmodium falciparum whole genome sequencing variant calling and analysis Niaré, Karamoko Greenhouse, Bryan Bailey, Jeffrey A. Malar J Methodology BACKGROUND: Accurate variant calls from whole genome sequencing (WGS) of Plasmodium falciparum infections are crucial in malaria population genomics. Here a falciparum variant calling pipeline based on GATK version 4 (GATK4) was optimized and applied to 6626 public Illumina WGS samples. METHODS: Control WGS and accurate PacBio assemblies of 10 laboratory strains were leveraged to optimize parameters that control the heterozygosity, local assembly region size, ploidy, mapping and base quality in both GATK HaplotypeCaller and GenotypeGVCFs. From these controls, a high-quality training dataset was generated to recalibrate the raw variant data. RESULTS: On current high-quality samples (read length = 250 bp, insert size = 405–524 bp), the optimized pipeline shows improved sensitivity (86.6 ± 1.7% for SNPs and 82.2 ± 5.9% for indels) compared to the default GATK4 pipeline (77.7 ± 1.3% for SNPs; and 73.1 ± 5.1% for indels, adjusted P < 0.001) and previous variant calling with GATK version 3 (GATK3, 70.3 ± 3.0% for SNPs and 59.7 ± 5.8% for indels, adjusted P < 0.001). Its sensitivity on simulated mixed infection samples (80.8 ± 6.1% for SNPs and 78.3 ± 5.1% for indels) was again improved relative to default GATK4 (68.8 ± 6.0% for SNPs and 38.9 ± 0.7% for indels, adjusted, adjusted P < 0.001). Precision was high and comparable across all pipelines on each type of data tested. The resulting combination of high-quality SNPs and indels increases the resolution of local population population structure detection in sub-Saharan Africa. Finally, increasing ploidy improves the detection of drug resistance mutations and estimation of complexity of infection. CONCLUSIONS: Overall, this study provides an optimized falciparum GATK4 pipeline resource for variant calling which should help improve genomic studies of malaria. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12936-023-04632-0. BioMed Central 2023-07-07 /pmc/articles/PMC10327343/ /pubmed/37420214 http://dx.doi.org/10.1186/s12936-023-04632-0 Text en © The Author(s) 2023 https://creativecommons.org/licenses/by/4.0/Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle Methodology
Niaré, Karamoko
Greenhouse, Bryan
Bailey, Jeffrey A.
An optimized GATK4 pipeline for Plasmodium falciparum whole genome sequencing variant calling and analysis
title An optimized GATK4 pipeline for Plasmodium falciparum whole genome sequencing variant calling and analysis
title_full An optimized GATK4 pipeline for Plasmodium falciparum whole genome sequencing variant calling and analysis
title_fullStr An optimized GATK4 pipeline for Plasmodium falciparum whole genome sequencing variant calling and analysis
title_full_unstemmed An optimized GATK4 pipeline for Plasmodium falciparum whole genome sequencing variant calling and analysis
title_short An optimized GATK4 pipeline for Plasmodium falciparum whole genome sequencing variant calling and analysis
title_sort optimized gatk4 pipeline for plasmodium falciparum whole genome sequencing variant calling and analysis
topic Methodology
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10327343/
https://www.ncbi.nlm.nih.gov/pubmed/37420214
http://dx.doi.org/10.1186/s12936-023-04632-0
work_keys_str_mv AT niarekaramoko anoptimizedgatk4pipelineforplasmodiumfalciparumwholegenomesequencingvariantcallingandanalysis
AT greenhousebryan anoptimizedgatk4pipelineforplasmodiumfalciparumwholegenomesequencingvariantcallingandanalysis
AT baileyjeffreya anoptimizedgatk4pipelineforplasmodiumfalciparumwholegenomesequencingvariantcallingandanalysis
AT niarekaramoko optimizedgatk4pipelineforplasmodiumfalciparumwholegenomesequencingvariantcallingandanalysis
AT greenhousebryan optimizedgatk4pipelineforplasmodiumfalciparumwholegenomesequencingvariantcallingandanalysis
AT baileyjeffreya optimizedgatk4pipelineforplasmodiumfalciparumwholegenomesequencingvariantcallingandanalysis