Cargando…

Benchmarking bacterial genome-wide association study methods using simulated genomes and phenotypes

Genome-wide association studies (GWASs) have the potential to reveal the genetics of microbial phenotypes such as antibiotic resistance and virulence. Capitalizing on the growing wealth of bacterial sequence data, microbial GWAS methods aim to identify causal genetic variants while ignoring spurious...

Descripción completa

Detalles Bibliográficos
Autores principales: Saber, Morteza M., Shapiro, B. Jesse
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Microbiology Society 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7200059/
https://www.ncbi.nlm.nih.gov/pubmed/32100713
http://dx.doi.org/10.1099/mgen.0.000337
_version_ 1783529266756976640
author Saber, Morteza M.
Shapiro, B. Jesse
author_facet Saber, Morteza M.
Shapiro, B. Jesse
author_sort Saber, Morteza M.
collection PubMed
description Genome-wide association studies (GWASs) have the potential to reveal the genetics of microbial phenotypes such as antibiotic resistance and virulence. Capitalizing on the growing wealth of bacterial sequence data, microbial GWAS methods aim to identify causal genetic variants while ignoring spurious associations. Bacteria reproduce clonally, leading to strong population structure and genome-wide linkage, making it challenging to separate true ‘hits’ (i.e. mutations that cause a phenotype) from non-causal linked mutations. GWAS methods attempt to correct for population structure in different ways, but their performance has not yet been systematically and comprehensively evaluated under a range of evolutionary scenarios. Here, we developed a bacterial GWAS simulator (BacGWASim) to generate bacterial genomes with varying rates of mutation, recombination and other evolutionary parameters, along with a subset of causal mutations underlying a phenotype of interest. We assessed the performance (recall and precision) of three widely used single-locus GWAS approaches (cluster-based, dimensionality-reduction and linear mixed models, implemented in plink, pyseer and gemma) and one relatively new multi-locus model implemented in pyseer, across a range of simulated sample sizes, recombination rates and causal mutation effect sizes. As expected, all methods performed better with larger sample sizes and effect sizes. The performance of clustering and dimensionality reduction approaches to correct for population structure were considerably variable according to the choice of parameters. Notably, the multi-locus elastic net (lasso) approach was consistently amongst the highest-performing methods, and had the highest power in detecting causal variants with both low and high effect sizes. Most methods reached the level of good performance (recall >0.75) for identifying causal mutations of strong effect size [log odds ratio (OR) ≥2] with a sample size of 2000 genomes. However, only elastic nets reached the level of reasonable performance (recall=0.35) for detecting markers with weaker effects (log OR ~1) in smaller samples. Elastic nets also showed superior precision and recall in controlling for genome-wide linkage, relative to single-locus models. However, all methods performed relatively poorly on highly clonal (low-recombining) genomes, suggesting room for improvement in method development. These findings show the potential for multi-locus models to improve bacterial GWAS performance. BacGWASim code and simulated data are publicly available to enable further comparisons and benchmarking of new methods.
format Online
Article
Text
id pubmed-7200059
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher Microbiology Society
record_format MEDLINE/PubMed
spelling pubmed-72000592020-05-06 Benchmarking bacterial genome-wide association study methods using simulated genomes and phenotypes Saber, Morteza M. Shapiro, B. Jesse Microb Genom Research Article Genome-wide association studies (GWASs) have the potential to reveal the genetics of microbial phenotypes such as antibiotic resistance and virulence. Capitalizing on the growing wealth of bacterial sequence data, microbial GWAS methods aim to identify causal genetic variants while ignoring spurious associations. Bacteria reproduce clonally, leading to strong population structure and genome-wide linkage, making it challenging to separate true ‘hits’ (i.e. mutations that cause a phenotype) from non-causal linked mutations. GWAS methods attempt to correct for population structure in different ways, but their performance has not yet been systematically and comprehensively evaluated under a range of evolutionary scenarios. Here, we developed a bacterial GWAS simulator (BacGWASim) to generate bacterial genomes with varying rates of mutation, recombination and other evolutionary parameters, along with a subset of causal mutations underlying a phenotype of interest. We assessed the performance (recall and precision) of three widely used single-locus GWAS approaches (cluster-based, dimensionality-reduction and linear mixed models, implemented in plink, pyseer and gemma) and one relatively new multi-locus model implemented in pyseer, across a range of simulated sample sizes, recombination rates and causal mutation effect sizes. As expected, all methods performed better with larger sample sizes and effect sizes. The performance of clustering and dimensionality reduction approaches to correct for population structure were considerably variable according to the choice of parameters. Notably, the multi-locus elastic net (lasso) approach was consistently amongst the highest-performing methods, and had the highest power in detecting causal variants with both low and high effect sizes. Most methods reached the level of good performance (recall >0.75) for identifying causal mutations of strong effect size [log odds ratio (OR) ≥2] with a sample size of 2000 genomes. However, only elastic nets reached the level of reasonable performance (recall=0.35) for detecting markers with weaker effects (log OR ~1) in smaller samples. Elastic nets also showed superior precision and recall in controlling for genome-wide linkage, relative to single-locus models. However, all methods performed relatively poorly on highly clonal (low-recombining) genomes, suggesting room for improvement in method development. These findings show the potential for multi-locus models to improve bacterial GWAS performance. BacGWASim code and simulated data are publicly available to enable further comparisons and benchmarking of new methods. Microbiology Society 2020-02-25 /pmc/articles/PMC7200059/ /pubmed/32100713 http://dx.doi.org/10.1099/mgen.0.000337 Text en © 2020 The Authors http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License.
spellingShingle Research Article
Saber, Morteza M.
Shapiro, B. Jesse
Benchmarking bacterial genome-wide association study methods using simulated genomes and phenotypes
title Benchmarking bacterial genome-wide association study methods using simulated genomes and phenotypes
title_full Benchmarking bacterial genome-wide association study methods using simulated genomes and phenotypes
title_fullStr Benchmarking bacterial genome-wide association study methods using simulated genomes and phenotypes
title_full_unstemmed Benchmarking bacterial genome-wide association study methods using simulated genomes and phenotypes
title_short Benchmarking bacterial genome-wide association study methods using simulated genomes and phenotypes
title_sort benchmarking bacterial genome-wide association study methods using simulated genomes and phenotypes
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7200059/
https://www.ncbi.nlm.nih.gov/pubmed/32100713
http://dx.doi.org/10.1099/mgen.0.000337
work_keys_str_mv AT sabermortezam benchmarkingbacterialgenomewideassociationstudymethodsusingsimulatedgenomesandphenotypes
AT shapirobjesse benchmarkingbacterialgenomewideassociationstudymethodsusingsimulatedgenomesandphenotypes