Cargando…

Evaluating the effect of reference genome divergence on the analysis of empirical RADseq datasets

The advent of high‐throughput sequencing (HTS) has made genomic‐level analyses feasible for nonmodel organisms. A critical step of many HTS pipelines involves aligning reads to a reference genome to identify variants. Despite recent initiatives, only a fraction of species has publically available re...

Descripción completa

Detalles Bibliográficos
Autor principal: Bohling, Justin
Formato: Online Artículo Texto
Lenguaje:English
Publicado: John Wiley and Sons Inc. 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7391306/
https://www.ncbi.nlm.nih.gov/pubmed/32760550
http://dx.doi.org/10.1002/ece3.6483
_version_ 1783564608488865792
author Bohling, Justin
author_facet Bohling, Justin
author_sort Bohling, Justin
collection PubMed
description The advent of high‐throughput sequencing (HTS) has made genomic‐level analyses feasible for nonmodel organisms. A critical step of many HTS pipelines involves aligning reads to a reference genome to identify variants. Despite recent initiatives, only a fraction of species has publically available reference genomes. Therefore, a common practice is to align reads to the genome of an organism related to the target species; however, this could affect read alignment and bias genotyping. In this study, I conducted an experiment using empirical RADseq datasets generated for two species of salmonids (Actinopterygii; Teleostei; Salmonidae) to address these questions. There are currently reference genomes for six salmonids of varying phylogenetic distance. I aligned the RADseq data to all six genomes and identified variants with several different genotypers, which were then fed into population genetic analyses. Increasing phylogenetic distance between target species and reference genome reduced the proportion of reads that successfully aligned and mapping quality. Reference genome also influenced the number of SNPs that were generated and depth at those SNPs, although the affect varied by genotyper. Inferences of population structure were mixed: increasing reference genome divergence reduced estimates of differentiation but similar patterns of population relationships were found across scenarios. These findings reveal how the choice of reference genome can influence the output of bioinformatic pipelines. It also emphasizes the need to identify best practices and guidelines for the burgeoning field of biodiversity genomics.
format Online
Article
Text
id pubmed-7391306
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher John Wiley and Sons Inc.
record_format MEDLINE/PubMed
spelling pubmed-73913062020-08-04 Evaluating the effect of reference genome divergence on the analysis of empirical RADseq datasets Bohling, Justin Ecol Evol Original Research The advent of high‐throughput sequencing (HTS) has made genomic‐level analyses feasible for nonmodel organisms. A critical step of many HTS pipelines involves aligning reads to a reference genome to identify variants. Despite recent initiatives, only a fraction of species has publically available reference genomes. Therefore, a common practice is to align reads to the genome of an organism related to the target species; however, this could affect read alignment and bias genotyping. In this study, I conducted an experiment using empirical RADseq datasets generated for two species of salmonids (Actinopterygii; Teleostei; Salmonidae) to address these questions. There are currently reference genomes for six salmonids of varying phylogenetic distance. I aligned the RADseq data to all six genomes and identified variants with several different genotypers, which were then fed into population genetic analyses. Increasing phylogenetic distance between target species and reference genome reduced the proportion of reads that successfully aligned and mapping quality. Reference genome also influenced the number of SNPs that were generated and depth at those SNPs, although the affect varied by genotyper. Inferences of population structure were mixed: increasing reference genome divergence reduced estimates of differentiation but similar patterns of population relationships were found across scenarios. These findings reveal how the choice of reference genome can influence the output of bioinformatic pipelines. It also emphasizes the need to identify best practices and guidelines for the burgeoning field of biodiversity genomics. John Wiley and Sons Inc. 2020-06-28 /pmc/articles/PMC7391306/ /pubmed/32760550 http://dx.doi.org/10.1002/ece3.6483 Text en Published 2020. This article is a U.S. Government work and is in the public domain in the USA. This is an open access article under the terms of the http://creativecommons.org/licenses/by/4.0/ License, which permits use, distribution and reproduction in any medium, provided the original work is properly cited.
spellingShingle Original Research
Bohling, Justin
Evaluating the effect of reference genome divergence on the analysis of empirical RADseq datasets
title Evaluating the effect of reference genome divergence on the analysis of empirical RADseq datasets
title_full Evaluating the effect of reference genome divergence on the analysis of empirical RADseq datasets
title_fullStr Evaluating the effect of reference genome divergence on the analysis of empirical RADseq datasets
title_full_unstemmed Evaluating the effect of reference genome divergence on the analysis of empirical RADseq datasets
title_short Evaluating the effect of reference genome divergence on the analysis of empirical RADseq datasets
title_sort evaluating the effect of reference genome divergence on the analysis of empirical radseq datasets
topic Original Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7391306/
https://www.ncbi.nlm.nih.gov/pubmed/32760550
http://dx.doi.org/10.1002/ece3.6483
work_keys_str_mv AT bohlingjustin evaluatingtheeffectofreferencegenomedivergenceontheanalysisofempiricalradseqdatasets