Cargando…

Similarity thresholds used in DNA sequence assembly from short reads can reduce the comparability of population histories across species

Comparing inferences among datasets generated using short read sequencing may provide insight into the concerted impacts of divergence, gene flow and selection across organisms, but comparisons are complicated by biases introduced during dataset assembly. Sequence similarity thresholds allow the de...

Descripción completa

Detalles Bibliográficos
Autores principales: Harvey, Michael G., Judy, Caroline Duffie, Seeholzer, Glenn F., Maley, James M., Graves, Gary R., Brumfield, Robb T.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: PeerJ Inc. 2015
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4411482/
https://www.ncbi.nlm.nih.gov/pubmed/25922792
http://dx.doi.org/10.7717/peerj.895
_version_ 1782368482393849856
author Harvey, Michael G.
Judy, Caroline Duffie
Seeholzer, Glenn F.
Maley, James M.
Graves, Gary R.
Brumfield, Robb T.
author_facet Harvey, Michael G.
Judy, Caroline Duffie
Seeholzer, Glenn F.
Maley, James M.
Graves, Gary R.
Brumfield, Robb T.
author_sort Harvey, Michael G.
collection PubMed
description Comparing inferences among datasets generated using short read sequencing may provide insight into the concerted impacts of divergence, gene flow and selection across organisms, but comparisons are complicated by biases introduced during dataset assembly. Sequence similarity thresholds allow the de novo assembly of short reads into clusters of alleles representing different loci, but the resulting datasets are sensitive to both the similarity threshold used and to the variation naturally present in the organism under study. Thresholds that require high sequence similarity among reads for assembly (stringent thresholds) as well as highly variable species may result in datasets in which divergent alleles are lost or divided into separate loci (‘over-splitting’), whereas liberal thresholds increase the risk of paralogous loci being combined into a single locus (‘under-splitting’). Comparisons among datasets or species are therefore potentially biased if different similarity thresholds are applied or if the species differ in levels of within-lineage genetic variation. We examine the impact of a range of similarity thresholds on assembly of empirical short read datasets from populations of four different non-model bird lineages (species or species pairs) with different levels of genetic divergence. We find that, in all species, stringent similarity thresholds result in fewer alleles per locus than more liberal thresholds, which appears to be the result of high levels of over-splitting. The frequency of putative under-splitting, conversely, is low at all thresholds. Inferred genetic distances between individuals, gene tree depths, and estimates of the ancestral mutation-scaled effective population size (θ) differ depending upon the similarity threshold applied. Relative differences in inferences across species differ even when the same threshold is applied, but may be dramatically different when datasets assembled under different thresholds are compared. These differences not only complicate comparisons across species, but also preclude the application of standard mutation rates for parameter calibration. We suggest some best practices for assembling short read data to maximize comparability, such as using more liberal thresholds and examining the impact of different thresholds on each dataset.
format Online
Article
Text
id pubmed-4411482
institution National Center for Biotechnology Information
language English
publishDate 2015
publisher PeerJ Inc.
record_format MEDLINE/PubMed
spelling pubmed-44114822015-04-28 Similarity thresholds used in DNA sequence assembly from short reads can reduce the comparability of population histories across species Harvey, Michael G. Judy, Caroline Duffie Seeholzer, Glenn F. Maley, James M. Graves, Gary R. Brumfield, Robb T. PeerJ Bioinformatics Comparing inferences among datasets generated using short read sequencing may provide insight into the concerted impacts of divergence, gene flow and selection across organisms, but comparisons are complicated by biases introduced during dataset assembly. Sequence similarity thresholds allow the de novo assembly of short reads into clusters of alleles representing different loci, but the resulting datasets are sensitive to both the similarity threshold used and to the variation naturally present in the organism under study. Thresholds that require high sequence similarity among reads for assembly (stringent thresholds) as well as highly variable species may result in datasets in which divergent alleles are lost or divided into separate loci (‘over-splitting’), whereas liberal thresholds increase the risk of paralogous loci being combined into a single locus (‘under-splitting’). Comparisons among datasets or species are therefore potentially biased if different similarity thresholds are applied or if the species differ in levels of within-lineage genetic variation. We examine the impact of a range of similarity thresholds on assembly of empirical short read datasets from populations of four different non-model bird lineages (species or species pairs) with different levels of genetic divergence. We find that, in all species, stringent similarity thresholds result in fewer alleles per locus than more liberal thresholds, which appears to be the result of high levels of over-splitting. The frequency of putative under-splitting, conversely, is low at all thresholds. Inferred genetic distances between individuals, gene tree depths, and estimates of the ancestral mutation-scaled effective population size (θ) differ depending upon the similarity threshold applied. Relative differences in inferences across species differ even when the same threshold is applied, but may be dramatically different when datasets assembled under different thresholds are compared. These differences not only complicate comparisons across species, but also preclude the application of standard mutation rates for parameter calibration. We suggest some best practices for assembling short read data to maximize comparability, such as using more liberal thresholds and examining the impact of different thresholds on each dataset. PeerJ Inc. 2015-04-21 /pmc/articles/PMC4411482/ /pubmed/25922792 http://dx.doi.org/10.7717/peerj.895 Text en © 2015 Harvey et al. http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ) and either DOI or URL of the article must be cited.
spellingShingle Bioinformatics
Harvey, Michael G.
Judy, Caroline Duffie
Seeholzer, Glenn F.
Maley, James M.
Graves, Gary R.
Brumfield, Robb T.
Similarity thresholds used in DNA sequence assembly from short reads can reduce the comparability of population histories across species
title Similarity thresholds used in DNA sequence assembly from short reads can reduce the comparability of population histories across species
title_full Similarity thresholds used in DNA sequence assembly from short reads can reduce the comparability of population histories across species
title_fullStr Similarity thresholds used in DNA sequence assembly from short reads can reduce the comparability of population histories across species
title_full_unstemmed Similarity thresholds used in DNA sequence assembly from short reads can reduce the comparability of population histories across species
title_short Similarity thresholds used in DNA sequence assembly from short reads can reduce the comparability of population histories across species
title_sort similarity thresholds used in dna sequence assembly from short reads can reduce the comparability of population histories across species
topic Bioinformatics
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4411482/
https://www.ncbi.nlm.nih.gov/pubmed/25922792
http://dx.doi.org/10.7717/peerj.895
work_keys_str_mv AT harveymichaelg similaritythresholdsusedindnasequenceassemblyfromshortreadscanreducethecomparabilityofpopulationhistoriesacrossspecies
AT judycarolineduffie similaritythresholdsusedindnasequenceassemblyfromshortreadscanreducethecomparabilityofpopulationhistoriesacrossspecies
AT seeholzerglennf similaritythresholdsusedindnasequenceassemblyfromshortreadscanreducethecomparabilityofpopulationhistoriesacrossspecies
AT maleyjamesm similaritythresholdsusedindnasequenceassemblyfromshortreadscanreducethecomparabilityofpopulationhistoriesacrossspecies
AT gravesgaryr similaritythresholdsusedindnasequenceassemblyfromshortreadscanreducethecomparabilityofpopulationhistoriesacrossspecies
AT brumfieldrobbt similaritythresholdsusedindnasequenceassemblyfromshortreadscanreducethecomparabilityofpopulationhistoriesacrossspecies