Cargando…

A population-level statistic for assessing Mendelian behavior of genotyping-by-sequencing data from highly duplicated genomes

BACKGROUND: Given the economic and environmental importance of allopolyploids and other species with highly duplicated genomes, there is a need for methods to distinguish paralogs, i.e. duplicate sequences within a genome, from Mendelian loci, i.e. single copy sequences that pair at meiosis. The rat...

Descripción completa

Detalles Bibliográficos
Autores principales: Clark, Lindsay V., Mays, Wittney, Lipka, Alexander E., Sacks, Erik J.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8939213/
https://www.ncbi.nlm.nih.gov/pubmed/35317727
http://dx.doi.org/10.1186/s12859-022-04635-9
_version_ 1784672697803866112
author Clark, Lindsay V.
Mays, Wittney
Lipka, Alexander E.
Sacks, Erik J.
author_facet Clark, Lindsay V.
Mays, Wittney
Lipka, Alexander E.
Sacks, Erik J.
author_sort Clark, Lindsay V.
collection PubMed
description BACKGROUND: Given the economic and environmental importance of allopolyploids and other species with highly duplicated genomes, there is a need for methods to distinguish paralogs, i.e. duplicate sequences within a genome, from Mendelian loci, i.e. single copy sequences that pair at meiosis. The ratio of observed to expected heterozygosity is an effective tool for filtering loci but requires genotyping to be performed first at a high computational cost, whereas counting the number of sequence tags detected per genotype is computationally quick but very ineffective in inbred or polyploid populations. Therefore, new methods are needed for filtering paralogs. RESULTS: We introduce a novel statistic, H(ind)/H(E), that uses the probability that two reads sampled from a genotype will belong to different alleles, instead of observed heterozygosity. The expected value of H(ind)/H(E) is the same across all loci in a dataset, regardless of read depth or allele frequency. In contrast to methods based on observed heterozygosity, it can be estimated and used for filtering loci prior to genotype calling. In addition to filtering paralogs, it can be used to filter loci with null alleles or high overdispersion, and identify individuals with unexpected ploidy and hybrid status. We demonstrate that the statistic is useful at read depths as low as five to 10, well below the depth needed for accurate genotype calling in polyploid and outcrossing species. CONCLUSIONS: Our methodology for estimating H(ind)/H(E) across loci and individuals, as well as determining reasonable thresholds for filtering loci, is implemented in polyRAD v1.6, available at https://github.com/lvclark/polyRAD. In large sequencing datasets, we anticipate that the ability to filter markers and identify problematic individuals prior to genotype calling will save researchers considerable computational time. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12859-022-04635-9.
format Online
Article
Text
id pubmed-8939213
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-89392132022-03-23 A population-level statistic for assessing Mendelian behavior of genotyping-by-sequencing data from highly duplicated genomes Clark, Lindsay V. Mays, Wittney Lipka, Alexander E. Sacks, Erik J. BMC Bioinformatics Research BACKGROUND: Given the economic and environmental importance of allopolyploids and other species with highly duplicated genomes, there is a need for methods to distinguish paralogs, i.e. duplicate sequences within a genome, from Mendelian loci, i.e. single copy sequences that pair at meiosis. The ratio of observed to expected heterozygosity is an effective tool for filtering loci but requires genotyping to be performed first at a high computational cost, whereas counting the number of sequence tags detected per genotype is computationally quick but very ineffective in inbred or polyploid populations. Therefore, new methods are needed for filtering paralogs. RESULTS: We introduce a novel statistic, H(ind)/H(E), that uses the probability that two reads sampled from a genotype will belong to different alleles, instead of observed heterozygosity. The expected value of H(ind)/H(E) is the same across all loci in a dataset, regardless of read depth or allele frequency. In contrast to methods based on observed heterozygosity, it can be estimated and used for filtering loci prior to genotype calling. In addition to filtering paralogs, it can be used to filter loci with null alleles or high overdispersion, and identify individuals with unexpected ploidy and hybrid status. We demonstrate that the statistic is useful at read depths as low as five to 10, well below the depth needed for accurate genotype calling in polyploid and outcrossing species. CONCLUSIONS: Our methodology for estimating H(ind)/H(E) across loci and individuals, as well as determining reasonable thresholds for filtering loci, is implemented in polyRAD v1.6, available at https://github.com/lvclark/polyRAD. In large sequencing datasets, we anticipate that the ability to filter markers and identify problematic individuals prior to genotype calling will save researchers considerable computational time. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12859-022-04635-9. BioMed Central 2022-03-22 /pmc/articles/PMC8939213/ /pubmed/35317727 http://dx.doi.org/10.1186/s12859-022-04635-9 Text en © The Author(s) 2022 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle Research
Clark, Lindsay V.
Mays, Wittney
Lipka, Alexander E.
Sacks, Erik J.
A population-level statistic for assessing Mendelian behavior of genotyping-by-sequencing data from highly duplicated genomes
title A population-level statistic for assessing Mendelian behavior of genotyping-by-sequencing data from highly duplicated genomes
title_full A population-level statistic for assessing Mendelian behavior of genotyping-by-sequencing data from highly duplicated genomes
title_fullStr A population-level statistic for assessing Mendelian behavior of genotyping-by-sequencing data from highly duplicated genomes
title_full_unstemmed A population-level statistic for assessing Mendelian behavior of genotyping-by-sequencing data from highly duplicated genomes
title_short A population-level statistic for assessing Mendelian behavior of genotyping-by-sequencing data from highly duplicated genomes
title_sort population-level statistic for assessing mendelian behavior of genotyping-by-sequencing data from highly duplicated genomes
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8939213/
https://www.ncbi.nlm.nih.gov/pubmed/35317727
http://dx.doi.org/10.1186/s12859-022-04635-9
work_keys_str_mv AT clarklindsayv apopulationlevelstatisticforassessingmendelianbehaviorofgenotypingbysequencingdatafromhighlyduplicatedgenomes
AT mayswittney apopulationlevelstatisticforassessingmendelianbehaviorofgenotypingbysequencingdatafromhighlyduplicatedgenomes
AT lipkaalexandere apopulationlevelstatisticforassessingmendelianbehaviorofgenotypingbysequencingdatafromhighlyduplicatedgenomes
AT sackserikj apopulationlevelstatisticforassessingmendelianbehaviorofgenotypingbysequencingdatafromhighlyduplicatedgenomes
AT clarklindsayv populationlevelstatisticforassessingmendelianbehaviorofgenotypingbysequencingdatafromhighlyduplicatedgenomes
AT mayswittney populationlevelstatisticforassessingmendelianbehaviorofgenotypingbysequencingdatafromhighlyduplicatedgenomes
AT lipkaalexandere populationlevelstatisticforassessingmendelianbehaviorofgenotypingbysequencingdatafromhighlyduplicatedgenomes
AT sackserikj populationlevelstatisticforassessingmendelianbehaviorofgenotypingbysequencingdatafromhighlyduplicatedgenomes