Cargando…
A population-level statistic for assessing Mendelian behavior of genotyping-by-sequencing data from highly duplicated genomes
BACKGROUND: Given the economic and environmental importance of allopolyploids and other species with highly duplicated genomes, there is a need for methods to distinguish paralogs, i.e. duplicate sequences within a genome, from Mendelian loci, i.e. single copy sequences that pair at meiosis. The rat...
Autores principales: | , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2022
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8939213/ https://www.ncbi.nlm.nih.gov/pubmed/35317727 http://dx.doi.org/10.1186/s12859-022-04635-9 |
_version_ | 1784672697803866112 |
---|---|
author | Clark, Lindsay V. Mays, Wittney Lipka, Alexander E. Sacks, Erik J. |
author_facet | Clark, Lindsay V. Mays, Wittney Lipka, Alexander E. Sacks, Erik J. |
author_sort | Clark, Lindsay V. |
collection | PubMed |
description | BACKGROUND: Given the economic and environmental importance of allopolyploids and other species with highly duplicated genomes, there is a need for methods to distinguish paralogs, i.e. duplicate sequences within a genome, from Mendelian loci, i.e. single copy sequences that pair at meiosis. The ratio of observed to expected heterozygosity is an effective tool for filtering loci but requires genotyping to be performed first at a high computational cost, whereas counting the number of sequence tags detected per genotype is computationally quick but very ineffective in inbred or polyploid populations. Therefore, new methods are needed for filtering paralogs. RESULTS: We introduce a novel statistic, H(ind)/H(E), that uses the probability that two reads sampled from a genotype will belong to different alleles, instead of observed heterozygosity. The expected value of H(ind)/H(E) is the same across all loci in a dataset, regardless of read depth or allele frequency. In contrast to methods based on observed heterozygosity, it can be estimated and used for filtering loci prior to genotype calling. In addition to filtering paralogs, it can be used to filter loci with null alleles or high overdispersion, and identify individuals with unexpected ploidy and hybrid status. We demonstrate that the statistic is useful at read depths as low as five to 10, well below the depth needed for accurate genotype calling in polyploid and outcrossing species. CONCLUSIONS: Our methodology for estimating H(ind)/H(E) across loci and individuals, as well as determining reasonable thresholds for filtering loci, is implemented in polyRAD v1.6, available at https://github.com/lvclark/polyRAD. In large sequencing datasets, we anticipate that the ability to filter markers and identify problematic individuals prior to genotype calling will save researchers considerable computational time. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12859-022-04635-9. |
format | Online Article Text |
id | pubmed-8939213 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2022 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-89392132022-03-23 A population-level statistic for assessing Mendelian behavior of genotyping-by-sequencing data from highly duplicated genomes Clark, Lindsay V. Mays, Wittney Lipka, Alexander E. Sacks, Erik J. BMC Bioinformatics Research BACKGROUND: Given the economic and environmental importance of allopolyploids and other species with highly duplicated genomes, there is a need for methods to distinguish paralogs, i.e. duplicate sequences within a genome, from Mendelian loci, i.e. single copy sequences that pair at meiosis. The ratio of observed to expected heterozygosity is an effective tool for filtering loci but requires genotyping to be performed first at a high computational cost, whereas counting the number of sequence tags detected per genotype is computationally quick but very ineffective in inbred or polyploid populations. Therefore, new methods are needed for filtering paralogs. RESULTS: We introduce a novel statistic, H(ind)/H(E), that uses the probability that two reads sampled from a genotype will belong to different alleles, instead of observed heterozygosity. The expected value of H(ind)/H(E) is the same across all loci in a dataset, regardless of read depth or allele frequency. In contrast to methods based on observed heterozygosity, it can be estimated and used for filtering loci prior to genotype calling. In addition to filtering paralogs, it can be used to filter loci with null alleles or high overdispersion, and identify individuals with unexpected ploidy and hybrid status. We demonstrate that the statistic is useful at read depths as low as five to 10, well below the depth needed for accurate genotype calling in polyploid and outcrossing species. CONCLUSIONS: Our methodology for estimating H(ind)/H(E) across loci and individuals, as well as determining reasonable thresholds for filtering loci, is implemented in polyRAD v1.6, available at https://github.com/lvclark/polyRAD. In large sequencing datasets, we anticipate that the ability to filter markers and identify problematic individuals prior to genotype calling will save researchers considerable computational time. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12859-022-04635-9. BioMed Central 2022-03-22 /pmc/articles/PMC8939213/ /pubmed/35317727 http://dx.doi.org/10.1186/s12859-022-04635-9 Text en © The Author(s) 2022 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data. |
spellingShingle | Research Clark, Lindsay V. Mays, Wittney Lipka, Alexander E. Sacks, Erik J. A population-level statistic for assessing Mendelian behavior of genotyping-by-sequencing data from highly duplicated genomes |
title | A population-level statistic for assessing Mendelian behavior of genotyping-by-sequencing data from highly duplicated genomes |
title_full | A population-level statistic for assessing Mendelian behavior of genotyping-by-sequencing data from highly duplicated genomes |
title_fullStr | A population-level statistic for assessing Mendelian behavior of genotyping-by-sequencing data from highly duplicated genomes |
title_full_unstemmed | A population-level statistic for assessing Mendelian behavior of genotyping-by-sequencing data from highly duplicated genomes |
title_short | A population-level statistic for assessing Mendelian behavior of genotyping-by-sequencing data from highly duplicated genomes |
title_sort | population-level statistic for assessing mendelian behavior of genotyping-by-sequencing data from highly duplicated genomes |
topic | Research |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8939213/ https://www.ncbi.nlm.nih.gov/pubmed/35317727 http://dx.doi.org/10.1186/s12859-022-04635-9 |
work_keys_str_mv | AT clarklindsayv apopulationlevelstatisticforassessingmendelianbehaviorofgenotypingbysequencingdatafromhighlyduplicatedgenomes AT mayswittney apopulationlevelstatisticforassessingmendelianbehaviorofgenotypingbysequencingdatafromhighlyduplicatedgenomes AT lipkaalexandere apopulationlevelstatisticforassessingmendelianbehaviorofgenotypingbysequencingdatafromhighlyduplicatedgenomes AT sackserikj apopulationlevelstatisticforassessingmendelianbehaviorofgenotypingbysequencingdatafromhighlyduplicatedgenomes AT clarklindsayv populationlevelstatisticforassessingmendelianbehaviorofgenotypingbysequencingdatafromhighlyduplicatedgenomes AT mayswittney populationlevelstatisticforassessingmendelianbehaviorofgenotypingbysequencingdatafromhighlyduplicatedgenomes AT lipkaalexandere populationlevelstatisticforassessingmendelianbehaviorofgenotypingbysequencingdatafromhighlyduplicatedgenomes AT sackserikj populationlevelstatisticforassessingmendelianbehaviorofgenotypingbysequencingdatafromhighlyduplicatedgenomes |