Cargando…

A population-level statistic for assessing Mendelian behavior of genotyping-by-sequencing data from highly duplicated genomes

BACKGROUND: Given the economic and environmental importance of allopolyploids and other species with highly duplicated genomes, there is a need for methods to distinguish paralogs, i.e. duplicate sequences within a genome, from Mendelian loci, i.e. single copy sequences that pair at meiosis. The rat...

Descripción completa

Detalles Bibliográficos
Autores principales:	Clark, Lindsay V., Mays, Wittney, Lipka, Alexander E., Sacks, Erik J.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2022
Materias:	Research
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8939213/ https://www.ncbi.nlm.nih.gov/pubmed/35317727 http://dx.doi.org/10.1186/s12859-022-04635-9

_version_	1784672697803866112
author	Clark, Lindsay V. Mays, Wittney Lipka, Alexander E. Sacks, Erik J.
author_facet	Clark, Lindsay V. Mays, Wittney Lipka, Alexander E. Sacks, Erik J.
author_sort	Clark, Lindsay V.
collection	PubMed
description	BACKGROUND: Given the economic and environmental importance of allopolyploids and other species with highly duplicated genomes, there is a need for methods to distinguish paralogs, i.e. duplicate sequences within a genome, from Mendelian loci, i.e. single copy sequences that pair at meiosis. The ratio of observed to expected heterozygosity is an effective tool for filtering loci but requires genotyping to be performed first at a high computational cost, whereas counting the number of sequence tags detected per genotype is computationally quick but very ineffective in inbred or polyploid populations. Therefore, new methods are needed for filtering paralogs. RESULTS: We introduce a novel statistic, H(ind)/H(E), that uses the probability that two reads sampled from a genotype will belong to different alleles, instead of observed heterozygosity. The expected value of H(ind)/H(E) is the same across all loci in a dataset, regardless of read depth or allele frequency. In contrast to methods based on observed heterozygosity, it can be estimated and used for filtering loci prior to genotype calling. In addition to filtering paralogs, it can be used to filter loci with null alleles or high overdispersion, and identify individuals with unexpected ploidy and hybrid status. We demonstrate that the statistic is useful at read depths as low as five to 10, well below the depth needed for accurate genotype calling in polyploid and outcrossing species. CONCLUSIONS: Our methodology for estimating H(ind)/H(E) across loci and individuals, as well as determining reasonable thresholds for filtering loci, is implemented in polyRAD v1.6, available at https://github.com/lvclark/polyRAD. In large sequencing datasets, we anticipate that the ability to filter markers and identify problematic individuals prior to genotype calling will save researchers considerable computational time. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12859-022-04635-9.
format	Online Article Text
id	pubmed-8939213
institution	National Center for Biotechnology Information
language	English
publishDate	2022
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-89392132022-03-23 A population-level statistic for assessing Mendelian behavior of genotyping-by-sequencing data from highly duplicated genomes Clark, Lindsay V. Mays, Wittney Lipka, Alexander E. Sacks, Erik J. BMC Bioinformatics Research BACKGROUND: Given the economic and environmental importance of allopolyploids and other species with highly duplicated genomes, there is a need for methods to distinguish paralogs, i.e. duplicate sequences within a genome, from Mendelian loci, i.e. single copy sequences that pair at meiosis. The ratio of observed to expected heterozygosity is an effective tool for filtering loci but requires genotyping to be performed first at a high computational cost, whereas counting the number of sequence tags detected per genotype is computationally quick but very ineffective in inbred or polyploid populations. Therefore, new methods are needed for filtering paralogs. RESULTS: We introduce a novel statistic, H(ind)/H(E), that uses the probability that two reads sampled from a genotype will belong to different alleles, instead of observed heterozygosity. The expected value of H(ind)/H(E) is the same across all loci in a dataset, regardless of read depth or allele frequency. In contrast to methods based on observed heterozygosity, it can be estimated and used for filtering loci prior to genotype calling. In addition to filtering paralogs, it can be used to filter loci with null alleles or high overdispersion, and identify individuals with unexpected ploidy and hybrid status. We demonstrate that the statistic is useful at read depths as low as five to 10, well below the depth needed for accurate genotype calling in polyploid and outcrossing species. CONCLUSIONS: Our methodology for estimating H(ind)/H(E) across loci and individuals, as well as determining reasonable thresholds for filtering loci, is implemented in polyRAD v1.6, available at https://github.com/lvclark/polyRAD. In large sequencing datasets, we anticipate that the ability to filter markers and identify problematic individuals prior to genotype calling will save researchers considerable computational time. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12859-022-04635-9. BioMed Central 2022-03-22 /pmc/articles/PMC8939213/ /pubmed/35317727 http://dx.doi.org/10.1186/s12859-022-04635-9 Text en © The Author(s) 2022 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle	Research Clark, Lindsay V. Mays, Wittney Lipka, Alexander E. Sacks, Erik J. A population-level statistic for assessing Mendelian behavior of genotyping-by-sequencing data from highly duplicated genomes
title	A population-level statistic for assessing Mendelian behavior of genotyping-by-sequencing data from highly duplicated genomes
title_full	A population-level statistic for assessing Mendelian behavior of genotyping-by-sequencing data from highly duplicated genomes
title_fullStr	A population-level statistic for assessing Mendelian behavior of genotyping-by-sequencing data from highly duplicated genomes
title_full_unstemmed	A population-level statistic for assessing Mendelian behavior of genotyping-by-sequencing data from highly duplicated genomes
title_short	A population-level statistic for assessing Mendelian behavior of genotyping-by-sequencing data from highly duplicated genomes
title_sort	population-level statistic for assessing mendelian behavior of genotyping-by-sequencing data from highly duplicated genomes
topic	Research
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8939213/ https://www.ncbi.nlm.nih.gov/pubmed/35317727 http://dx.doi.org/10.1186/s12859-022-04635-9
work_keys_str_mv	AT clarklindsayv apopulationlevelstatisticforassessingmendelianbehaviorofgenotypingbysequencingdatafromhighlyduplicatedgenomes AT mayswittney apopulationlevelstatisticforassessingmendelianbehaviorofgenotypingbysequencingdatafromhighlyduplicatedgenomes AT lipkaalexandere apopulationlevelstatisticforassessingmendelianbehaviorofgenotypingbysequencingdatafromhighlyduplicatedgenomes AT sackserikj apopulationlevelstatisticforassessingmendelianbehaviorofgenotypingbysequencingdatafromhighlyduplicatedgenomes AT clarklindsayv populationlevelstatisticforassessingmendelianbehaviorofgenotypingbysequencingdatafromhighlyduplicatedgenomes AT mayswittney populationlevelstatisticforassessingmendelianbehaviorofgenotypingbysequencingdatafromhighlyduplicatedgenomes AT lipkaalexandere populationlevelstatisticforassessingmendelianbehaviorofgenotypingbysequencingdatafromhighlyduplicatedgenomes AT sackserikj populationlevelstatisticforassessingmendelianbehaviorofgenotypingbysequencingdatafromhighlyduplicatedgenomes

A population-level statistic for assessing Mendelian behavior of genotyping-by-sequencing data from highly duplicated genomes

Ejemplares similares