Cargando…
A computational framework for improving genetic variants identification from 5,061 sheep sequencing data
BACKGROUND: Pan-genomics is a recently emerging strategy that can be utilized to provide a more comprehensive characterization of genetic variation. Joint calling is routinely used to combine identified variants across multiple related samples. However, the improvement of variants identification usi...
Autores principales: | , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2023
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10544426/ https://www.ncbi.nlm.nih.gov/pubmed/37779189 http://dx.doi.org/10.1186/s40104-023-00923-3 |
_version_ | 1785114501754912768 |
---|---|
author | Xie, Shangqian Isaacs, Karissa Becker, Gabrielle Murdoch, Brenda M. |
author_facet | Xie, Shangqian Isaacs, Karissa Becker, Gabrielle Murdoch, Brenda M. |
author_sort | Xie, Shangqian |
collection | PubMed |
description | BACKGROUND: Pan-genomics is a recently emerging strategy that can be utilized to provide a more comprehensive characterization of genetic variation. Joint calling is routinely used to combine identified variants across multiple related samples. However, the improvement of variants identification using the mutual support information from multiple samples remains quite limited for population-scale genotyping. RESULTS: In this study, we developed a computational framework for joint calling genetic variants from 5,061 sheep by incorporating the sequencing error and optimizing mutual support information from multiple samples' data. The variants were accurately identified from multiple samples by using four steps: (1) Probabilities of variants from two widely used algorithms, GATK and Freebayes, were calculated by Poisson model incorporating base sequencing error potential; (2) The variants with high mapping quality or consistently identified from at least two samples by GATK and Freebayes were used to construct the raw high-confidence identification (rHID) variants database; (3) The high confidence variants identified in single sample were ordered by probability value and controlled by false discovery rate (FDR) using rHID database; (4) To avoid the elimination of potentially true variants from rHID database, the variants that failed FDR were reexamined to rescued potential true variants and ensured high accurate identification variants. The results indicated that the percent of concordant SNPs and Indels from Freebayes and GATK after our new method were significantly improved 12%−32% compared with raw variants and advantageously found low frequency variants of individual sheep involved several traits including nipples number (GPC5), scrapie pathology (PAPSS2), seasonal reproduction and litter size (GRM1), coat color (RAB27A), and lentivirus susceptibility (TMEM154). CONCLUSION: The new method used the computational strategy to reduce the number of false positives, and simultaneously improve the identification of genetic variants. This strategy did not incur any extra cost by using any additional samples or sequencing data information and advantageously identified rare variants which can be important for practical applications of animal breeding. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s40104-023-00923-3. |
format | Online Article Text |
id | pubmed-10544426 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2023 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-105444262023-10-03 A computational framework for improving genetic variants identification from 5,061 sheep sequencing data Xie, Shangqian Isaacs, Karissa Becker, Gabrielle Murdoch, Brenda M. J Anim Sci Biotechnol Research BACKGROUND: Pan-genomics is a recently emerging strategy that can be utilized to provide a more comprehensive characterization of genetic variation. Joint calling is routinely used to combine identified variants across multiple related samples. However, the improvement of variants identification using the mutual support information from multiple samples remains quite limited for population-scale genotyping. RESULTS: In this study, we developed a computational framework for joint calling genetic variants from 5,061 sheep by incorporating the sequencing error and optimizing mutual support information from multiple samples' data. The variants were accurately identified from multiple samples by using four steps: (1) Probabilities of variants from two widely used algorithms, GATK and Freebayes, were calculated by Poisson model incorporating base sequencing error potential; (2) The variants with high mapping quality or consistently identified from at least two samples by GATK and Freebayes were used to construct the raw high-confidence identification (rHID) variants database; (3) The high confidence variants identified in single sample were ordered by probability value and controlled by false discovery rate (FDR) using rHID database; (4) To avoid the elimination of potentially true variants from rHID database, the variants that failed FDR were reexamined to rescued potential true variants and ensured high accurate identification variants. The results indicated that the percent of concordant SNPs and Indels from Freebayes and GATK after our new method were significantly improved 12%−32% compared with raw variants and advantageously found low frequency variants of individual sheep involved several traits including nipples number (GPC5), scrapie pathology (PAPSS2), seasonal reproduction and litter size (GRM1), coat color (RAB27A), and lentivirus susceptibility (TMEM154). CONCLUSION: The new method used the computational strategy to reduce the number of false positives, and simultaneously improve the identification of genetic variants. This strategy did not incur any extra cost by using any additional samples or sequencing data information and advantageously identified rare variants which can be important for practical applications of animal breeding. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s40104-023-00923-3. BioMed Central 2023-10-02 /pmc/articles/PMC10544426/ /pubmed/37779189 http://dx.doi.org/10.1186/s40104-023-00923-3 Text en © The Author(s) 2023 https://creativecommons.org/licenses/by/4.0/Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data. |
spellingShingle | Research Xie, Shangqian Isaacs, Karissa Becker, Gabrielle Murdoch, Brenda M. A computational framework for improving genetic variants identification from 5,061 sheep sequencing data |
title | A computational framework for improving genetic variants identification from 5,061 sheep sequencing data |
title_full | A computational framework for improving genetic variants identification from 5,061 sheep sequencing data |
title_fullStr | A computational framework for improving genetic variants identification from 5,061 sheep sequencing data |
title_full_unstemmed | A computational framework for improving genetic variants identification from 5,061 sheep sequencing data |
title_short | A computational framework for improving genetic variants identification from 5,061 sheep sequencing data |
title_sort | computational framework for improving genetic variants identification from 5,061 sheep sequencing data |
topic | Research |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10544426/ https://www.ncbi.nlm.nih.gov/pubmed/37779189 http://dx.doi.org/10.1186/s40104-023-00923-3 |
work_keys_str_mv | AT xieshangqian acomputationalframeworkforimprovinggeneticvariantsidentificationfrom5061sheepsequencingdata AT isaacskarissa acomputationalframeworkforimprovinggeneticvariantsidentificationfrom5061sheepsequencingdata AT beckergabrielle acomputationalframeworkforimprovinggeneticvariantsidentificationfrom5061sheepsequencingdata AT murdochbrendam acomputationalframeworkforimprovinggeneticvariantsidentificationfrom5061sheepsequencingdata AT xieshangqian computationalframeworkforimprovinggeneticvariantsidentificationfrom5061sheepsequencingdata AT isaacskarissa computationalframeworkforimprovinggeneticvariantsidentificationfrom5061sheepsequencingdata AT beckergabrielle computationalframeworkforimprovinggeneticvariantsidentificationfrom5061sheepsequencingdata AT murdochbrendam computationalframeworkforimprovinggeneticvariantsidentificationfrom5061sheepsequencingdata |