Cargando…

Generation of SNP datasets for orangutan population genomics using improved reduced-representation sequencing and direct comparisons of SNP calling algorithms

BACKGROUND: High-throughput sequencing has opened up exciting possibilities in population and conservation genetics by enabling the assessment of genetic variation at genome-wide scales. One approach to reduce genome complexity, i.e. investigating only parts of the genome, is reduced-representation...

Descripción completa

Detalles Bibliográficos
Autores principales: Greminger, Maja P, Stölting, Kai N, Nater, Alexander, Goossens, Benoit, Arora, Natasha, Bruggmann, Rémy, Patrignani, Andrea, Nussberger, Beatrice, Sharma, Reeta, Kraus, Robert H S, Ambu, Laurentius N, Singleton, Ian, Chikhi, Lounes, van Schaik, Carel P, Krützen, Michael
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2014
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3897891/
https://www.ncbi.nlm.nih.gov/pubmed/24405840
http://dx.doi.org/10.1186/1471-2164-15-16
_version_ 1782300315995865088
author Greminger, Maja P
Stölting, Kai N
Nater, Alexander
Goossens, Benoit
Arora, Natasha
Bruggmann, Rémy
Patrignani, Andrea
Nussberger, Beatrice
Sharma, Reeta
Kraus, Robert H S
Ambu, Laurentius N
Singleton, Ian
Chikhi, Lounes
van Schaik, Carel P
Krützen, Michael
author_facet Greminger, Maja P
Stölting, Kai N
Nater, Alexander
Goossens, Benoit
Arora, Natasha
Bruggmann, Rémy
Patrignani, Andrea
Nussberger, Beatrice
Sharma, Reeta
Kraus, Robert H S
Ambu, Laurentius N
Singleton, Ian
Chikhi, Lounes
van Schaik, Carel P
Krützen, Michael
author_sort Greminger, Maja P
collection PubMed
description BACKGROUND: High-throughput sequencing has opened up exciting possibilities in population and conservation genetics by enabling the assessment of genetic variation at genome-wide scales. One approach to reduce genome complexity, i.e. investigating only parts of the genome, is reduced-representation library (RRL) sequencing. Like similar approaches, RRL sequencing reduces ascertainment bias due to simultaneous discovery and genotyping of single-nucleotide polymorphisms (SNPs) and does not require reference genomes. Yet, generating such datasets remains challenging due to laboratory and bioinformatical issues. In the laboratory, current protocols require improvements with regards to sequencing homologous fragments to reduce the number of missing genotypes. From the bioinformatical perspective, the reliance of most studies on a single SNP caller disregards the possibility that different algorithms may produce disparate SNP datasets. RESULTS: We present an improved RRL (iRRL) protocol that maximizes the generation of homologous DNA sequences, thus achieving improved genotyping-by-sequencing efficiency. Our modifications facilitate generation of single-sample libraries, enabling individual genotype assignments instead of pooled-sample analysis. We sequenced ~1% of the orangutan genome with 41-fold median coverage in 31 wild-born individuals from two populations. SNPs and genotypes were called using three different algorithms. We obtained substantially different SNP datasets depending on the SNP caller. Genotype validations revealed that the Unified Genotyper of the Genome Analysis Toolkit and SAMtools performed significantly better than a caller from CLC Genomics Workbench (CLC). Of all conflicting genotype calls, CLC was only correct in 17% of the cases. Furthermore, conflicting genotypes between two algorithms showed a systematic bias in that one caller almost exclusively assigned heterozygotes, while the other one almost exclusively assigned homozygotes. CONCLUSIONS: Our enhanced iRRL approach greatly facilitates genotyping-by-sequencing and thus direct estimates of allele frequencies. Our direct comparison of three commonly used SNP callers emphasizes the need to question the accuracy of SNP and genotype calling, as we obtained considerably different SNP datasets depending on caller algorithms, sequencing depths and filtering criteria. These differences affected scans for signatures of natural selection, but will also exert undue influences on demographic inferences. This study presents the first effort to generate a population genomic dataset for wild-born orangutans with known population provenance.
format Online
Article
Text
id pubmed-3897891
institution National Center for Biotechnology Information
language English
publishDate 2014
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-38978912014-01-23 Generation of SNP datasets for orangutan population genomics using improved reduced-representation sequencing and direct comparisons of SNP calling algorithms Greminger, Maja P Stölting, Kai N Nater, Alexander Goossens, Benoit Arora, Natasha Bruggmann, Rémy Patrignani, Andrea Nussberger, Beatrice Sharma, Reeta Kraus, Robert H S Ambu, Laurentius N Singleton, Ian Chikhi, Lounes van Schaik, Carel P Krützen, Michael BMC Genomics Research Article BACKGROUND: High-throughput sequencing has opened up exciting possibilities in population and conservation genetics by enabling the assessment of genetic variation at genome-wide scales. One approach to reduce genome complexity, i.e. investigating only parts of the genome, is reduced-representation library (RRL) sequencing. Like similar approaches, RRL sequencing reduces ascertainment bias due to simultaneous discovery and genotyping of single-nucleotide polymorphisms (SNPs) and does not require reference genomes. Yet, generating such datasets remains challenging due to laboratory and bioinformatical issues. In the laboratory, current protocols require improvements with regards to sequencing homologous fragments to reduce the number of missing genotypes. From the bioinformatical perspective, the reliance of most studies on a single SNP caller disregards the possibility that different algorithms may produce disparate SNP datasets. RESULTS: We present an improved RRL (iRRL) protocol that maximizes the generation of homologous DNA sequences, thus achieving improved genotyping-by-sequencing efficiency. Our modifications facilitate generation of single-sample libraries, enabling individual genotype assignments instead of pooled-sample analysis. We sequenced ~1% of the orangutan genome with 41-fold median coverage in 31 wild-born individuals from two populations. SNPs and genotypes were called using three different algorithms. We obtained substantially different SNP datasets depending on the SNP caller. Genotype validations revealed that the Unified Genotyper of the Genome Analysis Toolkit and SAMtools performed significantly better than a caller from CLC Genomics Workbench (CLC). Of all conflicting genotype calls, CLC was only correct in 17% of the cases. Furthermore, conflicting genotypes between two algorithms showed a systematic bias in that one caller almost exclusively assigned heterozygotes, while the other one almost exclusively assigned homozygotes. CONCLUSIONS: Our enhanced iRRL approach greatly facilitates genotyping-by-sequencing and thus direct estimates of allele frequencies. Our direct comparison of three commonly used SNP callers emphasizes the need to question the accuracy of SNP and genotype calling, as we obtained considerably different SNP datasets depending on caller algorithms, sequencing depths and filtering criteria. These differences affected scans for signatures of natural selection, but will also exert undue influences on demographic inferences. This study presents the first effort to generate a population genomic dataset for wild-born orangutans with known population provenance. BioMed Central 2014-01-10 /pmc/articles/PMC3897891/ /pubmed/24405840 http://dx.doi.org/10.1186/1471-2164-15-16 Text en Copyright © 2014 Greminger et al.; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research Article
Greminger, Maja P
Stölting, Kai N
Nater, Alexander
Goossens, Benoit
Arora, Natasha
Bruggmann, Rémy
Patrignani, Andrea
Nussberger, Beatrice
Sharma, Reeta
Kraus, Robert H S
Ambu, Laurentius N
Singleton, Ian
Chikhi, Lounes
van Schaik, Carel P
Krützen, Michael
Generation of SNP datasets for orangutan population genomics using improved reduced-representation sequencing and direct comparisons of SNP calling algorithms
title Generation of SNP datasets for orangutan population genomics using improved reduced-representation sequencing and direct comparisons of SNP calling algorithms
title_full Generation of SNP datasets for orangutan population genomics using improved reduced-representation sequencing and direct comparisons of SNP calling algorithms
title_fullStr Generation of SNP datasets for orangutan population genomics using improved reduced-representation sequencing and direct comparisons of SNP calling algorithms
title_full_unstemmed Generation of SNP datasets for orangutan population genomics using improved reduced-representation sequencing and direct comparisons of SNP calling algorithms
title_short Generation of SNP datasets for orangutan population genomics using improved reduced-representation sequencing and direct comparisons of SNP calling algorithms
title_sort generation of snp datasets for orangutan population genomics using improved reduced-representation sequencing and direct comparisons of snp calling algorithms
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3897891/
https://www.ncbi.nlm.nih.gov/pubmed/24405840
http://dx.doi.org/10.1186/1471-2164-15-16
work_keys_str_mv AT gremingermajap generationofsnpdatasetsfororangutanpopulationgenomicsusingimprovedreducedrepresentationsequencinganddirectcomparisonsofsnpcallingalgorithms
AT stoltingkain generationofsnpdatasetsfororangutanpopulationgenomicsusingimprovedreducedrepresentationsequencinganddirectcomparisonsofsnpcallingalgorithms
AT nateralexander generationofsnpdatasetsfororangutanpopulationgenomicsusingimprovedreducedrepresentationsequencinganddirectcomparisonsofsnpcallingalgorithms
AT goossensbenoit generationofsnpdatasetsfororangutanpopulationgenomicsusingimprovedreducedrepresentationsequencinganddirectcomparisonsofsnpcallingalgorithms
AT aroranatasha generationofsnpdatasetsfororangutanpopulationgenomicsusingimprovedreducedrepresentationsequencinganddirectcomparisonsofsnpcallingalgorithms
AT bruggmannremy generationofsnpdatasetsfororangutanpopulationgenomicsusingimprovedreducedrepresentationsequencinganddirectcomparisonsofsnpcallingalgorithms
AT patrignaniandrea generationofsnpdatasetsfororangutanpopulationgenomicsusingimprovedreducedrepresentationsequencinganddirectcomparisonsofsnpcallingalgorithms
AT nussbergerbeatrice generationofsnpdatasetsfororangutanpopulationgenomicsusingimprovedreducedrepresentationsequencinganddirectcomparisonsofsnpcallingalgorithms
AT sharmareeta generationofsnpdatasetsfororangutanpopulationgenomicsusingimprovedreducedrepresentationsequencinganddirectcomparisonsofsnpcallingalgorithms
AT krausroberths generationofsnpdatasetsfororangutanpopulationgenomicsusingimprovedreducedrepresentationsequencinganddirectcomparisonsofsnpcallingalgorithms
AT ambulaurentiusn generationofsnpdatasetsfororangutanpopulationgenomicsusingimprovedreducedrepresentationsequencinganddirectcomparisonsofsnpcallingalgorithms
AT singletonian generationofsnpdatasetsfororangutanpopulationgenomicsusingimprovedreducedrepresentationsequencinganddirectcomparisonsofsnpcallingalgorithms
AT chikhilounes generationofsnpdatasetsfororangutanpopulationgenomicsusingimprovedreducedrepresentationsequencinganddirectcomparisonsofsnpcallingalgorithms
AT vanschaikcarelp generationofsnpdatasetsfororangutanpopulationgenomicsusingimprovedreducedrepresentationsequencinganddirectcomparisonsofsnpcallingalgorithms
AT krutzenmichael generationofsnpdatasetsfororangutanpopulationgenomicsusingimprovedreducedrepresentationsequencinganddirectcomparisonsofsnpcallingalgorithms