Cargando…

Reference-free SNP calling: improved accuracy by preventing incorrect calls from repetitive genomic regions

BACKGROUND: Single nucleotide polymorphisms (SNPs) are the most abundant type of genetic variation in eukaryotic genomes and have recently become the marker of choice in a wide variety of ecological and evolutionary studies. The advent of next-generation sequencing (NGS) technologies has made it pos...

Descripción completa

Detalles Bibliográficos
Autores principales: Dou, Jinzhuang, Zhao, Xiqiang, Fu, Xiaoteng, Jiao, Wenqian, Wang, Nannan, Zhang, Lingling, Hu, Xiaoli, Wang, Shi, Bao, Zhenmin
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2012
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3472322/
https://www.ncbi.nlm.nih.gov/pubmed/22682067
http://dx.doi.org/10.1186/1745-6150-7-17
_version_ 1782246582338453504
author Dou, Jinzhuang
Zhao, Xiqiang
Fu, Xiaoteng
Jiao, Wenqian
Wang, Nannan
Zhang, Lingling
Hu, Xiaoli
Wang, Shi
Bao, Zhenmin
author_facet Dou, Jinzhuang
Zhao, Xiqiang
Fu, Xiaoteng
Jiao, Wenqian
Wang, Nannan
Zhang, Lingling
Hu, Xiaoli
Wang, Shi
Bao, Zhenmin
author_sort Dou, Jinzhuang
collection PubMed
description BACKGROUND: Single nucleotide polymorphisms (SNPs) are the most abundant type of genetic variation in eukaryotic genomes and have recently become the marker of choice in a wide variety of ecological and evolutionary studies. The advent of next-generation sequencing (NGS) technologies has made it possible to efficiently genotype a large number of SNPs in the non-model organisms with no or limited genomic resources. Most NGS-based genotyping methods require a reference genome to perform accurate SNP calling. Little effort, however, has yet been devoted to developing or improving algorithms for accurate SNP calling in the absence of a reference genome. RESULTS: Here we describe an improved maximum likelihood (ML) algorithm called iML, which can achieve high genotyping accuracy for SNP calling in the non-model organisms without a reference genome. The iML algorithm incorporates the mixed Poisson/normal model to detect composite read clusters and can efficiently prevent incorrect SNP calls resulting from repetitive genomic regions. Through analysis of simulation and real sequencing datasets, we demonstrate that in comparison with ML or a threshold approach, iML can remarkably improve the accuracy of de novo SNP genotyping and is especially powerful for the reference-free genotyping in diploid genomes with high repeat contents. CONCLUSIONS: The iML algorithm can efficiently prevent incorrect SNP calls resulting from repetitive genomic regions, and thus outperforms the original ML algorithm by achieving much higher genotyping accuracy. Our algorithm is therefore very useful for accurate de novo SNP genotyping in the non-model organisms without a reference genome. REVIEWERS: This article was reviewed by Dr. Richard Durbin, Dr. Liliana Florea (nominated by Dr. Steven Salzberg) and Dr. Arcady Mushegian.
format Online
Article
Text
id pubmed-3472322
institution National Center for Biotechnology Information
language English
publishDate 2012
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-34723222012-10-23 Reference-free SNP calling: improved accuracy by preventing incorrect calls from repetitive genomic regions Dou, Jinzhuang Zhao, Xiqiang Fu, Xiaoteng Jiao, Wenqian Wang, Nannan Zhang, Lingling Hu, Xiaoli Wang, Shi Bao, Zhenmin Biol Direct Research BACKGROUND: Single nucleotide polymorphisms (SNPs) are the most abundant type of genetic variation in eukaryotic genomes and have recently become the marker of choice in a wide variety of ecological and evolutionary studies. The advent of next-generation sequencing (NGS) technologies has made it possible to efficiently genotype a large number of SNPs in the non-model organisms with no or limited genomic resources. Most NGS-based genotyping methods require a reference genome to perform accurate SNP calling. Little effort, however, has yet been devoted to developing or improving algorithms for accurate SNP calling in the absence of a reference genome. RESULTS: Here we describe an improved maximum likelihood (ML) algorithm called iML, which can achieve high genotyping accuracy for SNP calling in the non-model organisms without a reference genome. The iML algorithm incorporates the mixed Poisson/normal model to detect composite read clusters and can efficiently prevent incorrect SNP calls resulting from repetitive genomic regions. Through analysis of simulation and real sequencing datasets, we demonstrate that in comparison with ML or a threshold approach, iML can remarkably improve the accuracy of de novo SNP genotyping and is especially powerful for the reference-free genotyping in diploid genomes with high repeat contents. CONCLUSIONS: The iML algorithm can efficiently prevent incorrect SNP calls resulting from repetitive genomic regions, and thus outperforms the original ML algorithm by achieving much higher genotyping accuracy. Our algorithm is therefore very useful for accurate de novo SNP genotyping in the non-model organisms without a reference genome. REVIEWERS: This article was reviewed by Dr. Richard Durbin, Dr. Liliana Florea (nominated by Dr. Steven Salzberg) and Dr. Arcady Mushegian. BioMed Central 2012-06-08 /pmc/articles/PMC3472322/ /pubmed/22682067 http://dx.doi.org/10.1186/1745-6150-7-17 Text en Copyright ©2012 Dou et al.; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research
Dou, Jinzhuang
Zhao, Xiqiang
Fu, Xiaoteng
Jiao, Wenqian
Wang, Nannan
Zhang, Lingling
Hu, Xiaoli
Wang, Shi
Bao, Zhenmin
Reference-free SNP calling: improved accuracy by preventing incorrect calls from repetitive genomic regions
title Reference-free SNP calling: improved accuracy by preventing incorrect calls from repetitive genomic regions
title_full Reference-free SNP calling: improved accuracy by preventing incorrect calls from repetitive genomic regions
title_fullStr Reference-free SNP calling: improved accuracy by preventing incorrect calls from repetitive genomic regions
title_full_unstemmed Reference-free SNP calling: improved accuracy by preventing incorrect calls from repetitive genomic regions
title_short Reference-free SNP calling: improved accuracy by preventing incorrect calls from repetitive genomic regions
title_sort reference-free snp calling: improved accuracy by preventing incorrect calls from repetitive genomic regions
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3472322/
https://www.ncbi.nlm.nih.gov/pubmed/22682067
http://dx.doi.org/10.1186/1745-6150-7-17
work_keys_str_mv AT doujinzhuang referencefreesnpcallingimprovedaccuracybypreventingincorrectcallsfromrepetitivegenomicregions
AT zhaoxiqiang referencefreesnpcallingimprovedaccuracybypreventingincorrectcallsfromrepetitivegenomicregions
AT fuxiaoteng referencefreesnpcallingimprovedaccuracybypreventingincorrectcallsfromrepetitivegenomicregions
AT jiaowenqian referencefreesnpcallingimprovedaccuracybypreventingincorrectcallsfromrepetitivegenomicregions
AT wangnannan referencefreesnpcallingimprovedaccuracybypreventingincorrectcallsfromrepetitivegenomicregions
AT zhanglingling referencefreesnpcallingimprovedaccuracybypreventingincorrectcallsfromrepetitivegenomicregions
AT huxiaoli referencefreesnpcallingimprovedaccuracybypreventingincorrectcallsfromrepetitivegenomicregions
AT wangshi referencefreesnpcallingimprovedaccuracybypreventingincorrectcallsfromrepetitivegenomicregions
AT baozhenmin referencefreesnpcallingimprovedaccuracybypreventingincorrectcallsfromrepetitivegenomicregions