Cargando…

Discovery of common sequences absent in the human reference genome using pooled samples from next generation sequencing

BACKGROUND: Sequences up to several megabases in length have been found to be present in individual genomes but absent in the human reference genome. These sequences may be common in populations, and their absence in the reference genome may indicate rare variants in the genomes of individuals who s...

Descripción completa

Detalles Bibliográficos
Autores principales:	Liu, Yu, Koyutürk, Mehmet, Maxwell, Sean, Xiang, Min, Veigl, Martina, Cooper, Richard S, Tayo, Bamidele O, Li, Li, LaFramboise, Thomas, Wang, Zhenghe, Zhu, Xiaofeng, Chance, Mark R
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2014
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4148959/ https://www.ncbi.nlm.nih.gov/pubmed/25129063 http://dx.doi.org/10.1186/1471-2164-15-685

_version_	1782332689825660928
author	Liu, Yu Koyutürk, Mehmet Maxwell, Sean Xiang, Min Veigl, Martina Cooper, Richard S Tayo, Bamidele O Li, Li LaFramboise, Thomas Wang, Zhenghe Zhu, Xiaofeng Chance, Mark R
author_facet	Liu, Yu Koyutürk, Mehmet Maxwell, Sean Xiang, Min Veigl, Martina Cooper, Richard S Tayo, Bamidele O Li, Li LaFramboise, Thomas Wang, Zhenghe Zhu, Xiaofeng Chance, Mark R
author_sort	Liu, Yu
collection	PubMed
description	BACKGROUND: Sequences up to several megabases in length have been found to be present in individual genomes but absent in the human reference genome. These sequences may be common in populations, and their absence in the reference genome may indicate rare variants in the genomes of individuals who served as donors for the human genome project. As the reference genome is used in probe design for microarray technology and mapping short reads in next generation sequencing (NGS), this missing sequence could be a source of bias in functional genomic studies and variant analysis. One End Anchor (OEA) and/or orphan reads from paired-end sequencing have been used to identify novel sequences that are absent in reference genome. However, there is no study to investigate the distribution, evolution and functionality of those sequences in human populations. RESULTS: To systematically identify and study the missing common sequences (micSeqs), we extended the previous method by pooling OEA reads from large number of individuals and applying strict filtering methods to remove false sequences. The pipeline was applied to data from phase 1 of the 1000 Genomes Project. We identified 309 micSeqs that are present in at least 1% of the human population, but absent in the reference genome. We confirmed 76% of these 309 micSeqs by comparison to other primate genomes, individual human genomes, and gene expression data. Furthermore, we randomly selected fifteen micSeqs and confirmed their presence using PCR validation in 38 additional individuals. Functional analysis using published RNA-seq and ChIP-seq data showed that eleven micSeqs are highly expressed in human brain and three micSeqs contain transcription factor (TF) binding regions, suggesting they are functional elements. In addition, the identified micSeqs are absent in non-primates and show dynamic acquisition during primate evolution culminating with most micSeqs being present in Africans, suggesting some micSeqs may be important sources of human diversity. CONCLUSIONS: 76% of micSeqs were confirmed by a comparative genomics approach. Fourteen micSeqs are expressed in human brain or contain TF binding regions. Some micSeqs are primate-specific, conserved and may play a role in the evolution of primates. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/1471-2164-15-685) contains supplementary material, which is available to authorized users.
format	Online Article Text
id	pubmed-4148959
institution	National Center for Biotechnology Information
language	English
publishDate	2014
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-41489592014-09-05 Discovery of common sequences absent in the human reference genome using pooled samples from next generation sequencing Liu, Yu Koyutürk, Mehmet Maxwell, Sean Xiang, Min Veigl, Martina Cooper, Richard S Tayo, Bamidele O Li, Li LaFramboise, Thomas Wang, Zhenghe Zhu, Xiaofeng Chance, Mark R BMC Genomics Research Article BACKGROUND: Sequences up to several megabases in length have been found to be present in individual genomes but absent in the human reference genome. These sequences may be common in populations, and their absence in the reference genome may indicate rare variants in the genomes of individuals who served as donors for the human genome project. As the reference genome is used in probe design for microarray technology and mapping short reads in next generation sequencing (NGS), this missing sequence could be a source of bias in functional genomic studies and variant analysis. One End Anchor (OEA) and/or orphan reads from paired-end sequencing have been used to identify novel sequences that are absent in reference genome. However, there is no study to investigate the distribution, evolution and functionality of those sequences in human populations. RESULTS: To systematically identify and study the missing common sequences (micSeqs), we extended the previous method by pooling OEA reads from large number of individuals and applying strict filtering methods to remove false sequences. The pipeline was applied to data from phase 1 of the 1000 Genomes Project. We identified 309 micSeqs that are present in at least 1% of the human population, but absent in the reference genome. We confirmed 76% of these 309 micSeqs by comparison to other primate genomes, individual human genomes, and gene expression data. Furthermore, we randomly selected fifteen micSeqs and confirmed their presence using PCR validation in 38 additional individuals. Functional analysis using published RNA-seq and ChIP-seq data showed that eleven micSeqs are highly expressed in human brain and three micSeqs contain transcription factor (TF) binding regions, suggesting they are functional elements. In addition, the identified micSeqs are absent in non-primates and show dynamic acquisition during primate evolution culminating with most micSeqs being present in Africans, suggesting some micSeqs may be important sources of human diversity. CONCLUSIONS: 76% of micSeqs were confirmed by a comparative genomics approach. Fourteen micSeqs are expressed in human brain or contain TF binding regions. Some micSeqs are primate-specific, conserved and may play a role in the evolution of primates. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/1471-2164-15-685) contains supplementary material, which is available to authorized users. BioMed Central 2014-08-16 /pmc/articles/PMC4148959/ /pubmed/25129063 http://dx.doi.org/10.1186/1471-2164-15-685 Text en © Liu et al.; licensee BioMed Central Ltd. 2014 This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Research Article Liu, Yu Koyutürk, Mehmet Maxwell, Sean Xiang, Min Veigl, Martina Cooper, Richard S Tayo, Bamidele O Li, Li LaFramboise, Thomas Wang, Zhenghe Zhu, Xiaofeng Chance, Mark R Discovery of common sequences absent in the human reference genome using pooled samples from next generation sequencing
title	Discovery of common sequences absent in the human reference genome using pooled samples from next generation sequencing
title_full	Discovery of common sequences absent in the human reference genome using pooled samples from next generation sequencing
title_fullStr	Discovery of common sequences absent in the human reference genome using pooled samples from next generation sequencing
title_full_unstemmed	Discovery of common sequences absent in the human reference genome using pooled samples from next generation sequencing
title_short	Discovery of common sequences absent in the human reference genome using pooled samples from next generation sequencing
title_sort	discovery of common sequences absent in the human reference genome using pooled samples from next generation sequencing
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4148959/ https://www.ncbi.nlm.nih.gov/pubmed/25129063 http://dx.doi.org/10.1186/1471-2164-15-685
work_keys_str_mv	AT liuyu discoveryofcommonsequencesabsentinthehumanreferencegenomeusingpooledsamplesfromnextgenerationsequencing AT koyuturkmehmet discoveryofcommonsequencesabsentinthehumanreferencegenomeusingpooledsamplesfromnextgenerationsequencing AT maxwellsean discoveryofcommonsequencesabsentinthehumanreferencegenomeusingpooledsamplesfromnextgenerationsequencing AT xiangmin discoveryofcommonsequencesabsentinthehumanreferencegenomeusingpooledsamplesfromnextgenerationsequencing AT veiglmartina discoveryofcommonsequencesabsentinthehumanreferencegenomeusingpooledsamplesfromnextgenerationsequencing AT cooperrichards discoveryofcommonsequencesabsentinthehumanreferencegenomeusingpooledsamplesfromnextgenerationsequencing AT tayobamideleo discoveryofcommonsequencesabsentinthehumanreferencegenomeusingpooledsamplesfromnextgenerationsequencing AT lili discoveryofcommonsequencesabsentinthehumanreferencegenomeusingpooledsamplesfromnextgenerationsequencing AT laframboisethomas discoveryofcommonsequencesabsentinthehumanreferencegenomeusingpooledsamplesfromnextgenerationsequencing AT wangzhenghe discoveryofcommonsequencesabsentinthehumanreferencegenomeusingpooledsamplesfromnextgenerationsequencing AT zhuxiaofeng discoveryofcommonsequencesabsentinthehumanreferencegenomeusingpooledsamplesfromnextgenerationsequencing AT chancemarkr discoveryofcommonsequencesabsentinthehumanreferencegenomeusingpooledsamplesfromnextgenerationsequencing

Discovery of common sequences absent in the human reference genome using pooled samples from next generation sequencing

Ejemplares similares