Cargando…

Discovery of common sequences absent in the human reference genome using pooled samples from next generation sequencing

BACKGROUND: Sequences up to several megabases in length have been found to be present in individual genomes but absent in the human reference genome. These sequences may be common in populations, and their absence in the reference genome may indicate rare variants in the genomes of individuals who s...

Descripción completa

Detalles Bibliográficos
Autores principales: Liu, Yu, Koyutürk, Mehmet, Maxwell, Sean, Xiang, Min, Veigl, Martina, Cooper, Richard S, Tayo, Bamidele O, Li, Li, LaFramboise, Thomas, Wang, Zhenghe, Zhu, Xiaofeng, Chance, Mark R
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2014
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4148959/
https://www.ncbi.nlm.nih.gov/pubmed/25129063
http://dx.doi.org/10.1186/1471-2164-15-685
_version_ 1782332689825660928
author Liu, Yu
Koyutürk, Mehmet
Maxwell, Sean
Xiang, Min
Veigl, Martina
Cooper, Richard S
Tayo, Bamidele O
Li, Li
LaFramboise, Thomas
Wang, Zhenghe
Zhu, Xiaofeng
Chance, Mark R
author_facet Liu, Yu
Koyutürk, Mehmet
Maxwell, Sean
Xiang, Min
Veigl, Martina
Cooper, Richard S
Tayo, Bamidele O
Li, Li
LaFramboise, Thomas
Wang, Zhenghe
Zhu, Xiaofeng
Chance, Mark R
author_sort Liu, Yu
collection PubMed
description BACKGROUND: Sequences up to several megabases in length have been found to be present in individual genomes but absent in the human reference genome. These sequences may be common in populations, and their absence in the reference genome may indicate rare variants in the genomes of individuals who served as donors for the human genome project. As the reference genome is used in probe design for microarray technology and mapping short reads in next generation sequencing (NGS), this missing sequence could be a source of bias in functional genomic studies and variant analysis. One End Anchor (OEA) and/or orphan reads from paired-end sequencing have been used to identify novel sequences that are absent in reference genome. However, there is no study to investigate the distribution, evolution and functionality of those sequences in human populations. RESULTS: To systematically identify and study the missing common sequences (micSeqs), we extended the previous method by pooling OEA reads from large number of individuals and applying strict filtering methods to remove false sequences. The pipeline was applied to data from phase 1 of the 1000 Genomes Project. We identified 309 micSeqs that are present in at least 1% of the human population, but absent in the reference genome. We confirmed 76% of these 309 micSeqs by comparison to other primate genomes, individual human genomes, and gene expression data. Furthermore, we randomly selected fifteen micSeqs and confirmed their presence using PCR validation in 38 additional individuals. Functional analysis using published RNA-seq and ChIP-seq data showed that eleven micSeqs are highly expressed in human brain and three micSeqs contain transcription factor (TF) binding regions, suggesting they are functional elements. In addition, the identified micSeqs are absent in non-primates and show dynamic acquisition during primate evolution culminating with most micSeqs being present in Africans, suggesting some micSeqs may be important sources of human diversity. CONCLUSIONS: 76% of micSeqs were confirmed by a comparative genomics approach. Fourteen micSeqs are expressed in human brain or contain TF binding regions. Some micSeqs are primate-specific, conserved and may play a role in the evolution of primates. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/1471-2164-15-685) contains supplementary material, which is available to authorized users.
format Online
Article
Text
id pubmed-4148959
institution National Center for Biotechnology Information
language English
publishDate 2014
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-41489592014-09-05 Discovery of common sequences absent in the human reference genome using pooled samples from next generation sequencing Liu, Yu Koyutürk, Mehmet Maxwell, Sean Xiang, Min Veigl, Martina Cooper, Richard S Tayo, Bamidele O Li, Li LaFramboise, Thomas Wang, Zhenghe Zhu, Xiaofeng Chance, Mark R BMC Genomics Research Article BACKGROUND: Sequences up to several megabases in length have been found to be present in individual genomes but absent in the human reference genome. These sequences may be common in populations, and their absence in the reference genome may indicate rare variants in the genomes of individuals who served as donors for the human genome project. As the reference genome is used in probe design for microarray technology and mapping short reads in next generation sequencing (NGS), this missing sequence could be a source of bias in functional genomic studies and variant analysis. One End Anchor (OEA) and/or orphan reads from paired-end sequencing have been used to identify novel sequences that are absent in reference genome. However, there is no study to investigate the distribution, evolution and functionality of those sequences in human populations. RESULTS: To systematically identify and study the missing common sequences (micSeqs), we extended the previous method by pooling OEA reads from large number of individuals and applying strict filtering methods to remove false sequences. The pipeline was applied to data from phase 1 of the 1000 Genomes Project. We identified 309 micSeqs that are present in at least 1% of the human population, but absent in the reference genome. We confirmed 76% of these 309 micSeqs by comparison to other primate genomes, individual human genomes, and gene expression data. Furthermore, we randomly selected fifteen micSeqs and confirmed their presence using PCR validation in 38 additional individuals. Functional analysis using published RNA-seq and ChIP-seq data showed that eleven micSeqs are highly expressed in human brain and three micSeqs contain transcription factor (TF) binding regions, suggesting they are functional elements. In addition, the identified micSeqs are absent in non-primates and show dynamic acquisition during primate evolution culminating with most micSeqs being present in Africans, suggesting some micSeqs may be important sources of human diversity. CONCLUSIONS: 76% of micSeqs were confirmed by a comparative genomics approach. Fourteen micSeqs are expressed in human brain or contain TF binding regions. Some micSeqs are primate-specific, conserved and may play a role in the evolution of primates. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/1471-2164-15-685) contains supplementary material, which is available to authorized users. BioMed Central 2014-08-16 /pmc/articles/PMC4148959/ /pubmed/25129063 http://dx.doi.org/10.1186/1471-2164-15-685 Text en © Liu et al.; licensee BioMed Central Ltd. 2014 This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Research Article
Liu, Yu
Koyutürk, Mehmet
Maxwell, Sean
Xiang, Min
Veigl, Martina
Cooper, Richard S
Tayo, Bamidele O
Li, Li
LaFramboise, Thomas
Wang, Zhenghe
Zhu, Xiaofeng
Chance, Mark R
Discovery of common sequences absent in the human reference genome using pooled samples from next generation sequencing
title Discovery of common sequences absent in the human reference genome using pooled samples from next generation sequencing
title_full Discovery of common sequences absent in the human reference genome using pooled samples from next generation sequencing
title_fullStr Discovery of common sequences absent in the human reference genome using pooled samples from next generation sequencing
title_full_unstemmed Discovery of common sequences absent in the human reference genome using pooled samples from next generation sequencing
title_short Discovery of common sequences absent in the human reference genome using pooled samples from next generation sequencing
title_sort discovery of common sequences absent in the human reference genome using pooled samples from next generation sequencing
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4148959/
https://www.ncbi.nlm.nih.gov/pubmed/25129063
http://dx.doi.org/10.1186/1471-2164-15-685
work_keys_str_mv AT liuyu discoveryofcommonsequencesabsentinthehumanreferencegenomeusingpooledsamplesfromnextgenerationsequencing
AT koyuturkmehmet discoveryofcommonsequencesabsentinthehumanreferencegenomeusingpooledsamplesfromnextgenerationsequencing
AT maxwellsean discoveryofcommonsequencesabsentinthehumanreferencegenomeusingpooledsamplesfromnextgenerationsequencing
AT xiangmin discoveryofcommonsequencesabsentinthehumanreferencegenomeusingpooledsamplesfromnextgenerationsequencing
AT veiglmartina discoveryofcommonsequencesabsentinthehumanreferencegenomeusingpooledsamplesfromnextgenerationsequencing
AT cooperrichards discoveryofcommonsequencesabsentinthehumanreferencegenomeusingpooledsamplesfromnextgenerationsequencing
AT tayobamideleo discoveryofcommonsequencesabsentinthehumanreferencegenomeusingpooledsamplesfromnextgenerationsequencing
AT lili discoveryofcommonsequencesabsentinthehumanreferencegenomeusingpooledsamplesfromnextgenerationsequencing
AT laframboisethomas discoveryofcommonsequencesabsentinthehumanreferencegenomeusingpooledsamplesfromnextgenerationsequencing
AT wangzhenghe discoveryofcommonsequencesabsentinthehumanreferencegenomeusingpooledsamplesfromnextgenerationsequencing
AT zhuxiaofeng discoveryofcommonsequencesabsentinthehumanreferencegenomeusingpooledsamplesfromnextgenerationsequencing
AT chancemarkr discoveryofcommonsequencesabsentinthehumanreferencegenomeusingpooledsamplesfromnextgenerationsequencing