Cargando…
Discovery of common sequences absent in the human reference genome using pooled samples from next generation sequencing
BACKGROUND: Sequences up to several megabases in length have been found to be present in individual genomes but absent in the human reference genome. These sequences may be common in populations, and their absence in the reference genome may indicate rare variants in the genomes of individuals who s...
Autores principales: | , , , , , , , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2014
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4148959/ https://www.ncbi.nlm.nih.gov/pubmed/25129063 http://dx.doi.org/10.1186/1471-2164-15-685 |
_version_ | 1782332689825660928 |
---|---|
author | Liu, Yu Koyutürk, Mehmet Maxwell, Sean Xiang, Min Veigl, Martina Cooper, Richard S Tayo, Bamidele O Li, Li LaFramboise, Thomas Wang, Zhenghe Zhu, Xiaofeng Chance, Mark R |
author_facet | Liu, Yu Koyutürk, Mehmet Maxwell, Sean Xiang, Min Veigl, Martina Cooper, Richard S Tayo, Bamidele O Li, Li LaFramboise, Thomas Wang, Zhenghe Zhu, Xiaofeng Chance, Mark R |
author_sort | Liu, Yu |
collection | PubMed |
description | BACKGROUND: Sequences up to several megabases in length have been found to be present in individual genomes but absent in the human reference genome. These sequences may be common in populations, and their absence in the reference genome may indicate rare variants in the genomes of individuals who served as donors for the human genome project. As the reference genome is used in probe design for microarray technology and mapping short reads in next generation sequencing (NGS), this missing sequence could be a source of bias in functional genomic studies and variant analysis. One End Anchor (OEA) and/or orphan reads from paired-end sequencing have been used to identify novel sequences that are absent in reference genome. However, there is no study to investigate the distribution, evolution and functionality of those sequences in human populations. RESULTS: To systematically identify and study the missing common sequences (micSeqs), we extended the previous method by pooling OEA reads from large number of individuals and applying strict filtering methods to remove false sequences. The pipeline was applied to data from phase 1 of the 1000 Genomes Project. We identified 309 micSeqs that are present in at least 1% of the human population, but absent in the reference genome. We confirmed 76% of these 309 micSeqs by comparison to other primate genomes, individual human genomes, and gene expression data. Furthermore, we randomly selected fifteen micSeqs and confirmed their presence using PCR validation in 38 additional individuals. Functional analysis using published RNA-seq and ChIP-seq data showed that eleven micSeqs are highly expressed in human brain and three micSeqs contain transcription factor (TF) binding regions, suggesting they are functional elements. In addition, the identified micSeqs are absent in non-primates and show dynamic acquisition during primate evolution culminating with most micSeqs being present in Africans, suggesting some micSeqs may be important sources of human diversity. CONCLUSIONS: 76% of micSeqs were confirmed by a comparative genomics approach. Fourteen micSeqs are expressed in human brain or contain TF binding regions. Some micSeqs are primate-specific, conserved and may play a role in the evolution of primates. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/1471-2164-15-685) contains supplementary material, which is available to authorized users. |
format | Online Article Text |
id | pubmed-4148959 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2014 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-41489592014-09-05 Discovery of common sequences absent in the human reference genome using pooled samples from next generation sequencing Liu, Yu Koyutürk, Mehmet Maxwell, Sean Xiang, Min Veigl, Martina Cooper, Richard S Tayo, Bamidele O Li, Li LaFramboise, Thomas Wang, Zhenghe Zhu, Xiaofeng Chance, Mark R BMC Genomics Research Article BACKGROUND: Sequences up to several megabases in length have been found to be present in individual genomes but absent in the human reference genome. These sequences may be common in populations, and their absence in the reference genome may indicate rare variants in the genomes of individuals who served as donors for the human genome project. As the reference genome is used in probe design for microarray technology and mapping short reads in next generation sequencing (NGS), this missing sequence could be a source of bias in functional genomic studies and variant analysis. One End Anchor (OEA) and/or orphan reads from paired-end sequencing have been used to identify novel sequences that are absent in reference genome. However, there is no study to investigate the distribution, evolution and functionality of those sequences in human populations. RESULTS: To systematically identify and study the missing common sequences (micSeqs), we extended the previous method by pooling OEA reads from large number of individuals and applying strict filtering methods to remove false sequences. The pipeline was applied to data from phase 1 of the 1000 Genomes Project. We identified 309 micSeqs that are present in at least 1% of the human population, but absent in the reference genome. We confirmed 76% of these 309 micSeqs by comparison to other primate genomes, individual human genomes, and gene expression data. Furthermore, we randomly selected fifteen micSeqs and confirmed their presence using PCR validation in 38 additional individuals. Functional analysis using published RNA-seq and ChIP-seq data showed that eleven micSeqs are highly expressed in human brain and three micSeqs contain transcription factor (TF) binding regions, suggesting they are functional elements. In addition, the identified micSeqs are absent in non-primates and show dynamic acquisition during primate evolution culminating with most micSeqs being present in Africans, suggesting some micSeqs may be important sources of human diversity. CONCLUSIONS: 76% of micSeqs were confirmed by a comparative genomics approach. Fourteen micSeqs are expressed in human brain or contain TF binding regions. Some micSeqs are primate-specific, conserved and may play a role in the evolution of primates. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/1471-2164-15-685) contains supplementary material, which is available to authorized users. BioMed Central 2014-08-16 /pmc/articles/PMC4148959/ /pubmed/25129063 http://dx.doi.org/10.1186/1471-2164-15-685 Text en © Liu et al.; licensee BioMed Central Ltd. 2014 This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. |
spellingShingle | Research Article Liu, Yu Koyutürk, Mehmet Maxwell, Sean Xiang, Min Veigl, Martina Cooper, Richard S Tayo, Bamidele O Li, Li LaFramboise, Thomas Wang, Zhenghe Zhu, Xiaofeng Chance, Mark R Discovery of common sequences absent in the human reference genome using pooled samples from next generation sequencing |
title | Discovery of common sequences absent in the human reference genome using pooled samples from next generation sequencing |
title_full | Discovery of common sequences absent in the human reference genome using pooled samples from next generation sequencing |
title_fullStr | Discovery of common sequences absent in the human reference genome using pooled samples from next generation sequencing |
title_full_unstemmed | Discovery of common sequences absent in the human reference genome using pooled samples from next generation sequencing |
title_short | Discovery of common sequences absent in the human reference genome using pooled samples from next generation sequencing |
title_sort | discovery of common sequences absent in the human reference genome using pooled samples from next generation sequencing |
topic | Research Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4148959/ https://www.ncbi.nlm.nih.gov/pubmed/25129063 http://dx.doi.org/10.1186/1471-2164-15-685 |
work_keys_str_mv | AT liuyu discoveryofcommonsequencesabsentinthehumanreferencegenomeusingpooledsamplesfromnextgenerationsequencing AT koyuturkmehmet discoveryofcommonsequencesabsentinthehumanreferencegenomeusingpooledsamplesfromnextgenerationsequencing AT maxwellsean discoveryofcommonsequencesabsentinthehumanreferencegenomeusingpooledsamplesfromnextgenerationsequencing AT xiangmin discoveryofcommonsequencesabsentinthehumanreferencegenomeusingpooledsamplesfromnextgenerationsequencing AT veiglmartina discoveryofcommonsequencesabsentinthehumanreferencegenomeusingpooledsamplesfromnextgenerationsequencing AT cooperrichards discoveryofcommonsequencesabsentinthehumanreferencegenomeusingpooledsamplesfromnextgenerationsequencing AT tayobamideleo discoveryofcommonsequencesabsentinthehumanreferencegenomeusingpooledsamplesfromnextgenerationsequencing AT lili discoveryofcommonsequencesabsentinthehumanreferencegenomeusingpooledsamplesfromnextgenerationsequencing AT laframboisethomas discoveryofcommonsequencesabsentinthehumanreferencegenomeusingpooledsamplesfromnextgenerationsequencing AT wangzhenghe discoveryofcommonsequencesabsentinthehumanreferencegenomeusingpooledsamplesfromnextgenerationsequencing AT zhuxiaofeng discoveryofcommonsequencesabsentinthehumanreferencegenomeusingpooledsamplesfromnextgenerationsequencing AT chancemarkr discoveryofcommonsequencesabsentinthehumanreferencegenomeusingpooledsamplesfromnextgenerationsequencing |