Cargando…

Gene2vec: distributed representation of genes based on co-expression

BACKGROUND: Existing functional description of genes are categorical, discrete, and mostly through manual process. In this work, we explore the idea of gene embedding, distributed representation of genes, in the spirit of word embedding. RESULTS: From a pure data-driven fashion, we trained a 200-dim...

Descripción completa

Detalles Bibliográficos
Autores principales: Du, Jingcheng, Jia, Peilin, Dai, Yulin, Tao, Cui, Zhao, Zhongming, Zhi, Degui
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2019
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6360648/
https://www.ncbi.nlm.nih.gov/pubmed/30712510
http://dx.doi.org/10.1186/s12864-018-5370-x
_version_ 1783392538704478208
author Du, Jingcheng
Jia, Peilin
Dai, Yulin
Tao, Cui
Zhao, Zhongming
Zhi, Degui
author_facet Du, Jingcheng
Jia, Peilin
Dai, Yulin
Tao, Cui
Zhao, Zhongming
Zhi, Degui
author_sort Du, Jingcheng
collection PubMed
description BACKGROUND: Existing functional description of genes are categorical, discrete, and mostly through manual process. In this work, we explore the idea of gene embedding, distributed representation of genes, in the spirit of word embedding. RESULTS: From a pure data-driven fashion, we trained a 200-dimension vector representation of all human genes, using gene co-expression patterns in 984 data sets from the GEO databases. These vectors capture functional relatedness of genes in terms of recovering known pathways - the average inner product (similarity) of genes within a pathway is 1.52X greater than that of random genes. Using t-SNE, we produced a gene co-expression map that shows local concentrations of tissue specific genes. We also illustrated the usefulness of the embedded gene vectors, laden with rich information on gene co-expression patterns, in tasks such as gene-gene interaction prediction. CONCLUSIONS: We proposed a machine learning method that utilizes transcriptome-wide gene co-expression to generate a distributed representation of genes. We further demonstrated the utility of our distribution by predicting gene-gene interaction based solely on gene names. The distributed representation of genes could be useful for more bioinformatics applications. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1186/s12864-018-5370-x) contains supplementary material, which is available to authorized users.
format Online
Article
Text
id pubmed-6360648
institution National Center for Biotechnology Information
language English
publishDate 2019
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-63606482019-02-21 Gene2vec: distributed representation of genes based on co-expression Du, Jingcheng Jia, Peilin Dai, Yulin Tao, Cui Zhao, Zhongming Zhi, Degui BMC Genomics Research BACKGROUND: Existing functional description of genes are categorical, discrete, and mostly through manual process. In this work, we explore the idea of gene embedding, distributed representation of genes, in the spirit of word embedding. RESULTS: From a pure data-driven fashion, we trained a 200-dimension vector representation of all human genes, using gene co-expression patterns in 984 data sets from the GEO databases. These vectors capture functional relatedness of genes in terms of recovering known pathways - the average inner product (similarity) of genes within a pathway is 1.52X greater than that of random genes. Using t-SNE, we produced a gene co-expression map that shows local concentrations of tissue specific genes. We also illustrated the usefulness of the embedded gene vectors, laden with rich information on gene co-expression patterns, in tasks such as gene-gene interaction prediction. CONCLUSIONS: We proposed a machine learning method that utilizes transcriptome-wide gene co-expression to generate a distributed representation of genes. We further demonstrated the utility of our distribution by predicting gene-gene interaction based solely on gene names. The distributed representation of genes could be useful for more bioinformatics applications. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1186/s12864-018-5370-x) contains supplementary material, which is available to authorized users. BioMed Central 2019-02-04 /pmc/articles/PMC6360648/ /pubmed/30712510 http://dx.doi.org/10.1186/s12864-018-5370-x Text en © The Author(s). 2019 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Research
Du, Jingcheng
Jia, Peilin
Dai, Yulin
Tao, Cui
Zhao, Zhongming
Zhi, Degui
Gene2vec: distributed representation of genes based on co-expression
title Gene2vec: distributed representation of genes based on co-expression
title_full Gene2vec: distributed representation of genes based on co-expression
title_fullStr Gene2vec: distributed representation of genes based on co-expression
title_full_unstemmed Gene2vec: distributed representation of genes based on co-expression
title_short Gene2vec: distributed representation of genes based on co-expression
title_sort gene2vec: distributed representation of genes based on co-expression
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6360648/
https://www.ncbi.nlm.nih.gov/pubmed/30712510
http://dx.doi.org/10.1186/s12864-018-5370-x
work_keys_str_mv AT dujingcheng gene2vecdistributedrepresentationofgenesbasedoncoexpression
AT jiapeilin gene2vecdistributedrepresentationofgenesbasedoncoexpression
AT daiyulin gene2vecdistributedrepresentationofgenesbasedoncoexpression
AT taocui gene2vecdistributedrepresentationofgenesbasedoncoexpression
AT zhaozhongming gene2vecdistributedrepresentationofgenesbasedoncoexpression
AT zhidegui gene2vecdistributedrepresentationofgenesbasedoncoexpression