Cargando…

The harmonizome: a collection of processed datasets gathered to serve and mine knowledge about genes and proteins

Genomics, epigenomics, transcriptomics, proteomics and metabolomics efforts rapidly generate a plethora of data on the activity and levels of biomolecules within mammalian cells. At the same time, curation projects that organize knowledge from the biomedical literature into online databases are expa...

Descripción completa

Detalles Bibliográficos
Autores principales: Rouillard, Andrew D., Gundersen, Gregory W., Fernandez, Nicolas F., Wang, Zichen, Monteiro, Caroline D., McDermott, Michael G., Ma’ayan, Avi
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2016
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4930834/
https://www.ncbi.nlm.nih.gov/pubmed/27374120
http://dx.doi.org/10.1093/database/baw100
_version_ 1782440793240240128
author Rouillard, Andrew D.
Gundersen, Gregory W.
Fernandez, Nicolas F.
Wang, Zichen
Monteiro, Caroline D.
McDermott, Michael G.
Ma’ayan, Avi
author_facet Rouillard, Andrew D.
Gundersen, Gregory W.
Fernandez, Nicolas F.
Wang, Zichen
Monteiro, Caroline D.
McDermott, Michael G.
Ma’ayan, Avi
author_sort Rouillard, Andrew D.
collection PubMed
description Genomics, epigenomics, transcriptomics, proteomics and metabolomics efforts rapidly generate a plethora of data on the activity and levels of biomolecules within mammalian cells. At the same time, curation projects that organize knowledge from the biomedical literature into online databases are expanding. Hence, there is a wealth of information about genes, proteins and their associations, with an urgent need for data integration to achieve better knowledge extraction and data reuse. For this purpose, we developed the Harmonizome: a collection of processed datasets gathered to serve and mine knowledge about genes and proteins from over 70 major online resources. We extracted, abstracted and organized data into ∼72 million functional associations between genes/proteins and their attributes. Such attributes could be physical relationships with other biomolecules, expression in cell lines and tissues, genetic associations with knockout mouse or human phenotypes, or changes in expression after drug treatment. We stored these associations in a relational database along with rich metadata for the genes/proteins, their attributes and the original resources. The freely available Harmonizome web portal provides a graphical user interface, a web service and a mobile app for querying, browsing and downloading all of the collected data. To demonstrate the utility of the Harmonizome, we computed and visualized gene–gene and attribute–attribute similarity networks, and through unsupervised clustering, identified many unexpected relationships by combining pairs of datasets such as the association between kinase perturbations and disease signatures. We also applied supervised machine learning methods to predict novel substrates for kinases, endogenous ligands for G-protein coupled receptors, mouse phenotypes for knockout genes, and classified unannotated transmembrane proteins for likelihood of being ion channels. The Harmonizome is a comprehensive resource of knowledge about genes and proteins, and as such, it enables researchers to discover novel relationships between biological entities, as well as form novel data-driven hypotheses for experimental validation. Database URL: http://amp.pharm.mssm.edu/Harmonizome.
format Online
Article
Text
id pubmed-4930834
institution National Center for Biotechnology Information
language English
publishDate 2016
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-49308342016-07-05 The harmonizome: a collection of processed datasets gathered to serve and mine knowledge about genes and proteins Rouillard, Andrew D. Gundersen, Gregory W. Fernandez, Nicolas F. Wang, Zichen Monteiro, Caroline D. McDermott, Michael G. Ma’ayan, Avi Database (Oxford) Original Article Genomics, epigenomics, transcriptomics, proteomics and metabolomics efforts rapidly generate a plethora of data on the activity and levels of biomolecules within mammalian cells. At the same time, curation projects that organize knowledge from the biomedical literature into online databases are expanding. Hence, there is a wealth of information about genes, proteins and their associations, with an urgent need for data integration to achieve better knowledge extraction and data reuse. For this purpose, we developed the Harmonizome: a collection of processed datasets gathered to serve and mine knowledge about genes and proteins from over 70 major online resources. We extracted, abstracted and organized data into ∼72 million functional associations between genes/proteins and their attributes. Such attributes could be physical relationships with other biomolecules, expression in cell lines and tissues, genetic associations with knockout mouse or human phenotypes, or changes in expression after drug treatment. We stored these associations in a relational database along with rich metadata for the genes/proteins, their attributes and the original resources. The freely available Harmonizome web portal provides a graphical user interface, a web service and a mobile app for querying, browsing and downloading all of the collected data. To demonstrate the utility of the Harmonizome, we computed and visualized gene–gene and attribute–attribute similarity networks, and through unsupervised clustering, identified many unexpected relationships by combining pairs of datasets such as the association between kinase perturbations and disease signatures. We also applied supervised machine learning methods to predict novel substrates for kinases, endogenous ligands for G-protein coupled receptors, mouse phenotypes for knockout genes, and classified unannotated transmembrane proteins for likelihood of being ion channels. The Harmonizome is a comprehensive resource of knowledge about genes and proteins, and as such, it enables researchers to discover novel relationships between biological entities, as well as form novel data-driven hypotheses for experimental validation. Database URL: http://amp.pharm.mssm.edu/Harmonizome. Oxford University Press 2016-07-02 /pmc/articles/PMC4930834/ /pubmed/27374120 http://dx.doi.org/10.1093/database/baw100 Text en © The Author(s) 2016. Published by Oxford University Press. http://creativecommons.org/licenses/by/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Original Article
Rouillard, Andrew D.
Gundersen, Gregory W.
Fernandez, Nicolas F.
Wang, Zichen
Monteiro, Caroline D.
McDermott, Michael G.
Ma’ayan, Avi
The harmonizome: a collection of processed datasets gathered to serve and mine knowledge about genes and proteins
title The harmonizome: a collection of processed datasets gathered to serve and mine knowledge about genes and proteins
title_full The harmonizome: a collection of processed datasets gathered to serve and mine knowledge about genes and proteins
title_fullStr The harmonizome: a collection of processed datasets gathered to serve and mine knowledge about genes and proteins
title_full_unstemmed The harmonizome: a collection of processed datasets gathered to serve and mine knowledge about genes and proteins
title_short The harmonizome: a collection of processed datasets gathered to serve and mine knowledge about genes and proteins
title_sort harmonizome: a collection of processed datasets gathered to serve and mine knowledge about genes and proteins
topic Original Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4930834/
https://www.ncbi.nlm.nih.gov/pubmed/27374120
http://dx.doi.org/10.1093/database/baw100
work_keys_str_mv AT rouillardandrewd theharmonizomeacollectionofprocesseddatasetsgatheredtoserveandmineknowledgeaboutgenesandproteins
AT gundersengregoryw theharmonizomeacollectionofprocesseddatasetsgatheredtoserveandmineknowledgeaboutgenesandproteins
AT fernandeznicolasf theharmonizomeacollectionofprocesseddatasetsgatheredtoserveandmineknowledgeaboutgenesandproteins
AT wangzichen theharmonizomeacollectionofprocesseddatasetsgatheredtoserveandmineknowledgeaboutgenesandproteins
AT monteirocarolined theharmonizomeacollectionofprocesseddatasetsgatheredtoserveandmineknowledgeaboutgenesandproteins
AT mcdermottmichaelg theharmonizomeacollectionofprocesseddatasetsgatheredtoserveandmineknowledgeaboutgenesandproteins
AT maayanavi theharmonizomeacollectionofprocesseddatasetsgatheredtoserveandmineknowledgeaboutgenesandproteins
AT rouillardandrewd harmonizomeacollectionofprocesseddatasetsgatheredtoserveandmineknowledgeaboutgenesandproteins
AT gundersengregoryw harmonizomeacollectionofprocesseddatasetsgatheredtoserveandmineknowledgeaboutgenesandproteins
AT fernandeznicolasf harmonizomeacollectionofprocesseddatasetsgatheredtoserveandmineknowledgeaboutgenesandproteins
AT wangzichen harmonizomeacollectionofprocesseddatasetsgatheredtoserveandmineknowledgeaboutgenesandproteins
AT monteirocarolined harmonizomeacollectionofprocesseddatasetsgatheredtoserveandmineknowledgeaboutgenesandproteins
AT mcdermottmichaelg harmonizomeacollectionofprocesseddatasetsgatheredtoserveandmineknowledgeaboutgenesandproteins
AT maayanavi harmonizomeacollectionofprocesseddatasetsgatheredtoserveandmineknowledgeaboutgenesandproteins