Cargando…

An approach for proteins and their encoding genes synonyms integration based on protein ontology

BACKGROUND: Biological research is generating high volumes of data distributed across various sources. The inconsistent naming of proteins and their encoding genes brings great challenges to protein data integration: proteins and their coding genes usually have multiple related names and notations,...

Descripción completa

Detalles Bibliográficos
Autores principales: Wang, Xiaohong, Jing, Xiaoli, Dou, Fangkun, Cao, Haowei
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10496362/
https://www.ncbi.nlm.nih.gov/pubmed/37700258
http://dx.doi.org/10.1186/s12859-023-05464-0
_version_ 1785105089090813952
author Wang, Xiaohong
Jing, Xiaoli
Dou, Fangkun
Cao, Haowei
author_facet Wang, Xiaohong
Jing, Xiaoli
Dou, Fangkun
Cao, Haowei
author_sort Wang, Xiaohong
collection PubMed
description BACKGROUND: Biological research is generating high volumes of data distributed across various sources. The inconsistent naming of proteins and their encoding genes brings great challenges to protein data integration: proteins and their coding genes usually have multiple related names and notations, which are difficult to match absolutely; the nomenclature of genes and proteins is complex and varies from species to species; some less studied species have no nomenclature of genes and proteins; The annotation of the same protein/gene varies greatly in different databases. In summary, a comprehensive set of protein/gene synonyms is necessary for relevant studies. RESULTS: In this study, we propose an approach for protein and its encoding gene synonym integration based on protein ontology. The workflow of protein and gene synonym integration is composed of three modules: data acquisition, entity and attribute alignment, attribute integration and deduplication. Finally, the integrated synonym set of proteins and their coding genes contains over 128.59 million terminologies covering 560,275 proteins/genes and 13,781 species. As the semantic basis, the comprehensive synonym set was used to develop a data platform to provide one-stop data retrieval without considering the diversity of protein nomenclature and species. CONCLUSION: The synonym set constructed here can serve as an important resource for biological named entity identification, text mining and information retrieval without name ambiguity, especially synonyms associated with well-defined species categories can help to study the evolutionary relationships between species at the molecular level. More importantly, the comprehensive synonyms set is the semantic basis for our subsequent studies on Protein–protein Interaction (PPI) knowledge graph. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12859-023-05464-0.
format Online
Article
Text
id pubmed-10496362
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-104963622023-09-13 An approach for proteins and their encoding genes synonyms integration based on protein ontology Wang, Xiaohong Jing, Xiaoli Dou, Fangkun Cao, Haowei BMC Bioinformatics Research BACKGROUND: Biological research is generating high volumes of data distributed across various sources. The inconsistent naming of proteins and their encoding genes brings great challenges to protein data integration: proteins and their coding genes usually have multiple related names and notations, which are difficult to match absolutely; the nomenclature of genes and proteins is complex and varies from species to species; some less studied species have no nomenclature of genes and proteins; The annotation of the same protein/gene varies greatly in different databases. In summary, a comprehensive set of protein/gene synonyms is necessary for relevant studies. RESULTS: In this study, we propose an approach for protein and its encoding gene synonym integration based on protein ontology. The workflow of protein and gene synonym integration is composed of three modules: data acquisition, entity and attribute alignment, attribute integration and deduplication. Finally, the integrated synonym set of proteins and their coding genes contains over 128.59 million terminologies covering 560,275 proteins/genes and 13,781 species. As the semantic basis, the comprehensive synonym set was used to develop a data platform to provide one-stop data retrieval without considering the diversity of protein nomenclature and species. CONCLUSION: The synonym set constructed here can serve as an important resource for biological named entity identification, text mining and information retrieval without name ambiguity, especially synonyms associated with well-defined species categories can help to study the evolutionary relationships between species at the molecular level. More importantly, the comprehensive synonyms set is the semantic basis for our subsequent studies on Protein–protein Interaction (PPI) knowledge graph. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12859-023-05464-0. BioMed Central 2023-09-12 /pmc/articles/PMC10496362/ /pubmed/37700258 http://dx.doi.org/10.1186/s12859-023-05464-0 Text en © The Author(s) 2023 https://creativecommons.org/licenses/by/4.0/Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle Research
Wang, Xiaohong
Jing, Xiaoli
Dou, Fangkun
Cao, Haowei
An approach for proteins and their encoding genes synonyms integration based on protein ontology
title An approach for proteins and their encoding genes synonyms integration based on protein ontology
title_full An approach for proteins and their encoding genes synonyms integration based on protein ontology
title_fullStr An approach for proteins and their encoding genes synonyms integration based on protein ontology
title_full_unstemmed An approach for proteins and their encoding genes synonyms integration based on protein ontology
title_short An approach for proteins and their encoding genes synonyms integration based on protein ontology
title_sort approach for proteins and their encoding genes synonyms integration based on protein ontology
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10496362/
https://www.ncbi.nlm.nih.gov/pubmed/37700258
http://dx.doi.org/10.1186/s12859-023-05464-0
work_keys_str_mv AT wangxiaohong anapproachforproteinsandtheirencodinggenessynonymsintegrationbasedonproteinontology
AT jingxiaoli anapproachforproteinsandtheirencodinggenessynonymsintegrationbasedonproteinontology
AT doufangkun anapproachforproteinsandtheirencodinggenessynonymsintegrationbasedonproteinontology
AT caohaowei anapproachforproteinsandtheirencodinggenessynonymsintegrationbasedonproteinontology
AT wangxiaohong approachforproteinsandtheirencodinggenessynonymsintegrationbasedonproteinontology
AT jingxiaoli approachforproteinsandtheirencodinggenessynonymsintegrationbasedonproteinontology
AT doufangkun approachforproteinsandtheirencodinggenessynonymsintegrationbasedonproteinontology
AT caohaowei approachforproteinsandtheirencodinggenessynonymsintegrationbasedonproteinontology