Cargando…
An approach for proteins and their encoding genes synonyms integration based on protein ontology
BACKGROUND: Biological research is generating high volumes of data distributed across various sources. The inconsistent naming of proteins and their encoding genes brings great challenges to protein data integration: proteins and their coding genes usually have multiple related names and notations,...
Autores principales: | , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2023
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10496362/ https://www.ncbi.nlm.nih.gov/pubmed/37700258 http://dx.doi.org/10.1186/s12859-023-05464-0 |
_version_ | 1785105089090813952 |
---|---|
author | Wang, Xiaohong Jing, Xiaoli Dou, Fangkun Cao, Haowei |
author_facet | Wang, Xiaohong Jing, Xiaoli Dou, Fangkun Cao, Haowei |
author_sort | Wang, Xiaohong |
collection | PubMed |
description | BACKGROUND: Biological research is generating high volumes of data distributed across various sources. The inconsistent naming of proteins and their encoding genes brings great challenges to protein data integration: proteins and their coding genes usually have multiple related names and notations, which are difficult to match absolutely; the nomenclature of genes and proteins is complex and varies from species to species; some less studied species have no nomenclature of genes and proteins; The annotation of the same protein/gene varies greatly in different databases. In summary, a comprehensive set of protein/gene synonyms is necessary for relevant studies. RESULTS: In this study, we propose an approach for protein and its encoding gene synonym integration based on protein ontology. The workflow of protein and gene synonym integration is composed of three modules: data acquisition, entity and attribute alignment, attribute integration and deduplication. Finally, the integrated synonym set of proteins and their coding genes contains over 128.59 million terminologies covering 560,275 proteins/genes and 13,781 species. As the semantic basis, the comprehensive synonym set was used to develop a data platform to provide one-stop data retrieval without considering the diversity of protein nomenclature and species. CONCLUSION: The synonym set constructed here can serve as an important resource for biological named entity identification, text mining and information retrieval without name ambiguity, especially synonyms associated with well-defined species categories can help to study the evolutionary relationships between species at the molecular level. More importantly, the comprehensive synonyms set is the semantic basis for our subsequent studies on Protein–protein Interaction (PPI) knowledge graph. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12859-023-05464-0. |
format | Online Article Text |
id | pubmed-10496362 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2023 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-104963622023-09-13 An approach for proteins and their encoding genes synonyms integration based on protein ontology Wang, Xiaohong Jing, Xiaoli Dou, Fangkun Cao, Haowei BMC Bioinformatics Research BACKGROUND: Biological research is generating high volumes of data distributed across various sources. The inconsistent naming of proteins and their encoding genes brings great challenges to protein data integration: proteins and their coding genes usually have multiple related names and notations, which are difficult to match absolutely; the nomenclature of genes and proteins is complex and varies from species to species; some less studied species have no nomenclature of genes and proteins; The annotation of the same protein/gene varies greatly in different databases. In summary, a comprehensive set of protein/gene synonyms is necessary for relevant studies. RESULTS: In this study, we propose an approach for protein and its encoding gene synonym integration based on protein ontology. The workflow of protein and gene synonym integration is composed of three modules: data acquisition, entity and attribute alignment, attribute integration and deduplication. Finally, the integrated synonym set of proteins and their coding genes contains over 128.59 million terminologies covering 560,275 proteins/genes and 13,781 species. As the semantic basis, the comprehensive synonym set was used to develop a data platform to provide one-stop data retrieval without considering the diversity of protein nomenclature and species. CONCLUSION: The synonym set constructed here can serve as an important resource for biological named entity identification, text mining and information retrieval without name ambiguity, especially synonyms associated with well-defined species categories can help to study the evolutionary relationships between species at the molecular level. More importantly, the comprehensive synonyms set is the semantic basis for our subsequent studies on Protein–protein Interaction (PPI) knowledge graph. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12859-023-05464-0. BioMed Central 2023-09-12 /pmc/articles/PMC10496362/ /pubmed/37700258 http://dx.doi.org/10.1186/s12859-023-05464-0 Text en © The Author(s) 2023 https://creativecommons.org/licenses/by/4.0/Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data. |
spellingShingle | Research Wang, Xiaohong Jing, Xiaoli Dou, Fangkun Cao, Haowei An approach for proteins and their encoding genes synonyms integration based on protein ontology |
title | An approach for proteins and their encoding genes synonyms integration based on protein ontology |
title_full | An approach for proteins and their encoding genes synonyms integration based on protein ontology |
title_fullStr | An approach for proteins and their encoding genes synonyms integration based on protein ontology |
title_full_unstemmed | An approach for proteins and their encoding genes synonyms integration based on protein ontology |
title_short | An approach for proteins and their encoding genes synonyms integration based on protein ontology |
title_sort | approach for proteins and their encoding genes synonyms integration based on protein ontology |
topic | Research |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10496362/ https://www.ncbi.nlm.nih.gov/pubmed/37700258 http://dx.doi.org/10.1186/s12859-023-05464-0 |
work_keys_str_mv | AT wangxiaohong anapproachforproteinsandtheirencodinggenessynonymsintegrationbasedonproteinontology AT jingxiaoli anapproachforproteinsandtheirencodinggenessynonymsintegrationbasedonproteinontology AT doufangkun anapproachforproteinsandtheirencodinggenessynonymsintegrationbasedonproteinontology AT caohaowei anapproachforproteinsandtheirencodinggenessynonymsintegrationbasedonproteinontology AT wangxiaohong approachforproteinsandtheirencodinggenessynonymsintegrationbasedonproteinontology AT jingxiaoli approachforproteinsandtheirencodinggenessynonymsintegrationbasedonproteinontology AT doufangkun approachforproteinsandtheirencodinggenessynonymsintegrationbasedonproteinontology AT caohaowei approachforproteinsandtheirencodinggenessynonymsintegrationbasedonproteinontology |