Cargando…

An approach for proteins and their encoding genes synonyms integration based on protein ontology

BACKGROUND: Biological research is generating high volumes of data distributed across various sources. The inconsistent naming of proteins and their encoding genes brings great challenges to protein data integration: proteins and their coding genes usually have multiple related names and notations,...

Descripción completa

Detalles Bibliográficos
Autores principales: Wang, Xiaohong, Jing, Xiaoli, Dou, Fangkun, Cao, Haowei
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10496362/
https://www.ncbi.nlm.nih.gov/pubmed/37700258
http://dx.doi.org/10.1186/s12859-023-05464-0
Descripción
Sumario:BACKGROUND: Biological research is generating high volumes of data distributed across various sources. The inconsistent naming of proteins and their encoding genes brings great challenges to protein data integration: proteins and their coding genes usually have multiple related names and notations, which are difficult to match absolutely; the nomenclature of genes and proteins is complex and varies from species to species; some less studied species have no nomenclature of genes and proteins; The annotation of the same protein/gene varies greatly in different databases. In summary, a comprehensive set of protein/gene synonyms is necessary for relevant studies. RESULTS: In this study, we propose an approach for protein and its encoding gene synonym integration based on protein ontology. The workflow of protein and gene synonym integration is composed of three modules: data acquisition, entity and attribute alignment, attribute integration and deduplication. Finally, the integrated synonym set of proteins and their coding genes contains over 128.59 million terminologies covering 560,275 proteins/genes and 13,781 species. As the semantic basis, the comprehensive synonym set was used to develop a data platform to provide one-stop data retrieval without considering the diversity of protein nomenclature and species. CONCLUSION: The synonym set constructed here can serve as an important resource for biological named entity identification, text mining and information retrieval without name ambiguity, especially synonyms associated with well-defined species categories can help to study the evolutionary relationships between species at the molecular level. More importantly, the comprehensive synonyms set is the semantic basis for our subsequent studies on Protein–protein Interaction (PPI) knowledge graph. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12859-023-05464-0.