Cargando…

UPCLASS: a deep learning-based classifier for UniProtKB entry publications

In the UniProt Knowledgebase (UniProtKB), publications providing evidence for a specific protein annotation entry are organized across different categories, such as function, interaction and expression, based on the type of data they contain. To provide a systematic way of categorizing computational...

Descripción completa

Detalles Bibliográficos
Autores principales: Teodoro, Douglas, Knafou, Julien, Naderi, Nona, Pasche, Emilie, Gobeill, Julien, Arighi, Cecilia N, Ruch, Patrick
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7198315/
https://www.ncbi.nlm.nih.gov/pubmed/32367111
http://dx.doi.org/10.1093/database/baaa026
_version_ 1783528971380457472
author Teodoro, Douglas
Knafou, Julien
Naderi, Nona
Pasche, Emilie
Gobeill, Julien
Arighi, Cecilia N
Ruch, Patrick
author_facet Teodoro, Douglas
Knafou, Julien
Naderi, Nona
Pasche, Emilie
Gobeill, Julien
Arighi, Cecilia N
Ruch, Patrick
author_sort Teodoro, Douglas
collection PubMed
description In the UniProt Knowledgebase (UniProtKB), publications providing evidence for a specific protein annotation entry are organized across different categories, such as function, interaction and expression, based on the type of data they contain. To provide a systematic way of categorizing computationally mapped bibliographies in UniProt, we investigate a convolutional neural network (CNN) model to classify publications with accession annotations according to UniProtKB categories. The main challenge of categorizing publications at the accession annotation level is that the same publication can be annotated with multiple proteins and thus be associated with different category sets according to the evidence provided for the protein. We propose a model that divides the document into parts containing and not containing evidence for the protein annotation. Then, we use these parts to create different feature sets for each accession and feed them to separate layers of the network. The CNN model achieved a micro F1-score of 0.72 and a macro F1-score of 0.62, outperforming baseline models based on logistic regression and support vector machine by up to 22 and 18 percentage points, respectively. We believe that such an approach could be used to systematically categorize the computationally mapped bibliography in UniProtKB, which represents a significant set of the publications, and help curators to decide whether a publication is relevant for further curation for a protein accession. Database URL: https://goldorak.hesge.ch/bioexpclass/upclass/.
format Online
Article
Text
id pubmed-7198315
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-71983152020-05-05 UPCLASS: a deep learning-based classifier for UniProtKB entry publications Teodoro, Douglas Knafou, Julien Naderi, Nona Pasche, Emilie Gobeill, Julien Arighi, Cecilia N Ruch, Patrick Database (Oxford) Original Article In the UniProt Knowledgebase (UniProtKB), publications providing evidence for a specific protein annotation entry are organized across different categories, such as function, interaction and expression, based on the type of data they contain. To provide a systematic way of categorizing computationally mapped bibliographies in UniProt, we investigate a convolutional neural network (CNN) model to classify publications with accession annotations according to UniProtKB categories. The main challenge of categorizing publications at the accession annotation level is that the same publication can be annotated with multiple proteins and thus be associated with different category sets according to the evidence provided for the protein. We propose a model that divides the document into parts containing and not containing evidence for the protein annotation. Then, we use these parts to create different feature sets for each accession and feed them to separate layers of the network. The CNN model achieved a micro F1-score of 0.72 and a macro F1-score of 0.62, outperforming baseline models based on logistic regression and support vector machine by up to 22 and 18 percentage points, respectively. We believe that such an approach could be used to systematically categorize the computationally mapped bibliography in UniProtKB, which represents a significant set of the publications, and help curators to decide whether a publication is relevant for further curation for a protein accession. Database URL: https://goldorak.hesge.ch/bioexpclass/upclass/. Oxford University Press 2020-05-04 /pmc/articles/PMC7198315/ /pubmed/32367111 http://dx.doi.org/10.1093/database/baaa026 Text en © The Author(s) 2020. Published by Oxford University Press. http://creativecommons.org/licenses/by/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Original Article
Teodoro, Douglas
Knafou, Julien
Naderi, Nona
Pasche, Emilie
Gobeill, Julien
Arighi, Cecilia N
Ruch, Patrick
UPCLASS: a deep learning-based classifier for UniProtKB entry publications
title UPCLASS: a deep learning-based classifier for UniProtKB entry publications
title_full UPCLASS: a deep learning-based classifier for UniProtKB entry publications
title_fullStr UPCLASS: a deep learning-based classifier for UniProtKB entry publications
title_full_unstemmed UPCLASS: a deep learning-based classifier for UniProtKB entry publications
title_short UPCLASS: a deep learning-based classifier for UniProtKB entry publications
title_sort upclass: a deep learning-based classifier for uniprotkb entry publications
topic Original Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7198315/
https://www.ncbi.nlm.nih.gov/pubmed/32367111
http://dx.doi.org/10.1093/database/baaa026
work_keys_str_mv AT teodorodouglas upclassadeeplearningbasedclassifierforuniprotkbentrypublications
AT knafoujulien upclassadeeplearningbasedclassifierforuniprotkbentrypublications
AT naderinona upclassadeeplearningbasedclassifierforuniprotkbentrypublications
AT pascheemilie upclassadeeplearningbasedclassifierforuniprotkbentrypublications
AT gobeilljulien upclassadeeplearningbasedclassifierforuniprotkbentrypublications
AT arighicecilian upclassadeeplearningbasedclassifierforuniprotkbentrypublications
AT ruchpatrick upclassadeeplearningbasedclassifierforuniprotkbentrypublications