Cargando…

Building a protein name dictionary from full text: a machine learning term extraction approach

BACKGROUND: The majority of information in the biological literature resides in full text articles, instead of abstracts. Yet, abstracts remain the focus of many publicly available literature data mining tools. Most literature mining tools rely on pre-existing lexicons of biological names, often ext...

Descripción completa

Detalles Bibliográficos
Autores principales: Shi, Lei, Campagne, Fabien
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2005
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1090555/
https://www.ncbi.nlm.nih.gov/pubmed/15817129
http://dx.doi.org/10.1186/1471-2105-6-88
_version_ 1782123883882610688
author Shi, Lei
Campagne, Fabien
author_facet Shi, Lei
Campagne, Fabien
author_sort Shi, Lei
collection PubMed
description BACKGROUND: The majority of information in the biological literature resides in full text articles, instead of abstracts. Yet, abstracts remain the focus of many publicly available literature data mining tools. Most literature mining tools rely on pre-existing lexicons of biological names, often extracted from curated gene or protein databases. This is a limitation, because such databases have low coverage of the many name variants which are used to refer to biological entities in the literature. RESULTS: We present an approach to recognize named entities in full text. The approach collects high frequency terms in an article, and uses support vector machines (SVM) to identify biological entity names. It is also computationally efficient and robust to noise commonly found in full text material. We use the method to create a protein name dictionary from a set of 80,528 full text articles. Only 8.3% of the names in this dictionary match SwissProt description lines. We assess the quality of the dictionary by studying its protein name recognition performance in full text. CONCLUSION: This dictionary term lookup method compares favourably to other published methods, supporting the significance of our direct extraction approach. The method is strong in recognizing name variants not found in SwissProt.
format Text
id pubmed-1090555
institution National Center for Biotechnology Information
language English
publishDate 2005
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-10905552005-05-07 Building a protein name dictionary from full text: a machine learning term extraction approach Shi, Lei Campagne, Fabien BMC Bioinformatics Methodology Article BACKGROUND: The majority of information in the biological literature resides in full text articles, instead of abstracts. Yet, abstracts remain the focus of many publicly available literature data mining tools. Most literature mining tools rely on pre-existing lexicons of biological names, often extracted from curated gene or protein databases. This is a limitation, because such databases have low coverage of the many name variants which are used to refer to biological entities in the literature. RESULTS: We present an approach to recognize named entities in full text. The approach collects high frequency terms in an article, and uses support vector machines (SVM) to identify biological entity names. It is also computationally efficient and robust to noise commonly found in full text material. We use the method to create a protein name dictionary from a set of 80,528 full text articles. Only 8.3% of the names in this dictionary match SwissProt description lines. We assess the quality of the dictionary by studying its protein name recognition performance in full text. CONCLUSION: This dictionary term lookup method compares favourably to other published methods, supporting the significance of our direct extraction approach. The method is strong in recognizing name variants not found in SwissProt. BioMed Central 2005-04-07 /pmc/articles/PMC1090555/ /pubmed/15817129 http://dx.doi.org/10.1186/1471-2105-6-88 Text en Copyright © 2005 Shi and Campagne; licensee BioMed Central Ltd.
spellingShingle Methodology Article
Shi, Lei
Campagne, Fabien
Building a protein name dictionary from full text: a machine learning term extraction approach
title Building a protein name dictionary from full text: a machine learning term extraction approach
title_full Building a protein name dictionary from full text: a machine learning term extraction approach
title_fullStr Building a protein name dictionary from full text: a machine learning term extraction approach
title_full_unstemmed Building a protein name dictionary from full text: a machine learning term extraction approach
title_short Building a protein name dictionary from full text: a machine learning term extraction approach
title_sort building a protein name dictionary from full text: a machine learning term extraction approach
topic Methodology Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1090555/
https://www.ncbi.nlm.nih.gov/pubmed/15817129
http://dx.doi.org/10.1186/1471-2105-6-88
work_keys_str_mv AT shilei buildingaproteinnamedictionaryfromfulltextamachinelearningtermextractionapproach
AT campagnefabien buildingaproteinnamedictionaryfromfulltextamachinelearningtermextractionapproach