Cargando…

Stopwords in technical language processing

There are increasing applications of natural language processing techniques for information retrieval, indexing, topic modelling and text classification in engineering contexts. A standard component of such tasks is the removal of stopwords, which are uninformative components of the data. While rese...

Descripción completa

Detalles Bibliográficos
Autores principales: Sarica, Serhad, Luo, Jianxi
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8341615/
https://www.ncbi.nlm.nih.gov/pubmed/34351911
http://dx.doi.org/10.1371/journal.pone.0254937
_version_ 1783733950422712320
author Sarica, Serhad
Luo, Jianxi
author_facet Sarica, Serhad
Luo, Jianxi
author_sort Sarica, Serhad
collection PubMed
description There are increasing applications of natural language processing techniques for information retrieval, indexing, topic modelling and text classification in engineering contexts. A standard component of such tasks is the removal of stopwords, which are uninformative components of the data. While researchers use readily available stopwords lists that are derived from non-technical resources, the technical jargon of engineering fields contains their own highly frequent and uninformative words and there exists no standard stopwords list for technical language processing applications. Here we address this gap by rigorously identifying generic, insignificant, uninformative stopwords in engineering texts beyond the stopwords in general texts, based on the synthesis of alternative statistical measures such as term frequency, inverse document frequency, and entropy, and curating a stopwords dataset ready for technical language processing applications.
format Online
Article
Text
id pubmed-8341615
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-83416152021-08-06 Stopwords in technical language processing Sarica, Serhad Luo, Jianxi PLoS One Research Article There are increasing applications of natural language processing techniques for information retrieval, indexing, topic modelling and text classification in engineering contexts. A standard component of such tasks is the removal of stopwords, which are uninformative components of the data. While researchers use readily available stopwords lists that are derived from non-technical resources, the technical jargon of engineering fields contains their own highly frequent and uninformative words and there exists no standard stopwords list for technical language processing applications. Here we address this gap by rigorously identifying generic, insignificant, uninformative stopwords in engineering texts beyond the stopwords in general texts, based on the synthesis of alternative statistical measures such as term frequency, inverse document frequency, and entropy, and curating a stopwords dataset ready for technical language processing applications. Public Library of Science 2021-08-05 /pmc/articles/PMC8341615/ /pubmed/34351911 http://dx.doi.org/10.1371/journal.pone.0254937 Text en © 2021 Sarica, Luo https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle Research Article
Sarica, Serhad
Luo, Jianxi
Stopwords in technical language processing
title Stopwords in technical language processing
title_full Stopwords in technical language processing
title_fullStr Stopwords in technical language processing
title_full_unstemmed Stopwords in technical language processing
title_short Stopwords in technical language processing
title_sort stopwords in technical language processing
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8341615/
https://www.ncbi.nlm.nih.gov/pubmed/34351911
http://dx.doi.org/10.1371/journal.pone.0254937
work_keys_str_mv AT saricaserhad stopwordsintechnicallanguageprocessing
AT luojianxi stopwordsintechnicallanguageprocessing