Cargando…
Stopwords in technical language processing
There are increasing applications of natural language processing techniques for information retrieval, indexing, topic modelling and text classification in engineering contexts. A standard component of such tasks is the removal of stopwords, which are uninformative components of the data. While rese...
Autores principales: | , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Public Library of Science
2021
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8341615/ https://www.ncbi.nlm.nih.gov/pubmed/34351911 http://dx.doi.org/10.1371/journal.pone.0254937 |
_version_ | 1783733950422712320 |
---|---|
author | Sarica, Serhad Luo, Jianxi |
author_facet | Sarica, Serhad Luo, Jianxi |
author_sort | Sarica, Serhad |
collection | PubMed |
description | There are increasing applications of natural language processing techniques for information retrieval, indexing, topic modelling and text classification in engineering contexts. A standard component of such tasks is the removal of stopwords, which are uninformative components of the data. While researchers use readily available stopwords lists that are derived from non-technical resources, the technical jargon of engineering fields contains their own highly frequent and uninformative words and there exists no standard stopwords list for technical language processing applications. Here we address this gap by rigorously identifying generic, insignificant, uninformative stopwords in engineering texts beyond the stopwords in general texts, based on the synthesis of alternative statistical measures such as term frequency, inverse document frequency, and entropy, and curating a stopwords dataset ready for technical language processing applications. |
format | Online Article Text |
id | pubmed-8341615 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2021 |
publisher | Public Library of Science |
record_format | MEDLINE/PubMed |
spelling | pubmed-83416152021-08-06 Stopwords in technical language processing Sarica, Serhad Luo, Jianxi PLoS One Research Article There are increasing applications of natural language processing techniques for information retrieval, indexing, topic modelling and text classification in engineering contexts. A standard component of such tasks is the removal of stopwords, which are uninformative components of the data. While researchers use readily available stopwords lists that are derived from non-technical resources, the technical jargon of engineering fields contains their own highly frequent and uninformative words and there exists no standard stopwords list for technical language processing applications. Here we address this gap by rigorously identifying generic, insignificant, uninformative stopwords in engineering texts beyond the stopwords in general texts, based on the synthesis of alternative statistical measures such as term frequency, inverse document frequency, and entropy, and curating a stopwords dataset ready for technical language processing applications. Public Library of Science 2021-08-05 /pmc/articles/PMC8341615/ /pubmed/34351911 http://dx.doi.org/10.1371/journal.pone.0254937 Text en © 2021 Sarica, Luo https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. |
spellingShingle | Research Article Sarica, Serhad Luo, Jianxi Stopwords in technical language processing |
title | Stopwords in technical language processing |
title_full | Stopwords in technical language processing |
title_fullStr | Stopwords in technical language processing |
title_full_unstemmed | Stopwords in technical language processing |
title_short | Stopwords in technical language processing |
title_sort | stopwords in technical language processing |
topic | Research Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8341615/ https://www.ncbi.nlm.nih.gov/pubmed/34351911 http://dx.doi.org/10.1371/journal.pone.0254937 |
work_keys_str_mv | AT saricaserhad stopwordsintechnicallanguageprocessing AT luojianxi stopwordsintechnicallanguageprocessing |