Cargando…

A study on the classification of stylistic and formal features in English based on corpus data testing

The traditional statistical and rule combination algorithm lacks the determination of the inner cohesion of words, and the N-gram algorithm does not limit the length of N, which will produce a large number of invalid word strings, consume time and reduce the efficiency of the experiment. Therefore,...

Descripción completa

Detalles Bibliográficos
Autor principal:	Li, Shuhui
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	PeerJ Inc. 2023
Materias:	Algorithms and Analysis of Algorithms
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10280581/ https://www.ncbi.nlm.nih.gov/pubmed/37346606 http://dx.doi.org/10.7717/peerj-cs.1297

_version_	1785060827597897728
author	Li, Shuhui
author_facet	Li, Shuhui
author_sort	Li, Shuhui
collection	PubMed
description	The traditional statistical and rule combination algorithm lacks the determination of the inner cohesion of words, and the N-gram algorithm does not limit the length of N, which will produce a large number of invalid word strings, consume time and reduce the efficiency of the experiment. Therefore, this article first constructs a Chinese neologism corpus, adopts improved multi-PMI, and sets a double threshold to filter new words. Branch entropy is used to calculate the probabilities between words. Finally, the N-gram algorithm is used to segment the preprocessed corpus. We use multi-word mutual information and a double mutual information threshold to identify new words and improve their recognition accuracy. Experimental results show that the algorithm proposed in this article has been improved in accuracy, recall and F measures value by 7%, 3% and 5% respectively, which can promote the sharing of language information resources so that people can intuitively and accurately obtain language information services from the internet.
format	Online Article Text
id	pubmed-10280581
institution	National Center for Biotechnology Information
language	English
publishDate	2023
publisher	PeerJ Inc.
record_format	MEDLINE/PubMed
spelling	pubmed-102805812023-06-21 A study on the classification of stylistic and formal features in English based on corpus data testing Li, Shuhui PeerJ Comput Sci Algorithms and Analysis of Algorithms The traditional statistical and rule combination algorithm lacks the determination of the inner cohesion of words, and the N-gram algorithm does not limit the length of N, which will produce a large number of invalid word strings, consume time and reduce the efficiency of the experiment. Therefore, this article first constructs a Chinese neologism corpus, adopts improved multi-PMI, and sets a double threshold to filter new words. Branch entropy is used to calculate the probabilities between words. Finally, the N-gram algorithm is used to segment the preprocessed corpus. We use multi-word mutual information and a double mutual information threshold to identify new words and improve their recognition accuracy. Experimental results show that the algorithm proposed in this article has been improved in accuracy, recall and F measures value by 7%, 3% and 5% respectively, which can promote the sharing of language information resources so that people can intuitively and accurately obtain language information services from the internet. PeerJ Inc. 2023-04-25 /pmc/articles/PMC10280581/ /pubmed/37346606 http://dx.doi.org/10.7717/peerj-cs.1297 Text en ©2023 Li https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Computer Science) and either DOI or URL of the article must be cited.
spellingShingle	Algorithms and Analysis of Algorithms Li, Shuhui A study on the classification of stylistic and formal features in English based on corpus data testing
title	A study on the classification of stylistic and formal features in English based on corpus data testing
title_full	A study on the classification of stylistic and formal features in English based on corpus data testing
title_fullStr	A study on the classification of stylistic and formal features in English based on corpus data testing
title_full_unstemmed	A study on the classification of stylistic and formal features in English based on corpus data testing
title_short	A study on the classification of stylistic and formal features in English based on corpus data testing
title_sort	study on the classification of stylistic and formal features in english based on corpus data testing
topic	Algorithms and Analysis of Algorithms
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10280581/ https://www.ncbi.nlm.nih.gov/pubmed/37346606 http://dx.doi.org/10.7717/peerj-cs.1297
work_keys_str_mv	AT lishuhui astudyontheclassificationofstylisticandformalfeaturesinenglishbasedoncorpusdatatesting AT lishuhui studyontheclassificationofstylisticandformalfeaturesinenglishbasedoncorpusdatatesting

A study on the classification of stylistic and formal features in English based on corpus data testing

Ejemplares similares