Cargando…

Are n-gram Categories Helpful in Text Classification?

Character n-grams are widely used in text categorization problems and are the single most successful type of feature in authorship attribution. Their primary advantage is language independence, as they can be applied to a new language with no additional effort. Typed character n-grams reflect inform...

Descripción completa

Detalles Bibliográficos
Autores principales: Kruczek, Jakub, Kruczek, Paulina, Kuta, Marcin
Formato: Online Artículo Texto
Lenguaje:English
Publicado: 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7302864/
http://dx.doi.org/10.1007/978-3-030-50417-5_39
_version_ 1783547938250686464
author Kruczek, Jakub
Kruczek, Paulina
Kuta, Marcin
author_facet Kruczek, Jakub
Kruczek, Paulina
Kuta, Marcin
author_sort Kruczek, Jakub
collection PubMed
description Character n-grams are widely used in text categorization problems and are the single most successful type of feature in authorship attribution. Their primary advantage is language independence, as they can be applied to a new language with no additional effort. Typed character n-grams reflect information about their content and context. According to previous research, typed character n-grams improve the accuracy of authorship attribution. This paper examines their effectiveness in three domains: authorship attribution, author profiling and sentiment analysis. The problem of a very high number of features is tackled with distributed Apache Spark processing.
format Online
Article
Text
id pubmed-7302864
institution National Center for Biotechnology Information
language English
publishDate 2020
record_format MEDLINE/PubMed
spelling pubmed-73028642020-06-19 Are n-gram Categories Helpful in Text Classification? Kruczek, Jakub Kruczek, Paulina Kuta, Marcin Computational Science – ICCS 2020 Article Character n-grams are widely used in text categorization problems and are the single most successful type of feature in authorship attribution. Their primary advantage is language independence, as they can be applied to a new language with no additional effort. Typed character n-grams reflect information about their content and context. According to previous research, typed character n-grams improve the accuracy of authorship attribution. This paper examines their effectiveness in three domains: authorship attribution, author profiling and sentiment analysis. The problem of a very high number of features is tackled with distributed Apache Spark processing. 2020-06-15 /pmc/articles/PMC7302864/ http://dx.doi.org/10.1007/978-3-030-50417-5_39 Text en © Springer Nature Switzerland AG 2020 This article is made available via the PMC Open Access Subset for unrestricted research re-use and secondary analysis in any form or by any means with acknowledgement of the original source. These permissions are granted for the duration of the World Health Organization (WHO) declaration of COVID-19 as a global pandemic.
spellingShingle Article
Kruczek, Jakub
Kruczek, Paulina
Kuta, Marcin
Are n-gram Categories Helpful in Text Classification?
title Are n-gram Categories Helpful in Text Classification?
title_full Are n-gram Categories Helpful in Text Classification?
title_fullStr Are n-gram Categories Helpful in Text Classification?
title_full_unstemmed Are n-gram Categories Helpful in Text Classification?
title_short Are n-gram Categories Helpful in Text Classification?
title_sort are n-gram categories helpful in text classification?
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7302864/
http://dx.doi.org/10.1007/978-3-030-50417-5_39
work_keys_str_mv AT kruczekjakub arengramcategorieshelpfulintextclassification
AT kruczekpaulina arengramcategorieshelpfulintextclassification
AT kutamarcin arengramcategorieshelpfulintextclassification