Cargando…

Are n-gram Categories Helpful in Text Classification?

Character n-grams are widely used in text categorization problems and are the single most successful type of feature in authorship attribution. Their primary advantage is language independence, as they can be applied to a new language with no additional effort. Typed character n-grams reflect inform...

Descripción completa

Detalles Bibliográficos
Autores principales: Kruczek, Jakub, Kruczek, Paulina, Kuta, Marcin
Formato: Online Artículo Texto
Lenguaje:English
Publicado: 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7302864/
http://dx.doi.org/10.1007/978-3-030-50417-5_39
Descripción
Sumario:Character n-grams are widely used in text categorization problems and are the single most successful type of feature in authorship attribution. Their primary advantage is language independence, as they can be applied to a new language with no additional effort. Typed character n-grams reflect information about their content and context. According to previous research, typed character n-grams improve the accuracy of authorship attribution. This paper examines their effectiveness in three domains: authorship attribution, author profiling and sentiment analysis. The problem of a very high number of features is tackled with distributed Apache Spark processing.