Cargando…

Text Authorship Identified Using the Dynamics of Word Co-Occurrence Networks

Automatic identification of authorship in disputed documents has benefited from complex network theory as this approach does not require human expertise or detailed semantic knowledge. Networks modeling entire books can be used to discriminate texts from different sources and understand network grow...

Descripción completa

Detalles Bibliográficos
Autores principales:	Akimushkin, Camilo, Amancio, Diego Raphael, Oliveira, Osvaldo Novais
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Public Library of Science 2017
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5268788/ https://www.ncbi.nlm.nih.gov/pubmed/28125703 http://dx.doi.org/10.1371/journal.pone.0170527

_version_	1782500882667012096
author	Akimushkin, Camilo Amancio, Diego Raphael Oliveira, Osvaldo Novais
author_facet	Akimushkin, Camilo Amancio, Diego Raphael Oliveira, Osvaldo Novais
author_sort	Akimushkin, Camilo
collection	PubMed
description	Automatic identification of authorship in disputed documents has benefited from complex network theory as this approach does not require human expertise or detailed semantic knowledge. Networks modeling entire books can be used to discriminate texts from different sources and understand network growth mechanisms, but only a few studies have probed the suitability of networks in modeling small chunks of text to grasp stylistic features. In this study, we introduce a methodology based on the dynamics of word co-occurrence networks representing written texts to classify a corpus of 80 texts by 8 authors. The texts were divided into sections with equal number of linguistic tokens, from which time series were created for 12 topological metrics. Since 73% of all series were stationary (ARIMA(p, 0, q)) and the remaining were integrable of first order (ARIMA(p, 1, q)), probability distributions could be obtained for the global network metrics. The metrics exhibit bell-shaped non-Gaussian distributions, and therefore distribution moments were used as learning attributes. With an optimized supervised learning procedure based on a nonlinear transformation performed by Isomap, 71 out of 80 texts were correctly classified using the K-nearest neighbors algorithm, i.e. a remarkable 88.75% author matching success rate was achieved. Hence, purely dynamic fluctuations in network metrics can characterize authorship, thus paving the way for a robust description of large texts in terms of small evolving networks.
format	Online Article Text
id	pubmed-5268788
institution	National Center for Biotechnology Information
language	English
publishDate	2017
publisher	Public Library of Science
record_format	MEDLINE/PubMed
spelling	pubmed-52687882017-02-06 Text Authorship Identified Using the Dynamics of Word Co-Occurrence Networks Akimushkin, Camilo Amancio, Diego Raphael Oliveira, Osvaldo Novais PLoS One Research Article Automatic identification of authorship in disputed documents has benefited from complex network theory as this approach does not require human expertise or detailed semantic knowledge. Networks modeling entire books can be used to discriminate texts from different sources and understand network growth mechanisms, but only a few studies have probed the suitability of networks in modeling small chunks of text to grasp stylistic features. In this study, we introduce a methodology based on the dynamics of word co-occurrence networks representing written texts to classify a corpus of 80 texts by 8 authors. The texts were divided into sections with equal number of linguistic tokens, from which time series were created for 12 topological metrics. Since 73% of all series were stationary (ARIMA(p, 0, q)) and the remaining were integrable of first order (ARIMA(p, 1, q)), probability distributions could be obtained for the global network metrics. The metrics exhibit bell-shaped non-Gaussian distributions, and therefore distribution moments were used as learning attributes. With an optimized supervised learning procedure based on a nonlinear transformation performed by Isomap, 71 out of 80 texts were correctly classified using the K-nearest neighbors algorithm, i.e. a remarkable 88.75% author matching success rate was achieved. Hence, purely dynamic fluctuations in network metrics can characterize authorship, thus paving the way for a robust description of large texts in terms of small evolving networks. Public Library of Science 2017-01-26 /pmc/articles/PMC5268788/ /pubmed/28125703 http://dx.doi.org/10.1371/journal.pone.0170527 Text en © 2017 Akimushkin et al http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle	Research Article Akimushkin, Camilo Amancio, Diego Raphael Oliveira, Osvaldo Novais Text Authorship Identified Using the Dynamics of Word Co-Occurrence Networks
title	Text Authorship Identified Using the Dynamics of Word Co-Occurrence Networks
title_full	Text Authorship Identified Using the Dynamics of Word Co-Occurrence Networks
title_fullStr	Text Authorship Identified Using the Dynamics of Word Co-Occurrence Networks
title_full_unstemmed	Text Authorship Identified Using the Dynamics of Word Co-Occurrence Networks
title_short	Text Authorship Identified Using the Dynamics of Word Co-Occurrence Networks
title_sort	text authorship identified using the dynamics of word co-occurrence networks
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5268788/ https://www.ncbi.nlm.nih.gov/pubmed/28125703 http://dx.doi.org/10.1371/journal.pone.0170527
work_keys_str_mv	AT akimushkincamilo textauthorshipidentifiedusingthedynamicsofwordcooccurrencenetworks AT amanciodiegoraphael textauthorshipidentifiedusingthedynamicsofwordcooccurrencenetworks AT oliveiraosvaldonovais textauthorshipidentifiedusingthedynamicsofwordcooccurrencenetworks

Text Authorship Identified Using the Dynamics of Word Co-Occurrence Networks

Ejemplares similares