Cargando…

Authorship identification of documents with high content similarity

The goal of our work is inspired by the task of associating segments of text to their real authors. In this work, we focus on analyzing the way humans judge different writing styles. This analysis can help to better understand this process and to thus simulate/ mimic such behavior accordingly. Unlik...

Descripción completa

Detalles Bibliográficos
Autores principales: Rexha, Andi, Kröll, Mark, Ziak, Hermann, Kern, Roman
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Springer Netherlands 2018
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5838116/
https://www.ncbi.nlm.nih.gov/pubmed/29527072
http://dx.doi.org/10.1007/s11192-018-2661-6
_version_ 1783304188428550144
author Rexha, Andi
Kröll, Mark
Ziak, Hermann
Kern, Roman
author_facet Rexha, Andi
Kröll, Mark
Ziak, Hermann
Kern, Roman
author_sort Rexha, Andi
collection PubMed
description The goal of our work is inspired by the task of associating segments of text to their real authors. In this work, we focus on analyzing the way humans judge different writing styles. This analysis can help to better understand this process and to thus simulate/ mimic such behavior accordingly. Unlike the majority of the work done in this field (i.e. authorship attribution, plagiarism detection, etc.) which uses content features, we focus only on the stylometric, i.e. content-agnostic, characteristics of authors. Therefore, we conducted two pilot studies to determine, if humans can identify authorship among documents with high content similarity. The first was a quantitative experiment involving crowd-sourcing, while the second was a qualitative one executed by the authors of this paper. Both studies confirmed that this task is quite challenging. To gain a better understanding of how humans tackle such a problem, we conducted an exploratory data analysis on the results of the studies. In the first experiment, we compared the decisions against content features and stylometric features. While in the second, the evaluators described the process and the features on which their judgment was based. The findings of our detailed analysis could (1) help to improve algorithms such as automatic authorship attribution as well as plagiarism detection, (2) assist forensic experts or linguists to create profiles of writers, (3) support intelligence applications to analyze aggressive and threatening messages and (4) help editor conformity by adhering to, for instance, journal specific writing style.
format Online
Article
Text
id pubmed-5838116
institution National Center for Biotechnology Information
language English
publishDate 2018
publisher Springer Netherlands
record_format MEDLINE/PubMed
spelling pubmed-58381162018-03-09 Authorship identification of documents with high content similarity Rexha, Andi Kröll, Mark Ziak, Hermann Kern, Roman Scientometrics Article The goal of our work is inspired by the task of associating segments of text to their real authors. In this work, we focus on analyzing the way humans judge different writing styles. This analysis can help to better understand this process and to thus simulate/ mimic such behavior accordingly. Unlike the majority of the work done in this field (i.e. authorship attribution, plagiarism detection, etc.) which uses content features, we focus only on the stylometric, i.e. content-agnostic, characteristics of authors. Therefore, we conducted two pilot studies to determine, if humans can identify authorship among documents with high content similarity. The first was a quantitative experiment involving crowd-sourcing, while the second was a qualitative one executed by the authors of this paper. Both studies confirmed that this task is quite challenging. To gain a better understanding of how humans tackle such a problem, we conducted an exploratory data analysis on the results of the studies. In the first experiment, we compared the decisions against content features and stylometric features. While in the second, the evaluators described the process and the features on which their judgment was based. The findings of our detailed analysis could (1) help to improve algorithms such as automatic authorship attribution as well as plagiarism detection, (2) assist forensic experts or linguists to create profiles of writers, (3) support intelligence applications to analyze aggressive and threatening messages and (4) help editor conformity by adhering to, for instance, journal specific writing style. Springer Netherlands 2018-02-02 2018 /pmc/articles/PMC5838116/ /pubmed/29527072 http://dx.doi.org/10.1007/s11192-018-2661-6 Text en © The Author(s) 2018 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
spellingShingle Article
Rexha, Andi
Kröll, Mark
Ziak, Hermann
Kern, Roman
Authorship identification of documents with high content similarity
title Authorship identification of documents with high content similarity
title_full Authorship identification of documents with high content similarity
title_fullStr Authorship identification of documents with high content similarity
title_full_unstemmed Authorship identification of documents with high content similarity
title_short Authorship identification of documents with high content similarity
title_sort authorship identification of documents with high content similarity
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5838116/
https://www.ncbi.nlm.nih.gov/pubmed/29527072
http://dx.doi.org/10.1007/s11192-018-2661-6
work_keys_str_mv AT rexhaandi authorshipidentificationofdocumentswithhighcontentsimilarity
AT krollmark authorshipidentificationofdocumentswithhighcontentsimilarity
AT ziakhermann authorshipidentificationofdocumentswithhighcontentsimilarity
AT kernroman authorshipidentificationofdocumentswithhighcontentsimilarity