Cargando…

Finding Dutch natives in online forums

Law enforcement agencies have a restricted area in which their powers apply, which is called their jurisdiction. These restrictions also apply to the Internet. However, on the Internet, the physical borders of the jurisdiction, typically country borders, are hard to discover. In our case, it is hard...

Descripción completa

Detalles Bibliográficos
Autores principales:	van den Boom, Bernard, Veenman, Cor J.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Taylor & Francis 2018
Materias:	Original Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6201805/ https://www.ncbi.nlm.nih.gov/pubmed/30483673 http://dx.doi.org/10.1080/20961790.2018.1482042

_version_	1783365574887211008
author	van den Boom, Bernard Veenman, Cor J.
author_facet	van den Boom, Bernard Veenman, Cor J.
author_sort	van den Boom, Bernard
collection	PubMed
description	Law enforcement agencies have a restricted area in which their powers apply, which is called their jurisdiction. These restrictions also apply to the Internet. However, on the Internet, the physical borders of the jurisdiction, typically country borders, are hard to discover. In our case, it is hard to establish whether someone involved in criminal online behavior is indeed a Dutch citizen. We propose a way to overcome the arduous task of manually investigating whether a user on an Internet forum is Dutch or not. More precisely, we aim to detect that a given English text is written by a Dutch native author. To develop a detector, we follow a machine learning approach. Therefore, we need to prepare a specific training corpus. To obtain a corpus that is representative for online forums, we collected a large amount of English forum posts from Dutch and non-Dutch authors on Reddit. To learn a detection model, we used a bag-of-words representation to capture potential misspellings, grammatical errors or unusual turns of phrases that are characteristic of the mother tongue of the authors. For this learning task, we compare the linear support vector machine and regularized logistic regression using the appropriate performance metrics f (1) score, precision, and average precision. Our results show logistic regression with frequency-based feature selection performs best at predicting Dutch natives. Further study should be directed to the general applicability of the results that is to find out if the developed models are applicable to other forums with comparable high performance.
format	Online Article Text
id	pubmed-6201805
institution	National Center for Biotechnology Information
language	English
publishDate	2018
publisher	Taylor & Francis
record_format	MEDLINE/PubMed
spelling	pubmed-62018052018-11-27 Finding Dutch natives in online forums van den Boom, Bernard Veenman, Cor J. Forensic Sci Res Original Article Law enforcement agencies have a restricted area in which their powers apply, which is called their jurisdiction. These restrictions also apply to the Internet. However, on the Internet, the physical borders of the jurisdiction, typically country borders, are hard to discover. In our case, it is hard to establish whether someone involved in criminal online behavior is indeed a Dutch citizen. We propose a way to overcome the arduous task of manually investigating whether a user on an Internet forum is Dutch or not. More precisely, we aim to detect that a given English text is written by a Dutch native author. To develop a detector, we follow a machine learning approach. Therefore, we need to prepare a specific training corpus. To obtain a corpus that is representative for online forums, we collected a large amount of English forum posts from Dutch and non-Dutch authors on Reddit. To learn a detection model, we used a bag-of-words representation to capture potential misspellings, grammatical errors or unusual turns of phrases that are characteristic of the mother tongue of the authors. For this learning task, we compare the linear support vector machine and regularized logistic regression using the appropriate performance metrics f (1) score, precision, and average precision. Our results show logistic regression with frequency-based feature selection performs best at predicting Dutch natives. Further study should be directed to the general applicability of the results that is to find out if the developed models are applicable to other forums with comparable high performance. Taylor & Francis 2018-09-28 /pmc/articles/PMC6201805/ /pubmed/30483673 http://dx.doi.org/10.1080/20961790.2018.1482042 Text en © 2018 The Author(s). Published by Taylor & Francis Group on behalf of the Academy of Forensic Science. http://creativecommons.org/licenses/by/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Original Article van den Boom, Bernard Veenman, Cor J. Finding Dutch natives in online forums
title	Finding Dutch natives in online forums
title_full	Finding Dutch natives in online forums
title_fullStr	Finding Dutch natives in online forums
title_full_unstemmed	Finding Dutch natives in online forums
title_short	Finding Dutch natives in online forums
title_sort	finding dutch natives in online forums
topic	Original Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6201805/ https://www.ncbi.nlm.nih.gov/pubmed/30483673 http://dx.doi.org/10.1080/20961790.2018.1482042
work_keys_str_mv	AT vandenboombernard findingdutchnativesinonlineforums AT veenmancorj findingdutchnativesinonlineforums

Finding Dutch natives in online forums

Ejemplares similares