Cargando…

Short text classification approach to identify child sexual exploitation material

Producing or sharing Child Sexual Exploitation Material (CSEM) is a severe crime that Law Enforcement Agencies (LEAs) fight daily. When the LEA seizes a computer from a potential producer or consumer of the CSEM, it analyzes the storage devices of the suspect looking for evidence. Manual inspection...

Descripción completa

Detalles Bibliográficos
Autores principales: Al-Nabki, MHD Wesam, Fidalgo, Eduardo, Alegre, Enrique, Alaiz-Rodriguez, Rocio
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Nature Publishing Group UK 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10522674/
https://www.ncbi.nlm.nih.gov/pubmed/37752214
http://dx.doi.org/10.1038/s41598-023-42902-8
_version_ 1785110402875523072
author Al-Nabki, MHD Wesam
Fidalgo, Eduardo
Alegre, Enrique
Alaiz-Rodriguez, Rocio
author_facet Al-Nabki, MHD Wesam
Fidalgo, Eduardo
Alegre, Enrique
Alaiz-Rodriguez, Rocio
author_sort Al-Nabki, MHD Wesam
collection PubMed
description Producing or sharing Child Sexual Exploitation Material (CSEM) is a severe crime that Law Enforcement Agencies (LEAs) fight daily. When the LEA seizes a computer from a potential producer or consumer of the CSEM, it analyzes the storage devices of the suspect looking for evidence. Manual inspection of CSEM is time-consuming given the limited time available for Spanish police to use a search warrant. Our approach to speeding up the identification of CSEM-related files is to analyze only the file names and their absolute paths rather than their content. The main challenge lies in handling short and sparse texts that are deliberately distorted by file owners using obfuscated words and user-defined naming patterns. We present two approaches to CSEM identification. The first employs two independent classifiers, one for the file name and the other for the file path, and their outputs are then combined. Conversely, the second approach uses only the file name classifier to iterate over an absolute path. Both operate at the character n-gram level, whereas novel binary and orthographic features are presented to enrich the text representation. We benchmarked six classification models based on machine learning and convolutional neural networks. The proposed classifier has an F1 score of 0.988, which can be a promising tool for LEAs.
format Online
Article
Text
id pubmed-10522674
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Nature Publishing Group UK
record_format MEDLINE/PubMed
spelling pubmed-105226742023-09-28 Short text classification approach to identify child sexual exploitation material Al-Nabki, MHD Wesam Fidalgo, Eduardo Alegre, Enrique Alaiz-Rodriguez, Rocio Sci Rep Article Producing or sharing Child Sexual Exploitation Material (CSEM) is a severe crime that Law Enforcement Agencies (LEAs) fight daily. When the LEA seizes a computer from a potential producer or consumer of the CSEM, it analyzes the storage devices of the suspect looking for evidence. Manual inspection of CSEM is time-consuming given the limited time available for Spanish police to use a search warrant. Our approach to speeding up the identification of CSEM-related files is to analyze only the file names and their absolute paths rather than their content. The main challenge lies in handling short and sparse texts that are deliberately distorted by file owners using obfuscated words and user-defined naming patterns. We present two approaches to CSEM identification. The first employs two independent classifiers, one for the file name and the other for the file path, and their outputs are then combined. Conversely, the second approach uses only the file name classifier to iterate over an absolute path. Both operate at the character n-gram level, whereas novel binary and orthographic features are presented to enrich the text representation. We benchmarked six classification models based on machine learning and convolutional neural networks. The proposed classifier has an F1 score of 0.988, which can be a promising tool for LEAs. Nature Publishing Group UK 2023-09-26 /pmc/articles/PMC10522674/ /pubmed/37752214 http://dx.doi.org/10.1038/s41598-023-42902-8 Text en © The Author(s) 2023, corrected publication 2023 https://creativecommons.org/licenses/by/4.0/Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) .
spellingShingle Article
Al-Nabki, MHD Wesam
Fidalgo, Eduardo
Alegre, Enrique
Alaiz-Rodriguez, Rocio
Short text classification approach to identify child sexual exploitation material
title Short text classification approach to identify child sexual exploitation material
title_full Short text classification approach to identify child sexual exploitation material
title_fullStr Short text classification approach to identify child sexual exploitation material
title_full_unstemmed Short text classification approach to identify child sexual exploitation material
title_short Short text classification approach to identify child sexual exploitation material
title_sort short text classification approach to identify child sexual exploitation material
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10522674/
https://www.ncbi.nlm.nih.gov/pubmed/37752214
http://dx.doi.org/10.1038/s41598-023-42902-8
work_keys_str_mv AT alnabkimhdwesam shorttextclassificationapproachtoidentifychildsexualexploitationmaterial
AT fidalgoeduardo shorttextclassificationapproachtoidentifychildsexualexploitationmaterial
AT alegreenrique shorttextclassificationapproachtoidentifychildsexualexploitationmaterial
AT alaizrodriguezrocio shorttextclassificationapproachtoidentifychildsexualexploitationmaterial