Cargando…
Short text classification approach to identify child sexual exploitation material
Producing or sharing Child Sexual Exploitation Material (CSEM) is a severe crime that Law Enforcement Agencies (LEAs) fight daily. When the LEA seizes a computer from a potential producer or consumer of the CSEM, it analyzes the storage devices of the suspect looking for evidence. Manual inspection...
Autores principales: | , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Nature Publishing Group UK
2023
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10522674/ https://www.ncbi.nlm.nih.gov/pubmed/37752214 http://dx.doi.org/10.1038/s41598-023-42902-8 |
_version_ | 1785110402875523072 |
---|---|
author | Al-Nabki, MHD Wesam Fidalgo, Eduardo Alegre, Enrique Alaiz-Rodriguez, Rocio |
author_facet | Al-Nabki, MHD Wesam Fidalgo, Eduardo Alegre, Enrique Alaiz-Rodriguez, Rocio |
author_sort | Al-Nabki, MHD Wesam |
collection | PubMed |
description | Producing or sharing Child Sexual Exploitation Material (CSEM) is a severe crime that Law Enforcement Agencies (LEAs) fight daily. When the LEA seizes a computer from a potential producer or consumer of the CSEM, it analyzes the storage devices of the suspect looking for evidence. Manual inspection of CSEM is time-consuming given the limited time available for Spanish police to use a search warrant. Our approach to speeding up the identification of CSEM-related files is to analyze only the file names and their absolute paths rather than their content. The main challenge lies in handling short and sparse texts that are deliberately distorted by file owners using obfuscated words and user-defined naming patterns. We present two approaches to CSEM identification. The first employs two independent classifiers, one for the file name and the other for the file path, and their outputs are then combined. Conversely, the second approach uses only the file name classifier to iterate over an absolute path. Both operate at the character n-gram level, whereas novel binary and orthographic features are presented to enrich the text representation. We benchmarked six classification models based on machine learning and convolutional neural networks. The proposed classifier has an F1 score of 0.988, which can be a promising tool for LEAs. |
format | Online Article Text |
id | pubmed-10522674 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2023 |
publisher | Nature Publishing Group UK |
record_format | MEDLINE/PubMed |
spelling | pubmed-105226742023-09-28 Short text classification approach to identify child sexual exploitation material Al-Nabki, MHD Wesam Fidalgo, Eduardo Alegre, Enrique Alaiz-Rodriguez, Rocio Sci Rep Article Producing or sharing Child Sexual Exploitation Material (CSEM) is a severe crime that Law Enforcement Agencies (LEAs) fight daily. When the LEA seizes a computer from a potential producer or consumer of the CSEM, it analyzes the storage devices of the suspect looking for evidence. Manual inspection of CSEM is time-consuming given the limited time available for Spanish police to use a search warrant. Our approach to speeding up the identification of CSEM-related files is to analyze only the file names and their absolute paths rather than their content. The main challenge lies in handling short and sparse texts that are deliberately distorted by file owners using obfuscated words and user-defined naming patterns. We present two approaches to CSEM identification. The first employs two independent classifiers, one for the file name and the other for the file path, and their outputs are then combined. Conversely, the second approach uses only the file name classifier to iterate over an absolute path. Both operate at the character n-gram level, whereas novel binary and orthographic features are presented to enrich the text representation. We benchmarked six classification models based on machine learning and convolutional neural networks. The proposed classifier has an F1 score of 0.988, which can be a promising tool for LEAs. Nature Publishing Group UK 2023-09-26 /pmc/articles/PMC10522674/ /pubmed/37752214 http://dx.doi.org/10.1038/s41598-023-42902-8 Text en © The Author(s) 2023, corrected publication 2023 https://creativecommons.org/licenses/by/4.0/Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . |
spellingShingle | Article Al-Nabki, MHD Wesam Fidalgo, Eduardo Alegre, Enrique Alaiz-Rodriguez, Rocio Short text classification approach to identify child sexual exploitation material |
title | Short text classification approach to identify child sexual exploitation material |
title_full | Short text classification approach to identify child sexual exploitation material |
title_fullStr | Short text classification approach to identify child sexual exploitation material |
title_full_unstemmed | Short text classification approach to identify child sexual exploitation material |
title_short | Short text classification approach to identify child sexual exploitation material |
title_sort | short text classification approach to identify child sexual exploitation material |
topic | Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10522674/ https://www.ncbi.nlm.nih.gov/pubmed/37752214 http://dx.doi.org/10.1038/s41598-023-42902-8 |
work_keys_str_mv | AT alnabkimhdwesam shorttextclassificationapproachtoidentifychildsexualexploitationmaterial AT fidalgoeduardo shorttextclassificationapproachtoidentifychildsexualexploitationmaterial AT alegreenrique shorttextclassificationapproachtoidentifychildsexualexploitationmaterial AT alaizrodriguezrocio shorttextclassificationapproachtoidentifychildsexualexploitationmaterial |