Cargando…

A dataset to facilitate automated workflow analysis

Data sets that provide a ground truth to quantify the efficacy of automated algorithms are rare due to the time consuming and expensive, although highly valuable, task of manually annotating observations. These datasets exist for niche problems in developed fields such as Natural Language Processing...

Descripción completa

Detalles Bibliográficos
Autores principales: Allard, Tony, Alvino, Paul, Shing, Leslie, Wollaber, Allan, Yuen, Joseph
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2019
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6366754/
https://www.ncbi.nlm.nih.gov/pubmed/30730921
http://dx.doi.org/10.1371/journal.pone.0211486
_version_ 1783393660951330816
author Allard, Tony
Alvino, Paul
Shing, Leslie
Wollaber, Allan
Yuen, Joseph
author_facet Allard, Tony
Alvino, Paul
Shing, Leslie
Wollaber, Allan
Yuen, Joseph
author_sort Allard, Tony
collection PubMed
description Data sets that provide a ground truth to quantify the efficacy of automated algorithms are rare due to the time consuming and expensive, although highly valuable, task of manually annotating observations. These datasets exist for niche problems in developed fields such as Natural Language Processing (NLP) and Business Process Mining (BPM), however it is difficult to find a suitable dataset for use cases that span across multiple fields, such as the one described in this study. The lack of established ground truth maps between cyberspace and the human-interpretable, persona-driven tasks that occur therein, is one of the principal barriers preventing reliable, automated situation awareness of dynamically evolving events and the consequences of loss due to cybersecurity breaches. Automated workflow analysis—the machine-learning assisted identification of templates of repeated tasks—is the likely missing link between semantic descriptions of mission goals and observable events in cyberspace. We summarize our efforts to establish a ground truth for an email dataset pertaining to the operation of an open source software project. The ground truth defines semantic labels for each email and the arrangement of emails within a sequence that describe actions observed in the dataset. Identified sequences are then used to define template workflows that describe the possible tasks undertaken for a project and their business process model. We present the overall purpose of the dataset, the methodology for establishing a ground truth, and lessons learned from the effort. Finally, we report on the proposed use of the dataset for the workflow discovery problem, and its effect on system accuracy.
format Online
Article
Text
id pubmed-6366754
institution National Center for Biotechnology Information
language English
publishDate 2019
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-63667542019-02-22 A dataset to facilitate automated workflow analysis Allard, Tony Alvino, Paul Shing, Leslie Wollaber, Allan Yuen, Joseph PLoS One Research Article Data sets that provide a ground truth to quantify the efficacy of automated algorithms are rare due to the time consuming and expensive, although highly valuable, task of manually annotating observations. These datasets exist for niche problems in developed fields such as Natural Language Processing (NLP) and Business Process Mining (BPM), however it is difficult to find a suitable dataset for use cases that span across multiple fields, such as the one described in this study. The lack of established ground truth maps between cyberspace and the human-interpretable, persona-driven tasks that occur therein, is one of the principal barriers preventing reliable, automated situation awareness of dynamically evolving events and the consequences of loss due to cybersecurity breaches. Automated workflow analysis—the machine-learning assisted identification of templates of repeated tasks—is the likely missing link between semantic descriptions of mission goals and observable events in cyberspace. We summarize our efforts to establish a ground truth for an email dataset pertaining to the operation of an open source software project. The ground truth defines semantic labels for each email and the arrangement of emails within a sequence that describe actions observed in the dataset. Identified sequences are then used to define template workflows that describe the possible tasks undertaken for a project and their business process model. We present the overall purpose of the dataset, the methodology for establishing a ground truth, and lessons learned from the effort. Finally, we report on the proposed use of the dataset for the workflow discovery problem, and its effect on system accuracy. Public Library of Science 2019-02-07 /pmc/articles/PMC6366754/ /pubmed/30730921 http://dx.doi.org/10.1371/journal.pone.0211486 Text en © 2019 Allard et al http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle Research Article
Allard, Tony
Alvino, Paul
Shing, Leslie
Wollaber, Allan
Yuen, Joseph
A dataset to facilitate automated workflow analysis
title A dataset to facilitate automated workflow analysis
title_full A dataset to facilitate automated workflow analysis
title_fullStr A dataset to facilitate automated workflow analysis
title_full_unstemmed A dataset to facilitate automated workflow analysis
title_short A dataset to facilitate automated workflow analysis
title_sort dataset to facilitate automated workflow analysis
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6366754/
https://www.ncbi.nlm.nih.gov/pubmed/30730921
http://dx.doi.org/10.1371/journal.pone.0211486
work_keys_str_mv AT allardtony adatasettofacilitateautomatedworkflowanalysis
AT alvinopaul adatasettofacilitateautomatedworkflowanalysis
AT shingleslie adatasettofacilitateautomatedworkflowanalysis
AT wollaberallan adatasettofacilitateautomatedworkflowanalysis
AT yuenjoseph adatasettofacilitateautomatedworkflowanalysis
AT allardtony datasettofacilitateautomatedworkflowanalysis
AT alvinopaul datasettofacilitateautomatedworkflowanalysis
AT shingleslie datasettofacilitateautomatedworkflowanalysis
AT wollaberallan datasettofacilitateautomatedworkflowanalysis
AT yuenjoseph datasettofacilitateautomatedworkflowanalysis