Cargando…

Relating instance hardness to classification performance in a dataset: a visual approach

Machine Learning studies often involve a series of computational experiments in which the predictive performance of multiple models are compared across one or more datasets. The results obtained are usually summarized through average statistics, either in numeric tables or simple plots. Such approac...

Descripción completa

Detalles Bibliográficos
Autores principales:	Paiva, Pedro Yuri Arbs, Moreno, Camila Castro, Smith-Miles, Kate, Valeriano, Maria Gabriela, Lorena, Ana Carolina
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Springer US 2022
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9217125/ https://www.ncbi.nlm.nih.gov/pubmed/35761958 http://dx.doi.org/10.1007/s10994-022-06205-9

_version_	1784731576423153664
author	Paiva, Pedro Yuri Arbs Moreno, Camila Castro Smith-Miles, Kate Valeriano, Maria Gabriela Lorena, Ana Carolina
author_facet	Paiva, Pedro Yuri Arbs Moreno, Camila Castro Smith-Miles, Kate Valeriano, Maria Gabriela Lorena, Ana Carolina
author_sort	Paiva, Pedro Yuri Arbs
collection	PubMed
description	Machine Learning studies often involve a series of computational experiments in which the predictive performance of multiple models are compared across one or more datasets. The results obtained are usually summarized through average statistics, either in numeric tables or simple plots. Such approaches fail to reveal interesting subtleties about algorithmic performance, including which observations an algorithm may find easy or hard to classify, and also which observations within a dataset may present unique challenges. Recently, a methodology known as Instance Space Analysis was proposed for visualizing algorithm performance across different datasets. This methodology relates predictive performance to estimated instance hardness measures extracted from the datasets. However, the analysis considered an instance as being an entire classification dataset and the algorithm performance was reported for each dataset as an average error across all observations in the dataset. In this paper, we developed a more fine-grained analysis by adapting the ISA methodology. The adapted version of ISA allows the analysis of an individual classification dataset by a 2-D hardness embedding, which provides a visualization of the data according to the difficulty level of its individual observations. This allows deeper analyses of the relationships between instance hardness and predictive performance of classifiers. We also provide an open-access Python package named PyHard, which encapsulates the adapted ISA and provides an interactive visualization interface. We illustrate through case studies how our tool can provide insights about data quality and algorithm performance in the presence of challenges such as noisy and biased data.
format	Online Article Text
id	pubmed-9217125
institution	National Center for Biotechnology Information
language	English
publishDate	2022
publisher	Springer US
record_format	MEDLINE/PubMed
spelling	pubmed-92171252022-06-23 Relating instance hardness to classification performance in a dataset: a visual approach Paiva, Pedro Yuri Arbs Moreno, Camila Castro Smith-Miles, Kate Valeriano, Maria Gabriela Lorena, Ana Carolina Mach Learn Article Machine Learning studies often involve a series of computational experiments in which the predictive performance of multiple models are compared across one or more datasets. The results obtained are usually summarized through average statistics, either in numeric tables or simple plots. Such approaches fail to reveal interesting subtleties about algorithmic performance, including which observations an algorithm may find easy or hard to classify, and also which observations within a dataset may present unique challenges. Recently, a methodology known as Instance Space Analysis was proposed for visualizing algorithm performance across different datasets. This methodology relates predictive performance to estimated instance hardness measures extracted from the datasets. However, the analysis considered an instance as being an entire classification dataset and the algorithm performance was reported for each dataset as an average error across all observations in the dataset. In this paper, we developed a more fine-grained analysis by adapting the ISA methodology. The adapted version of ISA allows the analysis of an individual classification dataset by a 2-D hardness embedding, which provides a visualization of the data according to the difficulty level of its individual observations. This allows deeper analyses of the relationships between instance hardness and predictive performance of classifiers. We also provide an open-access Python package named PyHard, which encapsulates the adapted ISA and provides an interactive visualization interface. We illustrate through case studies how our tool can provide insights about data quality and algorithm performance in the presence of challenges such as noisy and biased data. Springer US 2022-06-22 2022 /pmc/articles/PMC9217125/ /pubmed/35761958 http://dx.doi.org/10.1007/s10994-022-06205-9 Text en © The Author(s), under exclusive licence to Springer Science+Business Media LLC, part of Springer Nature 2022 This article is made available via the PMC Open Access Subset for unrestricted research re-use and secondary analysis in any form or by any means with acknowledgement of the original source. These permissions are granted for the duration of the World Health Organization (WHO) declaration of COVID-19 as a global pandemic.
spellingShingle	Article Paiva, Pedro Yuri Arbs Moreno, Camila Castro Smith-Miles, Kate Valeriano, Maria Gabriela Lorena, Ana Carolina Relating instance hardness to classification performance in a dataset: a visual approach
title	Relating instance hardness to classification performance in a dataset: a visual approach
title_full	Relating instance hardness to classification performance in a dataset: a visual approach
title_fullStr	Relating instance hardness to classification performance in a dataset: a visual approach
title_full_unstemmed	Relating instance hardness to classification performance in a dataset: a visual approach
title_short	Relating instance hardness to classification performance in a dataset: a visual approach
title_sort	relating instance hardness to classification performance in a dataset: a visual approach
topic	Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9217125/ https://www.ncbi.nlm.nih.gov/pubmed/35761958 http://dx.doi.org/10.1007/s10994-022-06205-9
work_keys_str_mv	AT paivapedroyuriarbs relatinginstancehardnesstoclassificationperformanceinadatasetavisualapproach AT morenocamilacastro relatinginstancehardnesstoclassificationperformanceinadatasetavisualapproach AT smithmileskate relatinginstancehardnesstoclassificationperformanceinadatasetavisualapproach AT valerianomariagabriela relatinginstancehardnesstoclassificationperformanceinadatasetavisualapproach AT lorenaanacarolina relatinginstancehardnesstoclassificationperformanceinadatasetavisualapproach

Relating instance hardness to classification performance in a dataset: a visual approach

Ejemplares similares