Cargando…

Enabling interpretable machine learning for biological data with reliability scores

Machine learning tools have proven useful across biological disciplines, allowing researchers to draw conclusions from large datasets, and opening up new opportunities for interpreting complex and heterogeneous biological data. Alongside the rapid growth of machine learning, there have also been gro...

Descripción completa

Detalles Bibliográficos
Autores principales:	Ahlquist, K. D., Sugden, Lauren A., Ramachandran, Sohini
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Public Library of Science 2023
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10249903/ https://www.ncbi.nlm.nih.gov/pubmed/37235578 http://dx.doi.org/10.1371/journal.pcbi.1011175

_version_	1785055645326639104
author	Ahlquist, K. D. Sugden, Lauren A. Ramachandran, Sohini
author_facet	Ahlquist, K. D. Sugden, Lauren A. Ramachandran, Sohini
author_sort	Ahlquist, K. D.
collection	PubMed
description	Machine learning tools have proven useful across biological disciplines, allowing researchers to draw conclusions from large datasets, and opening up new opportunities for interpreting complex and heterogeneous biological data. Alongside the rapid growth of machine learning, there have also been growing pains: some models that appear to perform well have later been revealed to rely on features of the data that are artifactual or biased; this feeds into the general criticism that machine learning models are designed to optimize model performance over the creation of new biological insights. A natural question arises: how do we develop machine learning models that are inherently interpretable or explainable? In this manuscript, we describe the SWIF(r) reliability score (SRS), a method building on the SWIF(r) generative framework that reflects the trustworthiness of the classification of a specific instance. The concept of the reliability score has the potential to generalize to other machine learning methods. We demonstrate the utility of the SRS when faced with common challenges in machine learning including: 1) an unknown class present in testing data that was not present in training data, 2) systemic mismatch between training and testing data, and 3) instances of testing data that have missing values for some attributes. We explore these applications of the SRS using a range of biological datasets, from agricultural data on seed morphology, to 22 quantitative traits in the UK Biobank, and population genetic simulations and 1000 Genomes Project data. With each of these examples, we demonstrate how the SRS can allow researchers to interrogate their data and training approach thoroughly, and to pair their domain-specific knowledge with powerful machine-learning frameworks. We also compare the SRS to related tools for outlier and novelty detection, and find that it has comparable performance, with the advantage of being able to operate when some data are missing. The SRS, and the broader discussion of interpretable scientific machine learning, will aid researchers in the biological machine learning space as they seek to harness the power of machine learning without sacrificing rigor and biological insight.
format	Online Article Text
id	pubmed-10249903
institution	National Center for Biotechnology Information
language	English
publishDate	2023
publisher	Public Library of Science
record_format	MEDLINE/PubMed
spelling	pubmed-102499032023-06-09 Enabling interpretable machine learning for biological data with reliability scores Ahlquist, K. D. Sugden, Lauren A. Ramachandran, Sohini PLoS Comput Biol Research Article Machine learning tools have proven useful across biological disciplines, allowing researchers to draw conclusions from large datasets, and opening up new opportunities for interpreting complex and heterogeneous biological data. Alongside the rapid growth of machine learning, there have also been growing pains: some models that appear to perform well have later been revealed to rely on features of the data that are artifactual or biased; this feeds into the general criticism that machine learning models are designed to optimize model performance over the creation of new biological insights. A natural question arises: how do we develop machine learning models that are inherently interpretable or explainable? In this manuscript, we describe the SWIF(r) reliability score (SRS), a method building on the SWIF(r) generative framework that reflects the trustworthiness of the classification of a specific instance. The concept of the reliability score has the potential to generalize to other machine learning methods. We demonstrate the utility of the SRS when faced with common challenges in machine learning including: 1) an unknown class present in testing data that was not present in training data, 2) systemic mismatch between training and testing data, and 3) instances of testing data that have missing values for some attributes. We explore these applications of the SRS using a range of biological datasets, from agricultural data on seed morphology, to 22 quantitative traits in the UK Biobank, and population genetic simulations and 1000 Genomes Project data. With each of these examples, we demonstrate how the SRS can allow researchers to interrogate their data and training approach thoroughly, and to pair their domain-specific knowledge with powerful machine-learning frameworks. We also compare the SRS to related tools for outlier and novelty detection, and find that it has comparable performance, with the advantage of being able to operate when some data are missing. The SRS, and the broader discussion of interpretable scientific machine learning, will aid researchers in the biological machine learning space as they seek to harness the power of machine learning without sacrificing rigor and biological insight. Public Library of Science 2023-05-26 /pmc/articles/PMC10249903/ /pubmed/37235578 http://dx.doi.org/10.1371/journal.pcbi.1011175 Text en © 2023 Ahlquist et al https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle	Research Article Ahlquist, K. D. Sugden, Lauren A. Ramachandran, Sohini Enabling interpretable machine learning for biological data with reliability scores
title	Enabling interpretable machine learning for biological data with reliability scores
title_full	Enabling interpretable machine learning for biological data with reliability scores
title_fullStr	Enabling interpretable machine learning for biological data with reliability scores
title_full_unstemmed	Enabling interpretable machine learning for biological data with reliability scores
title_short	Enabling interpretable machine learning for biological data with reliability scores
title_sort	enabling interpretable machine learning for biological data with reliability scores
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10249903/ https://www.ncbi.nlm.nih.gov/pubmed/37235578 http://dx.doi.org/10.1371/journal.pcbi.1011175
work_keys_str_mv	AT ahlquistkd enablinginterpretablemachinelearningforbiologicaldatawithreliabilityscores AT sugdenlaurena enablinginterpretablemachinelearningforbiologicaldatawithreliabilityscores AT ramachandransohini enablinginterpretablemachinelearningforbiologicaldatawithreliabilityscores

Enabling interpretable machine learning for biological data with reliability scores

Ejemplares similares