Cargando…

Ten quick tips for sequence-based prediction of protein properties using machine learning

The ubiquitous availability of genome sequencing data explains the popularity of machine learning-based methods for the prediction of protein properties from their amino acid sequences. Over the years, while revising our own work, reading submitted manuscripts as well as published papers, we have no...

Descripción completa

Detalles Bibliográficos
Autores principales: Hou, Qingzhen, Waury, Katharina, Gogishvili, Dea, Feenstra, K. Anton
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9714715/
https://www.ncbi.nlm.nih.gov/pubmed/36454728
http://dx.doi.org/10.1371/journal.pcbi.1010669
_version_ 1784842288480911360
author Hou, Qingzhen
Waury, Katharina
Gogishvili, Dea
Feenstra, K. Anton
author_facet Hou, Qingzhen
Waury, Katharina
Gogishvili, Dea
Feenstra, K. Anton
author_sort Hou, Qingzhen
collection PubMed
description The ubiquitous availability of genome sequencing data explains the popularity of machine learning-based methods for the prediction of protein properties from their amino acid sequences. Over the years, while revising our own work, reading submitted manuscripts as well as published papers, we have noticed several recurring issues, which make some reported findings hard to understand and replicate. We suspect this may be due to biologists being unfamiliar with machine learning methodology, or conversely, machine learning experts may miss some of the knowledge needed to correctly apply their methods to proteins. Here, we aim to bridge this gap for developers of such methods. The most striking issues are linked to a lack of clarity: how were annotations of interest obtained; which benchmark metrics were used; how are positives and negatives defined. Others relate to a lack of rigor: If you sneak in structural information, your method is not sequence-based; if you compare your own model to “state-of-the-art,” take the best methods; if you want to conclude that some method is better than another, obtain a significance estimate to support this claim. These, and other issues, we will cover in detail. These points may have seemed obvious to the authors during writing; however, they are not always clear-cut to the readers. We also expect many of these tips to hold for other machine learning-based applications in biology. Therefore, many computational biologists who develop methods in this particular subject will benefit from a concise overview of what to avoid and what to do instead.
format Online
Article
Text
id pubmed-9714715
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-97147152022-12-02 Ten quick tips for sequence-based prediction of protein properties using machine learning Hou, Qingzhen Waury, Katharina Gogishvili, Dea Feenstra, K. Anton PLoS Comput Biol Education The ubiquitous availability of genome sequencing data explains the popularity of machine learning-based methods for the prediction of protein properties from their amino acid sequences. Over the years, while revising our own work, reading submitted manuscripts as well as published papers, we have noticed several recurring issues, which make some reported findings hard to understand and replicate. We suspect this may be due to biologists being unfamiliar with machine learning methodology, or conversely, machine learning experts may miss some of the knowledge needed to correctly apply their methods to proteins. Here, we aim to bridge this gap for developers of such methods. The most striking issues are linked to a lack of clarity: how were annotations of interest obtained; which benchmark metrics were used; how are positives and negatives defined. Others relate to a lack of rigor: If you sneak in structural information, your method is not sequence-based; if you compare your own model to “state-of-the-art,” take the best methods; if you want to conclude that some method is better than another, obtain a significance estimate to support this claim. These, and other issues, we will cover in detail. These points may have seemed obvious to the authors during writing; however, they are not always clear-cut to the readers. We also expect many of these tips to hold for other machine learning-based applications in biology. Therefore, many computational biologists who develop methods in this particular subject will benefit from a concise overview of what to avoid and what to do instead. Public Library of Science 2022-12-01 /pmc/articles/PMC9714715/ /pubmed/36454728 http://dx.doi.org/10.1371/journal.pcbi.1010669 Text en © 2022 Hou et al https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle Education
Hou, Qingzhen
Waury, Katharina
Gogishvili, Dea
Feenstra, K. Anton
Ten quick tips for sequence-based prediction of protein properties using machine learning
title Ten quick tips for sequence-based prediction of protein properties using machine learning
title_full Ten quick tips for sequence-based prediction of protein properties using machine learning
title_fullStr Ten quick tips for sequence-based prediction of protein properties using machine learning
title_full_unstemmed Ten quick tips for sequence-based prediction of protein properties using machine learning
title_short Ten quick tips for sequence-based prediction of protein properties using machine learning
title_sort ten quick tips for sequence-based prediction of protein properties using machine learning
topic Education
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9714715/
https://www.ncbi.nlm.nih.gov/pubmed/36454728
http://dx.doi.org/10.1371/journal.pcbi.1010669
work_keys_str_mv AT houqingzhen tenquicktipsforsequencebasedpredictionofproteinpropertiesusingmachinelearning
AT waurykatharina tenquicktipsforsequencebasedpredictionofproteinpropertiesusingmachinelearning
AT gogishvilidea tenquicktipsforsequencebasedpredictionofproteinpropertiesusingmachinelearning
AT feenstrakanton tenquicktipsforsequencebasedpredictionofproteinpropertiesusingmachinelearning