Cargando…

Ten quick tips for sequence-based prediction of protein properties using machine learning

The ubiquitous availability of genome sequencing data explains the popularity of machine learning-based methods for the prediction of protein properties from their amino acid sequences. Over the years, while revising our own work, reading submitted manuscripts as well as published papers, we have no...

Descripción completa

Detalles Bibliográficos
Autores principales:	Hou, Qingzhen, Waury, Katharina, Gogishvili, Dea, Feenstra, K. Anton
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Public Library of Science 2022
Materias:	Education
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9714715/ https://www.ncbi.nlm.nih.gov/pubmed/36454728 http://dx.doi.org/10.1371/journal.pcbi.1010669

_version_	1784842288480911360
author	Hou, Qingzhen Waury, Katharina Gogishvili, Dea Feenstra, K. Anton
author_facet	Hou, Qingzhen Waury, Katharina Gogishvili, Dea Feenstra, K. Anton
author_sort	Hou, Qingzhen
collection	PubMed
description	The ubiquitous availability of genome sequencing data explains the popularity of machine learning-based methods for the prediction of protein properties from their amino acid sequences. Over the years, while revising our own work, reading submitted manuscripts as well as published papers, we have noticed several recurring issues, which make some reported findings hard to understand and replicate. We suspect this may be due to biologists being unfamiliar with machine learning methodology, or conversely, machine learning experts may miss some of the knowledge needed to correctly apply their methods to proteins. Here, we aim to bridge this gap for developers of such methods. The most striking issues are linked to a lack of clarity: how were annotations of interest obtained; which benchmark metrics were used; how are positives and negatives defined. Others relate to a lack of rigor: If you sneak in structural information, your method is not sequence-based; if you compare your own model to “state-of-the-art,” take the best methods; if you want to conclude that some method is better than another, obtain a significance estimate to support this claim. These, and other issues, we will cover in detail. These points may have seemed obvious to the authors during writing; however, they are not always clear-cut to the readers. We also expect many of these tips to hold for other machine learning-based applications in biology. Therefore, many computational biologists who develop methods in this particular subject will benefit from a concise overview of what to avoid and what to do instead.
format	Online Article Text
id	pubmed-9714715
institution	National Center for Biotechnology Information
language	English
publishDate	2022
publisher	Public Library of Science
record_format	MEDLINE/PubMed
spelling	pubmed-97147152022-12-02 Ten quick tips for sequence-based prediction of protein properties using machine learning Hou, Qingzhen Waury, Katharina Gogishvili, Dea Feenstra, K. Anton PLoS Comput Biol Education The ubiquitous availability of genome sequencing data explains the popularity of machine learning-based methods for the prediction of protein properties from their amino acid sequences. Over the years, while revising our own work, reading submitted manuscripts as well as published papers, we have noticed several recurring issues, which make some reported findings hard to understand and replicate. We suspect this may be due to biologists being unfamiliar with machine learning methodology, or conversely, machine learning experts may miss some of the knowledge needed to correctly apply their methods to proteins. Here, we aim to bridge this gap for developers of such methods. The most striking issues are linked to a lack of clarity: how were annotations of interest obtained; which benchmark metrics were used; how are positives and negatives defined. Others relate to a lack of rigor: If you sneak in structural information, your method is not sequence-based; if you compare your own model to “state-of-the-art,” take the best methods; if you want to conclude that some method is better than another, obtain a significance estimate to support this claim. These, and other issues, we will cover in detail. These points may have seemed obvious to the authors during writing; however, they are not always clear-cut to the readers. We also expect many of these tips to hold for other machine learning-based applications in biology. Therefore, many computational biologists who develop methods in this particular subject will benefit from a concise overview of what to avoid and what to do instead. Public Library of Science 2022-12-01 /pmc/articles/PMC9714715/ /pubmed/36454728 http://dx.doi.org/10.1371/journal.pcbi.1010669 Text en © 2022 Hou et al https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle	Education Hou, Qingzhen Waury, Katharina Gogishvili, Dea Feenstra, K. Anton Ten quick tips for sequence-based prediction of protein properties using machine learning
title	Ten quick tips for sequence-based prediction of protein properties using machine learning
title_full	Ten quick tips for sequence-based prediction of protein properties using machine learning
title_fullStr	Ten quick tips for sequence-based prediction of protein properties using machine learning
title_full_unstemmed	Ten quick tips for sequence-based prediction of protein properties using machine learning
title_short	Ten quick tips for sequence-based prediction of protein properties using machine learning
title_sort	ten quick tips for sequence-based prediction of protein properties using machine learning
topic	Education
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9714715/ https://www.ncbi.nlm.nih.gov/pubmed/36454728 http://dx.doi.org/10.1371/journal.pcbi.1010669
work_keys_str_mv	AT houqingzhen tenquicktipsforsequencebasedpredictionofproteinpropertiesusingmachinelearning AT waurykatharina tenquicktipsforsequencebasedpredictionofproteinpropertiesusingmachinelearning AT gogishvilidea tenquicktipsforsequencebasedpredictionofproteinpropertiesusingmachinelearning AT feenstrakanton tenquicktipsforsequencebasedpredictionofproteinpropertiesusingmachinelearning

Ten quick tips for sequence-based prediction of protein properties using machine learning

Ejemplares similares