Cargando…

GUANinE v0.9: Benchmark Datasets for Genomic AI Sequence-to-Function Models

Computational genomics increasingly relies on machine learning methods for genome interpretation, and the recent adoption of neural sequence-to-function models highlights the need for rigorous model specification and controlled evaluation, problems familiar to other fields of AI. Research strategies...

Descripción completa

Detalles Bibliográficos
Autores principales:	robson, eyes s., Ioannidis, Nilah M.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Cold Spring Harbor Laboratory 2023
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10614795/ https://www.ncbi.nlm.nih.gov/pubmed/37904945 http://dx.doi.org/10.1101/2023.10.12.562113

_version_	1785129101774815232
author	robson, eyes s. Ioannidis, Nilah M.
author_facet	robson, eyes s. Ioannidis, Nilah M.
author_sort	robson, eyes s.
collection	PubMed
description	Computational genomics increasingly relies on machine learning methods for genome interpretation, and the recent adoption of neural sequence-to-function models highlights the need for rigorous model specification and controlled evaluation, problems familiar to other fields of AI. Research strategies that have greatly benefited other fields — including benchmarking, auditing, and algorithmic fairness — are also needed to advance the field of genomic AI and to facilitate model development. Here we propose a genomic AI benchmark, GUANinE, for evaluating model generalization across a number of distinct genomic tasks. Compared to existing task formulations in computational genomics, GUANinE is large-scale, de-noised, and suitable for evaluating pretrained models. GUANinE v0.9 primarily focuses on functional genomics tasks such as functional element annotation and gene expression prediction, but also draws upon connections to evolutionary biology through sequence conservation tasks. The current GUANinE tasks provide insight into the performance of existing genomic AI models and non-neural baselines, with opportunities to be refined, revisited, and broadened as the field matures. Finally, the GUANinE benchmark allows us to evaluate new self-supervised T5 models and explore the tradeoffs between tokenization and model performance, while showcasing the potential for self-supervision to complement existing pretraining procedures.
format	Online Article Text
id	pubmed-10614795
institution	National Center for Biotechnology Information
language	English
publishDate	2023
publisher	Cold Spring Harbor Laboratory
record_format	MEDLINE/PubMed
spelling	pubmed-106147952023-10-31 GUANinE v0.9: Benchmark Datasets for Genomic AI Sequence-to-Function Models robson, eyes s. Ioannidis, Nilah M. bioRxiv Article Computational genomics increasingly relies on machine learning methods for genome interpretation, and the recent adoption of neural sequence-to-function models highlights the need for rigorous model specification and controlled evaluation, problems familiar to other fields of AI. Research strategies that have greatly benefited other fields — including benchmarking, auditing, and algorithmic fairness — are also needed to advance the field of genomic AI and to facilitate model development. Here we propose a genomic AI benchmark, GUANinE, for evaluating model generalization across a number of distinct genomic tasks. Compared to existing task formulations in computational genomics, GUANinE is large-scale, de-noised, and suitable for evaluating pretrained models. GUANinE v0.9 primarily focuses on functional genomics tasks such as functional element annotation and gene expression prediction, but also draws upon connections to evolutionary biology through sequence conservation tasks. The current GUANinE tasks provide insight into the performance of existing genomic AI models and non-neural baselines, with opportunities to be refined, revisited, and broadened as the field matures. Finally, the GUANinE benchmark allows us to evaluate new self-supervised T5 models and explore the tradeoffs between tokenization and model performance, while showcasing the potential for self-supervision to complement existing pretraining procedures. Cold Spring Harbor Laboratory 2023-10-17 /pmc/articles/PMC10614795/ /pubmed/37904945 http://dx.doi.org/10.1101/2023.10.12.562113 Text en https://creativecommons.org/licenses/by-nc-nd/4.0/This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License (https://creativecommons.org/licenses/by-nc-nd/4.0/) , which allows reusers to copy and distribute the material in any medium or format in unadapted form only, for noncommercial purposes only, and only so long as attribution is given to the creator.
spellingShingle	Article robson, eyes s. Ioannidis, Nilah M. GUANinE v0.9: Benchmark Datasets for Genomic AI Sequence-to-Function Models
title	GUANinE v0.9: Benchmark Datasets for Genomic AI Sequence-to-Function Models
title_full	GUANinE v0.9: Benchmark Datasets for Genomic AI Sequence-to-Function Models
title_fullStr	GUANinE v0.9: Benchmark Datasets for Genomic AI Sequence-to-Function Models
title_full_unstemmed	GUANinE v0.9: Benchmark Datasets for Genomic AI Sequence-to-Function Models
title_short	GUANinE v0.9: Benchmark Datasets for Genomic AI Sequence-to-Function Models
title_sort	guanine v0.9: benchmark datasets for genomic ai sequence-to-function models
topic	Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10614795/ https://www.ncbi.nlm.nih.gov/pubmed/37904945 http://dx.doi.org/10.1101/2023.10.12.562113
work_keys_str_mv	AT robsoneyess guaninev09benchmarkdatasetsforgenomicaisequencetofunctionmodels AT ioannidisnilahm guaninev09benchmarkdatasetsforgenomicaisequencetofunctionmodels

GUANinE v0.9: Benchmark Datasets for Genomic AI Sequence-to-Function Models

Ejemplares similares