Cargando…

Genomic benchmarks: a collection of datasets for genomic sequence classification

BACKGROUND: Recently, deep neural networks have been successfully applied in many biological fields. In 2020, a deep learning model AlphaFold won the protein folding competition with predicted structures within the error tolerance of experimental methods. However, this solution to the most prominent...

Descripción completa

Detalles Bibliográficos
Autores principales:	Grešová, Katarína, Martinek, Vlastimil, Čechák, David, Šimeček, Petr, Alexiou, Panagiotis
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2023
Materias:	Database
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10150520/ https://www.ncbi.nlm.nih.gov/pubmed/37127596 http://dx.doi.org/10.1186/s12863-023-01123-8

_version_	1785035376502505472
author	Grešová, Katarína Martinek, Vlastimil Čechák, David Šimeček, Petr Alexiou, Panagiotis
author_facet	Grešová, Katarína Martinek, Vlastimil Čechák, David Šimeček, Petr Alexiou, Panagiotis
author_sort	Grešová, Katarína
collection	PubMed
description	BACKGROUND: Recently, deep neural networks have been successfully applied in many biological fields. In 2020, a deep learning model AlphaFold won the protein folding competition with predicted structures within the error tolerance of experimental methods. However, this solution to the most prominent bioinformatic challenge of the past 50 years has been possible only thanks to a carefully curated benchmark of experimentally predicted protein structures. In Genomics, we have similar challenges (annotation of genomes and identification of functional elements) but currently, we lack benchmarks similar to protein folding competition. RESULTS: Here we present a collection of curated and easily accessible sequence classification datasets in the field of genomics. The proposed collection is based on a combination of novel datasets constructed from the mining of publicly available databases and existing datasets obtained from published articles. The collection currently contains nine datasets that focus on regulatory elements (promoters, enhancers, open chromatin region) from three model organisms: human, mouse, and roundworm. A simple convolution neural network is also included in a repository and can be used as a baseline model. Benchmarks and the baseline model are distributed as the Python package ‘genomic-benchmarks’, and the code is available at https://github.com/ML-Bioinfo-CEITEC/genomic_benchmarks. CONCLUSIONS: Deep learning techniques revolutionized many biological fields but mainly thanks to the carefully curated benchmarks. For the field of Genomics, we propose a collection of benchmark datasets for the classification of genomic sequences with an interface for the most commonly used deep learning libraries, implementation of the simple neural network and a training framework that can be used as a starting point for future research. The main aim of this effort is to create a repository for shared datasets that will make machine learning for genomics more comparable and reproducible while reducing the overhead of researchers who want to enter the field, leading to healthy competition and new discoveries.
format	Online Article Text
id	pubmed-10150520
institution	National Center for Biotechnology Information
language	English
publishDate	2023
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-101505202023-05-02 Genomic benchmarks: a collection of datasets for genomic sequence classification Grešová, Katarína Martinek, Vlastimil Čechák, David Šimeček, Petr Alexiou, Panagiotis BMC Genom Data Database BACKGROUND: Recently, deep neural networks have been successfully applied in many biological fields. In 2020, a deep learning model AlphaFold won the protein folding competition with predicted structures within the error tolerance of experimental methods. However, this solution to the most prominent bioinformatic challenge of the past 50 years has been possible only thanks to a carefully curated benchmark of experimentally predicted protein structures. In Genomics, we have similar challenges (annotation of genomes and identification of functional elements) but currently, we lack benchmarks similar to protein folding competition. RESULTS: Here we present a collection of curated and easily accessible sequence classification datasets in the field of genomics. The proposed collection is based on a combination of novel datasets constructed from the mining of publicly available databases and existing datasets obtained from published articles. The collection currently contains nine datasets that focus on regulatory elements (promoters, enhancers, open chromatin region) from three model organisms: human, mouse, and roundworm. A simple convolution neural network is also included in a repository and can be used as a baseline model. Benchmarks and the baseline model are distributed as the Python package ‘genomic-benchmarks’, and the code is available at https://github.com/ML-Bioinfo-CEITEC/genomic_benchmarks. CONCLUSIONS: Deep learning techniques revolutionized many biological fields but mainly thanks to the carefully curated benchmarks. For the field of Genomics, we propose a collection of benchmark datasets for the classification of genomic sequences with an interface for the most commonly used deep learning libraries, implementation of the simple neural network and a training framework that can be used as a starting point for future research. The main aim of this effort is to create a repository for shared datasets that will make machine learning for genomics more comparable and reproducible while reducing the overhead of researchers who want to enter the field, leading to healthy competition and new discoveries. BioMed Central 2023-05-01 /pmc/articles/PMC10150520/ /pubmed/37127596 http://dx.doi.org/10.1186/s12863-023-01123-8 Text en © The Author(s) 2023 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle	Database Grešová, Katarína Martinek, Vlastimil Čechák, David Šimeček, Petr Alexiou, Panagiotis Genomic benchmarks: a collection of datasets for genomic sequence classification
title	Genomic benchmarks: a collection of datasets for genomic sequence classification
title_full	Genomic benchmarks: a collection of datasets for genomic sequence classification
title_fullStr	Genomic benchmarks: a collection of datasets for genomic sequence classification
title_full_unstemmed	Genomic benchmarks: a collection of datasets for genomic sequence classification
title_short	Genomic benchmarks: a collection of datasets for genomic sequence classification
title_sort	genomic benchmarks: a collection of datasets for genomic sequence classification
topic	Database
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10150520/ https://www.ncbi.nlm.nih.gov/pubmed/37127596 http://dx.doi.org/10.1186/s12863-023-01123-8
work_keys_str_mv	AT gresovakatarina genomicbenchmarksacollectionofdatasetsforgenomicsequenceclassification AT martinekvlastimil genomicbenchmarksacollectionofdatasetsforgenomicsequenceclassification AT cechakdavid genomicbenchmarksacollectionofdatasetsforgenomicsequenceclassification AT simecekpetr genomicbenchmarksacollectionofdatasetsforgenomicsequenceclassification AT alexioupanagiotis genomicbenchmarksacollectionofdatasetsforgenomicsequenceclassification

Genomic benchmarks: a collection of datasets for genomic sequence classification

Ejemplares similares