Cargando…

RNANet: an automatically built dual-source dataset integrating homologous sequences and RNA structures

MOTIVATION: Applied research in machine learning progresses faster when a clean dataset is available and ready to use. Several datasets have been proposed and released over the years for specific tasks such as image classification, speech-recognition and more recently for protein structure predictio...

Descripción completa

Detalles Bibliográficos
Autores principales: Becquey, Louis, Angel, Eric, Tahi, Fariza
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8189678/
https://www.ncbi.nlm.nih.gov/pubmed/33135044
http://dx.doi.org/10.1093/bioinformatics/btaa944
_version_ 1783705535164448768
author Becquey, Louis
Angel, Eric
Tahi, Fariza
author_facet Becquey, Louis
Angel, Eric
Tahi, Fariza
author_sort Becquey, Louis
collection PubMed
description MOTIVATION: Applied research in machine learning progresses faster when a clean dataset is available and ready to use. Several datasets have been proposed and released over the years for specific tasks such as image classification, speech-recognition and more recently for protein structure prediction. However, for the fundamental problem of RNA structure prediction, information is spread between several databases depending on the level we are interested in: sequence, secondary structure, 3D structure or interactions with other macromolecules. In order to speed-up advances in machine-learning based approaches for RNA secondary and/or 3D structure prediction, a dataset integrating all this information is required, to avoid spending time on data gathering and cleaning. RESULTS: Here, we propose the first attempt of a standardized and automatically generated dataset dedicated to RNA combining together: RNA sequences, homology information (under the form of position-specific scoring matrices) and information derived by annotation of available 3D structures (including secondary structure, canonical and non-canonical interactions and backbone torsion angles). The data are retrieved from public databases PDB, Rfam and SILVA. The paper describes the procedure to build such dataset and the RNA structure descriptors we provide. Some statistical descriptions of the resulting dataset are also provided. AVAILABILITY AND IMPLEMENTATION: The dataset is updated every month and available online (in flat-text file format) on the EvryRNA software platform (https://evryrna.ibisc.univ-evry.fr/evryrna/rnanet). An efficient parallel pipeline to build the dataset is also provided for easy reproduction or modification. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
format Online
Article
Text
id pubmed-8189678
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-81896782021-06-10 RNANet: an automatically built dual-source dataset integrating homologous sequences and RNA structures Becquey, Louis Angel, Eric Tahi, Fariza Bioinformatics Original Papers MOTIVATION: Applied research in machine learning progresses faster when a clean dataset is available and ready to use. Several datasets have been proposed and released over the years for specific tasks such as image classification, speech-recognition and more recently for protein structure prediction. However, for the fundamental problem of RNA structure prediction, information is spread between several databases depending on the level we are interested in: sequence, secondary structure, 3D structure or interactions with other macromolecules. In order to speed-up advances in machine-learning based approaches for RNA secondary and/or 3D structure prediction, a dataset integrating all this information is required, to avoid spending time on data gathering and cleaning. RESULTS: Here, we propose the first attempt of a standardized and automatically generated dataset dedicated to RNA combining together: RNA sequences, homology information (under the form of position-specific scoring matrices) and information derived by annotation of available 3D structures (including secondary structure, canonical and non-canonical interactions and backbone torsion angles). The data are retrieved from public databases PDB, Rfam and SILVA. The paper describes the procedure to build such dataset and the RNA structure descriptors we provide. Some statistical descriptions of the resulting dataset are also provided. AVAILABILITY AND IMPLEMENTATION: The dataset is updated every month and available online (in flat-text file format) on the EvryRNA software platform (https://evryrna.ibisc.univ-evry.fr/evryrna/rnanet). An efficient parallel pipeline to build the dataset is also provided for easy reproduction or modification. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. Oxford University Press 2020-12-07 /pmc/articles/PMC8189678/ /pubmed/33135044 http://dx.doi.org/10.1093/bioinformatics/btaa944 Text en © The Author(s) 2020. Published by Oxford University Press. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) ), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Original Papers
Becquey, Louis
Angel, Eric
Tahi, Fariza
RNANet: an automatically built dual-source dataset integrating homologous sequences and RNA structures
title RNANet: an automatically built dual-source dataset integrating homologous sequences and RNA structures
title_full RNANet: an automatically built dual-source dataset integrating homologous sequences and RNA structures
title_fullStr RNANet: an automatically built dual-source dataset integrating homologous sequences and RNA structures
title_full_unstemmed RNANet: an automatically built dual-source dataset integrating homologous sequences and RNA structures
title_short RNANet: an automatically built dual-source dataset integrating homologous sequences and RNA structures
title_sort rnanet: an automatically built dual-source dataset integrating homologous sequences and rna structures
topic Original Papers
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8189678/
https://www.ncbi.nlm.nih.gov/pubmed/33135044
http://dx.doi.org/10.1093/bioinformatics/btaa944
work_keys_str_mv AT becqueylouis rnanetanautomaticallybuiltdualsourcedatasetintegratinghomologoussequencesandrnastructures
AT angeleric rnanetanautomaticallybuiltdualsourcedatasetintegratinghomologoussequencesandrnastructures
AT tahifariza rnanetanautomaticallybuiltdualsourcedatasetintegratinghomologoussequencesandrnastructures