Cargando…

DNABERT-based explainable lncRNA identification in plant genome assemblies

Long non-coding ribonucleic acids (lncRNAs) have been shown to play an important role in plant gene regulation, involving both epigenetic and transcript regulation. LncRNAs are transcripts longer than 200 nucleotides that are not translated into functional proteins but can be translated into small p...

Descripción completa

Detalles Bibliográficos
Autores principales: Danilevicz, Monica F., Gill, Mitchell, Fernandez, Cassandria G. Tay, Petereit, Jakob, Upadhyaya, Shriprabha R., Batley, Jacqueline, Bennamoun, Mohammed, Edwards, David, Bayer, Philipp E.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Research Network of Computational and Structural Biotechnology 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10696397/
http://dx.doi.org/10.1016/j.csbj.2023.11.025
_version_ 1785154565053612032
author Danilevicz, Monica F.
Gill, Mitchell
Fernandez, Cassandria G. Tay
Petereit, Jakob
Upadhyaya, Shriprabha R.
Batley, Jacqueline
Bennamoun, Mohammed
Edwards, David
Bayer, Philipp E.
author_facet Danilevicz, Monica F.
Gill, Mitchell
Fernandez, Cassandria G. Tay
Petereit, Jakob
Upadhyaya, Shriprabha R.
Batley, Jacqueline
Bennamoun, Mohammed
Edwards, David
Bayer, Philipp E.
author_sort Danilevicz, Monica F.
collection PubMed
description Long non-coding ribonucleic acids (lncRNAs) have been shown to play an important role in plant gene regulation, involving both epigenetic and transcript regulation. LncRNAs are transcripts longer than 200 nucleotides that are not translated into functional proteins but can be translated into small peptides. Machine learning models have predominantly used transcriptome data with manually defined features to detect lncRNAs, however, they often underrepresent the abundance of lncRNAs and can be biased in their detection. Here we present a study using Natural Language Processing (NLP) models to identify plant lncRNAs from genomic sequences rather than transcriptomic data. The NLP models were trained to predict lncRNAs for seven model and crop species (Zea mays, Arabidopsis thaliana, Brassica napus, Brassica oleracea, Brassica rapa, Glycine max and Oryza sativa) using publicly available genomic references. We demonstrated that lncRNAs can be accurately predicted from genomic sequences with the highest accuracy of 83.4% for Z. mays and the lowest accuracy of 57.9% for B. rapa, revealing that genome assembly quality might affect the accuracy of lncRNA identification. Furthermore, we demonstrated the potential of using NLP models for cross-species prediction with an average of 63.1% accuracy using target species not previously seen by the model. As more species are incorporated into the training datasets, we expect the accuracy to increase, becoming a more reliable tool for uncovering novel lncRNAs. Finally, we show that the models can be interpreted using explainable artificial intelligence to identify motifs important to lncRNA prediction and that these motifs frequently flanked the lncRNA sequence.
format Online
Article
Text
id pubmed-10696397
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Research Network of Computational and Structural Biotechnology
record_format MEDLINE/PubMed
spelling pubmed-106963972023-12-06 DNABERT-based explainable lncRNA identification in plant genome assemblies Danilevicz, Monica F. Gill, Mitchell Fernandez, Cassandria G. Tay Petereit, Jakob Upadhyaya, Shriprabha R. Batley, Jacqueline Bennamoun, Mohammed Edwards, David Bayer, Philipp E. Comput Struct Biotechnol J Research Article Long non-coding ribonucleic acids (lncRNAs) have been shown to play an important role in plant gene regulation, involving both epigenetic and transcript regulation. LncRNAs are transcripts longer than 200 nucleotides that are not translated into functional proteins but can be translated into small peptides. Machine learning models have predominantly used transcriptome data with manually defined features to detect lncRNAs, however, they often underrepresent the abundance of lncRNAs and can be biased in their detection. Here we present a study using Natural Language Processing (NLP) models to identify plant lncRNAs from genomic sequences rather than transcriptomic data. The NLP models were trained to predict lncRNAs for seven model and crop species (Zea mays, Arabidopsis thaliana, Brassica napus, Brassica oleracea, Brassica rapa, Glycine max and Oryza sativa) using publicly available genomic references. We demonstrated that lncRNAs can be accurately predicted from genomic sequences with the highest accuracy of 83.4% for Z. mays and the lowest accuracy of 57.9% for B. rapa, revealing that genome assembly quality might affect the accuracy of lncRNA identification. Furthermore, we demonstrated the potential of using NLP models for cross-species prediction with an average of 63.1% accuracy using target species not previously seen by the model. As more species are incorporated into the training datasets, we expect the accuracy to increase, becoming a more reliable tool for uncovering novel lncRNAs. Finally, we show that the models can be interpreted using explainable artificial intelligence to identify motifs important to lncRNA prediction and that these motifs frequently flanked the lncRNA sequence. Research Network of Computational and Structural Biotechnology 2023-11-17 /pmc/articles/PMC10696397/ http://dx.doi.org/10.1016/j.csbj.2023.11.025 Text en © 2023 The Authors https://creativecommons.org/licenses/by/4.0/This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
spellingShingle Research Article
Danilevicz, Monica F.
Gill, Mitchell
Fernandez, Cassandria G. Tay
Petereit, Jakob
Upadhyaya, Shriprabha R.
Batley, Jacqueline
Bennamoun, Mohammed
Edwards, David
Bayer, Philipp E.
DNABERT-based explainable lncRNA identification in plant genome assemblies
title DNABERT-based explainable lncRNA identification in plant genome assemblies
title_full DNABERT-based explainable lncRNA identification in plant genome assemblies
title_fullStr DNABERT-based explainable lncRNA identification in plant genome assemblies
title_full_unstemmed DNABERT-based explainable lncRNA identification in plant genome assemblies
title_short DNABERT-based explainable lncRNA identification in plant genome assemblies
title_sort dnabert-based explainable lncrna identification in plant genome assemblies
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10696397/
http://dx.doi.org/10.1016/j.csbj.2023.11.025
work_keys_str_mv AT danileviczmonicaf dnabertbasedexplainablelncrnaidentificationinplantgenomeassemblies
AT gillmitchell dnabertbasedexplainablelncrnaidentificationinplantgenomeassemblies
AT fernandezcassandriagtay dnabertbasedexplainablelncrnaidentificationinplantgenomeassemblies
AT petereitjakob dnabertbasedexplainablelncrnaidentificationinplantgenomeassemblies
AT upadhyayashriprabhar dnabertbasedexplainablelncrnaidentificationinplantgenomeassemblies
AT batleyjacqueline dnabertbasedexplainablelncrnaidentificationinplantgenomeassemblies
AT bennamounmohammed dnabertbasedexplainablelncrnaidentificationinplantgenomeassemblies
AT edwardsdavid dnabertbasedexplainablelncrnaidentificationinplantgenomeassemblies
AT bayerphilippe dnabertbasedexplainablelncrnaidentificationinplantgenomeassemblies