Cargando…

Feature selection for RNA cleavage efficiency at specific sites using the LASSO regression model in Arabidopsis thaliana

BACKGROUND: RNA degradation is important for the regulation of gene expression. Despite the identification of proteins and sequences related to deadenylation-dependent RNA degradation in plants, endonucleolytic cleavage-dependent RNA degradation has not been studied in detail. Here, we developed tru...

Descripción completa

Detalles Bibliográficos
Autores principales: Ueno, Daishin, Kawabe, Harunori, Yamasaki, Shotaro, Demura, Taku, Kato, Ko
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8299621/
https://www.ncbi.nlm.nih.gov/pubmed/34294042
http://dx.doi.org/10.1186/s12859-021-04291-5
_version_ 1783726305316962304
author Ueno, Daishin
Kawabe, Harunori
Yamasaki, Shotaro
Demura, Taku
Kato, Ko
author_facet Ueno, Daishin
Kawabe, Harunori
Yamasaki, Shotaro
Demura, Taku
Kato, Ko
author_sort Ueno, Daishin
collection PubMed
description BACKGROUND: RNA degradation is important for the regulation of gene expression. Despite the identification of proteins and sequences related to deadenylation-dependent RNA degradation in plants, endonucleolytic cleavage-dependent RNA degradation has not been studied in detail. Here, we developed truncated RNA end sequencing in Arabidopsis thaliana to identify cleavage sites and evaluate the efficiency of cleavage at each site. Although several features are related to RNA cleavage efficiency, the effect of each feature on cleavage efficiency has not been evaluated by considering multiple putative determinants in A. thaliana. RESULTS: Cleavage site information was acquired from a previous study, and cleavage efficiency at the site level (CS(site) value), which indicates the number of reads at each cleavage site normalized to RNA abundance, was calculated. To identify features related to cleavage efficiency at the site level, multiple putative determinants (features) were used to perform feature selection using the Least Absolute Shrinkage and Selection Operator (LASSO) regression model. The results indicated that whole RNA features were important for the CS(site) value, in addition to features around cleavage sites. Whole RNA features related to the translation process and nucleotide frequency around cleavage sites were major determinants of cleavage efficiency. The results were verified in a model constructed using only sequence features, which showed that the prediction accuracy was similar to that determined using all features including the translation process, suggesting that cleavage efficiency can be predicted using only sequence information. The LASSO regression model was validated in exogenous genes, which showed that the model constructed using only sequence information can predict cleavage efficiency in both endogenous and exogenous genes. CONCLUSIONS: Feature selection using the LASSO regression model in A. thaliana identified 155 features. Correlation coefficients revealed that whole RNA features are important for determining cleavage efficiency in addition to features around the cleavage sites. The LASSO regression model can predict cleavage efficiency in endogenous and exogenous genes using only sequence information. The model revealed the significance of the effect of multiple determinants on cleavage efficiency, suggesting that sequence features are important for RNA degradation mechanisms in A. thaliana.
format Online
Article
Text
id pubmed-8299621
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-82996212021-07-28 Feature selection for RNA cleavage efficiency at specific sites using the LASSO regression model in Arabidopsis thaliana Ueno, Daishin Kawabe, Harunori Yamasaki, Shotaro Demura, Taku Kato, Ko BMC Bioinformatics Research BACKGROUND: RNA degradation is important for the regulation of gene expression. Despite the identification of proteins and sequences related to deadenylation-dependent RNA degradation in plants, endonucleolytic cleavage-dependent RNA degradation has not been studied in detail. Here, we developed truncated RNA end sequencing in Arabidopsis thaliana to identify cleavage sites and evaluate the efficiency of cleavage at each site. Although several features are related to RNA cleavage efficiency, the effect of each feature on cleavage efficiency has not been evaluated by considering multiple putative determinants in A. thaliana. RESULTS: Cleavage site information was acquired from a previous study, and cleavage efficiency at the site level (CS(site) value), which indicates the number of reads at each cleavage site normalized to RNA abundance, was calculated. To identify features related to cleavage efficiency at the site level, multiple putative determinants (features) were used to perform feature selection using the Least Absolute Shrinkage and Selection Operator (LASSO) regression model. The results indicated that whole RNA features were important for the CS(site) value, in addition to features around cleavage sites. Whole RNA features related to the translation process and nucleotide frequency around cleavage sites were major determinants of cleavage efficiency. The results were verified in a model constructed using only sequence features, which showed that the prediction accuracy was similar to that determined using all features including the translation process, suggesting that cleavage efficiency can be predicted using only sequence information. The LASSO regression model was validated in exogenous genes, which showed that the model constructed using only sequence information can predict cleavage efficiency in both endogenous and exogenous genes. CONCLUSIONS: Feature selection using the LASSO regression model in A. thaliana identified 155 features. Correlation coefficients revealed that whole RNA features are important for determining cleavage efficiency in addition to features around the cleavage sites. The LASSO regression model can predict cleavage efficiency in endogenous and exogenous genes using only sequence information. The model revealed the significance of the effect of multiple determinants on cleavage efficiency, suggesting that sequence features are important for RNA degradation mechanisms in A. thaliana. BioMed Central 2021-07-22 /pmc/articles/PMC8299621/ /pubmed/34294042 http://dx.doi.org/10.1186/s12859-021-04291-5 Text en © The Author(s) 2021 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle Research
Ueno, Daishin
Kawabe, Harunori
Yamasaki, Shotaro
Demura, Taku
Kato, Ko
Feature selection for RNA cleavage efficiency at specific sites using the LASSO regression model in Arabidopsis thaliana
title Feature selection for RNA cleavage efficiency at specific sites using the LASSO regression model in Arabidopsis thaliana
title_full Feature selection for RNA cleavage efficiency at specific sites using the LASSO regression model in Arabidopsis thaliana
title_fullStr Feature selection for RNA cleavage efficiency at specific sites using the LASSO regression model in Arabidopsis thaliana
title_full_unstemmed Feature selection for RNA cleavage efficiency at specific sites using the LASSO regression model in Arabidopsis thaliana
title_short Feature selection for RNA cleavage efficiency at specific sites using the LASSO regression model in Arabidopsis thaliana
title_sort feature selection for rna cleavage efficiency at specific sites using the lasso regression model in arabidopsis thaliana
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8299621/
https://www.ncbi.nlm.nih.gov/pubmed/34294042
http://dx.doi.org/10.1186/s12859-021-04291-5
work_keys_str_mv AT uenodaishin featureselectionforrnacleavageefficiencyatspecificsitesusingthelassoregressionmodelinarabidopsisthaliana
AT kawabeharunori featureselectionforrnacleavageefficiencyatspecificsitesusingthelassoregressionmodelinarabidopsisthaliana
AT yamasakishotaro featureselectionforrnacleavageefficiencyatspecificsitesusingthelassoregressionmodelinarabidopsisthaliana
AT demurataku featureselectionforrnacleavageefficiencyatspecificsitesusingthelassoregressionmodelinarabidopsisthaliana
AT katoko featureselectionforrnacleavageefficiencyatspecificsitesusingthelassoregressionmodelinarabidopsisthaliana