Cargando…
OCCAM: prediction of small ORFs in bacterial genomes by means of a target-decoy database approach and machine learning techniques
Small open reading frames (ORFs) have been systematically disregarded by automatic genome annotation. The difficulty in finding patterns in tiny sequences is the main reason that makes small ORFs to be overlooked by computational procedures. However, advances in experimental methods show that small...
Autores principales: | , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
© Published by Oxford University Press
2020
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7673341/ https://www.ncbi.nlm.nih.gov/pubmed/33206960 http://dx.doi.org/10.1093/database/baaa067 |
_version_ | 1783611299162226688 |
---|---|
author | R. Cerqueira, Fabio Vasconcelos, Ana Tereza Ribeiro |
author_facet | R. Cerqueira, Fabio Vasconcelos, Ana Tereza Ribeiro |
author_sort | R. Cerqueira, Fabio |
collection | PubMed |
description | Small open reading frames (ORFs) have been systematically disregarded by automatic genome annotation. The difficulty in finding patterns in tiny sequences is the main reason that makes small ORFs to be overlooked by computational procedures. However, advances in experimental methods show that small proteins can play vital roles in cellular activities. Hence, it is urgent to make progress in the development of computational approaches to speed up the identification of potential small ORFs. In this work, our focus is on bacterial genomes. We improve a previous approach to identify small ORFs in bacteria. Our method uses machine learning techniques and decoy subject sequences to filter out spurious ORF alignments. We show that an advanced multivariate analysis can be more effective in terms of sensitivity than applying the simplistic and widely used e-value cutoff. This is particularly important in the case of small ORFs for which alignments present higher e-values than usual. Experiments with control datasets show that the machine learning algorithms used in our method to curate significant alignments can achieve average sensitivity and specificity of 97.06% and 99.61%, respectively. Therefore, an important step is provided here toward the construction of more accurate computational tools for the identification of small ORFs in bacteria. |
format | Online Article Text |
id | pubmed-7673341 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2020 |
publisher | © Published by Oxford University Press |
record_format | MEDLINE/PubMed |
spelling | pubmed-76733412020-11-24 OCCAM: prediction of small ORFs in bacterial genomes by means of a target-decoy database approach and machine learning techniques R. Cerqueira, Fabio Vasconcelos, Ana Tereza Ribeiro Database (Oxford) Original Article Small open reading frames (ORFs) have been systematically disregarded by automatic genome annotation. The difficulty in finding patterns in tiny sequences is the main reason that makes small ORFs to be overlooked by computational procedures. However, advances in experimental methods show that small proteins can play vital roles in cellular activities. Hence, it is urgent to make progress in the development of computational approaches to speed up the identification of potential small ORFs. In this work, our focus is on bacterial genomes. We improve a previous approach to identify small ORFs in bacteria. Our method uses machine learning techniques and decoy subject sequences to filter out spurious ORF alignments. We show that an advanced multivariate analysis can be more effective in terms of sensitivity than applying the simplistic and widely used e-value cutoff. This is particularly important in the case of small ORFs for which alignments present higher e-values than usual. Experiments with control datasets show that the machine learning algorithms used in our method to curate significant alignments can achieve average sensitivity and specificity of 97.06% and 99.61%, respectively. Therefore, an important step is provided here toward the construction of more accurate computational tools for the identification of small ORFs in bacteria. © Published by Oxford University Press 2020-11-18 /pmc/articles/PMC7673341/ /pubmed/33206960 http://dx.doi.org/10.1093/database/baaa067 Text en © The Author(s) 2020. Published by Oxford University Press. http://creativecommons.org/licenses/by/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Original Article R. Cerqueira, Fabio Vasconcelos, Ana Tereza Ribeiro OCCAM: prediction of small ORFs in bacterial genomes by means of a target-decoy database approach and machine learning techniques |
title | OCCAM: prediction of small ORFs in bacterial genomes by means of a target-decoy database approach and machine learning techniques |
title_full | OCCAM: prediction of small ORFs in bacterial genomes by means of a target-decoy database approach and machine learning techniques |
title_fullStr | OCCAM: prediction of small ORFs in bacterial genomes by means of a target-decoy database approach and machine learning techniques |
title_full_unstemmed | OCCAM: prediction of small ORFs in bacterial genomes by means of a target-decoy database approach and machine learning techniques |
title_short | OCCAM: prediction of small ORFs in bacterial genomes by means of a target-decoy database approach and machine learning techniques |
title_sort | occam: prediction of small orfs in bacterial genomes by means of a target-decoy database approach and machine learning techniques |
topic | Original Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7673341/ https://www.ncbi.nlm.nih.gov/pubmed/33206960 http://dx.doi.org/10.1093/database/baaa067 |
work_keys_str_mv | AT rcerqueirafabio occampredictionofsmallorfsinbacterialgenomesbymeansofatargetdecoydatabaseapproachandmachinelearningtechniques AT vasconcelosanaterezaribeiro occampredictionofsmallorfsinbacterialgenomesbymeansofatargetdecoydatabaseapproachandmachinelearningtechniques |