Cargando…

Utilizing Amino Acid Composition and Entropy of Potential Open Reading Frames to Identify Protein-Coding Genes

One of the main steps in gene-finding in prokaryotes is determining which open reading frames encode for a protein, and which occur by chance alone. There are many different methods to differentiate the two; the most prevalent approach is using shared homology with a database of known genes. This me...

Descripción completa

Detalles Bibliográficos
Autores principales:	McNair, Katelyn, Ecale Zhou, Carol L., Souza, Brian, Malfatti, Stephanie, Edwards, Robert A.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	MDPI 2021
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7827183/ https://www.ncbi.nlm.nih.gov/pubmed/33429904 http://dx.doi.org/10.3390/microorganisms9010129

_version_	1783640700654452736
author	McNair, Katelyn Ecale Zhou, Carol L. Souza, Brian Malfatti, Stephanie Edwards, Robert A.
author_facet	McNair, Katelyn Ecale Zhou, Carol L. Souza, Brian Malfatti, Stephanie Edwards, Robert A.
author_sort	McNair, Katelyn
collection	PubMed
description	One of the main steps in gene-finding in prokaryotes is determining which open reading frames encode for a protein, and which occur by chance alone. There are many different methods to differentiate the two; the most prevalent approach is using shared homology with a database of known genes. This method presents many pitfalls, most notably the catch that you only find genes that you have seen before. The four most popular prokaryotic gene-prediction programs (GeneMark, Glimmer, Prodigal, Phanotate) all use a protein-coding training model to predict protein-coding genes, with the latter three allowing for the training model to be created ab initio from the input genome. Different methods are available for creating the training model, and to increase the accuracy of such tools, we present here GOODORFS, a method for identifying protein-coding genes within a set of all possible open reading frames (ORFS). Our workflow begins with taking the amino acid frequencies of each ORF, calculating an entropy density profile (EDP), using KMeans to cluster the EDPs, and then selecting the cluster with the lowest variation as the coding ORFs. To test the efficacy of our method, we ran GOODORFS on 14,179 annotated phage genomes, and compared our results to the initial training-set creation step of four other similar methods (Glimmer, MED2, PHANOTATE, Prodigal). We found that GOODORFS was the most accurate (0.94) and had the best F1-score (0.85), while Glimmer had the highest precision (0.92) and PHANOTATE had the highest recall (0.96).
format	Online Article Text
id	pubmed-7827183
institution	National Center for Biotechnology Information
language	English
publishDate	2021
publisher	MDPI
record_format	MEDLINE/PubMed
spelling	pubmed-78271832021-01-25 Utilizing Amino Acid Composition and Entropy of Potential Open Reading Frames to Identify Protein-Coding Genes McNair, Katelyn Ecale Zhou, Carol L. Souza, Brian Malfatti, Stephanie Edwards, Robert A. Microorganisms Article One of the main steps in gene-finding in prokaryotes is determining which open reading frames encode for a protein, and which occur by chance alone. There are many different methods to differentiate the two; the most prevalent approach is using shared homology with a database of known genes. This method presents many pitfalls, most notably the catch that you only find genes that you have seen before. The four most popular prokaryotic gene-prediction programs (GeneMark, Glimmer, Prodigal, Phanotate) all use a protein-coding training model to predict protein-coding genes, with the latter three allowing for the training model to be created ab initio from the input genome. Different methods are available for creating the training model, and to increase the accuracy of such tools, we present here GOODORFS, a method for identifying protein-coding genes within a set of all possible open reading frames (ORFS). Our workflow begins with taking the amino acid frequencies of each ORF, calculating an entropy density profile (EDP), using KMeans to cluster the EDPs, and then selecting the cluster with the lowest variation as the coding ORFs. To test the efficacy of our method, we ran GOODORFS on 14,179 annotated phage genomes, and compared our results to the initial training-set creation step of four other similar methods (Glimmer, MED2, PHANOTATE, Prodigal). We found that GOODORFS was the most accurate (0.94) and had the best F1-score (0.85), while Glimmer had the highest precision (0.92) and PHANOTATE had the highest recall (0.96). MDPI 2021-01-08 /pmc/articles/PMC7827183/ /pubmed/33429904 http://dx.doi.org/10.3390/microorganisms9010129 Text en © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
spellingShingle	Article McNair, Katelyn Ecale Zhou, Carol L. Souza, Brian Malfatti, Stephanie Edwards, Robert A. Utilizing Amino Acid Composition and Entropy of Potential Open Reading Frames to Identify Protein-Coding Genes
title	Utilizing Amino Acid Composition and Entropy of Potential Open Reading Frames to Identify Protein-Coding Genes
title_full	Utilizing Amino Acid Composition and Entropy of Potential Open Reading Frames to Identify Protein-Coding Genes
title_fullStr	Utilizing Amino Acid Composition and Entropy of Potential Open Reading Frames to Identify Protein-Coding Genes
title_full_unstemmed	Utilizing Amino Acid Composition and Entropy of Potential Open Reading Frames to Identify Protein-Coding Genes
title_short	Utilizing Amino Acid Composition and Entropy of Potential Open Reading Frames to Identify Protein-Coding Genes
title_sort	utilizing amino acid composition and entropy of potential open reading frames to identify protein-coding genes
topic	Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7827183/ https://www.ncbi.nlm.nih.gov/pubmed/33429904 http://dx.doi.org/10.3390/microorganisms9010129
work_keys_str_mv	AT mcnairkatelyn utilizingaminoacidcompositionandentropyofpotentialopenreadingframestoidentifyproteincodinggenes AT ecalezhoucaroll utilizingaminoacidcompositionandentropyofpotentialopenreadingframestoidentifyproteincodinggenes AT souzabrian utilizingaminoacidcompositionandentropyofpotentialopenreadingframestoidentifyproteincodinggenes AT malfattistephanie utilizingaminoacidcompositionandentropyofpotentialopenreadingframestoidentifyproteincodinggenes AT edwardsroberta utilizingaminoacidcompositionandentropyofpotentialopenreadingframestoidentifyproteincodinggenes

Utilizing Amino Acid Composition and Entropy of Potential Open Reading Frames to Identify Protein-Coding Genes

Ejemplares similares