Cargando…

A Statistical Method without Training Step for the Classification of Coding Frame in Transcriptome Sequences

In this study, we investigated the modalities of coding open reading frame (cORF) classification of expressed sequence tags (EST) by using the universal feature method (UFM). The UFM algorithm is based on the scoring of purine bias (Rrr) and stop codon frequencies. UFM classifies ORFs as coding or n...

Descripción completa

Detalles Bibliográficos
Autores principales:	Carels, Nicolas, Frías, Diego
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Libertas Academica 2013
Materias:	Original Research
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3561939/ https://www.ncbi.nlm.nih.gov/pubmed/23400232 http://dx.doi.org/10.4137/BBI.S10053

_version_	1782258016883572736
author	Carels, Nicolas Frías, Diego
author_facet	Carels, Nicolas Frías, Diego
author_sort	Carels, Nicolas
collection	PubMed
description	In this study, we investigated the modalities of coding open reading frame (cORF) classification of expressed sequence tags (EST) by using the universal feature method (UFM). The UFM algorithm is based on the scoring of purine bias (Rrr) and stop codon frequencies. UFM classifies ORFs as coding or non-coding through a score based on 5 factors: (i) stop codon frequency; (ii) the product of the probabilities of purines occurring in the three positions of nucleotide triplets; (iii) the product of the probabilities of Cytosine (C), Guanine (G), and Adenine (A) occurring in the 1st, 2nd, and 3rd positions of triplets, respectively; (iv) the probabilities of a G occurring in the 1st and 2nd positions of triplets; and (v) the probabilities of a T occurring in the 1st and an A in the 2nd position of triplets. Because UFM is based on primary determinants of coding sequences that are conserved throughout the biosphere, it is suitable for cORF classification of any sequence in eukaryote transcriptomes without prior knowledge. Considering the protein sequences of the Protein Data Bank (RCSB PDB or more simply PDB) as a reference, we found that UFM classifies cORFs of ≥200 bp (if the coding strand is known) and cORFs of ≥300 bp (if the coding strand is unknown), and releases them in their coding strand and coding frame, which allows their automatic translation into protein sequences with a success rate equal to or higher than 95%. We first established the statistical parameters of UFM using ESTs from Plasmodium falciparum, Arabidopsis thaliana, Oryza sativa, Zea mays, Drosophila melanogaster, Homo sapiens and Chlamydomonas reinhardtii in reference to the protein sequences of PDB. Second, we showed that the success rate of cORF classification using UFM is expected to apply to approximately 95% of higher eukaryote genes that encode for proteins. Third, we used UFM in combination with CAP3 to assemble large EST samples into cORFs that we used to analyze transcriptome phenotypes in rice, maize, and humans. We discuss the error rate and the interference of noisy sequences such as pseudogenes, transposons, and retrotransposons. This method is suitable for rapid cORF extraction from transcriptome data and allows correct description of the genome phenotypes of plant genomes without prior knowledge. Additional care is necessary when addressing the human transcriptome due to the interference caused by large amounts of noisy sequences. UFM can be regarded as a low complexity tool for prior knowledge extraction concerning the coding fraction of the transcriptome of any eukaryote. Due to its low level of complexity, UFM is also very robust to variations of codon usage.
format	Online Article Text
id	pubmed-3561939
institution	National Center for Biotechnology Information
language	English
publishDate	2013
publisher	Libertas Academica
record_format	MEDLINE/PubMed
spelling	pubmed-35619392013-02-11 A Statistical Method without Training Step for the Classification of Coding Frame in Transcriptome Sequences Carels, Nicolas Frías, Diego Bioinform Biol Insights Original Research In this study, we investigated the modalities of coding open reading frame (cORF) classification of expressed sequence tags (EST) by using the universal feature method (UFM). The UFM algorithm is based on the scoring of purine bias (Rrr) and stop codon frequencies. UFM classifies ORFs as coding or non-coding through a score based on 5 factors: (i) stop codon frequency; (ii) the product of the probabilities of purines occurring in the three positions of nucleotide triplets; (iii) the product of the probabilities of Cytosine (C), Guanine (G), and Adenine (A) occurring in the 1st, 2nd, and 3rd positions of triplets, respectively; (iv) the probabilities of a G occurring in the 1st and 2nd positions of triplets; and (v) the probabilities of a T occurring in the 1st and an A in the 2nd position of triplets. Because UFM is based on primary determinants of coding sequences that are conserved throughout the biosphere, it is suitable for cORF classification of any sequence in eukaryote transcriptomes without prior knowledge. Considering the protein sequences of the Protein Data Bank (RCSB PDB or more simply PDB) as a reference, we found that UFM classifies cORFs of ≥200 bp (if the coding strand is known) and cORFs of ≥300 bp (if the coding strand is unknown), and releases them in their coding strand and coding frame, which allows their automatic translation into protein sequences with a success rate equal to or higher than 95%. We first established the statistical parameters of UFM using ESTs from Plasmodium falciparum, Arabidopsis thaliana, Oryza sativa, Zea mays, Drosophila melanogaster, Homo sapiens and Chlamydomonas reinhardtii in reference to the protein sequences of PDB. Second, we showed that the success rate of cORF classification using UFM is expected to apply to approximately 95% of higher eukaryote genes that encode for proteins. Third, we used UFM in combination with CAP3 to assemble large EST samples into cORFs that we used to analyze transcriptome phenotypes in rice, maize, and humans. We discuss the error rate and the interference of noisy sequences such as pseudogenes, transposons, and retrotransposons. This method is suitable for rapid cORF extraction from transcriptome data and allows correct description of the genome phenotypes of plant genomes without prior knowledge. Additional care is necessary when addressing the human transcriptome due to the interference caused by large amounts of noisy sequences. UFM can be regarded as a low complexity tool for prior knowledge extraction concerning the coding fraction of the transcriptome of any eukaryote. Due to its low level of complexity, UFM is also very robust to variations of codon usage. Libertas Academica 2013-01-23 /pmc/articles/PMC3561939/ /pubmed/23400232 http://dx.doi.org/10.4137/BBI.S10053 Text en © 2013 the author(s), publisher and licensee Libertas Academica Ltd. This is an open access article. Unrestricted non-commercial use is permitted provided the original work is properly cited.
spellingShingle	Original Research Carels, Nicolas Frías, Diego A Statistical Method without Training Step for the Classification of Coding Frame in Transcriptome Sequences
title	A Statistical Method without Training Step for the Classification of Coding Frame in Transcriptome Sequences
title_full	A Statistical Method without Training Step for the Classification of Coding Frame in Transcriptome Sequences
title_fullStr	A Statistical Method without Training Step for the Classification of Coding Frame in Transcriptome Sequences
title_full_unstemmed	A Statistical Method without Training Step for the Classification of Coding Frame in Transcriptome Sequences
title_short	A Statistical Method without Training Step for the Classification of Coding Frame in Transcriptome Sequences
title_sort	statistical method without training step for the classification of coding frame in transcriptome sequences
topic	Original Research
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3561939/ https://www.ncbi.nlm.nih.gov/pubmed/23400232 http://dx.doi.org/10.4137/BBI.S10053
work_keys_str_mv	AT carelsnicolas astatisticalmethodwithouttrainingstepfortheclassificationofcodingframeintranscriptomesequences AT friasdiego astatisticalmethodwithouttrainingstepfortheclassificationofcodingframeintranscriptomesequences AT carelsnicolas statisticalmethodwithouttrainingstepfortheclassificationofcodingframeintranscriptomesequences AT friasdiego statisticalmethodwithouttrainingstepfortheclassificationofcodingframeintranscriptomesequences

A Statistical Method without Training Step for the Classification of Coding Frame in Transcriptome Sequences

Ejemplares similares