Cargando…

Classifying Coding DNA with Nucleotide Statistics

In this report, we compared the success rate of classification of coding sequences (CDS) vs. introns by Codon Structure Factor (CSF) and by a method that we called Universal Feature Method (UFM). UFM is based on the scoring of purine bias (Rrr) and stop codon frequency. We show that the success rate...

Descripción completa

Detalles Bibliográficos
Autores principales:	Carels, Nicolas, Frías, Diego
Formato:	Texto
Lenguaje:	English
Publicado:	Libertas Academica 2009
Materias:	Original Research
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2808172/ https://www.ncbi.nlm.nih.gov/pubmed/20140062

_version_	1782176457289629696
author	Carels, Nicolas Frías, Diego
author_facet	Carels, Nicolas Frías, Diego
author_sort	Carels, Nicolas
collection	PubMed
description	In this report, we compared the success rate of classification of coding sequences (CDS) vs. introns by Codon Structure Factor (CSF) and by a method that we called Universal Feature Method (UFM). UFM is based on the scoring of purine bias (Rrr) and stop codon frequency. We show that the success rate of CDS/intron classification by UFM is higher than by CSF. UFM classifies ORFs as coding or non-coding through a score based on (i) the stop codon distribution, (ii) the product of purine probabilities in the three positions of nucleotide triplets, (iii) the product of Cytosine (C), Guanine (G), and Adenine (A) probabilities in the 1st, 2nd, and 3rd positions of triplets, respectively, (iv) the probabilities of G in 1st and 2nd position of triplets and (v) the distance of their GC3 vs. GC2 levels to the regression line of the universal correlation. More than 80% of CDSs (true positives) of Homo sapiens (>250 bp), Drosophila melanogaster (>250 bp) and Arabidopsis thaliana (>200 bp) are successfully classified with a false positive rate lower or equal to 5%. The method releases coding sequences in their coding strand and coding frame, which allows their automatic translation into protein sequences with 95% confidence. The method is a natural consequence of the compositional bias of nucleotides in coding sequences.
format	Text
id	pubmed-2808172
institution	National Center for Biotechnology Information
language	English
publishDate	2009
publisher	Libertas Academica
record_format	MEDLINE/PubMed
spelling	pubmed-28081722010-02-04 Classifying Coding DNA with Nucleotide Statistics Carels, Nicolas Frías, Diego Bioinform Biol Insights Original Research In this report, we compared the success rate of classification of coding sequences (CDS) vs. introns by Codon Structure Factor (CSF) and by a method that we called Universal Feature Method (UFM). UFM is based on the scoring of purine bias (Rrr) and stop codon frequency. We show that the success rate of CDS/intron classification by UFM is higher than by CSF. UFM classifies ORFs as coding or non-coding through a score based on (i) the stop codon distribution, (ii) the product of purine probabilities in the three positions of nucleotide triplets, (iii) the product of Cytosine (C), Guanine (G), and Adenine (A) probabilities in the 1st, 2nd, and 3rd positions of triplets, respectively, (iv) the probabilities of G in 1st and 2nd position of triplets and (v) the distance of their GC3 vs. GC2 levels to the regression line of the universal correlation. More than 80% of CDSs (true positives) of Homo sapiens (>250 bp), Drosophila melanogaster (>250 bp) and Arabidopsis thaliana (>200 bp) are successfully classified with a false positive rate lower or equal to 5%. The method releases coding sequences in their coding strand and coding frame, which allows their automatic translation into protein sequences with 95% confidence. The method is a natural consequence of the compositional bias of nucleotides in coding sequences. Libertas Academica 2009-10-28 /pmc/articles/PMC2808172/ /pubmed/20140062 Text en Copyright © 2009 The authors. http://creativecommons.org/licenses/by/2.0 This article is an open-access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/2.0/).
spellingShingle	Original Research Carels, Nicolas Frías, Diego Classifying Coding DNA with Nucleotide Statistics
title	Classifying Coding DNA with Nucleotide Statistics
title_full	Classifying Coding DNA with Nucleotide Statistics
title_fullStr	Classifying Coding DNA with Nucleotide Statistics
title_full_unstemmed	Classifying Coding DNA with Nucleotide Statistics
title_short	Classifying Coding DNA with Nucleotide Statistics
title_sort	classifying coding dna with nucleotide statistics
topic	Original Research
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2808172/ https://www.ncbi.nlm.nih.gov/pubmed/20140062
work_keys_str_mv	AT carelsnicolas classifyingcodingdnawithnucleotidestatistics AT friasdiego classifyingcodingdnawithnucleotidestatistics

Classifying Coding DNA with Nucleotide Statistics

Ejemplares similares