Cargando…

Distinguishing Protein-Coding from Non-Coding RNAs through Support Vector Machines

RIKEN's FANTOM project has revealed many previously unknown coding sequences, as well as an unexpected degree of variation in transcripts resulting from alternative promoter usage and splicing. Ever more transcripts that do not code for proteins have been identified by transcriptome studies, in...

Descripción completa

Detalles Bibliográficos
Autores principales:	Liu, Jinfeng, Gough, Julian, Rost, Burkhard
Formato:	Texto
Lenguaje:	English
Publicado:	Public Library of Science 2006
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1449884/ https://www.ncbi.nlm.nih.gov/pubmed/16683024 http://dx.doi.org/10.1371/journal.pgen.0020029

_version_	1782127373280346112
author	Liu, Jinfeng Gough, Julian Rost, Burkhard
author_facet	Liu, Jinfeng Gough, Julian Rost, Burkhard
author_sort	Liu, Jinfeng
collection	PubMed
description	RIKEN's FANTOM project has revealed many previously unknown coding sequences, as well as an unexpected degree of variation in transcripts resulting from alternative promoter usage and splicing. Ever more transcripts that do not code for proteins have been identified by transcriptome studies, in general. Increasing evidence points to the important cellular roles of such non-coding RNAs (ncRNAs). The distinction of protein-coding RNA transcripts from ncRNA transcripts is therefore an important problem in understanding the transcriptome and carrying out its annotation. Very few in silico methods have specifically addressed this problem. Here, we introduce CONC (for “coding or non-coding”), a novel method based on support vector machines that classifies transcripts according to features they would have if they were coding for proteins. These features include peptide length, amino acid composition, predicted secondary structure content, predicted percentage of exposed residues, compositional entropy, number of homologs from database searches, and alignment entropy. Nucleotide frequencies are also incorporated into the method. Confirmed coding cDNAs for eukaryotic proteins from the Swiss-Prot database constituted the set of true positives, ncRNAs from RNAdb and NONCODE the true negatives. Ten-fold cross-validation suggested that CONC distinguished coding RNAs from ncRNAs at about 97% specificity and 98% sensitivity. Applied to 102,801 mouse cDNAs from the FANTOM3 dataset, our method reliably identified over 14,000 ncRNAs and estimated the total number of ncRNAs to be about 28,000.
format	Text
id	pubmed-1449884
institution	National Center for Biotechnology Information
language	English
publishDate	2006
publisher	Public Library of Science
record_format	MEDLINE/PubMed
spelling	pubmed-14498842006-05-08 Distinguishing Protein-Coding from Non-Coding RNAs through Support Vector Machines Liu, Jinfeng Gough, Julian Rost, Burkhard PLoS Genet Research Article RIKEN's FANTOM project has revealed many previously unknown coding sequences, as well as an unexpected degree of variation in transcripts resulting from alternative promoter usage and splicing. Ever more transcripts that do not code for proteins have been identified by transcriptome studies, in general. Increasing evidence points to the important cellular roles of such non-coding RNAs (ncRNAs). The distinction of protein-coding RNA transcripts from ncRNA transcripts is therefore an important problem in understanding the transcriptome and carrying out its annotation. Very few in silico methods have specifically addressed this problem. Here, we introduce CONC (for “coding or non-coding”), a novel method based on support vector machines that classifies transcripts according to features they would have if they were coding for proteins. These features include peptide length, amino acid composition, predicted secondary structure content, predicted percentage of exposed residues, compositional entropy, number of homologs from database searches, and alignment entropy. Nucleotide frequencies are also incorporated into the method. Confirmed coding cDNAs for eukaryotic proteins from the Swiss-Prot database constituted the set of true positives, ncRNAs from RNAdb and NONCODE the true negatives. Ten-fold cross-validation suggested that CONC distinguished coding RNAs from ncRNAs at about 97% specificity and 98% sensitivity. Applied to 102,801 mouse cDNAs from the FANTOM3 dataset, our method reliably identified over 14,000 ncRNAs and estimated the total number of ncRNAs to be about 28,000. Public Library of Science 2006-04 2006-04-28 /pmc/articles/PMC1449884/ /pubmed/16683024 http://dx.doi.org/10.1371/journal.pgen.0020029 Text en © 2006 Liu et al. http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are properly credited.
spellingShingle	Research Article Liu, Jinfeng Gough, Julian Rost, Burkhard Distinguishing Protein-Coding from Non-Coding RNAs through Support Vector Machines
title	Distinguishing Protein-Coding from Non-Coding RNAs through Support Vector Machines
title_full	Distinguishing Protein-Coding from Non-Coding RNAs through Support Vector Machines
title_fullStr	Distinguishing Protein-Coding from Non-Coding RNAs through Support Vector Machines
title_full_unstemmed	Distinguishing Protein-Coding from Non-Coding RNAs through Support Vector Machines
title_short	Distinguishing Protein-Coding from Non-Coding RNAs through Support Vector Machines
title_sort	distinguishing protein-coding from non-coding rnas through support vector machines
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1449884/ https://www.ncbi.nlm.nih.gov/pubmed/16683024 http://dx.doi.org/10.1371/journal.pgen.0020029
work_keys_str_mv	AT liujinfeng distinguishingproteincodingfromnoncodingrnasthroughsupportvectormachines AT goughjulian distinguishingproteincodingfromnoncodingrnasthroughsupportvectormachines AT rostburkhard distinguishingproteincodingfromnoncodingrnasthroughsupportvectormachines

Distinguishing Protein-Coding from Non-Coding RNAs through Support Vector Machines

Ejemplares similares