Cargando…

PLEK: a tool for predicting long non-coding RNAs and messenger RNAs based on an improved k-mer scheme

BACKGROUND: High-throughput transcriptome sequencing (RNA-seq) technology promises to discover novel protein-coding and non-coding transcripts, particularly the identification of long non-coding RNAs (lncRNAs) from de novo sequencing data. This requires tools that are not restricted by prior gene an...

Descripción completa

Detalles Bibliográficos
Autores principales: Li, Aimin, Zhang, Junying, Zhou, Zhongyin
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2014
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4177586/
https://www.ncbi.nlm.nih.gov/pubmed/25239089
http://dx.doi.org/10.1186/1471-2105-15-311
_version_ 1782336789677080576
author Li, Aimin
Zhang, Junying
Zhou, Zhongyin
author_facet Li, Aimin
Zhang, Junying
Zhou, Zhongyin
author_sort Li, Aimin
collection PubMed
description BACKGROUND: High-throughput transcriptome sequencing (RNA-seq) technology promises to discover novel protein-coding and non-coding transcripts, particularly the identification of long non-coding RNAs (lncRNAs) from de novo sequencing data. This requires tools that are not restricted by prior gene annotations, genomic sequences and high-quality sequencing. RESULTS: We present an alignment-free tool called PLEK (predictor of long non-coding RNAs and messenger RNAs based on an improved k-mer scheme), which uses a computational pipeline based on an improved k-mer scheme and a support vector machine (SVM) algorithm to distinguish lncRNAs from messenger RNAs (mRNAs), in the absence of genomic sequences or annotations. The performance of PLEK was evaluated on well-annotated mRNA and lncRNA transcripts. 10-fold cross-validation tests on human RefSeq mRNAs and GENCODE lncRNAs indicated that our tool could achieve accuracy of up to 95.6%. We demonstrated the utility of PLEK on transcripts from other vertebrates using the model built from human datasets. PLEK attained >90% accuracy on most of these datasets. PLEK also performed well using a simulated dataset and two real de novo assembled transcriptome datasets (sequenced by PacBio and 454 platforms) with relatively high indel sequencing errors. In addition, PLEK is approximately eightfold faster than a newly developed alignment-free tool, named Coding-Non-Coding Index (CNCI), and 244 times faster than the most popular alignment-based tool, Coding Potential Calculator (CPC), in a single-threading running manner. CONCLUSIONS: PLEK is an efficient alignment-free computational tool to distinguish lncRNAs from mRNAs in RNA-seq transcriptomes of species lacking reference genomes. PLEK is especially suitable for PacBio or 454 sequencing data and large-scale transcriptome data. Its open-source software can be freely downloaded from https://sourceforge.net/projects/plek/files/. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/1471-2105-15-311) contains supplementary material, which is available to authorized users.
format Online
Article
Text
id pubmed-4177586
institution National Center for Biotechnology Information
language English
publishDate 2014
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-41775862014-09-29 PLEK: a tool for predicting long non-coding RNAs and messenger RNAs based on an improved k-mer scheme Li, Aimin Zhang, Junying Zhou, Zhongyin BMC Bioinformatics Software BACKGROUND: High-throughput transcriptome sequencing (RNA-seq) technology promises to discover novel protein-coding and non-coding transcripts, particularly the identification of long non-coding RNAs (lncRNAs) from de novo sequencing data. This requires tools that are not restricted by prior gene annotations, genomic sequences and high-quality sequencing. RESULTS: We present an alignment-free tool called PLEK (predictor of long non-coding RNAs and messenger RNAs based on an improved k-mer scheme), which uses a computational pipeline based on an improved k-mer scheme and a support vector machine (SVM) algorithm to distinguish lncRNAs from messenger RNAs (mRNAs), in the absence of genomic sequences or annotations. The performance of PLEK was evaluated on well-annotated mRNA and lncRNA transcripts. 10-fold cross-validation tests on human RefSeq mRNAs and GENCODE lncRNAs indicated that our tool could achieve accuracy of up to 95.6%. We demonstrated the utility of PLEK on transcripts from other vertebrates using the model built from human datasets. PLEK attained >90% accuracy on most of these datasets. PLEK also performed well using a simulated dataset and two real de novo assembled transcriptome datasets (sequenced by PacBio and 454 platforms) with relatively high indel sequencing errors. In addition, PLEK is approximately eightfold faster than a newly developed alignment-free tool, named Coding-Non-Coding Index (CNCI), and 244 times faster than the most popular alignment-based tool, Coding Potential Calculator (CPC), in a single-threading running manner. CONCLUSIONS: PLEK is an efficient alignment-free computational tool to distinguish lncRNAs from mRNAs in RNA-seq transcriptomes of species lacking reference genomes. PLEK is especially suitable for PacBio or 454 sequencing data and large-scale transcriptome data. Its open-source software can be freely downloaded from https://sourceforge.net/projects/plek/files/. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/1471-2105-15-311) contains supplementary material, which is available to authorized users. BioMed Central 2014-09-19 /pmc/articles/PMC4177586/ /pubmed/25239089 http://dx.doi.org/10.1186/1471-2105-15-311 Text en © Li et al.; licensee BioMed Central Ltd. 2014 This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Software
Li, Aimin
Zhang, Junying
Zhou, Zhongyin
PLEK: a tool for predicting long non-coding RNAs and messenger RNAs based on an improved k-mer scheme
title PLEK: a tool for predicting long non-coding RNAs and messenger RNAs based on an improved k-mer scheme
title_full PLEK: a tool for predicting long non-coding RNAs and messenger RNAs based on an improved k-mer scheme
title_fullStr PLEK: a tool for predicting long non-coding RNAs and messenger RNAs based on an improved k-mer scheme
title_full_unstemmed PLEK: a tool for predicting long non-coding RNAs and messenger RNAs based on an improved k-mer scheme
title_short PLEK: a tool for predicting long non-coding RNAs and messenger RNAs based on an improved k-mer scheme
title_sort plek: a tool for predicting long non-coding rnas and messenger rnas based on an improved k-mer scheme
topic Software
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4177586/
https://www.ncbi.nlm.nih.gov/pubmed/25239089
http://dx.doi.org/10.1186/1471-2105-15-311
work_keys_str_mv AT liaimin plekatoolforpredictinglongnoncodingrnasandmessengerrnasbasedonanimprovedkmerscheme
AT zhangjunying plekatoolforpredictinglongnoncodingrnasandmessengerrnasbasedonanimprovedkmerscheme
AT zhouzhongyin plekatoolforpredictinglongnoncodingrnasandmessengerrnasbasedonanimprovedkmerscheme