Cargando…
PLEK: a tool for predicting long non-coding RNAs and messenger RNAs based on an improved k-mer scheme
BACKGROUND: High-throughput transcriptome sequencing (RNA-seq) technology promises to discover novel protein-coding and non-coding transcripts, particularly the identification of long non-coding RNAs (lncRNAs) from de novo sequencing data. This requires tools that are not restricted by prior gene an...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2014
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4177586/ https://www.ncbi.nlm.nih.gov/pubmed/25239089 http://dx.doi.org/10.1186/1471-2105-15-311 |
_version_ | 1782336789677080576 |
---|---|
author | Li, Aimin Zhang, Junying Zhou, Zhongyin |
author_facet | Li, Aimin Zhang, Junying Zhou, Zhongyin |
author_sort | Li, Aimin |
collection | PubMed |
description | BACKGROUND: High-throughput transcriptome sequencing (RNA-seq) technology promises to discover novel protein-coding and non-coding transcripts, particularly the identification of long non-coding RNAs (lncRNAs) from de novo sequencing data. This requires tools that are not restricted by prior gene annotations, genomic sequences and high-quality sequencing. RESULTS: We present an alignment-free tool called PLEK (predictor of long non-coding RNAs and messenger RNAs based on an improved k-mer scheme), which uses a computational pipeline based on an improved k-mer scheme and a support vector machine (SVM) algorithm to distinguish lncRNAs from messenger RNAs (mRNAs), in the absence of genomic sequences or annotations. The performance of PLEK was evaluated on well-annotated mRNA and lncRNA transcripts. 10-fold cross-validation tests on human RefSeq mRNAs and GENCODE lncRNAs indicated that our tool could achieve accuracy of up to 95.6%. We demonstrated the utility of PLEK on transcripts from other vertebrates using the model built from human datasets. PLEK attained >90% accuracy on most of these datasets. PLEK also performed well using a simulated dataset and two real de novo assembled transcriptome datasets (sequenced by PacBio and 454 platforms) with relatively high indel sequencing errors. In addition, PLEK is approximately eightfold faster than a newly developed alignment-free tool, named Coding-Non-Coding Index (CNCI), and 244 times faster than the most popular alignment-based tool, Coding Potential Calculator (CPC), in a single-threading running manner. CONCLUSIONS: PLEK is an efficient alignment-free computational tool to distinguish lncRNAs from mRNAs in RNA-seq transcriptomes of species lacking reference genomes. PLEK is especially suitable for PacBio or 454 sequencing data and large-scale transcriptome data. Its open-source software can be freely downloaded from https://sourceforge.net/projects/plek/files/. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/1471-2105-15-311) contains supplementary material, which is available to authorized users. |
format | Online Article Text |
id | pubmed-4177586 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2014 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-41775862014-09-29 PLEK: a tool for predicting long non-coding RNAs and messenger RNAs based on an improved k-mer scheme Li, Aimin Zhang, Junying Zhou, Zhongyin BMC Bioinformatics Software BACKGROUND: High-throughput transcriptome sequencing (RNA-seq) technology promises to discover novel protein-coding and non-coding transcripts, particularly the identification of long non-coding RNAs (lncRNAs) from de novo sequencing data. This requires tools that are not restricted by prior gene annotations, genomic sequences and high-quality sequencing. RESULTS: We present an alignment-free tool called PLEK (predictor of long non-coding RNAs and messenger RNAs based on an improved k-mer scheme), which uses a computational pipeline based on an improved k-mer scheme and a support vector machine (SVM) algorithm to distinguish lncRNAs from messenger RNAs (mRNAs), in the absence of genomic sequences or annotations. The performance of PLEK was evaluated on well-annotated mRNA and lncRNA transcripts. 10-fold cross-validation tests on human RefSeq mRNAs and GENCODE lncRNAs indicated that our tool could achieve accuracy of up to 95.6%. We demonstrated the utility of PLEK on transcripts from other vertebrates using the model built from human datasets. PLEK attained >90% accuracy on most of these datasets. PLEK also performed well using a simulated dataset and two real de novo assembled transcriptome datasets (sequenced by PacBio and 454 platforms) with relatively high indel sequencing errors. In addition, PLEK is approximately eightfold faster than a newly developed alignment-free tool, named Coding-Non-Coding Index (CNCI), and 244 times faster than the most popular alignment-based tool, Coding Potential Calculator (CPC), in a single-threading running manner. CONCLUSIONS: PLEK is an efficient alignment-free computational tool to distinguish lncRNAs from mRNAs in RNA-seq transcriptomes of species lacking reference genomes. PLEK is especially suitable for PacBio or 454 sequencing data and large-scale transcriptome data. Its open-source software can be freely downloaded from https://sourceforge.net/projects/plek/files/. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/1471-2105-15-311) contains supplementary material, which is available to authorized users. BioMed Central 2014-09-19 /pmc/articles/PMC4177586/ /pubmed/25239089 http://dx.doi.org/10.1186/1471-2105-15-311 Text en © Li et al.; licensee BioMed Central Ltd. 2014 This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. |
spellingShingle | Software Li, Aimin Zhang, Junying Zhou, Zhongyin PLEK: a tool for predicting long non-coding RNAs and messenger RNAs based on an improved k-mer scheme |
title | PLEK: a tool for predicting long non-coding RNAs and messenger RNAs based on an improved k-mer scheme |
title_full | PLEK: a tool for predicting long non-coding RNAs and messenger RNAs based on an improved k-mer scheme |
title_fullStr | PLEK: a tool for predicting long non-coding RNAs and messenger RNAs based on an improved k-mer scheme |
title_full_unstemmed | PLEK: a tool for predicting long non-coding RNAs and messenger RNAs based on an improved k-mer scheme |
title_short | PLEK: a tool for predicting long non-coding RNAs and messenger RNAs based on an improved k-mer scheme |
title_sort | plek: a tool for predicting long non-coding rnas and messenger rnas based on an improved k-mer scheme |
topic | Software |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4177586/ https://www.ncbi.nlm.nih.gov/pubmed/25239089 http://dx.doi.org/10.1186/1471-2105-15-311 |
work_keys_str_mv | AT liaimin plekatoolforpredictinglongnoncodingrnasandmessengerrnasbasedonanimprovedkmerscheme AT zhangjunying plekatoolforpredictinglongnoncodingrnasandmessengerrnasbasedonanimprovedkmerscheme AT zhouzhongyin plekatoolforpredictinglongnoncodingrnasandmessengerrnasbasedonanimprovedkmerscheme |