Cargando…

CFSP: a collaborative frequent sequence pattern discovery algorithm for nucleic acid sequence classification

BACKGROUND: Conserved nucleic acid sequences play an essential role in transcriptional regulation. The motifs/templates derived from nucleic acid sequence datasets are usually used as biomarkers to predict biochemical properties such as protein binding sites or to identify specific non-coding RNAs....

Descripción completa

Detalles Bibliográficos
Autor principal: Peng, He
Formato: Online Artículo Texto
Lenguaje:English
Publicado: PeerJ Inc. 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7179567/
https://www.ncbi.nlm.nih.gov/pubmed/32341900
http://dx.doi.org/10.7717/peerj.8965
_version_ 1783525674602987520
author Peng, He
author_facet Peng, He
author_sort Peng, He
collection PubMed
description BACKGROUND: Conserved nucleic acid sequences play an essential role in transcriptional regulation. The motifs/templates derived from nucleic acid sequence datasets are usually used as biomarkers to predict biochemical properties such as protein binding sites or to identify specific non-coding RNAs. In many cases, template-based nucleic acid sequence classification performs better than some feature extraction methods, such as N-gram and k-spaced pairs classification. The availability of large-scale experimental data provides an unprecedented opportunity to improve motif extraction methods. The process for pattern extraction from large-scale data is crucial for the creation of predictive models. METHODS: In this article, a Teiresias-like feature extraction algorithm to discover frequent sub-sequences (CFSP) is proposed. Although gaps are allowed in some motif discovery algorithms, the distance and number of gaps are limited. The proposed algorithm can find frequent sequence pairs with a larger gap. The combinations of frequent sub-sequences in given protracted sequences capture the long-distance correlation, which implies a specific molecular biological property. Hence, the proposed algorithm intends to discover the combinations. A set of frequent sub-sequences derived from nucleic acid sequences with order is used as a base frequent sub-sequence array. The mutation information is attached to each sub-sequence array to implement fuzzy matching. Thus, a mutate records a single nucleotide variant or nucleotides insertion/deletion (indel) to encode a slight difference between frequent sequences and a matched subsequence of a sequence under investigation. CONCLUSIONS: The proposed algorithm has been validated with several nucleic acid sequence prediction case studies. These data demonstrate better results than the recently available feature descriptors based methods based on experimental data sets such as miRNA, piRNA, and Sigma 54 promoters. CFSP is implemented in C++ and shell script; the source code and related data are available at https://github.com/HePeng2016/CFSP.
format Online
Article
Text
id pubmed-7179567
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher PeerJ Inc.
record_format MEDLINE/PubMed
spelling pubmed-71795672020-04-27 CFSP: a collaborative frequent sequence pattern discovery algorithm for nucleic acid sequence classification Peng, He PeerJ Bioinformatics BACKGROUND: Conserved nucleic acid sequences play an essential role in transcriptional regulation. The motifs/templates derived from nucleic acid sequence datasets are usually used as biomarkers to predict biochemical properties such as protein binding sites or to identify specific non-coding RNAs. In many cases, template-based nucleic acid sequence classification performs better than some feature extraction methods, such as N-gram and k-spaced pairs classification. The availability of large-scale experimental data provides an unprecedented opportunity to improve motif extraction methods. The process for pattern extraction from large-scale data is crucial for the creation of predictive models. METHODS: In this article, a Teiresias-like feature extraction algorithm to discover frequent sub-sequences (CFSP) is proposed. Although gaps are allowed in some motif discovery algorithms, the distance and number of gaps are limited. The proposed algorithm can find frequent sequence pairs with a larger gap. The combinations of frequent sub-sequences in given protracted sequences capture the long-distance correlation, which implies a specific molecular biological property. Hence, the proposed algorithm intends to discover the combinations. A set of frequent sub-sequences derived from nucleic acid sequences with order is used as a base frequent sub-sequence array. The mutation information is attached to each sub-sequence array to implement fuzzy matching. Thus, a mutate records a single nucleotide variant or nucleotides insertion/deletion (indel) to encode a slight difference between frequent sequences and a matched subsequence of a sequence under investigation. CONCLUSIONS: The proposed algorithm has been validated with several nucleic acid sequence prediction case studies. These data demonstrate better results than the recently available feature descriptors based methods based on experimental data sets such as miRNA, piRNA, and Sigma 54 promoters. CFSP is implemented in C++ and shell script; the source code and related data are available at https://github.com/HePeng2016/CFSP. PeerJ Inc. 2020-04-20 /pmc/articles/PMC7179567/ /pubmed/32341900 http://dx.doi.org/10.7717/peerj.8965 Text en ©2020 Peng https://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ) and either DOI or URL of the article must be cited.
spellingShingle Bioinformatics
Peng, He
CFSP: a collaborative frequent sequence pattern discovery algorithm for nucleic acid sequence classification
title CFSP: a collaborative frequent sequence pattern discovery algorithm for nucleic acid sequence classification
title_full CFSP: a collaborative frequent sequence pattern discovery algorithm for nucleic acid sequence classification
title_fullStr CFSP: a collaborative frequent sequence pattern discovery algorithm for nucleic acid sequence classification
title_full_unstemmed CFSP: a collaborative frequent sequence pattern discovery algorithm for nucleic acid sequence classification
title_short CFSP: a collaborative frequent sequence pattern discovery algorithm for nucleic acid sequence classification
title_sort cfsp: a collaborative frequent sequence pattern discovery algorithm for nucleic acid sequence classification
topic Bioinformatics
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7179567/
https://www.ncbi.nlm.nih.gov/pubmed/32341900
http://dx.doi.org/10.7717/peerj.8965
work_keys_str_mv AT penghe cfspacollaborativefrequentsequencepatterndiscoveryalgorithmfornucleicacidsequenceclassification