Cargando…

CFSP: a collaborative frequent sequence pattern discovery algorithm for nucleic acid sequence classification

BACKGROUND: Conserved nucleic acid sequences play an essential role in transcriptional regulation. The motifs/templates derived from nucleic acid sequence datasets are usually used as biomarkers to predict biochemical properties such as protein binding sites or to identify specific non-coding RNAs....

Descripción completa

Detalles Bibliográficos
Autor principal:	Peng, He
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	PeerJ Inc. 2020
Materias:	Bioinformatics
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7179567/ https://www.ncbi.nlm.nih.gov/pubmed/32341900 http://dx.doi.org/10.7717/peerj.8965

_version_	1783525674602987520
author	Peng, He
author_facet	Peng, He
author_sort	Peng, He
collection	PubMed
description	BACKGROUND: Conserved nucleic acid sequences play an essential role in transcriptional regulation. The motifs/templates derived from nucleic acid sequence datasets are usually used as biomarkers to predict biochemical properties such as protein binding sites or to identify specific non-coding RNAs. In many cases, template-based nucleic acid sequence classification performs better than some feature extraction methods, such as N-gram and k-spaced pairs classification. The availability of large-scale experimental data provides an unprecedented opportunity to improve motif extraction methods. The process for pattern extraction from large-scale data is crucial for the creation of predictive models. METHODS: In this article, a Teiresias-like feature extraction algorithm to discover frequent sub-sequences (CFSP) is proposed. Although gaps are allowed in some motif discovery algorithms, the distance and number of gaps are limited. The proposed algorithm can find frequent sequence pairs with a larger gap. The combinations of frequent sub-sequences in given protracted sequences capture the long-distance correlation, which implies a specific molecular biological property. Hence, the proposed algorithm intends to discover the combinations. A set of frequent sub-sequences derived from nucleic acid sequences with order is used as a base frequent sub-sequence array. The mutation information is attached to each sub-sequence array to implement fuzzy matching. Thus, a mutate records a single nucleotide variant or nucleotides insertion/deletion (indel) to encode a slight difference between frequent sequences and a matched subsequence of a sequence under investigation. CONCLUSIONS: The proposed algorithm has been validated with several nucleic acid sequence prediction case studies. These data demonstrate better results than the recently available feature descriptors based methods based on experimental data sets such as miRNA, piRNA, and Sigma 54 promoters. CFSP is implemented in C++ and shell script; the source code and related data are available at https://github.com/HePeng2016/CFSP.
format	Online Article Text
id	pubmed-7179567
institution	National Center for Biotechnology Information
language	English
publishDate	2020
publisher	PeerJ Inc.
record_format	MEDLINE/PubMed
spelling	pubmed-71795672020-04-27 CFSP: a collaborative frequent sequence pattern discovery algorithm for nucleic acid sequence classification Peng, He PeerJ Bioinformatics BACKGROUND: Conserved nucleic acid sequences play an essential role in transcriptional regulation. The motifs/templates derived from nucleic acid sequence datasets are usually used as biomarkers to predict biochemical properties such as protein binding sites or to identify specific non-coding RNAs. In many cases, template-based nucleic acid sequence classification performs better than some feature extraction methods, such as N-gram and k-spaced pairs classification. The availability of large-scale experimental data provides an unprecedented opportunity to improve motif extraction methods. The process for pattern extraction from large-scale data is crucial for the creation of predictive models. METHODS: In this article, a Teiresias-like feature extraction algorithm to discover frequent sub-sequences (CFSP) is proposed. Although gaps are allowed in some motif discovery algorithms, the distance and number of gaps are limited. The proposed algorithm can find frequent sequence pairs with a larger gap. The combinations of frequent sub-sequences in given protracted sequences capture the long-distance correlation, which implies a specific molecular biological property. Hence, the proposed algorithm intends to discover the combinations. A set of frequent sub-sequences derived from nucleic acid sequences with order is used as a base frequent sub-sequence array. The mutation information is attached to each sub-sequence array to implement fuzzy matching. Thus, a mutate records a single nucleotide variant or nucleotides insertion/deletion (indel) to encode a slight difference between frequent sequences and a matched subsequence of a sequence under investigation. CONCLUSIONS: The proposed algorithm has been validated with several nucleic acid sequence prediction case studies. These data demonstrate better results than the recently available feature descriptors based methods based on experimental data sets such as miRNA, piRNA, and Sigma 54 promoters. CFSP is implemented in C++ and shell script; the source code and related data are available at https://github.com/HePeng2016/CFSP. PeerJ Inc. 2020-04-20 /pmc/articles/PMC7179567/ /pubmed/32341900 http://dx.doi.org/10.7717/peerj.8965 Text en ©2020 Peng https://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ) and either DOI or URL of the article must be cited.
spellingShingle	Bioinformatics Peng, He CFSP: a collaborative frequent sequence pattern discovery algorithm for nucleic acid sequence classification
title	CFSP: a collaborative frequent sequence pattern discovery algorithm for nucleic acid sequence classification
title_full	CFSP: a collaborative frequent sequence pattern discovery algorithm for nucleic acid sequence classification
title_fullStr	CFSP: a collaborative frequent sequence pattern discovery algorithm for nucleic acid sequence classification
title_full_unstemmed	CFSP: a collaborative frequent sequence pattern discovery algorithm for nucleic acid sequence classification
title_short	CFSP: a collaborative frequent sequence pattern discovery algorithm for nucleic acid sequence classification
title_sort	cfsp: a collaborative frequent sequence pattern discovery algorithm for nucleic acid sequence classification
topic	Bioinformatics
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7179567/ https://www.ncbi.nlm.nih.gov/pubmed/32341900 http://dx.doi.org/10.7717/peerj.8965
work_keys_str_mv	AT penghe cfspacollaborativefrequentsequencepatterndiscoveryalgorithmfornucleicacidsequenceclassification

CFSP: a collaborative frequent sequence pattern discovery algorithm for nucleic acid sequence classification

Ejemplares similares