Cargando…

(m, n)-mer—a simple statistical feature for sequence classification

SUMMARY: The (m, n)-mer is a simple alternative classification feature based on conditional probability distributions. In this application note, we compared k-mer and (m, n)-mer frequency features in 11 distinct datasets used for binary, multiclass and clustering classifications. Our findings show t...

Descripción completa

Detalles Bibliográficos
Autores principales:	de Andrade, Amanda Araújo Serrão, Grivet, Marco, Brustolini, Otávio, Vasconcelos, Ana Tereza Ribeiro
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Oxford University Press 2023
Materias:	Application Note
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10338135/ https://www.ncbi.nlm.nih.gov/pubmed/37448814 http://dx.doi.org/10.1093/bioadv/vbad088

_version_	1785071563572248576
author	de Andrade, Amanda Araújo Serrão Grivet, Marco Brustolini, Otávio Vasconcelos, Ana Tereza Ribeiro
author_facet	de Andrade, Amanda Araújo Serrão Grivet, Marco Brustolini, Otávio Vasconcelos, Ana Tereza Ribeiro
author_sort	de Andrade, Amanda Araújo Serrão
collection	PubMed
description	SUMMARY: The (m, n)-mer is a simple alternative classification feature based on conditional probability distributions. In this application note, we compared k-mer and (m, n)-mer frequency features in 11 distinct datasets used for binary, multiclass and clustering classifications. Our findings show that the (m, n)-mer frequency features are related to the highest performance metrics and often statistically outperformed the k-mers. Here, the (m, n)-mer frequencies improved performance for classifying smaller sequence lengths (as short as 300 bp) and yielded higher metrics when using short values of k (ranging from 2 to 4). Therefore, we present the (m, n)-mers frequencies to the scientific community as a feature that seems to be quite effective in identifying complex discriminatory patterns and classifying polyphyletic sequence groups. AVAILABILITY AND IMPLEMENTATION: The (m, n)-mer algorithm is released as an R package within the CRAN project (https://cran.r-project.org/web/packages/mnmer) and is also available at https://github.com/labinfo-lncc/mnmer. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics Advances online.
format	Online Article Text
id	pubmed-10338135
institution	National Center for Biotechnology Information
language	English
publishDate	2023
publisher	Oxford University Press
record_format	MEDLINE/PubMed
spelling	pubmed-103381352023-07-13 (m, n)-mer—a simple statistical feature for sequence classification de Andrade, Amanda Araújo Serrão Grivet, Marco Brustolini, Otávio Vasconcelos, Ana Tereza Ribeiro Bioinform Adv Application Note SUMMARY: The (m, n)-mer is a simple alternative classification feature based on conditional probability distributions. In this application note, we compared k-mer and (m, n)-mer frequency features in 11 distinct datasets used for binary, multiclass and clustering classifications. Our findings show that the (m, n)-mer frequency features are related to the highest performance metrics and often statistically outperformed the k-mers. Here, the (m, n)-mer frequencies improved performance for classifying smaller sequence lengths (as short as 300 bp) and yielded higher metrics when using short values of k (ranging from 2 to 4). Therefore, we present the (m, n)-mers frequencies to the scientific community as a feature that seems to be quite effective in identifying complex discriminatory patterns and classifying polyphyletic sequence groups. AVAILABILITY AND IMPLEMENTATION: The (m, n)-mer algorithm is released as an R package within the CRAN project (https://cran.r-project.org/web/packages/mnmer) and is also available at https://github.com/labinfo-lncc/mnmer. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics Advances online. Oxford University Press 2023-07-11 /pmc/articles/PMC10338135/ /pubmed/37448814 http://dx.doi.org/10.1093/bioadv/vbad088 Text en © The Author(s) 2023. Published by Oxford University Press. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Application Note de Andrade, Amanda Araújo Serrão Grivet, Marco Brustolini, Otávio Vasconcelos, Ana Tereza Ribeiro (m, n)-mer—a simple statistical feature for sequence classification
title	(m, n)-mer—a simple statistical feature for sequence classification
title_full	(m, n)-mer—a simple statistical feature for sequence classification
title_fullStr	(m, n)-mer—a simple statistical feature for sequence classification
title_full_unstemmed	(m, n)-mer—a simple statistical feature for sequence classification
title_short	(m, n)-mer—a simple statistical feature for sequence classification
title_sort	(m, n)-mer—a simple statistical feature for sequence classification
topic	Application Note
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10338135/ https://www.ncbi.nlm.nih.gov/pubmed/37448814 http://dx.doi.org/10.1093/bioadv/vbad088
work_keys_str_mv	AT deandradeamandaaraujoserrao mnmerasimplestatisticalfeatureforsequenceclassification AT grivetmarco mnmerasimplestatisticalfeatureforsequenceclassification AT brustoliniotavio mnmerasimplestatisticalfeatureforsequenceclassification AT vasconcelosanaterezaribeiro mnmerasimplestatisticalfeatureforsequenceclassification

(m, n)-mer—a simple statistical feature for sequence classification

Ejemplares similares