Cargando…

(m, n)-mer—a simple statistical feature for sequence classification

SUMMARY: The (m, n)-mer is a simple alternative classification feature based on conditional probability distributions. In this application note, we compared k-mer and (m, n)-mer frequency features in 11 distinct datasets used for binary, multiclass and clustering classifications. Our findings show t...

Descripción completa

Detalles Bibliográficos
Autores principales: de Andrade, Amanda Araújo Serrão, Grivet, Marco, Brustolini, Otávio, Vasconcelos, Ana Tereza Ribeiro
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10338135/
https://www.ncbi.nlm.nih.gov/pubmed/37448814
http://dx.doi.org/10.1093/bioadv/vbad088
_version_ 1785071563572248576
author de Andrade, Amanda Araújo Serrão
Grivet, Marco
Brustolini, Otávio
Vasconcelos, Ana Tereza Ribeiro
author_facet de Andrade, Amanda Araújo Serrão
Grivet, Marco
Brustolini, Otávio
Vasconcelos, Ana Tereza Ribeiro
author_sort de Andrade, Amanda Araújo Serrão
collection PubMed
description SUMMARY: The (m, n)-mer is a simple alternative classification feature based on conditional probability distributions. In this application note, we compared k-mer and (m, n)-mer frequency features in 11 distinct datasets used for binary, multiclass and clustering classifications. Our findings show that the (m, n)-mer frequency features are related to the highest performance metrics and often statistically outperformed the k-mers. Here, the (m, n)-mer frequencies improved performance for classifying smaller sequence lengths (as short as 300 bp) and yielded higher metrics when using short values of k (ranging from 2 to 4). Therefore, we present the (m, n)-mers frequencies to the scientific community as a feature that seems to be quite effective in identifying complex discriminatory patterns and classifying polyphyletic sequence groups. AVAILABILITY AND IMPLEMENTATION: The (m, n)-mer algorithm is released as an R package within the CRAN project (https://cran.r-project.org/web/packages/mnmer) and is also available at https://github.com/labinfo-lncc/mnmer. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics Advances online.
format Online
Article
Text
id pubmed-10338135
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-103381352023-07-13 (m, n)-mer—a simple statistical feature for sequence classification de Andrade, Amanda Araújo Serrão Grivet, Marco Brustolini, Otávio Vasconcelos, Ana Tereza Ribeiro Bioinform Adv Application Note SUMMARY: The (m, n)-mer is a simple alternative classification feature based on conditional probability distributions. In this application note, we compared k-mer and (m, n)-mer frequency features in 11 distinct datasets used for binary, multiclass and clustering classifications. Our findings show that the (m, n)-mer frequency features are related to the highest performance metrics and often statistically outperformed the k-mers. Here, the (m, n)-mer frequencies improved performance for classifying smaller sequence lengths (as short as 300 bp) and yielded higher metrics when using short values of k (ranging from 2 to 4). Therefore, we present the (m, n)-mers frequencies to the scientific community as a feature that seems to be quite effective in identifying complex discriminatory patterns and classifying polyphyletic sequence groups. AVAILABILITY AND IMPLEMENTATION: The (m, n)-mer algorithm is released as an R package within the CRAN project (https://cran.r-project.org/web/packages/mnmer) and is also available at https://github.com/labinfo-lncc/mnmer. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics Advances online. Oxford University Press 2023-07-11 /pmc/articles/PMC10338135/ /pubmed/37448814 http://dx.doi.org/10.1093/bioadv/vbad088 Text en © The Author(s) 2023. Published by Oxford University Press. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Application Note
de Andrade, Amanda Araújo Serrão
Grivet, Marco
Brustolini, Otávio
Vasconcelos, Ana Tereza Ribeiro
(m, n)-mer—a simple statistical feature for sequence classification
title (m, n)-mer—a simple statistical feature for sequence classification
title_full (m, n)-mer—a simple statistical feature for sequence classification
title_fullStr (m, n)-mer—a simple statistical feature for sequence classification
title_full_unstemmed (m, n)-mer—a simple statistical feature for sequence classification
title_short (m, n)-mer—a simple statistical feature for sequence classification
title_sort (m, n)-mer—a simple statistical feature for sequence classification
topic Application Note
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10338135/
https://www.ncbi.nlm.nih.gov/pubmed/37448814
http://dx.doi.org/10.1093/bioadv/vbad088
work_keys_str_mv AT deandradeamandaaraujoserrao mnmerasimplestatisticalfeatureforsequenceclassification
AT grivetmarco mnmerasimplestatisticalfeatureforsequenceclassification
AT brustoliniotavio mnmerasimplestatisticalfeatureforsequenceclassification
AT vasconcelosanaterezaribeiro mnmerasimplestatisticalfeatureforsequenceclassification