Cargando…

Recent Experiences in Parameter-Free Data Mining

Recent results supporting the usefulness of the normalized compression distance for the task to classify genome sequences of virus data are reported. Specifically, the problem to cluster the hemagglutinin (HA) sequences of in uenza virus data for the HA gene in dependence on the host and subtype of...

Descripción completa

Detalles Bibliográficos
Autores principales: Ito, Kimihito, Zeugmann, Thomas, Zhu, Yu
Formato: Online Artículo Texto
Lenguaje:English
Publicado: 2010
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7121110/
http://dx.doi.org/10.1007/978-90-481-9794-1_68
Descripción
Sumario:Recent results supporting the usefulness of the normalized compression distance for the task to classify genome sequences of virus data are reported. Specifically, the problem to cluster the hemagglutinin (HA) sequences of in uenza virus data for the HA gene in dependence on the host and subtype of the virus, and the classification of dengue virus genome data with respect to their four serotypes are studied. A comparison is made with respect to hierarchical and spectral clustering via the kLine algorithm by Fischer and Poland (2004), respectively, and with respect to the standard compressors bzlip, ppmd, and zlib. Our results are very promising and show that one can obtain an (almost) perfect clustering for all the problems studied.