Cargando…

The extension of the largest generalized-eigenvalue based distance metric D(ij)(γ(1)) in arbitrary feature spaces to classify composite data points

Analyzing patterns in data points embedded in linear and non-linear feature spaces is considered as one of the common research problems among different research areas, for example: data mining, machine learning, pattern recognition, and multivariate analysis. In this paper, data points are heterogen...

Descripción completa

Detalles Bibliográficos
Autor principal: Daoud, Mosaab
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Korea Genome Organization 2019
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6944050/
https://www.ncbi.nlm.nih.gov/pubmed/31896239
http://dx.doi.org/10.5808/GI.2019.17.4.e39
_version_ 1783484987594506240
author Daoud, Mosaab
author_facet Daoud, Mosaab
author_sort Daoud, Mosaab
collection PubMed
description Analyzing patterns in data points embedded in linear and non-linear feature spaces is considered as one of the common research problems among different research areas, for example: data mining, machine learning, pattern recognition, and multivariate analysis. In this paper, data points are heterogeneous sets of biosequences (composite data points). A composite data point is a set of ordinary data points (e.g., set of feature vectors). We theoretically extend the derivation of the largest generalized eigenvalue-based distance metric D(ij) (γ(1)) in any linear and non-linear feature spaces. We prove that D(ij) (γ(1)) is a metric under any linear and non-linear feature transformation function. We show the sufficiency and efficiency of using the decision rule [Formula: see text] (i.e., mean of D(ij) (γ(1))) in classification of heterogeneous sets of biosequences compared with the decision rules min(Ξ)(i) and median(Ξ)(i). We analyze the impact of linear and non-linear transformation functions on classifying/clustering collections of heterogeneous sets of biosequences. The impact of the length of a sequence in a heterogeneous sequence-set generated by simulation on the classification and clustering results in linear and non-linear feature spaces is empirically shown in this paper. We propose a new concept: the limiting dispersion map of the existing clusters in heterogeneous sets of biosequences embedded in linear and nonlinear feature spaces, which is based on the limiting distribution of nucleotide compositions estimated from real data sets. Finally, the empirical conclusions and the scientific evidences are deduced from the experiments to support the theoretical side stated in this paper.
format Online
Article
Text
id pubmed-6944050
institution National Center for Biotechnology Information
language English
publishDate 2019
publisher Korea Genome Organization
record_format MEDLINE/PubMed
spelling pubmed-69440502020-01-09 The extension of the largest generalized-eigenvalue based distance metric D(ij)(γ(1)) in arbitrary feature spaces to classify composite data points Daoud, Mosaab Genomics Inform Original Article Analyzing patterns in data points embedded in linear and non-linear feature spaces is considered as one of the common research problems among different research areas, for example: data mining, machine learning, pattern recognition, and multivariate analysis. In this paper, data points are heterogeneous sets of biosequences (composite data points). A composite data point is a set of ordinary data points (e.g., set of feature vectors). We theoretically extend the derivation of the largest generalized eigenvalue-based distance metric D(ij) (γ(1)) in any linear and non-linear feature spaces. We prove that D(ij) (γ(1)) is a metric under any linear and non-linear feature transformation function. We show the sufficiency and efficiency of using the decision rule [Formula: see text] (i.e., mean of D(ij) (γ(1))) in classification of heterogeneous sets of biosequences compared with the decision rules min(Ξ)(i) and median(Ξ)(i). We analyze the impact of linear and non-linear transformation functions on classifying/clustering collections of heterogeneous sets of biosequences. The impact of the length of a sequence in a heterogeneous sequence-set generated by simulation on the classification and clustering results in linear and non-linear feature spaces is empirically shown in this paper. We propose a new concept: the limiting dispersion map of the existing clusters in heterogeneous sets of biosequences embedded in linear and nonlinear feature spaces, which is based on the limiting distribution of nucleotide compositions estimated from real data sets. Finally, the empirical conclusions and the scientific evidences are deduced from the experiments to support the theoretical side stated in this paper. Korea Genome Organization 2019-11-14 /pmc/articles/PMC6944050/ /pubmed/31896239 http://dx.doi.org/10.5808/GI.2019.17.4.e39 Text en (c) 2019, Korea Genome Organization (CC) This is an open-access article distributed under the terms of the Creative Commons Attribution license(https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Original Article
Daoud, Mosaab
The extension of the largest generalized-eigenvalue based distance metric D(ij)(γ(1)) in arbitrary feature spaces to classify composite data points
title The extension of the largest generalized-eigenvalue based distance metric D(ij)(γ(1)) in arbitrary feature spaces to classify composite data points
title_full The extension of the largest generalized-eigenvalue based distance metric D(ij)(γ(1)) in arbitrary feature spaces to classify composite data points
title_fullStr The extension of the largest generalized-eigenvalue based distance metric D(ij)(γ(1)) in arbitrary feature spaces to classify composite data points
title_full_unstemmed The extension of the largest generalized-eigenvalue based distance metric D(ij)(γ(1)) in arbitrary feature spaces to classify composite data points
title_short The extension of the largest generalized-eigenvalue based distance metric D(ij)(γ(1)) in arbitrary feature spaces to classify composite data points
title_sort extension of the largest generalized-eigenvalue based distance metric d(ij)(γ(1)) in arbitrary feature spaces to classify composite data points
topic Original Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6944050/
https://www.ncbi.nlm.nih.gov/pubmed/31896239
http://dx.doi.org/10.5808/GI.2019.17.4.e39
work_keys_str_mv AT daoudmosaab theextensionofthelargestgeneralizedeigenvaluebaseddistancemetricdijg1inarbitraryfeaturespacestoclassifycompositedatapoints
AT daoudmosaab extensionofthelargestgeneralizedeigenvaluebaseddistancemetricdijg1inarbitraryfeaturespacestoclassifycompositedatapoints