Cargando…

Robust High-dimensional Bioinformatics Data Streams Mining by ODR-ioVFDT

Outlier detection in bioinformatics data streaming mining has received significant attention by research communities in recent years. The problems of how to distinguish noise from an exception and deciding whether to discard it or to devise an extra decision path for accommodating it are causing dil...

Descripción completa

Detalles Bibliográficos
Autores principales: Wang, Dantong, Fong, Simon, Wong, Raymond K., Mohammed, Sabah, Fiaidhi, Jinan, Wong, Kelvin K. L.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Nature Publishing Group 2017
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5322330/
https://www.ncbi.nlm.nih.gov/pubmed/28230161
http://dx.doi.org/10.1038/srep43167
_version_ 1782509823307284480
author Wang, Dantong
Fong, Simon
Wong, Raymond K.
Mohammed, Sabah
Fiaidhi, Jinan
Wong, Kelvin K. L.
author_facet Wang, Dantong
Fong, Simon
Wong, Raymond K.
Mohammed, Sabah
Fiaidhi, Jinan
Wong, Kelvin K. L.
author_sort Wang, Dantong
collection PubMed
description Outlier detection in bioinformatics data streaming mining has received significant attention by research communities in recent years. The problems of how to distinguish noise from an exception and deciding whether to discard it or to devise an extra decision path for accommodating it are causing dilemma. In this paper, we propose a novel algorithm called ODR with incrementally Optimized Very Fast Decision Tree (ODR-ioVFDT) for taking care of outliers in the progress of continuous data learning. By using an adaptive interquartile-range based identification method, a tolerance threshold is set. It is then used to judge if a data of exceptional value should be included for training or otherwise. This is different from the traditional outlier detection/removal approaches which are two separate steps in processing through the data. The proposed algorithm is tested using datasets of five bioinformatics scenarios and comparing the performance of our model and other ones without ODR. The results show that ODR-ioVFDT has better performance in classification accuracy, kappa statistics, and time consumption. The ODR-ioVFDT applied onto bioinformatics streaming data processing for detecting and quantifying the information of life phenomena, states, characters, variables and components of the organism can help to diagnose and treat disease more effectively.
format Online
Article
Text
id pubmed-5322330
institution National Center for Biotechnology Information
language English
publishDate 2017
publisher Nature Publishing Group
record_format MEDLINE/PubMed
spelling pubmed-53223302017-03-01 Robust High-dimensional Bioinformatics Data Streams Mining by ODR-ioVFDT Wang, Dantong Fong, Simon Wong, Raymond K. Mohammed, Sabah Fiaidhi, Jinan Wong, Kelvin K. L. Sci Rep Article Outlier detection in bioinformatics data streaming mining has received significant attention by research communities in recent years. The problems of how to distinguish noise from an exception and deciding whether to discard it or to devise an extra decision path for accommodating it are causing dilemma. In this paper, we propose a novel algorithm called ODR with incrementally Optimized Very Fast Decision Tree (ODR-ioVFDT) for taking care of outliers in the progress of continuous data learning. By using an adaptive interquartile-range based identification method, a tolerance threshold is set. It is then used to judge if a data of exceptional value should be included for training or otherwise. This is different from the traditional outlier detection/removal approaches which are two separate steps in processing through the data. The proposed algorithm is tested using datasets of five bioinformatics scenarios and comparing the performance of our model and other ones without ODR. The results show that ODR-ioVFDT has better performance in classification accuracy, kappa statistics, and time consumption. The ODR-ioVFDT applied onto bioinformatics streaming data processing for detecting and quantifying the information of life phenomena, states, characters, variables and components of the organism can help to diagnose and treat disease more effectively. Nature Publishing Group 2017-02-23 /pmc/articles/PMC5322330/ /pubmed/28230161 http://dx.doi.org/10.1038/srep43167 Text en Copyright © 2017, The Author(s) http://creativecommons.org/licenses/by/4.0/ This work is licensed under a Creative Commons Attribution 4.0 International License. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in the credit line; if the material is not included under the Creative Commons license, users will need to obtain permission from the license holder to reproduce the material. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/
spellingShingle Article
Wang, Dantong
Fong, Simon
Wong, Raymond K.
Mohammed, Sabah
Fiaidhi, Jinan
Wong, Kelvin K. L.
Robust High-dimensional Bioinformatics Data Streams Mining by ODR-ioVFDT
title Robust High-dimensional Bioinformatics Data Streams Mining by ODR-ioVFDT
title_full Robust High-dimensional Bioinformatics Data Streams Mining by ODR-ioVFDT
title_fullStr Robust High-dimensional Bioinformatics Data Streams Mining by ODR-ioVFDT
title_full_unstemmed Robust High-dimensional Bioinformatics Data Streams Mining by ODR-ioVFDT
title_short Robust High-dimensional Bioinformatics Data Streams Mining by ODR-ioVFDT
title_sort robust high-dimensional bioinformatics data streams mining by odr-iovfdt
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5322330/
https://www.ncbi.nlm.nih.gov/pubmed/28230161
http://dx.doi.org/10.1038/srep43167
work_keys_str_mv AT wangdantong robusthighdimensionalbioinformaticsdatastreamsminingbyodriovfdt
AT fongsimon robusthighdimensionalbioinformaticsdatastreamsminingbyodriovfdt
AT wongraymondk robusthighdimensionalbioinformaticsdatastreamsminingbyodriovfdt
AT mohammedsabah robusthighdimensionalbioinformaticsdatastreamsminingbyodriovfdt
AT fiaidhijinan robusthighdimensionalbioinformaticsdatastreamsminingbyodriovfdt
AT wongkelvinkl robusthighdimensionalbioinformaticsdatastreamsminingbyodriovfdt