Cargando…

Pearson Correlation-Based Feature Selection for Document Classification Using Balanced Training

Documents are stored in a digital form across several organizations. Printing this amount of data and placing it into folders instead of storing digitally is against the practical, economical, and ecological perspective. An efficient way of retrieving data from digitally stored documents is also req...

Descripción completa

Detalles Bibliográficos
Autores principales: Nasir, Inzamam Mashood, Khan, Muhammad Attique, Yasmin, Mussarat, Shah, Jamal Hussain, Gabryel, Marcin, Scherer, Rafał, Damaševičius, Robertas
Formato: Online Artículo Texto
Lenguaje:English
Publicado: MDPI 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7730850/
https://www.ncbi.nlm.nih.gov/pubmed/33261136
http://dx.doi.org/10.3390/s20236793
_version_ 1783621779253624832
author Nasir, Inzamam Mashood
Khan, Muhammad Attique
Yasmin, Mussarat
Shah, Jamal Hussain
Gabryel, Marcin
Scherer, Rafał
Damaševičius, Robertas
author_facet Nasir, Inzamam Mashood
Khan, Muhammad Attique
Yasmin, Mussarat
Shah, Jamal Hussain
Gabryel, Marcin
Scherer, Rafał
Damaševičius, Robertas
author_sort Nasir, Inzamam Mashood
collection PubMed
description Documents are stored in a digital form across several organizations. Printing this amount of data and placing it into folders instead of storing digitally is against the practical, economical, and ecological perspective. An efficient way of retrieving data from digitally stored documents is also required. This article presents a real-time supervised learning technique for document classification based on deep convolutional neural network (DCNN), which aims to reduce the impact of adverse document image issues such as signatures, marks, logo, and handwritten notes. The proposed technique’s major steps include data augmentation, feature extraction using pre-trained neural network models, feature fusion, and feature selection. We propose a novel data augmentation technique, which normalizes the imbalanced dataset using the secondary dataset RVL-CDIP. The DCNN features are extracted using the VGG19 and AlexNet networks. The extracted features are fused, and the fused feature vector is optimized by applying a Pearson correlation coefficient-based technique to select the optimized features while removing the redundant features. The proposed technique is tested on the Tobacco3482 dataset, which gives a classification accuracy of 93.1% using a cubic support vector machine classifier, proving the validity of the proposed technique.
format Online
Article
Text
id pubmed-7730850
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher MDPI
record_format MEDLINE/PubMed
spelling pubmed-77308502020-12-12 Pearson Correlation-Based Feature Selection for Document Classification Using Balanced Training Nasir, Inzamam Mashood Khan, Muhammad Attique Yasmin, Mussarat Shah, Jamal Hussain Gabryel, Marcin Scherer, Rafał Damaševičius, Robertas Sensors (Basel) Article Documents are stored in a digital form across several organizations. Printing this amount of data and placing it into folders instead of storing digitally is against the practical, economical, and ecological perspective. An efficient way of retrieving data from digitally stored documents is also required. This article presents a real-time supervised learning technique for document classification based on deep convolutional neural network (DCNN), which aims to reduce the impact of adverse document image issues such as signatures, marks, logo, and handwritten notes. The proposed technique’s major steps include data augmentation, feature extraction using pre-trained neural network models, feature fusion, and feature selection. We propose a novel data augmentation technique, which normalizes the imbalanced dataset using the secondary dataset RVL-CDIP. The DCNN features are extracted using the VGG19 and AlexNet networks. The extracted features are fused, and the fused feature vector is optimized by applying a Pearson correlation coefficient-based technique to select the optimized features while removing the redundant features. The proposed technique is tested on the Tobacco3482 dataset, which gives a classification accuracy of 93.1% using a cubic support vector machine classifier, proving the validity of the proposed technique. MDPI 2020-11-27 /pmc/articles/PMC7730850/ /pubmed/33261136 http://dx.doi.org/10.3390/s20236793 Text en © 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
spellingShingle Article
Nasir, Inzamam Mashood
Khan, Muhammad Attique
Yasmin, Mussarat
Shah, Jamal Hussain
Gabryel, Marcin
Scherer, Rafał
Damaševičius, Robertas
Pearson Correlation-Based Feature Selection for Document Classification Using Balanced Training
title Pearson Correlation-Based Feature Selection for Document Classification Using Balanced Training
title_full Pearson Correlation-Based Feature Selection for Document Classification Using Balanced Training
title_fullStr Pearson Correlation-Based Feature Selection for Document Classification Using Balanced Training
title_full_unstemmed Pearson Correlation-Based Feature Selection for Document Classification Using Balanced Training
title_short Pearson Correlation-Based Feature Selection for Document Classification Using Balanced Training
title_sort pearson correlation-based feature selection for document classification using balanced training
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7730850/
https://www.ncbi.nlm.nih.gov/pubmed/33261136
http://dx.doi.org/10.3390/s20236793
work_keys_str_mv AT nasirinzamammashood pearsoncorrelationbasedfeatureselectionfordocumentclassificationusingbalancedtraining
AT khanmuhammadattique pearsoncorrelationbasedfeatureselectionfordocumentclassificationusingbalancedtraining
AT yasminmussarat pearsoncorrelationbasedfeatureselectionfordocumentclassificationusingbalancedtraining
AT shahjamalhussain pearsoncorrelationbasedfeatureselectionfordocumentclassificationusingbalancedtraining
AT gabryelmarcin pearsoncorrelationbasedfeatureselectionfordocumentclassificationusingbalancedtraining
AT schererrafał pearsoncorrelationbasedfeatureselectionfordocumentclassificationusingbalancedtraining
AT damaseviciusrobertas pearsoncorrelationbasedfeatureselectionfordocumentclassificationusingbalancedtraining