Cargando…
Pearson Correlation-Based Feature Selection for Document Classification Using Balanced Training
Documents are stored in a digital form across several organizations. Printing this amount of data and placing it into folders instead of storing digitally is against the practical, economical, and ecological perspective. An efficient way of retrieving data from digitally stored documents is also req...
Autores principales: | , , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
MDPI
2020
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7730850/ https://www.ncbi.nlm.nih.gov/pubmed/33261136 http://dx.doi.org/10.3390/s20236793 |
_version_ | 1783621779253624832 |
---|---|
author | Nasir, Inzamam Mashood Khan, Muhammad Attique Yasmin, Mussarat Shah, Jamal Hussain Gabryel, Marcin Scherer, Rafał Damaševičius, Robertas |
author_facet | Nasir, Inzamam Mashood Khan, Muhammad Attique Yasmin, Mussarat Shah, Jamal Hussain Gabryel, Marcin Scherer, Rafał Damaševičius, Robertas |
author_sort | Nasir, Inzamam Mashood |
collection | PubMed |
description | Documents are stored in a digital form across several organizations. Printing this amount of data and placing it into folders instead of storing digitally is against the practical, economical, and ecological perspective. An efficient way of retrieving data from digitally stored documents is also required. This article presents a real-time supervised learning technique for document classification based on deep convolutional neural network (DCNN), which aims to reduce the impact of adverse document image issues such as signatures, marks, logo, and handwritten notes. The proposed technique’s major steps include data augmentation, feature extraction using pre-trained neural network models, feature fusion, and feature selection. We propose a novel data augmentation technique, which normalizes the imbalanced dataset using the secondary dataset RVL-CDIP. The DCNN features are extracted using the VGG19 and AlexNet networks. The extracted features are fused, and the fused feature vector is optimized by applying a Pearson correlation coefficient-based technique to select the optimized features while removing the redundant features. The proposed technique is tested on the Tobacco3482 dataset, which gives a classification accuracy of 93.1% using a cubic support vector machine classifier, proving the validity of the proposed technique. |
format | Online Article Text |
id | pubmed-7730850 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2020 |
publisher | MDPI |
record_format | MEDLINE/PubMed |
spelling | pubmed-77308502020-12-12 Pearson Correlation-Based Feature Selection for Document Classification Using Balanced Training Nasir, Inzamam Mashood Khan, Muhammad Attique Yasmin, Mussarat Shah, Jamal Hussain Gabryel, Marcin Scherer, Rafał Damaševičius, Robertas Sensors (Basel) Article Documents are stored in a digital form across several organizations. Printing this amount of data and placing it into folders instead of storing digitally is against the practical, economical, and ecological perspective. An efficient way of retrieving data from digitally stored documents is also required. This article presents a real-time supervised learning technique for document classification based on deep convolutional neural network (DCNN), which aims to reduce the impact of adverse document image issues such as signatures, marks, logo, and handwritten notes. The proposed technique’s major steps include data augmentation, feature extraction using pre-trained neural network models, feature fusion, and feature selection. We propose a novel data augmentation technique, which normalizes the imbalanced dataset using the secondary dataset RVL-CDIP. The DCNN features are extracted using the VGG19 and AlexNet networks. The extracted features are fused, and the fused feature vector is optimized by applying a Pearson correlation coefficient-based technique to select the optimized features while removing the redundant features. The proposed technique is tested on the Tobacco3482 dataset, which gives a classification accuracy of 93.1% using a cubic support vector machine classifier, proving the validity of the proposed technique. MDPI 2020-11-27 /pmc/articles/PMC7730850/ /pubmed/33261136 http://dx.doi.org/10.3390/s20236793 Text en © 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/). |
spellingShingle | Article Nasir, Inzamam Mashood Khan, Muhammad Attique Yasmin, Mussarat Shah, Jamal Hussain Gabryel, Marcin Scherer, Rafał Damaševičius, Robertas Pearson Correlation-Based Feature Selection for Document Classification Using Balanced Training |
title | Pearson Correlation-Based Feature Selection for Document Classification Using Balanced Training |
title_full | Pearson Correlation-Based Feature Selection for Document Classification Using Balanced Training |
title_fullStr | Pearson Correlation-Based Feature Selection for Document Classification Using Balanced Training |
title_full_unstemmed | Pearson Correlation-Based Feature Selection for Document Classification Using Balanced Training |
title_short | Pearson Correlation-Based Feature Selection for Document Classification Using Balanced Training |
title_sort | pearson correlation-based feature selection for document classification using balanced training |
topic | Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7730850/ https://www.ncbi.nlm.nih.gov/pubmed/33261136 http://dx.doi.org/10.3390/s20236793 |
work_keys_str_mv | AT nasirinzamammashood pearsoncorrelationbasedfeatureselectionfordocumentclassificationusingbalancedtraining AT khanmuhammadattique pearsoncorrelationbasedfeatureselectionfordocumentclassificationusingbalancedtraining AT yasminmussarat pearsoncorrelationbasedfeatureselectionfordocumentclassificationusingbalancedtraining AT shahjamalhussain pearsoncorrelationbasedfeatureselectionfordocumentclassificationusingbalancedtraining AT gabryelmarcin pearsoncorrelationbasedfeatureselectionfordocumentclassificationusingbalancedtraining AT schererrafał pearsoncorrelationbasedfeatureselectionfordocumentclassificationusingbalancedtraining AT damaseviciusrobertas pearsoncorrelationbasedfeatureselectionfordocumentclassificationusingbalancedtraining |