Cargando…

Imbalanced classification for protein subcellular localization with multilabel oversampling

MOTIVATION: Subcellular localization of human proteins is essential to comprehend their functions and roles in physiological processes, which in turn helps in diagnostic and prognostic studies of pathological conditions and impacts clinical decision-making. Since proteins reside at multiple location...

Descripción completa

Detalles Bibliográficos
Autores principales: Rana, Priyanka, Sowmya, Arcot, Meijering, Erik, Song, Yang
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9825308/
https://www.ncbi.nlm.nih.gov/pubmed/36579866
http://dx.doi.org/10.1093/bioinformatics/btac841
_version_ 1784866608251928576
author Rana, Priyanka
Sowmya, Arcot
Meijering, Erik
Song, Yang
author_facet Rana, Priyanka
Sowmya, Arcot
Meijering, Erik
Song, Yang
author_sort Rana, Priyanka
collection PubMed
description MOTIVATION: Subcellular localization of human proteins is essential to comprehend their functions and roles in physiological processes, which in turn helps in diagnostic and prognostic studies of pathological conditions and impacts clinical decision-making. Since proteins reside at multiple locations at the same time and few subcellular locations host far more proteins than other locations, the computational task for their subcellular localization is to train a multilabel classifier while handling data imbalance. In imbalanced data, minority classes are underrepresented, thus leading to a heavy bias towards the majority classes and the degradation of predictive capability for the minority classes. Furthermore, data imbalance in multilabel settings is an even more complex problem due to the coexistence of majority and minority classes. RESULTS: Our studies reveal that based on the extent of concurrence of majority and minority classes, oversampling of minority samples through appropriate data augmentation techniques holds promising scope for boosting the classification performance for the minority classes. We measured the magnitude of data imbalance per class and the concurrence of majority and minority classes in the dataset. Based on the obtained values, we identified minority and medium classes, and a new oversampling method is proposed that includes non-linear mixup, geometric and colour transformations for data augmentation and a sampling approach to prepare minibatches. Performance evaluation on the Human Protein Atlas Kaggle challenge dataset shows that the proposed method is capable of achieving better predictions for minority classes than existing methods. AVAILABILITY AND IMPLEMENTATION: Data used in this study are available at https://www.kaggle.com/competitions/human-protein-atlas-image-classification/data. Source code is available at https://github.com/priyarana/Protein-subcellular-localisation-method. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
format Online
Article
Text
id pubmed-9825308
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-98253082023-01-10 Imbalanced classification for protein subcellular localization with multilabel oversampling Rana, Priyanka Sowmya, Arcot Meijering, Erik Song, Yang Bioinformatics Original Paper MOTIVATION: Subcellular localization of human proteins is essential to comprehend their functions and roles in physiological processes, which in turn helps in diagnostic and prognostic studies of pathological conditions and impacts clinical decision-making. Since proteins reside at multiple locations at the same time and few subcellular locations host far more proteins than other locations, the computational task for their subcellular localization is to train a multilabel classifier while handling data imbalance. In imbalanced data, minority classes are underrepresented, thus leading to a heavy bias towards the majority classes and the degradation of predictive capability for the minority classes. Furthermore, data imbalance in multilabel settings is an even more complex problem due to the coexistence of majority and minority classes. RESULTS: Our studies reveal that based on the extent of concurrence of majority and minority classes, oversampling of minority samples through appropriate data augmentation techniques holds promising scope for boosting the classification performance for the minority classes. We measured the magnitude of data imbalance per class and the concurrence of majority and minority classes in the dataset. Based on the obtained values, we identified minority and medium classes, and a new oversampling method is proposed that includes non-linear mixup, geometric and colour transformations for data augmentation and a sampling approach to prepare minibatches. Performance evaluation on the Human Protein Atlas Kaggle challenge dataset shows that the proposed method is capable of achieving better predictions for minority classes than existing methods. AVAILABILITY AND IMPLEMENTATION: Data used in this study are available at https://www.kaggle.com/competitions/human-protein-atlas-image-classification/data. Source code is available at https://github.com/priyarana/Protein-subcellular-localisation-method. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. Oxford University Press 2022-12-29 /pmc/articles/PMC9825308/ /pubmed/36579866 http://dx.doi.org/10.1093/bioinformatics/btac841 Text en © The Author(s) 2022. Published by Oxford University Press. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Original Paper
Rana, Priyanka
Sowmya, Arcot
Meijering, Erik
Song, Yang
Imbalanced classification for protein subcellular localization with multilabel oversampling
title Imbalanced classification for protein subcellular localization with multilabel oversampling
title_full Imbalanced classification for protein subcellular localization with multilabel oversampling
title_fullStr Imbalanced classification for protein subcellular localization with multilabel oversampling
title_full_unstemmed Imbalanced classification for protein subcellular localization with multilabel oversampling
title_short Imbalanced classification for protein subcellular localization with multilabel oversampling
title_sort imbalanced classification for protein subcellular localization with multilabel oversampling
topic Original Paper
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9825308/
https://www.ncbi.nlm.nih.gov/pubmed/36579866
http://dx.doi.org/10.1093/bioinformatics/btac841
work_keys_str_mv AT ranapriyanka imbalancedclassificationforproteinsubcellularlocalizationwithmultilabeloversampling
AT sowmyaarcot imbalancedclassificationforproteinsubcellularlocalizationwithmultilabeloversampling
AT meijeringerik imbalancedclassificationforproteinsubcellularlocalizationwithmultilabeloversampling
AT songyang imbalancedclassificationforproteinsubcellularlocalizationwithmultilabeloversampling