Cargando…

Optimal distribution-preserving downsampling of large biomedical data sets (opdisDownsampling)

MOTIVATION: The size of today’s biomedical data sets pushes computer equipment to its limits, even for seemingly standard analysis tasks such as data projection or clustering. Reducing large biomedical data by downsampling is therefore a common early step in data processing, often performed as rando...

Descripción completa

Detalles Bibliográficos
Autores principales: Lötsch, Jörn, Malkusch, Sebastian, Ultsch, Alfred
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8341664/
https://www.ncbi.nlm.nih.gov/pubmed/34352006
http://dx.doi.org/10.1371/journal.pone.0255838
_version_ 1783733959902887936
author Lötsch, Jörn
Malkusch, Sebastian
Ultsch, Alfred
author_facet Lötsch, Jörn
Malkusch, Sebastian
Ultsch, Alfred
author_sort Lötsch, Jörn
collection PubMed
description MOTIVATION: The size of today’s biomedical data sets pushes computer equipment to its limits, even for seemingly standard analysis tasks such as data projection or clustering. Reducing large biomedical data by downsampling is therefore a common early step in data processing, often performed as random uniform class-proportional downsampling. In this report, we hypothesized that this can be optimized to obtain samples that better reflect the entire data set than those obtained using the current standard method. RESULTS: By repeating the random sampling and comparing the distribution of the drawn sample with the distribution of the original data, it was possible to establish a method for obtaining subsets of data that better reflect the entire data set than taking only the first randomly selected subsample, as is the current standard. Experiments on artificial and real biomedical data sets showed that the reconstruction of the remaining data from the original data set from the downsampled data improved significantly. This was observed with both principal component analysis and autoencoding neural networks. The fidelity was dependent on both the number of cases drawn from the original and the number of samples drawn. CONCLUSIONS: Optimal distribution-preserving class-proportional downsampling yields data subsets that reflect the structure of the entire data better than those obtained with the standard method. By using distributional similarity as the only selection criterion, the proposed method does not in any way affect the results of a later planned analysis.
format Online
Article
Text
id pubmed-8341664
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-83416642021-08-06 Optimal distribution-preserving downsampling of large biomedical data sets (opdisDownsampling) Lötsch, Jörn Malkusch, Sebastian Ultsch, Alfred PLoS One Research Article MOTIVATION: The size of today’s biomedical data sets pushes computer equipment to its limits, even for seemingly standard analysis tasks such as data projection or clustering. Reducing large biomedical data by downsampling is therefore a common early step in data processing, often performed as random uniform class-proportional downsampling. In this report, we hypothesized that this can be optimized to obtain samples that better reflect the entire data set than those obtained using the current standard method. RESULTS: By repeating the random sampling and comparing the distribution of the drawn sample with the distribution of the original data, it was possible to establish a method for obtaining subsets of data that better reflect the entire data set than taking only the first randomly selected subsample, as is the current standard. Experiments on artificial and real biomedical data sets showed that the reconstruction of the remaining data from the original data set from the downsampled data improved significantly. This was observed with both principal component analysis and autoencoding neural networks. The fidelity was dependent on both the number of cases drawn from the original and the number of samples drawn. CONCLUSIONS: Optimal distribution-preserving class-proportional downsampling yields data subsets that reflect the structure of the entire data better than those obtained with the standard method. By using distributional similarity as the only selection criterion, the proposed method does not in any way affect the results of a later planned analysis. Public Library of Science 2021-08-05 /pmc/articles/PMC8341664/ /pubmed/34352006 http://dx.doi.org/10.1371/journal.pone.0255838 Text en © 2021 Lötsch et al https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle Research Article
Lötsch, Jörn
Malkusch, Sebastian
Ultsch, Alfred
Optimal distribution-preserving downsampling of large biomedical data sets (opdisDownsampling)
title Optimal distribution-preserving downsampling of large biomedical data sets (opdisDownsampling)
title_full Optimal distribution-preserving downsampling of large biomedical data sets (opdisDownsampling)
title_fullStr Optimal distribution-preserving downsampling of large biomedical data sets (opdisDownsampling)
title_full_unstemmed Optimal distribution-preserving downsampling of large biomedical data sets (opdisDownsampling)
title_short Optimal distribution-preserving downsampling of large biomedical data sets (opdisDownsampling)
title_sort optimal distribution-preserving downsampling of large biomedical data sets (opdisdownsampling)
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8341664/
https://www.ncbi.nlm.nih.gov/pubmed/34352006
http://dx.doi.org/10.1371/journal.pone.0255838
work_keys_str_mv AT lotschjorn optimaldistributionpreservingdownsamplingoflargebiomedicaldatasetsopdisdownsampling
AT malkuschsebastian optimaldistributionpreservingdownsamplingoflargebiomedicaldatasetsopdisdownsampling
AT ultschalfred optimaldistributionpreservingdownsamplingoflargebiomedicaldatasetsopdisdownsampling