Cargando…

Maximizing the reusability of gene expression data by predicting missing metadata

Reusability is part of the FAIR data principle, which aims to make data Findable, Accessible, Interoperable, and Reusable. One of the current efforts to increase the reusability of public genomics data has been to focus on the inclusion of quality metadata associated with the data. When necessary me...

Descripción completa

Detalles Bibliográficos
Autores principales: Lung, Pei-Yau, Zhong, Dongrui, Pang, Xiaodong, Li, Yan, Zhang, Jinfeng
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7673503/
https://www.ncbi.nlm.nih.gov/pubmed/33156882
http://dx.doi.org/10.1371/journal.pcbi.1007450
_version_ 1783611331227680768
author Lung, Pei-Yau
Zhong, Dongrui
Pang, Xiaodong
Li, Yan
Zhang, Jinfeng
author_facet Lung, Pei-Yau
Zhong, Dongrui
Pang, Xiaodong
Li, Yan
Zhang, Jinfeng
author_sort Lung, Pei-Yau
collection PubMed
description Reusability is part of the FAIR data principle, which aims to make data Findable, Accessible, Interoperable, and Reusable. One of the current efforts to increase the reusability of public genomics data has been to focus on the inclusion of quality metadata associated with the data. When necessary metadata are missing, most researchers will consider the data useless. In this study, we developed a framework to predict the missing metadata of gene expression datasets to maximize their reusability. We found that when using predicted data to conduct other analyses, it is not optimal to use all the predicted data. Instead, one should only use the subset of data, which can be predicted accurately. We proposed a new metric called Proportion of Cases Accurately Predicted (PCAP), which is optimized in our specifically-designed machine learning pipeline. The new approach performed better than pipelines using commonly used metrics such as F1-score in terms of maximizing the reusability of data with missing values. We also found that different variables might need to be predicted using different machine learning methods and/or different data processing protocols. Using differential gene expression analysis as an example, we showed that when missing variables are accurately predicted, the corresponding gene expression data can be reliably used in downstream analyses.
format Online
Article
Text
id pubmed-7673503
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-76735032020-11-19 Maximizing the reusability of gene expression data by predicting missing metadata Lung, Pei-Yau Zhong, Dongrui Pang, Xiaodong Li, Yan Zhang, Jinfeng PLoS Comput Biol Research Article Reusability is part of the FAIR data principle, which aims to make data Findable, Accessible, Interoperable, and Reusable. One of the current efforts to increase the reusability of public genomics data has been to focus on the inclusion of quality metadata associated with the data. When necessary metadata are missing, most researchers will consider the data useless. In this study, we developed a framework to predict the missing metadata of gene expression datasets to maximize their reusability. We found that when using predicted data to conduct other analyses, it is not optimal to use all the predicted data. Instead, one should only use the subset of data, which can be predicted accurately. We proposed a new metric called Proportion of Cases Accurately Predicted (PCAP), which is optimized in our specifically-designed machine learning pipeline. The new approach performed better than pipelines using commonly used metrics such as F1-score in terms of maximizing the reusability of data with missing values. We also found that different variables might need to be predicted using different machine learning methods and/or different data processing protocols. Using differential gene expression analysis as an example, we showed that when missing variables are accurately predicted, the corresponding gene expression data can be reliably used in downstream analyses. Public Library of Science 2020-11-06 /pmc/articles/PMC7673503/ /pubmed/33156882 http://dx.doi.org/10.1371/journal.pcbi.1007450 Text en © 2020 Lung et al http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle Research Article
Lung, Pei-Yau
Zhong, Dongrui
Pang, Xiaodong
Li, Yan
Zhang, Jinfeng
Maximizing the reusability of gene expression data by predicting missing metadata
title Maximizing the reusability of gene expression data by predicting missing metadata
title_full Maximizing the reusability of gene expression data by predicting missing metadata
title_fullStr Maximizing the reusability of gene expression data by predicting missing metadata
title_full_unstemmed Maximizing the reusability of gene expression data by predicting missing metadata
title_short Maximizing the reusability of gene expression data by predicting missing metadata
title_sort maximizing the reusability of gene expression data by predicting missing metadata
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7673503/
https://www.ncbi.nlm.nih.gov/pubmed/33156882
http://dx.doi.org/10.1371/journal.pcbi.1007450
work_keys_str_mv AT lungpeiyau maximizingthereusabilityofgeneexpressiondatabypredictingmissingmetadata
AT zhongdongrui maximizingthereusabilityofgeneexpressiondatabypredictingmissingmetadata
AT pangxiaodong maximizingthereusabilityofgeneexpressiondatabypredictingmissingmetadata
AT liyan maximizingthereusabilityofgeneexpressiondatabypredictingmissingmetadata
AT zhangjinfeng maximizingthereusabilityofgeneexpressiondatabypredictingmissingmetadata