Cargando…

Maximizing the reusability of gene expression data by predicting missing metadata

Reusability is part of the FAIR data principle, which aims to make data Findable, Accessible, Interoperable, and Reusable. One of the current efforts to increase the reusability of public genomics data has been to focus on the inclusion of quality metadata associated with the data. When necessary me...

Descripción completa

Detalles Bibliográficos
Autores principales:	Lung, Pei-Yau, Zhong, Dongrui, Pang, Xiaodong, Li, Yan, Zhang, Jinfeng
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Public Library of Science 2020
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7673503/ https://www.ncbi.nlm.nih.gov/pubmed/33156882 http://dx.doi.org/10.1371/journal.pcbi.1007450

_version_	1783611331227680768
author	Lung, Pei-Yau Zhong, Dongrui Pang, Xiaodong Li, Yan Zhang, Jinfeng
author_facet	Lung, Pei-Yau Zhong, Dongrui Pang, Xiaodong Li, Yan Zhang, Jinfeng
author_sort	Lung, Pei-Yau
collection	PubMed
description	Reusability is part of the FAIR data principle, which aims to make data Findable, Accessible, Interoperable, and Reusable. One of the current efforts to increase the reusability of public genomics data has been to focus on the inclusion of quality metadata associated with the data. When necessary metadata are missing, most researchers will consider the data useless. In this study, we developed a framework to predict the missing metadata of gene expression datasets to maximize their reusability. We found that when using predicted data to conduct other analyses, it is not optimal to use all the predicted data. Instead, one should only use the subset of data, which can be predicted accurately. We proposed a new metric called Proportion of Cases Accurately Predicted (PCAP), which is optimized in our specifically-designed machine learning pipeline. The new approach performed better than pipelines using commonly used metrics such as F1-score in terms of maximizing the reusability of data with missing values. We also found that different variables might need to be predicted using different machine learning methods and/or different data processing protocols. Using differential gene expression analysis as an example, we showed that when missing variables are accurately predicted, the corresponding gene expression data can be reliably used in downstream analyses.
format	Online Article Text
id	pubmed-7673503
institution	National Center for Biotechnology Information
language	English
publishDate	2020
publisher	Public Library of Science
record_format	MEDLINE/PubMed
spelling	pubmed-76735032020-11-19 Maximizing the reusability of gene expression data by predicting missing metadata Lung, Pei-Yau Zhong, Dongrui Pang, Xiaodong Li, Yan Zhang, Jinfeng PLoS Comput Biol Research Article Reusability is part of the FAIR data principle, which aims to make data Findable, Accessible, Interoperable, and Reusable. One of the current efforts to increase the reusability of public genomics data has been to focus on the inclusion of quality metadata associated with the data. When necessary metadata are missing, most researchers will consider the data useless. In this study, we developed a framework to predict the missing metadata of gene expression datasets to maximize their reusability. We found that when using predicted data to conduct other analyses, it is not optimal to use all the predicted data. Instead, one should only use the subset of data, which can be predicted accurately. We proposed a new metric called Proportion of Cases Accurately Predicted (PCAP), which is optimized in our specifically-designed machine learning pipeline. The new approach performed better than pipelines using commonly used metrics such as F1-score in terms of maximizing the reusability of data with missing values. We also found that different variables might need to be predicted using different machine learning methods and/or different data processing protocols. Using differential gene expression analysis as an example, we showed that when missing variables are accurately predicted, the corresponding gene expression data can be reliably used in downstream analyses. Public Library of Science 2020-11-06 /pmc/articles/PMC7673503/ /pubmed/33156882 http://dx.doi.org/10.1371/journal.pcbi.1007450 Text en © 2020 Lung et al http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle	Research Article Lung, Pei-Yau Zhong, Dongrui Pang, Xiaodong Li, Yan Zhang, Jinfeng Maximizing the reusability of gene expression data by predicting missing metadata
title	Maximizing the reusability of gene expression data by predicting missing metadata
title_full	Maximizing the reusability of gene expression data by predicting missing metadata
title_fullStr	Maximizing the reusability of gene expression data by predicting missing metadata
title_full_unstemmed	Maximizing the reusability of gene expression data by predicting missing metadata
title_short	Maximizing the reusability of gene expression data by predicting missing metadata
title_sort	maximizing the reusability of gene expression data by predicting missing metadata
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7673503/ https://www.ncbi.nlm.nih.gov/pubmed/33156882 http://dx.doi.org/10.1371/journal.pcbi.1007450
work_keys_str_mv	AT lungpeiyau maximizingthereusabilityofgeneexpressiondatabypredictingmissingmetadata AT zhongdongrui maximizingthereusabilityofgeneexpressiondatabypredictingmissingmetadata AT pangxiaodong maximizingthereusabilityofgeneexpressiondatabypredictingmissingmetadata AT liyan maximizingthereusabilityofgeneexpressiondatabypredictingmissingmetadata AT zhangjinfeng maximizingthereusabilityofgeneexpressiondatabypredictingmissingmetadata

Maximizing the reusability of gene expression data by predicting missing metadata

Ejemplares similares