Cargando…
Maximizing the reusability of gene expression data by predicting missing metadata
Reusability is part of the FAIR data principle, which aims to make data Findable, Accessible, Interoperable, and Reusable. One of the current efforts to increase the reusability of public genomics data has been to focus on the inclusion of quality metadata associated with the data. When necessary me...
Autores principales: | , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Public Library of Science
2020
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7673503/ https://www.ncbi.nlm.nih.gov/pubmed/33156882 http://dx.doi.org/10.1371/journal.pcbi.1007450 |
_version_ | 1783611331227680768 |
---|---|
author | Lung, Pei-Yau Zhong, Dongrui Pang, Xiaodong Li, Yan Zhang, Jinfeng |
author_facet | Lung, Pei-Yau Zhong, Dongrui Pang, Xiaodong Li, Yan Zhang, Jinfeng |
author_sort | Lung, Pei-Yau |
collection | PubMed |
description | Reusability is part of the FAIR data principle, which aims to make data Findable, Accessible, Interoperable, and Reusable. One of the current efforts to increase the reusability of public genomics data has been to focus on the inclusion of quality metadata associated with the data. When necessary metadata are missing, most researchers will consider the data useless. In this study, we developed a framework to predict the missing metadata of gene expression datasets to maximize their reusability. We found that when using predicted data to conduct other analyses, it is not optimal to use all the predicted data. Instead, one should only use the subset of data, which can be predicted accurately. We proposed a new metric called Proportion of Cases Accurately Predicted (PCAP), which is optimized in our specifically-designed machine learning pipeline. The new approach performed better than pipelines using commonly used metrics such as F1-score in terms of maximizing the reusability of data with missing values. We also found that different variables might need to be predicted using different machine learning methods and/or different data processing protocols. Using differential gene expression analysis as an example, we showed that when missing variables are accurately predicted, the corresponding gene expression data can be reliably used in downstream analyses. |
format | Online Article Text |
id | pubmed-7673503 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2020 |
publisher | Public Library of Science |
record_format | MEDLINE/PubMed |
spelling | pubmed-76735032020-11-19 Maximizing the reusability of gene expression data by predicting missing metadata Lung, Pei-Yau Zhong, Dongrui Pang, Xiaodong Li, Yan Zhang, Jinfeng PLoS Comput Biol Research Article Reusability is part of the FAIR data principle, which aims to make data Findable, Accessible, Interoperable, and Reusable. One of the current efforts to increase the reusability of public genomics data has been to focus on the inclusion of quality metadata associated with the data. When necessary metadata are missing, most researchers will consider the data useless. In this study, we developed a framework to predict the missing metadata of gene expression datasets to maximize their reusability. We found that when using predicted data to conduct other analyses, it is not optimal to use all the predicted data. Instead, one should only use the subset of data, which can be predicted accurately. We proposed a new metric called Proportion of Cases Accurately Predicted (PCAP), which is optimized in our specifically-designed machine learning pipeline. The new approach performed better than pipelines using commonly used metrics such as F1-score in terms of maximizing the reusability of data with missing values. We also found that different variables might need to be predicted using different machine learning methods and/or different data processing protocols. Using differential gene expression analysis as an example, we showed that when missing variables are accurately predicted, the corresponding gene expression data can be reliably used in downstream analyses. Public Library of Science 2020-11-06 /pmc/articles/PMC7673503/ /pubmed/33156882 http://dx.doi.org/10.1371/journal.pcbi.1007450 Text en © 2020 Lung et al http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. |
spellingShingle | Research Article Lung, Pei-Yau Zhong, Dongrui Pang, Xiaodong Li, Yan Zhang, Jinfeng Maximizing the reusability of gene expression data by predicting missing metadata |
title | Maximizing the reusability of gene expression data by predicting missing metadata |
title_full | Maximizing the reusability of gene expression data by predicting missing metadata |
title_fullStr | Maximizing the reusability of gene expression data by predicting missing metadata |
title_full_unstemmed | Maximizing the reusability of gene expression data by predicting missing metadata |
title_short | Maximizing the reusability of gene expression data by predicting missing metadata |
title_sort | maximizing the reusability of gene expression data by predicting missing metadata |
topic | Research Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7673503/ https://www.ncbi.nlm.nih.gov/pubmed/33156882 http://dx.doi.org/10.1371/journal.pcbi.1007450 |
work_keys_str_mv | AT lungpeiyau maximizingthereusabilityofgeneexpressiondatabypredictingmissingmetadata AT zhongdongrui maximizingthereusabilityofgeneexpressiondatabypredictingmissingmetadata AT pangxiaodong maximizingthereusabilityofgeneexpressiondatabypredictingmissingmetadata AT liyan maximizingthereusabilityofgeneexpressiondatabypredictingmissingmetadata AT zhangjinfeng maximizingthereusabilityofgeneexpressiondatabypredictingmissingmetadata |