Cargando…
Personal Health Information Inference Using Machine Learning on RNA Expression Data from Patients With Cancer: Algorithm Validation Study
BACKGROUND: As the need for sharing genomic data grows, privacy issues and concerns, such as the ethics surrounding data sharing and disclosure of personal information, are raised. OBJECTIVE: The main purpose of this study was to verify whether genomic data is sufficient to predict a patient's...
Autores principales: | , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
JMIR Publications
2020
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7445622/ https://www.ncbi.nlm.nih.gov/pubmed/32773372 http://dx.doi.org/10.2196/18387 |
_version_ | 1783574022203637760 |
---|---|
author | Kweon, Solbi Lee, Jeong Hoon Lee, Younghee Park, Yu Rang |
author_facet | Kweon, Solbi Lee, Jeong Hoon Lee, Younghee Park, Yu Rang |
author_sort | Kweon, Solbi |
collection | PubMed |
description | BACKGROUND: As the need for sharing genomic data grows, privacy issues and concerns, such as the ethics surrounding data sharing and disclosure of personal information, are raised. OBJECTIVE: The main purpose of this study was to verify whether genomic data is sufficient to predict a patient's personal information. METHODS: RNA expression data and matched patient personal information were collected from 9538 patients in The Cancer Genome Atlas program. Five personal information variables (age, gender, race, cancer type, and cancer stage) were recorded for each patient. Four different machine learning algorithms (support vector machine, decision tree, random forest, and artificial neural network) were used to determine whether a patient's personal information could be accurately predicted from RNA expression data. Performance measurement of the prediction models was based on the accuracy and area under the receiver operating characteristic curve. We selected five cancer types (breast carcinoma, kidney renal clear cell carcinoma, head and neck squamous cell carcinoma, low-grade glioma, and lung adenocarcinoma) with large samples sizes to verify whether predictive accuracy would differ between them. We also validated the efficacy of our four machine learning models in analyzing normal samples from 593 cancer patients. RESULTS: In most samples, personal information with high genetic relevance, such as gender and cancer type, could be predicted from RNA expression data alone. The prediction accuracies for gender and cancer type, which were the best models, were 0.93-0.99 and 0.78-0.94, respectively. Other aspects of personal information, such as age, race, and cancer stage, were difficult to predict from RNA expression data, with accuracies ranging from 0.0026-0.29, 0.76-0.96, and 0.45-0.79, respectively. Among the tested machine learning methods, the highest predictive accuracy was obtained using the support vector machine algorithm (mean accuracy 0.77), while the lowest accuracy was obtained using the random forest method (mean accuracy 0.65). Gender and race were predicted more accurately than other variables in the samples. On average, the accuracy of cancer stage prediction ranged between 0.71-0.67, while the age prediction accuracy ranged between 0.18-0.23 for the five cancer types. CONCLUSIONS: We attempted to predict patient information using RNA expression data. We found that some identifiers could be predicted, but most others could not. This study showed that personal information available from RNA expression data is limited and this information cannot be used to identify specific patients. |
format | Online Article Text |
id | pubmed-7445622 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2020 |
publisher | JMIR Publications |
record_format | MEDLINE/PubMed |
spelling | pubmed-74456222020-08-31 Personal Health Information Inference Using Machine Learning on RNA Expression Data from Patients With Cancer: Algorithm Validation Study Kweon, Solbi Lee, Jeong Hoon Lee, Younghee Park, Yu Rang J Med Internet Res Original Paper BACKGROUND: As the need for sharing genomic data grows, privacy issues and concerns, such as the ethics surrounding data sharing and disclosure of personal information, are raised. OBJECTIVE: The main purpose of this study was to verify whether genomic data is sufficient to predict a patient's personal information. METHODS: RNA expression data and matched patient personal information were collected from 9538 patients in The Cancer Genome Atlas program. Five personal information variables (age, gender, race, cancer type, and cancer stage) were recorded for each patient. Four different machine learning algorithms (support vector machine, decision tree, random forest, and artificial neural network) were used to determine whether a patient's personal information could be accurately predicted from RNA expression data. Performance measurement of the prediction models was based on the accuracy and area under the receiver operating characteristic curve. We selected five cancer types (breast carcinoma, kidney renal clear cell carcinoma, head and neck squamous cell carcinoma, low-grade glioma, and lung adenocarcinoma) with large samples sizes to verify whether predictive accuracy would differ between them. We also validated the efficacy of our four machine learning models in analyzing normal samples from 593 cancer patients. RESULTS: In most samples, personal information with high genetic relevance, such as gender and cancer type, could be predicted from RNA expression data alone. The prediction accuracies for gender and cancer type, which were the best models, were 0.93-0.99 and 0.78-0.94, respectively. Other aspects of personal information, such as age, race, and cancer stage, were difficult to predict from RNA expression data, with accuracies ranging from 0.0026-0.29, 0.76-0.96, and 0.45-0.79, respectively. Among the tested machine learning methods, the highest predictive accuracy was obtained using the support vector machine algorithm (mean accuracy 0.77), while the lowest accuracy was obtained using the random forest method (mean accuracy 0.65). Gender and race were predicted more accurately than other variables in the samples. On average, the accuracy of cancer stage prediction ranged between 0.71-0.67, while the age prediction accuracy ranged between 0.18-0.23 for the five cancer types. CONCLUSIONS: We attempted to predict patient information using RNA expression data. We found that some identifiers could be predicted, but most others could not. This study showed that personal information available from RNA expression data is limited and this information cannot be used to identify specific patients. JMIR Publications 2020-08-10 /pmc/articles/PMC7445622/ /pubmed/32773372 http://dx.doi.org/10.2196/18387 Text en ©Solbi Kweon, Jeong Hoon Lee, Younghee Lee, Yu Rang Park. Originally published in the Journal of Medical Internet Research (http://www.jmir.org), 10.08.2020. https://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research, is properly cited. The complete bibliographic information, a link to the original publication on http://www.jmir.org/, as well as this copyright and license information must be included. |
spellingShingle | Original Paper Kweon, Solbi Lee, Jeong Hoon Lee, Younghee Park, Yu Rang Personal Health Information Inference Using Machine Learning on RNA Expression Data from Patients With Cancer: Algorithm Validation Study |
title | Personal Health Information Inference Using Machine Learning on RNA Expression Data from Patients With Cancer: Algorithm Validation Study |
title_full | Personal Health Information Inference Using Machine Learning on RNA Expression Data from Patients With Cancer: Algorithm Validation Study |
title_fullStr | Personal Health Information Inference Using Machine Learning on RNA Expression Data from Patients With Cancer: Algorithm Validation Study |
title_full_unstemmed | Personal Health Information Inference Using Machine Learning on RNA Expression Data from Patients With Cancer: Algorithm Validation Study |
title_short | Personal Health Information Inference Using Machine Learning on RNA Expression Data from Patients With Cancer: Algorithm Validation Study |
title_sort | personal health information inference using machine learning on rna expression data from patients with cancer: algorithm validation study |
topic | Original Paper |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7445622/ https://www.ncbi.nlm.nih.gov/pubmed/32773372 http://dx.doi.org/10.2196/18387 |
work_keys_str_mv | AT kweonsolbi personalhealthinformationinferenceusingmachinelearningonrnaexpressiondatafrompatientswithcanceralgorithmvalidationstudy AT leejeonghoon personalhealthinformationinferenceusingmachinelearningonrnaexpressiondatafrompatientswithcanceralgorithmvalidationstudy AT leeyounghee personalhealthinformationinferenceusingmachinelearningonrnaexpressiondatafrompatientswithcanceralgorithmvalidationstudy AT parkyurang personalhealthinformationinferenceusingmachinelearningonrnaexpressiondatafrompatientswithcanceralgorithmvalidationstudy |