Cargando…
Application of unsupervised analysis techniques to lung cancer patient data
This study applies unsupervised machine learning techniques for classification and clustering to a collection of descriptive variables from 10,442 lung cancer patient records in the Surveillance, Epidemiology, and End Results (SEER) program database. The goal is to automatically classify lung cancer...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Public Library of Science
2017
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5598970/ https://www.ncbi.nlm.nih.gov/pubmed/28910336 http://dx.doi.org/10.1371/journal.pone.0184370 |
_version_ | 1783264010950410240 |
---|---|
author | Lynch, Chip M. van Berkel, Victor H. Frieboes, Hermann B. |
author_facet | Lynch, Chip M. van Berkel, Victor H. Frieboes, Hermann B. |
author_sort | Lynch, Chip M. |
collection | PubMed |
description | This study applies unsupervised machine learning techniques for classification and clustering to a collection of descriptive variables from 10,442 lung cancer patient records in the Surveillance, Epidemiology, and End Results (SEER) program database. The goal is to automatically classify lung cancer patients into groups based on clinically measurable disease-specific variables in order to estimate survival. Variables selected as inputs for machine learning include Number of Primaries, Age, Grade, Tumor Size, Stage, and TNM, which are numeric or can readily be converted to numeric type. Minimal up-front processing of the data enables exploring the out-of-the-box capabilities of established unsupervised learning techniques, with little human intervention through the entire process. The output of the techniques is used to predict survival time, with the efficacy of the prediction representing a proxy for the usefulness of the classification. A basic single variable linear regression against each unsupervised output is applied, and the associated Root Mean Squared Error (RMSE) value is calculated as a metric to compare between the outputs. The results show that self-ordering maps exhibit the best performance, while k-Means performs the best of the simpler classification techniques. Predicting against the full data set, it is found that their respective RMSE values (15.591 for self-ordering maps and 16.193 for k-Means) are comparable to supervised regression techniques, such as Gradient Boosting Machine (RMSE of 15.048). We conclude that unsupervised data analysis techniques may be of use to classify patients by defining the classes as effective proxies for survival prediction. |
format | Online Article Text |
id | pubmed-5598970 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2017 |
publisher | Public Library of Science |
record_format | MEDLINE/PubMed |
spelling | pubmed-55989702017-09-22 Application of unsupervised analysis techniques to lung cancer patient data Lynch, Chip M. van Berkel, Victor H. Frieboes, Hermann B. PLoS One Research Article This study applies unsupervised machine learning techniques for classification and clustering to a collection of descriptive variables from 10,442 lung cancer patient records in the Surveillance, Epidemiology, and End Results (SEER) program database. The goal is to automatically classify lung cancer patients into groups based on clinically measurable disease-specific variables in order to estimate survival. Variables selected as inputs for machine learning include Number of Primaries, Age, Grade, Tumor Size, Stage, and TNM, which are numeric or can readily be converted to numeric type. Minimal up-front processing of the data enables exploring the out-of-the-box capabilities of established unsupervised learning techniques, with little human intervention through the entire process. The output of the techniques is used to predict survival time, with the efficacy of the prediction representing a proxy for the usefulness of the classification. A basic single variable linear regression against each unsupervised output is applied, and the associated Root Mean Squared Error (RMSE) value is calculated as a metric to compare between the outputs. The results show that self-ordering maps exhibit the best performance, while k-Means performs the best of the simpler classification techniques. Predicting against the full data set, it is found that their respective RMSE values (15.591 for self-ordering maps and 16.193 for k-Means) are comparable to supervised regression techniques, such as Gradient Boosting Machine (RMSE of 15.048). We conclude that unsupervised data analysis techniques may be of use to classify patients by defining the classes as effective proxies for survival prediction. Public Library of Science 2017-09-14 /pmc/articles/PMC5598970/ /pubmed/28910336 http://dx.doi.org/10.1371/journal.pone.0184370 Text en © 2017 Lynch et al http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. |
spellingShingle | Research Article Lynch, Chip M. van Berkel, Victor H. Frieboes, Hermann B. Application of unsupervised analysis techniques to lung cancer patient data |
title | Application of unsupervised analysis techniques to lung cancer patient data |
title_full | Application of unsupervised analysis techniques to lung cancer patient data |
title_fullStr | Application of unsupervised analysis techniques to lung cancer patient data |
title_full_unstemmed | Application of unsupervised analysis techniques to lung cancer patient data |
title_short | Application of unsupervised analysis techniques to lung cancer patient data |
title_sort | application of unsupervised analysis techniques to lung cancer patient data |
topic | Research Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5598970/ https://www.ncbi.nlm.nih.gov/pubmed/28910336 http://dx.doi.org/10.1371/journal.pone.0184370 |
work_keys_str_mv | AT lynchchipm applicationofunsupervisedanalysistechniquestolungcancerpatientdata AT vanberkelvictorh applicationofunsupervisedanalysistechniquestolungcancerpatientdata AT frieboeshermannb applicationofunsupervisedanalysistechniquestolungcancerpatientdata |