Cargando…

Application of unsupervised analysis techniques to lung cancer patient data

This study applies unsupervised machine learning techniques for classification and clustering to a collection of descriptive variables from 10,442 lung cancer patient records in the Surveillance, Epidemiology, and End Results (SEER) program database. The goal is to automatically classify lung cancer...

Descripción completa

Detalles Bibliográficos
Autores principales: Lynch, Chip M., van Berkel, Victor H., Frieboes, Hermann B.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2017
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5598970/
https://www.ncbi.nlm.nih.gov/pubmed/28910336
http://dx.doi.org/10.1371/journal.pone.0184370
_version_ 1783264010950410240
author Lynch, Chip M.
van Berkel, Victor H.
Frieboes, Hermann B.
author_facet Lynch, Chip M.
van Berkel, Victor H.
Frieboes, Hermann B.
author_sort Lynch, Chip M.
collection PubMed
description This study applies unsupervised machine learning techniques for classification and clustering to a collection of descriptive variables from 10,442 lung cancer patient records in the Surveillance, Epidemiology, and End Results (SEER) program database. The goal is to automatically classify lung cancer patients into groups based on clinically measurable disease-specific variables in order to estimate survival. Variables selected as inputs for machine learning include Number of Primaries, Age, Grade, Tumor Size, Stage, and TNM, which are numeric or can readily be converted to numeric type. Minimal up-front processing of the data enables exploring the out-of-the-box capabilities of established unsupervised learning techniques, with little human intervention through the entire process. The output of the techniques is used to predict survival time, with the efficacy of the prediction representing a proxy for the usefulness of the classification. A basic single variable linear regression against each unsupervised output is applied, and the associated Root Mean Squared Error (RMSE) value is calculated as a metric to compare between the outputs. The results show that self-ordering maps exhibit the best performance, while k-Means performs the best of the simpler classification techniques. Predicting against the full data set, it is found that their respective RMSE values (15.591 for self-ordering maps and 16.193 for k-Means) are comparable to supervised regression techniques, such as Gradient Boosting Machine (RMSE of 15.048). We conclude that unsupervised data analysis techniques may be of use to classify patients by defining the classes as effective proxies for survival prediction.
format Online
Article
Text
id pubmed-5598970
institution National Center for Biotechnology Information
language English
publishDate 2017
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-55989702017-09-22 Application of unsupervised analysis techniques to lung cancer patient data Lynch, Chip M. van Berkel, Victor H. Frieboes, Hermann B. PLoS One Research Article This study applies unsupervised machine learning techniques for classification and clustering to a collection of descriptive variables from 10,442 lung cancer patient records in the Surveillance, Epidemiology, and End Results (SEER) program database. The goal is to automatically classify lung cancer patients into groups based on clinically measurable disease-specific variables in order to estimate survival. Variables selected as inputs for machine learning include Number of Primaries, Age, Grade, Tumor Size, Stage, and TNM, which are numeric or can readily be converted to numeric type. Minimal up-front processing of the data enables exploring the out-of-the-box capabilities of established unsupervised learning techniques, with little human intervention through the entire process. The output of the techniques is used to predict survival time, with the efficacy of the prediction representing a proxy for the usefulness of the classification. A basic single variable linear regression against each unsupervised output is applied, and the associated Root Mean Squared Error (RMSE) value is calculated as a metric to compare between the outputs. The results show that self-ordering maps exhibit the best performance, while k-Means performs the best of the simpler classification techniques. Predicting against the full data set, it is found that their respective RMSE values (15.591 for self-ordering maps and 16.193 for k-Means) are comparable to supervised regression techniques, such as Gradient Boosting Machine (RMSE of 15.048). We conclude that unsupervised data analysis techniques may be of use to classify patients by defining the classes as effective proxies for survival prediction. Public Library of Science 2017-09-14 /pmc/articles/PMC5598970/ /pubmed/28910336 http://dx.doi.org/10.1371/journal.pone.0184370 Text en © 2017 Lynch et al http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle Research Article
Lynch, Chip M.
van Berkel, Victor H.
Frieboes, Hermann B.
Application of unsupervised analysis techniques to lung cancer patient data
title Application of unsupervised analysis techniques to lung cancer patient data
title_full Application of unsupervised analysis techniques to lung cancer patient data
title_fullStr Application of unsupervised analysis techniques to lung cancer patient data
title_full_unstemmed Application of unsupervised analysis techniques to lung cancer patient data
title_short Application of unsupervised analysis techniques to lung cancer patient data
title_sort application of unsupervised analysis techniques to lung cancer patient data
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5598970/
https://www.ncbi.nlm.nih.gov/pubmed/28910336
http://dx.doi.org/10.1371/journal.pone.0184370
work_keys_str_mv AT lynchchipm applicationofunsupervisedanalysistechniquestolungcancerpatientdata
AT vanberkelvictorh applicationofunsupervisedanalysistechniquestolungcancerpatientdata
AT frieboeshermannb applicationofunsupervisedanalysistechniquestolungcancerpatientdata