Cargando…
Patient-Level Cancer Prediction Models From a Nationwide Patient Cohort: Model Development and Validation
BACKGROUND: Nationwide population-based cohorts provide a new opportunity to build automated risk prediction models at the patient level, and claim data are one of the more useful resources to this end. To avoid unnecessary diagnostic intervention after cancer screening tests, patient-level predicti...
Autores principales: | , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
JMIR Publications
2021
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8438609/ https://www.ncbi.nlm.nih.gov/pubmed/34459743 http://dx.doi.org/10.2196/29807 |
_version_ | 1783752382324146176 |
---|---|
author | Lee, Eunsaem Jung, Se Young Hwang, Hyung Ju Jung, Jaewoo |
author_facet | Lee, Eunsaem Jung, Se Young Hwang, Hyung Ju Jung, Jaewoo |
author_sort | Lee, Eunsaem |
collection | PubMed |
description | BACKGROUND: Nationwide population-based cohorts provide a new opportunity to build automated risk prediction models at the patient level, and claim data are one of the more useful resources to this end. To avoid unnecessary diagnostic intervention after cancer screening tests, patient-level prediction models should be developed. OBJECTIVE: We aimed to develop cancer prediction models using nationwide claim databases with machine learning algorithms, which are explainable and easily applicable in real-world environments. METHODS: As source data, we used the Korean National Insurance System Database. Every Korean in ≥40 years old undergoes a national health checkup every 2 years. We gathered all variables from the database including demographic information, basic laboratory values, anthropometric values, and previous medical history. We applied conventional logistic regression methods, light gradient boosting methods, neural networks, survival analysis, and one-class embedding classifier methods to effectively analyze high dimension data based on deep learning–based anomaly detection. Performance was measured with area under the curve and area under precision recall curve. We validated our models externally with a health checkup database from a tertiary hospital. RESULTS: The one-class embedding classifier model received the highest area under the curve scores with values of 0.868, 0.849, 0.798, 0.746, 0.800, 0.749, and 0.790 for liver, lung, colorectal, pancreatic, gastric, breast, and cervical cancers, respectively. For area under precision recall curve, the light gradient boosting models had the highest score with values of 0.383, 0.401, 0.387, 0.300, 0.385, 0.357, and 0.296 for liver, lung, colorectal, pancreatic, gastric, breast, and cervical cancers, respectively. CONCLUSIONS: Our results show that it is possible to easily develop applicable cancer prediction models with nationwide claim data using machine learning. The 7 models showed acceptable performances and explainability, and thus can be distributed easily in real-world environments. |
format | Online Article Text |
id | pubmed-8438609 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2021 |
publisher | JMIR Publications |
record_format | MEDLINE/PubMed |
spelling | pubmed-84386092021-09-27 Patient-Level Cancer Prediction Models From a Nationwide Patient Cohort: Model Development and Validation Lee, Eunsaem Jung, Se Young Hwang, Hyung Ju Jung, Jaewoo JMIR Med Inform Original Paper BACKGROUND: Nationwide population-based cohorts provide a new opportunity to build automated risk prediction models at the patient level, and claim data are one of the more useful resources to this end. To avoid unnecessary diagnostic intervention after cancer screening tests, patient-level prediction models should be developed. OBJECTIVE: We aimed to develop cancer prediction models using nationwide claim databases with machine learning algorithms, which are explainable and easily applicable in real-world environments. METHODS: As source data, we used the Korean National Insurance System Database. Every Korean in ≥40 years old undergoes a national health checkup every 2 years. We gathered all variables from the database including demographic information, basic laboratory values, anthropometric values, and previous medical history. We applied conventional logistic regression methods, light gradient boosting methods, neural networks, survival analysis, and one-class embedding classifier methods to effectively analyze high dimension data based on deep learning–based anomaly detection. Performance was measured with area under the curve and area under precision recall curve. We validated our models externally with a health checkup database from a tertiary hospital. RESULTS: The one-class embedding classifier model received the highest area under the curve scores with values of 0.868, 0.849, 0.798, 0.746, 0.800, 0.749, and 0.790 for liver, lung, colorectal, pancreatic, gastric, breast, and cervical cancers, respectively. For area under precision recall curve, the light gradient boosting models had the highest score with values of 0.383, 0.401, 0.387, 0.300, 0.385, 0.357, and 0.296 for liver, lung, colorectal, pancreatic, gastric, breast, and cervical cancers, respectively. CONCLUSIONS: Our results show that it is possible to easily develop applicable cancer prediction models with nationwide claim data using machine learning. The 7 models showed acceptable performances and explainability, and thus can be distributed easily in real-world environments. JMIR Publications 2021-08-30 /pmc/articles/PMC8438609/ /pubmed/34459743 http://dx.doi.org/10.2196/29807 Text en ©Eunsaem Lee, Se Young Jung, Hyung Ju Hwang, Jaewoo Jung. Originally published in JMIR Medical Informatics (https://medinform.jmir.org), 30.08.2021. https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Informatics, is properly cited. The complete bibliographic information, a link to the original publication on https://medinform.jmir.org/, as well as this copyright and license information must be included. |
spellingShingle | Original Paper Lee, Eunsaem Jung, Se Young Hwang, Hyung Ju Jung, Jaewoo Patient-Level Cancer Prediction Models From a Nationwide Patient Cohort: Model Development and Validation |
title | Patient-Level Cancer Prediction Models From a Nationwide Patient Cohort: Model Development and Validation |
title_full | Patient-Level Cancer Prediction Models From a Nationwide Patient Cohort: Model Development and Validation |
title_fullStr | Patient-Level Cancer Prediction Models From a Nationwide Patient Cohort: Model Development and Validation |
title_full_unstemmed | Patient-Level Cancer Prediction Models From a Nationwide Patient Cohort: Model Development and Validation |
title_short | Patient-Level Cancer Prediction Models From a Nationwide Patient Cohort: Model Development and Validation |
title_sort | patient-level cancer prediction models from a nationwide patient cohort: model development and validation |
topic | Original Paper |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8438609/ https://www.ncbi.nlm.nih.gov/pubmed/34459743 http://dx.doi.org/10.2196/29807 |
work_keys_str_mv | AT leeeunsaem patientlevelcancerpredictionmodelsfromanationwidepatientcohortmodeldevelopmentandvalidation AT jungseyoung patientlevelcancerpredictionmodelsfromanationwidepatientcohortmodeldevelopmentandvalidation AT hwanghyungju patientlevelcancerpredictionmodelsfromanationwidepatientcohortmodeldevelopmentandvalidation AT jungjaewoo patientlevelcancerpredictionmodelsfromanationwidepatientcohortmodeldevelopmentandvalidation |