Cargando…

Cohort Selection for Clinical Trials From Longitudinal Patient Records: Text Mining Approach

BACKGROUND: Clinical trials are an important step in introducing new interventions into clinical practice by generating data on their safety and efficacy. Clinical trials need to ensure that participants are similar so that the findings can be attributed to the interventions studied and not to some...

Descripción completa

Detalles Bibliográficos
Autores principales: Spasic, Irena, Krzeminski, Dominik, Corcoran, Padraig, Balinsky, Alexander
Formato: Online Artículo Texto
Lenguaje:English
Publicado: JMIR Publications 2019
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6913747/
https://www.ncbi.nlm.nih.gov/pubmed/31674914
http://dx.doi.org/10.2196/15980
_version_ 1783479694355595264
author Spasic, Irena
Krzeminski, Dominik
Corcoran, Padraig
Balinsky, Alexander
author_facet Spasic, Irena
Krzeminski, Dominik
Corcoran, Padraig
Balinsky, Alexander
author_sort Spasic, Irena
collection PubMed
description BACKGROUND: Clinical trials are an important step in introducing new interventions into clinical practice by generating data on their safety and efficacy. Clinical trials need to ensure that participants are similar so that the findings can be attributed to the interventions studied and not to some other factors. Therefore, each clinical trial defines eligibility criteria, which describe characteristics that must be shared by the participants. Unfortunately, the complexities of eligibility criteria may not allow them to be translated directly into readily executable database queries. Instead, they may require careful analysis of the narrative sections of medical records. Manual screening of medical records is time consuming, thus negatively affecting the timeliness of the recruitment process. OBJECTIVE: Track 1 of the 2018 National Natural Language Processing Clinical Challenge focused on the task of cohort selection for clinical trials, aiming to answer the following question: Can natural language processing be applied to narrative medical records to identify patients who meet eligibility criteria for clinical trials? The task required the participating systems to analyze longitudinal patient records to determine if the corresponding patients met the given eligibility criteria. We aimed to describe a system developed to address this task. METHODS: Our system consisted of 13 classifiers, one for each eligibility criterion. All classifiers used a bag-of-words document representation model. To prevent the loss of relevant contextual information associated with such representation, a pattern-matching approach was used to extract context-sensitive features. They were embedded back into the text as lexically distinguishable tokens, which were consequently featured in the bag-of-words representation. Supervised machine learning was chosen wherever a sufficient number of both positive and negative instances was available to learn from. A rule-based approach focusing on a small set of relevant features was chosen for the remaining criteria. RESULTS: The system was evaluated using microaveraged F measure. Overall, 4 machine algorithms, including support vector machine, logistic regression, naïve Bayesian classifier, and gradient tree boosting (GTB), were evaluated on the training data using 10–fold cross-validation. Overall, GTB demonstrated the most consistent performance. Its performance peaked when oversampling was used to balance the training data. The final evaluation was performed on previously unseen test data. On average, the F measure of 89.04% was comparable to 3 of the top ranked performances in the shared task (91.11%, 90.28%, and 90.21%). With an F measure of 88.14%, we significantly outperformed these systems (81.03%, 78.50%, and 70.81%) in identifying patients with advanced coronary artery disease. CONCLUSIONS: The holdout evaluation provides evidence that our system was able to identify eligible patients for the given clinical trial with high accuracy. Our approach demonstrates how rule-based knowledge infusion can improve the performance of machine learning algorithms even when trained on a relatively small dataset.
format Online
Article
Text
id pubmed-6913747
institution National Center for Biotechnology Information
language English
publishDate 2019
publisher JMIR Publications
record_format MEDLINE/PubMed
spelling pubmed-69137472020-01-06 Cohort Selection for Clinical Trials From Longitudinal Patient Records: Text Mining Approach Spasic, Irena Krzeminski, Dominik Corcoran, Padraig Balinsky, Alexander JMIR Med Inform Original Paper BACKGROUND: Clinical trials are an important step in introducing new interventions into clinical practice by generating data on their safety and efficacy. Clinical trials need to ensure that participants are similar so that the findings can be attributed to the interventions studied and not to some other factors. Therefore, each clinical trial defines eligibility criteria, which describe characteristics that must be shared by the participants. Unfortunately, the complexities of eligibility criteria may not allow them to be translated directly into readily executable database queries. Instead, they may require careful analysis of the narrative sections of medical records. Manual screening of medical records is time consuming, thus negatively affecting the timeliness of the recruitment process. OBJECTIVE: Track 1 of the 2018 National Natural Language Processing Clinical Challenge focused on the task of cohort selection for clinical trials, aiming to answer the following question: Can natural language processing be applied to narrative medical records to identify patients who meet eligibility criteria for clinical trials? The task required the participating systems to analyze longitudinal patient records to determine if the corresponding patients met the given eligibility criteria. We aimed to describe a system developed to address this task. METHODS: Our system consisted of 13 classifiers, one for each eligibility criterion. All classifiers used a bag-of-words document representation model. To prevent the loss of relevant contextual information associated with such representation, a pattern-matching approach was used to extract context-sensitive features. They were embedded back into the text as lexically distinguishable tokens, which were consequently featured in the bag-of-words representation. Supervised machine learning was chosen wherever a sufficient number of both positive and negative instances was available to learn from. A rule-based approach focusing on a small set of relevant features was chosen for the remaining criteria. RESULTS: The system was evaluated using microaveraged F measure. Overall, 4 machine algorithms, including support vector machine, logistic regression, naïve Bayesian classifier, and gradient tree boosting (GTB), were evaluated on the training data using 10–fold cross-validation. Overall, GTB demonstrated the most consistent performance. Its performance peaked when oversampling was used to balance the training data. The final evaluation was performed on previously unseen test data. On average, the F measure of 89.04% was comparable to 3 of the top ranked performances in the shared task (91.11%, 90.28%, and 90.21%). With an F measure of 88.14%, we significantly outperformed these systems (81.03%, 78.50%, and 70.81%) in identifying patients with advanced coronary artery disease. CONCLUSIONS: The holdout evaluation provides evidence that our system was able to identify eligible patients for the given clinical trial with high accuracy. Our approach demonstrates how rule-based knowledge infusion can improve the performance of machine learning algorithms even when trained on a relatively small dataset. JMIR Publications 2019-10-31 /pmc/articles/PMC6913747/ /pubmed/31674914 http://dx.doi.org/10.2196/15980 Text en ©Irena Spasic, Dominik Krzeminski, Padraig Corcoran, Alexander Balinsky. Originally published in JMIR Medical Informatics (http://medinform.jmir.org), 31.10.2019. https://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Informatics, is properly cited. The complete bibliographic information, a link to the original publication on http://medinform.jmir.org/, as well as this copyright and license information must be included.
spellingShingle Original Paper
Spasic, Irena
Krzeminski, Dominik
Corcoran, Padraig
Balinsky, Alexander
Cohort Selection for Clinical Trials From Longitudinal Patient Records: Text Mining Approach
title Cohort Selection for Clinical Trials From Longitudinal Patient Records: Text Mining Approach
title_full Cohort Selection for Clinical Trials From Longitudinal Patient Records: Text Mining Approach
title_fullStr Cohort Selection for Clinical Trials From Longitudinal Patient Records: Text Mining Approach
title_full_unstemmed Cohort Selection for Clinical Trials From Longitudinal Patient Records: Text Mining Approach
title_short Cohort Selection for Clinical Trials From Longitudinal Patient Records: Text Mining Approach
title_sort cohort selection for clinical trials from longitudinal patient records: text mining approach
topic Original Paper
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6913747/
https://www.ncbi.nlm.nih.gov/pubmed/31674914
http://dx.doi.org/10.2196/15980
work_keys_str_mv AT spasicirena cohortselectionforclinicaltrialsfromlongitudinalpatientrecordstextminingapproach
AT krzeminskidominik cohortselectionforclinicaltrialsfromlongitudinalpatientrecordstextminingapproach
AT corcoranpadraig cohortselectionforclinicaltrialsfromlongitudinalpatientrecordstextminingapproach
AT balinskyalexander cohortselectionforclinicaltrialsfromlongitudinalpatientrecordstextminingapproach