Cargando…

Identifying the status of genetic lesions in cancer clinical trial documents using machine learning

BACKGROUND: Many cancer clinical trials now specify the particular status of a genetic lesion in a patient's tumor in the inclusion or exclusion criteria for trial enrollment. To facilitate search and identification of gene-associated clinical trials by potential participants and clinicians, it...

Descripción completa

Detalles Bibliográficos
Autores principales: Wu, Yonghui, Levy, Mia A, Micheel, Christine M, Yeh, Paul, Tang, Buzhou, Cantrell, Michael J, Cooreman, Stacy M, Xu, Hua
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2012
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3535695/
https://www.ncbi.nlm.nih.gov/pubmed/23282337
http://dx.doi.org/10.1186/1471-2164-13-S8-S21
_version_ 1782254698899701760
author Wu, Yonghui
Levy, Mia A
Micheel, Christine M
Yeh, Paul
Tang, Buzhou
Cantrell, Michael J
Cooreman, Stacy M
Xu, Hua
author_facet Wu, Yonghui
Levy, Mia A
Micheel, Christine M
Yeh, Paul
Tang, Buzhou
Cantrell, Michael J
Cooreman, Stacy M
Xu, Hua
author_sort Wu, Yonghui
collection PubMed
description BACKGROUND: Many cancer clinical trials now specify the particular status of a genetic lesion in a patient's tumor in the inclusion or exclusion criteria for trial enrollment. To facilitate search and identification of gene-associated clinical trials by potential participants and clinicians, it is important to develop automated methods to identify genetic information from narrative trial documents. METHODS: We developed a two-stage classification method to identify genes and genetic lesion statuses in clinical trial documents extracted from the National Cancer Institute's (NCI's) Physician Data Query (PDQ) cancer clinical trial database. The method consists of two steps: 1) to distinguish gene entities from non-gene entities such as English words; and 2) to determine whether and which genetic lesion status is associated with an identified gene entity. We developed and evaluated the performance of the method using a manually annotated data set containing 1,143 instances of the eight most frequently mentioned genes in cancer clinical trials. In addition, we applied the classifier to a real-world task of cancer trial annotation and evaluated its performance using a larger sample size (4,013 instances from 249 distinct human gene symbols detected from 250 trials). RESULTS: Our evaluation using a manually annotated data set showed that the two-stage classifier outperformed the single-stage classifier and achieved the best average accuracy of 83.7% for the eight most frequently mentioned genes when optimized feature sets were used. It also showed better generalizability when we applied the two-stage classifier trained on one set of genes to another independent gene. When a gene-neutral, two-stage classifier was applied to the real-world task of cancer trial annotation, it achieved a highest accuracy of 89.8%, demonstrating the feasibility of developing a gene-neutral classifier for this task. CONCLUSIONS: We presented a machine learning-based approach to detect gene entities and the genetic lesion statuses from clinical trial documents and demonstrated its use in cancer trial annotation. Such methods would be valuable for building information retrieval tools targeting gene-associated clinical trials.
format Online
Article
Text
id pubmed-3535695
institution National Center for Biotechnology Information
language English
publishDate 2012
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-35356952013-01-04 Identifying the status of genetic lesions in cancer clinical trial documents using machine learning Wu, Yonghui Levy, Mia A Micheel, Christine M Yeh, Paul Tang, Buzhou Cantrell, Michael J Cooreman, Stacy M Xu, Hua BMC Genomics Research BACKGROUND: Many cancer clinical trials now specify the particular status of a genetic lesion in a patient's tumor in the inclusion or exclusion criteria for trial enrollment. To facilitate search and identification of gene-associated clinical trials by potential participants and clinicians, it is important to develop automated methods to identify genetic information from narrative trial documents. METHODS: We developed a two-stage classification method to identify genes and genetic lesion statuses in clinical trial documents extracted from the National Cancer Institute's (NCI's) Physician Data Query (PDQ) cancer clinical trial database. The method consists of two steps: 1) to distinguish gene entities from non-gene entities such as English words; and 2) to determine whether and which genetic lesion status is associated with an identified gene entity. We developed and evaluated the performance of the method using a manually annotated data set containing 1,143 instances of the eight most frequently mentioned genes in cancer clinical trials. In addition, we applied the classifier to a real-world task of cancer trial annotation and evaluated its performance using a larger sample size (4,013 instances from 249 distinct human gene symbols detected from 250 trials). RESULTS: Our evaluation using a manually annotated data set showed that the two-stage classifier outperformed the single-stage classifier and achieved the best average accuracy of 83.7% for the eight most frequently mentioned genes when optimized feature sets were used. It also showed better generalizability when we applied the two-stage classifier trained on one set of genes to another independent gene. When a gene-neutral, two-stage classifier was applied to the real-world task of cancer trial annotation, it achieved a highest accuracy of 89.8%, demonstrating the feasibility of developing a gene-neutral classifier for this task. CONCLUSIONS: We presented a machine learning-based approach to detect gene entities and the genetic lesion statuses from clinical trial documents and demonstrated its use in cancer trial annotation. Such methods would be valuable for building information retrieval tools targeting gene-associated clinical trials. BioMed Central 2012-12-17 /pmc/articles/PMC3535695/ /pubmed/23282337 http://dx.doi.org/10.1186/1471-2164-13-S8-S21 Text en Copyright ©2012 Wu et al.; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research
Wu, Yonghui
Levy, Mia A
Micheel, Christine M
Yeh, Paul
Tang, Buzhou
Cantrell, Michael J
Cooreman, Stacy M
Xu, Hua
Identifying the status of genetic lesions in cancer clinical trial documents using machine learning
title Identifying the status of genetic lesions in cancer clinical trial documents using machine learning
title_full Identifying the status of genetic lesions in cancer clinical trial documents using machine learning
title_fullStr Identifying the status of genetic lesions in cancer clinical trial documents using machine learning
title_full_unstemmed Identifying the status of genetic lesions in cancer clinical trial documents using machine learning
title_short Identifying the status of genetic lesions in cancer clinical trial documents using machine learning
title_sort identifying the status of genetic lesions in cancer clinical trial documents using machine learning
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3535695/
https://www.ncbi.nlm.nih.gov/pubmed/23282337
http://dx.doi.org/10.1186/1471-2164-13-S8-S21
work_keys_str_mv AT wuyonghui identifyingthestatusofgeneticlesionsincancerclinicaltrialdocumentsusingmachinelearning
AT levymiaa identifyingthestatusofgeneticlesionsincancerclinicaltrialdocumentsusingmachinelearning
AT micheelchristinem identifyingthestatusofgeneticlesionsincancerclinicaltrialdocumentsusingmachinelearning
AT yehpaul identifyingthestatusofgeneticlesionsincancerclinicaltrialdocumentsusingmachinelearning
AT tangbuzhou identifyingthestatusofgeneticlesionsincancerclinicaltrialdocumentsusingmachinelearning
AT cantrellmichaelj identifyingthestatusofgeneticlesionsincancerclinicaltrialdocumentsusingmachinelearning
AT cooremanstacym identifyingthestatusofgeneticlesionsincancerclinicaltrialdocumentsusingmachinelearning
AT xuhua identifyingthestatusofgeneticlesionsincancerclinicaltrialdocumentsusingmachinelearning