Cargando…

Text Categorization of Heart, Lung, and Blood Studies in the Database of Genotypes and Phenotypes (dbGaP) Utilizing n-grams and Metadata Features

The database of Genotypes and Phenotypes (dbGaP) allows researchers to understand phenotypic contribution to genetic conditions, generate new hypotheses, confirm previous study results, and identify control populations. However, effective use of the database is hindered by suboptimal study retrieval...

Descripción completa

Detalles Bibliográficos
Autores principales: Ross, Mindy K., Lin, Ko-Wei, Truong, Karen, Kumar, Abhishek, Conway, Mike
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Libertas Academica 2013
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3728208/
https://www.ncbi.nlm.nih.gov/pubmed/23926434
http://dx.doi.org/10.4137/BII.S11987
_version_ 1782278823460470784
author Ross, Mindy K.
Lin, Ko-Wei
Truong, Karen
Kumar, Abhishek
Conway, Mike
author_facet Ross, Mindy K.
Lin, Ko-Wei
Truong, Karen
Kumar, Abhishek
Conway, Mike
author_sort Ross, Mindy K.
collection PubMed
description The database of Genotypes and Phenotypes (dbGaP) allows researchers to understand phenotypic contribution to genetic conditions, generate new hypotheses, confirm previous study results, and identify control populations. However, effective use of the database is hindered by suboptimal study retrieval. Our objective is to evaluate text classification techniques to improve study retrieval in the context of the dbGaP database. We utilized standard machine learning algorithms (naive Bayes, support vector machines, and the C4.5 decision tree) trained on dbGaP study text and incorporated n-gram features and study metadata to identify heart, lung, and blood studies. We used the χ(2) feature selection algorithm to identify features that contributed most to classification performance and experimented with dbGaP associated PubMed papers as a proxy for topicality. Classifier performance was favorable in comparison to keyword-based search results. It was determined that text categorization is a useful complement to document retrieval techniques in the dbGaP.
format Online
Article
Text
id pubmed-3728208
institution National Center for Biotechnology Information
language English
publishDate 2013
publisher Libertas Academica
record_format MEDLINE/PubMed
spelling pubmed-37282082013-08-07 Text Categorization of Heart, Lung, and Blood Studies in the Database of Genotypes and Phenotypes (dbGaP) Utilizing n-grams and Metadata Features Ross, Mindy K. Lin, Ko-Wei Truong, Karen Kumar, Abhishek Conway, Mike Biomed Inform Insights Original Research The database of Genotypes and Phenotypes (dbGaP) allows researchers to understand phenotypic contribution to genetic conditions, generate new hypotheses, confirm previous study results, and identify control populations. However, effective use of the database is hindered by suboptimal study retrieval. Our objective is to evaluate text classification techniques to improve study retrieval in the context of the dbGaP database. We utilized standard machine learning algorithms (naive Bayes, support vector machines, and the C4.5 decision tree) trained on dbGaP study text and incorporated n-gram features and study metadata to identify heart, lung, and blood studies. We used the χ(2) feature selection algorithm to identify features that contributed most to classification performance and experimented with dbGaP associated PubMed papers as a proxy for topicality. Classifier performance was favorable in comparison to keyword-based search results. It was determined that text categorization is a useful complement to document retrieval techniques in the dbGaP. Libertas Academica 2013-07-22 /pmc/articles/PMC3728208/ /pubmed/23926434 http://dx.doi.org/10.4137/BII.S11987 Text en © 2013 the author(s), publisher and licensee Libertas Academica Ltd. This is an open access article published under the Creative Commons CC-BY-NC 3.0 license.
spellingShingle Original Research
Ross, Mindy K.
Lin, Ko-Wei
Truong, Karen
Kumar, Abhishek
Conway, Mike
Text Categorization of Heart, Lung, and Blood Studies in the Database of Genotypes and Phenotypes (dbGaP) Utilizing n-grams and Metadata Features
title Text Categorization of Heart, Lung, and Blood Studies in the Database of Genotypes and Phenotypes (dbGaP) Utilizing n-grams and Metadata Features
title_full Text Categorization of Heart, Lung, and Blood Studies in the Database of Genotypes and Phenotypes (dbGaP) Utilizing n-grams and Metadata Features
title_fullStr Text Categorization of Heart, Lung, and Blood Studies in the Database of Genotypes and Phenotypes (dbGaP) Utilizing n-grams and Metadata Features
title_full_unstemmed Text Categorization of Heart, Lung, and Blood Studies in the Database of Genotypes and Phenotypes (dbGaP) Utilizing n-grams and Metadata Features
title_short Text Categorization of Heart, Lung, and Blood Studies in the Database of Genotypes and Phenotypes (dbGaP) Utilizing n-grams and Metadata Features
title_sort text categorization of heart, lung, and blood studies in the database of genotypes and phenotypes (dbgap) utilizing n-grams and metadata features
topic Original Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3728208/
https://www.ncbi.nlm.nih.gov/pubmed/23926434
http://dx.doi.org/10.4137/BII.S11987
work_keys_str_mv AT rossmindyk textcategorizationofheartlungandbloodstudiesinthedatabaseofgenotypesandphenotypesdbgaputilizingngramsandmetadatafeatures
AT linkowei textcategorizationofheartlungandbloodstudiesinthedatabaseofgenotypesandphenotypesdbgaputilizingngramsandmetadatafeatures
AT truongkaren textcategorizationofheartlungandbloodstudiesinthedatabaseofgenotypesandphenotypesdbgaputilizingngramsandmetadatafeatures
AT kumarabhishek textcategorizationofheartlungandbloodstudiesinthedatabaseofgenotypesandphenotypesdbgaputilizingngramsandmetadatafeatures
AT conwaymike textcategorizationofheartlungandbloodstudiesinthedatabaseofgenotypesandphenotypesdbgaputilizingngramsandmetadatafeatures