Cargando…

Semi-automated literature mining to identify putative biomarkers of disease from multiple biofluids

BACKGROUND: Computational methods for mining of biomedical literature can be useful in augmenting manual searches of the literature using keywords for disease-specific biomarker discovery from biofluids. In this work, we develop and apply a semi-automated literature mining method to mine abstracts o...

Descripción completa

Detalles Bibliográficos
Autores principales: Jordan, Rick, Visweswaran, Shyam, Gopalakrishnan, Vanathi
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2014
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4215335/
https://www.ncbi.nlm.nih.gov/pubmed/25379168
http://dx.doi.org/10.1186/2043-9113-4-13
_version_ 1782342073264898048
author Jordan, Rick
Visweswaran, Shyam
Gopalakrishnan, Vanathi
author_facet Jordan, Rick
Visweswaran, Shyam
Gopalakrishnan, Vanathi
author_sort Jordan, Rick
collection PubMed
description BACKGROUND: Computational methods for mining of biomedical literature can be useful in augmenting manual searches of the literature using keywords for disease-specific biomarker discovery from biofluids. In this work, we develop and apply a semi-automated literature mining method to mine abstracts obtained from PubMed to discover putative biomarkers of breast and lung cancers in specific biofluids. METHODOLOGY: A positive set of abstracts was defined by the terms ‘breast cancer’ and ‘lung cancer’ in conjunction with 14 separate ‘biofluids’ (bile, blood, breastmilk, cerebrospinal fluid, mucus, plasma, saliva, semen, serum, synovial fluid, stool, sweat, tears, and urine), while a negative set of abstracts was defined by the terms ‘(biofluid) NOT breast cancer’ or ‘(biofluid) NOT lung cancer.’ More than 5.3 million total abstracts were obtained from PubMed and examined for biomarker-disease-biofluid associations (34,296 positive and 2,653,396 negative for breast cancer; 28,355 positive and 2,595,034 negative for lung cancer). Biological entities such as genes and proteins were tagged using ABNER, and processed using Python scripts to produce a list of putative biomarkers. Z-scores were calculated, ranked, and used to determine significance of putative biomarkers found. Manual verification of relevant abstracts was performed to assess our method’s performance. RESULTS: Biofluid-specific markers were identified from the literature, assigned relevance scores based on frequency of occurrence, and validated using known biomarker lists and/or databases for lung and breast cancer [NCBI’s On-line Mendelian Inheritance in Man (OMIM), Cancer Gene annotation server for cancer genomics (CAGE), NCBI’s Genes & Disease, NCI’s Early Detection Research Network (EDRN), and others]. The specificity of each marker for a given biofluid was calculated, and the performance of our semi-automated literature mining method assessed for breast and lung cancer. CONCLUSIONS: We developed a semi-automated process for determining a list of putative biomarkers for breast and lung cancer. New knowledge is presented in the form of biomarker lists; ranked, newly discovered biomarker-disease-biofluid relationships; and biomarker specificity across biofluids.
format Online
Article
Text
id pubmed-4215335
institution National Center for Biotechnology Information
language English
publishDate 2014
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-42153352014-11-06 Semi-automated literature mining to identify putative biomarkers of disease from multiple biofluids Jordan, Rick Visweswaran, Shyam Gopalakrishnan, Vanathi J Clin Bioinforma Research BACKGROUND: Computational methods for mining of biomedical literature can be useful in augmenting manual searches of the literature using keywords for disease-specific biomarker discovery from biofluids. In this work, we develop and apply a semi-automated literature mining method to mine abstracts obtained from PubMed to discover putative biomarkers of breast and lung cancers in specific biofluids. METHODOLOGY: A positive set of abstracts was defined by the terms ‘breast cancer’ and ‘lung cancer’ in conjunction with 14 separate ‘biofluids’ (bile, blood, breastmilk, cerebrospinal fluid, mucus, plasma, saliva, semen, serum, synovial fluid, stool, sweat, tears, and urine), while a negative set of abstracts was defined by the terms ‘(biofluid) NOT breast cancer’ or ‘(biofluid) NOT lung cancer.’ More than 5.3 million total abstracts were obtained from PubMed and examined for biomarker-disease-biofluid associations (34,296 positive and 2,653,396 negative for breast cancer; 28,355 positive and 2,595,034 negative for lung cancer). Biological entities such as genes and proteins were tagged using ABNER, and processed using Python scripts to produce a list of putative biomarkers. Z-scores were calculated, ranked, and used to determine significance of putative biomarkers found. Manual verification of relevant abstracts was performed to assess our method’s performance. RESULTS: Biofluid-specific markers were identified from the literature, assigned relevance scores based on frequency of occurrence, and validated using known biomarker lists and/or databases for lung and breast cancer [NCBI’s On-line Mendelian Inheritance in Man (OMIM), Cancer Gene annotation server for cancer genomics (CAGE), NCBI’s Genes & Disease, NCI’s Early Detection Research Network (EDRN), and others]. The specificity of each marker for a given biofluid was calculated, and the performance of our semi-automated literature mining method assessed for breast and lung cancer. CONCLUSIONS: We developed a semi-automated process for determining a list of putative biomarkers for breast and lung cancer. New knowledge is presented in the form of biomarker lists; ranked, newly discovered biomarker-disease-biofluid relationships; and biomarker specificity across biofluids. BioMed Central 2014-10-23 /pmc/articles/PMC4215335/ /pubmed/25379168 http://dx.doi.org/10.1186/2043-9113-4-13 Text en Copyright © 2014 Jordan et al.; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/4.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Research
Jordan, Rick
Visweswaran, Shyam
Gopalakrishnan, Vanathi
Semi-automated literature mining to identify putative biomarkers of disease from multiple biofluids
title Semi-automated literature mining to identify putative biomarkers of disease from multiple biofluids
title_full Semi-automated literature mining to identify putative biomarkers of disease from multiple biofluids
title_fullStr Semi-automated literature mining to identify putative biomarkers of disease from multiple biofluids
title_full_unstemmed Semi-automated literature mining to identify putative biomarkers of disease from multiple biofluids
title_short Semi-automated literature mining to identify putative biomarkers of disease from multiple biofluids
title_sort semi-automated literature mining to identify putative biomarkers of disease from multiple biofluids
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4215335/
https://www.ncbi.nlm.nih.gov/pubmed/25379168
http://dx.doi.org/10.1186/2043-9113-4-13
work_keys_str_mv AT jordanrick semiautomatedliteratureminingtoidentifyputativebiomarkersofdiseasefrommultiplebiofluids
AT visweswaranshyam semiautomatedliteratureminingtoidentifyputativebiomarkersofdiseasefrommultiplebiofluids
AT gopalakrishnanvanathi semiautomatedliteratureminingtoidentifyputativebiomarkersofdiseasefrommultiplebiofluids