Cargando…
Semi-automated literature mining to identify putative biomarkers of disease from multiple biofluids
BACKGROUND: Computational methods for mining of biomedical literature can be useful in augmenting manual searches of the literature using keywords for disease-specific biomarker discovery from biofluids. In this work, we develop and apply a semi-automated literature mining method to mine abstracts o...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2014
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4215335/ https://www.ncbi.nlm.nih.gov/pubmed/25379168 http://dx.doi.org/10.1186/2043-9113-4-13 |
_version_ | 1782342073264898048 |
---|---|
author | Jordan, Rick Visweswaran, Shyam Gopalakrishnan, Vanathi |
author_facet | Jordan, Rick Visweswaran, Shyam Gopalakrishnan, Vanathi |
author_sort | Jordan, Rick |
collection | PubMed |
description | BACKGROUND: Computational methods for mining of biomedical literature can be useful in augmenting manual searches of the literature using keywords for disease-specific biomarker discovery from biofluids. In this work, we develop and apply a semi-automated literature mining method to mine abstracts obtained from PubMed to discover putative biomarkers of breast and lung cancers in specific biofluids. METHODOLOGY: A positive set of abstracts was defined by the terms ‘breast cancer’ and ‘lung cancer’ in conjunction with 14 separate ‘biofluids’ (bile, blood, breastmilk, cerebrospinal fluid, mucus, plasma, saliva, semen, serum, synovial fluid, stool, sweat, tears, and urine), while a negative set of abstracts was defined by the terms ‘(biofluid) NOT breast cancer’ or ‘(biofluid) NOT lung cancer.’ More than 5.3 million total abstracts were obtained from PubMed and examined for biomarker-disease-biofluid associations (34,296 positive and 2,653,396 negative for breast cancer; 28,355 positive and 2,595,034 negative for lung cancer). Biological entities such as genes and proteins were tagged using ABNER, and processed using Python scripts to produce a list of putative biomarkers. Z-scores were calculated, ranked, and used to determine significance of putative biomarkers found. Manual verification of relevant abstracts was performed to assess our method’s performance. RESULTS: Biofluid-specific markers were identified from the literature, assigned relevance scores based on frequency of occurrence, and validated using known biomarker lists and/or databases for lung and breast cancer [NCBI’s On-line Mendelian Inheritance in Man (OMIM), Cancer Gene annotation server for cancer genomics (CAGE), NCBI’s Genes & Disease, NCI’s Early Detection Research Network (EDRN), and others]. The specificity of each marker for a given biofluid was calculated, and the performance of our semi-automated literature mining method assessed for breast and lung cancer. CONCLUSIONS: We developed a semi-automated process for determining a list of putative biomarkers for breast and lung cancer. New knowledge is presented in the form of biomarker lists; ranked, newly discovered biomarker-disease-biofluid relationships; and biomarker specificity across biofluids. |
format | Online Article Text |
id | pubmed-4215335 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2014 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-42153352014-11-06 Semi-automated literature mining to identify putative biomarkers of disease from multiple biofluids Jordan, Rick Visweswaran, Shyam Gopalakrishnan, Vanathi J Clin Bioinforma Research BACKGROUND: Computational methods for mining of biomedical literature can be useful in augmenting manual searches of the literature using keywords for disease-specific biomarker discovery from biofluids. In this work, we develop and apply a semi-automated literature mining method to mine abstracts obtained from PubMed to discover putative biomarkers of breast and lung cancers in specific biofluids. METHODOLOGY: A positive set of abstracts was defined by the terms ‘breast cancer’ and ‘lung cancer’ in conjunction with 14 separate ‘biofluids’ (bile, blood, breastmilk, cerebrospinal fluid, mucus, plasma, saliva, semen, serum, synovial fluid, stool, sweat, tears, and urine), while a negative set of abstracts was defined by the terms ‘(biofluid) NOT breast cancer’ or ‘(biofluid) NOT lung cancer.’ More than 5.3 million total abstracts were obtained from PubMed and examined for biomarker-disease-biofluid associations (34,296 positive and 2,653,396 negative for breast cancer; 28,355 positive and 2,595,034 negative for lung cancer). Biological entities such as genes and proteins were tagged using ABNER, and processed using Python scripts to produce a list of putative biomarkers. Z-scores were calculated, ranked, and used to determine significance of putative biomarkers found. Manual verification of relevant abstracts was performed to assess our method’s performance. RESULTS: Biofluid-specific markers were identified from the literature, assigned relevance scores based on frequency of occurrence, and validated using known biomarker lists and/or databases for lung and breast cancer [NCBI’s On-line Mendelian Inheritance in Man (OMIM), Cancer Gene annotation server for cancer genomics (CAGE), NCBI’s Genes & Disease, NCI’s Early Detection Research Network (EDRN), and others]. The specificity of each marker for a given biofluid was calculated, and the performance of our semi-automated literature mining method assessed for breast and lung cancer. CONCLUSIONS: We developed a semi-automated process for determining a list of putative biomarkers for breast and lung cancer. New knowledge is presented in the form of biomarker lists; ranked, newly discovered biomarker-disease-biofluid relationships; and biomarker specificity across biofluids. BioMed Central 2014-10-23 /pmc/articles/PMC4215335/ /pubmed/25379168 http://dx.doi.org/10.1186/2043-9113-4-13 Text en Copyright © 2014 Jordan et al.; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/4.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. |
spellingShingle | Research Jordan, Rick Visweswaran, Shyam Gopalakrishnan, Vanathi Semi-automated literature mining to identify putative biomarkers of disease from multiple biofluids |
title | Semi-automated literature mining to identify putative biomarkers of disease from multiple biofluids |
title_full | Semi-automated literature mining to identify putative biomarkers of disease from multiple biofluids |
title_fullStr | Semi-automated literature mining to identify putative biomarkers of disease from multiple biofluids |
title_full_unstemmed | Semi-automated literature mining to identify putative biomarkers of disease from multiple biofluids |
title_short | Semi-automated literature mining to identify putative biomarkers of disease from multiple biofluids |
title_sort | semi-automated literature mining to identify putative biomarkers of disease from multiple biofluids |
topic | Research |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4215335/ https://www.ncbi.nlm.nih.gov/pubmed/25379168 http://dx.doi.org/10.1186/2043-9113-4-13 |
work_keys_str_mv | AT jordanrick semiautomatedliteratureminingtoidentifyputativebiomarkersofdiseasefrommultiplebiofluids AT visweswaranshyam semiautomatedliteratureminingtoidentifyputativebiomarkersofdiseasefrommultiplebiofluids AT gopalakrishnanvanathi semiautomatedliteratureminingtoidentifyputativebiomarkersofdiseasefrommultiplebiofluids |