Cargando…

Benchmark data set for breast cancer associated genes

Breast cancer is one of the leading causes of death in women worldwide. The main reason could be inheritance, change in environmental conditions or the mutation in certain genes that cause cancer. These genes are not negligible, on the contrary, a wide range of genes have their involvement in the de...

Descripción completa

Detalles Bibliográficos
Autores principales: Raj, Sushrutha, Anil, Athira P, Shukla, Anshita, Anoosha, Kadambala, Srivastava, Alok
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Elsevier 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9679444/
https://www.ncbi.nlm.nih.gov/pubmed/36425994
http://dx.doi.org/10.1016/j.dib.2022.108583
_version_ 1784834192127819776
author Raj, Sushrutha
Anil, Athira P
Shukla, Anshita
Anoosha, Kadambala
Srivastava, Alok
author_facet Raj, Sushrutha
Anil, Athira P
Shukla, Anshita
Anoosha, Kadambala
Srivastava, Alok
author_sort Raj, Sushrutha
collection PubMed
description Breast cancer is one of the leading causes of death in women worldwide. The main reason could be inheritance, change in environmental conditions or the mutation in certain genes that cause cancer. These genes are not negligible, on the contrary, a wide range of genes have their involvement in the development and progression of different stages of breast cancer. In this article, we are going to explore the association of breast cancer genes and classify them into different association classes viz. positive, negative and neutral. Among all the available biomedical literature resources for a disease, HuGE Navigator is a major resource comprising continually updated human genome epidemiology data controlled by the Centers for Disease Control and Prevention. However the literature finder module of HuGE Navigator only yields PubMed IDs for a specific disease, which are explored further to retrieve abstract data from PubMed. These abstracts are filtered out to include those reference sentences which have at least one gene and disease term. This reference sentence data has been taken as a reference to apply double-fold cross-validation to compile the most comprehensive list and then classify them into different association classes viz, positive, negative or neutral along with the reference sentences confirming the association of the disease with the gene. The positively associated data generated here can be used for breast cancer modelling or meta-analysis of breast cancer. The data generated in the present work can be used as standard reference data for the training of text mining-based biological literature classifiers to predict the class of published literature not only in breast cancer but in other diseases as well.
format Online
Article
Text
id pubmed-9679444
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher Elsevier
record_format MEDLINE/PubMed
spelling pubmed-96794442022-11-23 Benchmark data set for breast cancer associated genes Raj, Sushrutha Anil, Athira P Shukla, Anshita Anoosha, Kadambala Srivastava, Alok Data Brief Data Article Breast cancer is one of the leading causes of death in women worldwide. The main reason could be inheritance, change in environmental conditions or the mutation in certain genes that cause cancer. These genes are not negligible, on the contrary, a wide range of genes have their involvement in the development and progression of different stages of breast cancer. In this article, we are going to explore the association of breast cancer genes and classify them into different association classes viz. positive, negative and neutral. Among all the available biomedical literature resources for a disease, HuGE Navigator is a major resource comprising continually updated human genome epidemiology data controlled by the Centers for Disease Control and Prevention. However the literature finder module of HuGE Navigator only yields PubMed IDs for a specific disease, which are explored further to retrieve abstract data from PubMed. These abstracts are filtered out to include those reference sentences which have at least one gene and disease term. This reference sentence data has been taken as a reference to apply double-fold cross-validation to compile the most comprehensive list and then classify them into different association classes viz, positive, negative or neutral along with the reference sentences confirming the association of the disease with the gene. The positively associated data generated here can be used for breast cancer modelling or meta-analysis of breast cancer. The data generated in the present work can be used as standard reference data for the training of text mining-based biological literature classifiers to predict the class of published literature not only in breast cancer but in other diseases as well. Elsevier 2022-09-13 /pmc/articles/PMC9679444/ /pubmed/36425994 http://dx.doi.org/10.1016/j.dib.2022.108583 Text en © 2022 The Authors. Published by Elsevier Inc. https://creativecommons.org/licenses/by/4.0/This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
spellingShingle Data Article
Raj, Sushrutha
Anil, Athira P
Shukla, Anshita
Anoosha, Kadambala
Srivastava, Alok
Benchmark data set for breast cancer associated genes
title Benchmark data set for breast cancer associated genes
title_full Benchmark data set for breast cancer associated genes
title_fullStr Benchmark data set for breast cancer associated genes
title_full_unstemmed Benchmark data set for breast cancer associated genes
title_short Benchmark data set for breast cancer associated genes
title_sort benchmark data set for breast cancer associated genes
topic Data Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9679444/
https://www.ncbi.nlm.nih.gov/pubmed/36425994
http://dx.doi.org/10.1016/j.dib.2022.108583
work_keys_str_mv AT rajsushrutha benchmarkdatasetforbreastcancerassociatedgenes
AT anilathirap benchmarkdatasetforbreastcancerassociatedgenes
AT shuklaanshita benchmarkdatasetforbreastcancerassociatedgenes
AT anooshakadambala benchmarkdatasetforbreastcancerassociatedgenes
AT srivastavaalok benchmarkdatasetforbreastcancerassociatedgenes