Cargando…
Statistical representation models for mutation information within genomic data
BACKGROUND: As DNA sequencing technologies are improving and getting cheaper, genomic data can be utilized for diagnosis of many diseases such as cancer. Human raw genome data is huge in size for computational systems. Therefore, there is a need for a compact and accurate representation of the valua...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2019
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6567431/ https://www.ncbi.nlm.nih.gov/pubmed/31195961 http://dx.doi.org/10.1186/s12859-019-2868-4 |
_version_ | 1783427075861905408 |
---|---|
author | ÖZCAN ŞİMŞEK, N. Özlem ÖZGÜR, Arzucan GÜRGEN, Fikret |
author_facet | ÖZCAN ŞİMŞEK, N. Özlem ÖZGÜR, Arzucan GÜRGEN, Fikret |
author_sort | ÖZCAN ŞİMŞEK, N. Özlem |
collection | PubMed |
description | BACKGROUND: As DNA sequencing technologies are improving and getting cheaper, genomic data can be utilized for diagnosis of many diseases such as cancer. Human raw genome data is huge in size for computational systems. Therefore, there is a need for a compact and accurate representation of the valuable information in DNA. The occurrence of complex genetic disorders often results from multiple gene mutations. The effect of each mutation is not equal for the development of a disease. Inspired from the field of information retrieval, we propose using the term frequency (tf) and BM25 term weighting measures with the inverse document frequency (idf) and relevance frequency (rf) measures to weight genes based on their mutations. The underlying assumption is that the more mutations a gene has in patients with a certain disease and the less mutations it has in other patients, the more discriminative that gene is. RESULTS: We evaluated the proposed representations on the task of cancer type classification. We applied various machine learning techniques using the tf-idf and tf-rf schemes and their BM25 versions. Our results show that the BM25-tf-rf representation leads to improved classification accuracy and f-score values compared to the other representations. The highest accuracy (76.44%) and f-score (76.95%) are achieved with the BM25-tf-rf based data representation. CONCLUSIONS: As a result of our experiments, the BM25-tf-rf scheme and the proposed neural network model is shown to be the best performing classification system for our case study of cancer type classification. This system is further utilized for causal gene analysis. Examples from the most effective genes that are used for decision making are found to be in the literature as target or causal genes. |
format | Online Article Text |
id | pubmed-6567431 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2019 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-65674312019-06-17 Statistical representation models for mutation information within genomic data ÖZCAN ŞİMŞEK, N. Özlem ÖZGÜR, Arzucan GÜRGEN, Fikret BMC Bioinformatics Research Article BACKGROUND: As DNA sequencing technologies are improving and getting cheaper, genomic data can be utilized for diagnosis of many diseases such as cancer. Human raw genome data is huge in size for computational systems. Therefore, there is a need for a compact and accurate representation of the valuable information in DNA. The occurrence of complex genetic disorders often results from multiple gene mutations. The effect of each mutation is not equal for the development of a disease. Inspired from the field of information retrieval, we propose using the term frequency (tf) and BM25 term weighting measures with the inverse document frequency (idf) and relevance frequency (rf) measures to weight genes based on their mutations. The underlying assumption is that the more mutations a gene has in patients with a certain disease and the less mutations it has in other patients, the more discriminative that gene is. RESULTS: We evaluated the proposed representations on the task of cancer type classification. We applied various machine learning techniques using the tf-idf and tf-rf schemes and their BM25 versions. Our results show that the BM25-tf-rf representation leads to improved classification accuracy and f-score values compared to the other representations. The highest accuracy (76.44%) and f-score (76.95%) are achieved with the BM25-tf-rf based data representation. CONCLUSIONS: As a result of our experiments, the BM25-tf-rf scheme and the proposed neural network model is shown to be the best performing classification system for our case study of cancer type classification. This system is further utilized for causal gene analysis. Examples from the most effective genes that are used for decision making are found to be in the literature as target or causal genes. BioMed Central 2019-06-13 /pmc/articles/PMC6567431/ /pubmed/31195961 http://dx.doi.org/10.1186/s12859-019-2868-4 Text en © The Author(s) 2019 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. |
spellingShingle | Research Article ÖZCAN ŞİMŞEK, N. Özlem ÖZGÜR, Arzucan GÜRGEN, Fikret Statistical representation models for mutation information within genomic data |
title | Statistical representation models for mutation information within genomic data |
title_full | Statistical representation models for mutation information within genomic data |
title_fullStr | Statistical representation models for mutation information within genomic data |
title_full_unstemmed | Statistical representation models for mutation information within genomic data |
title_short | Statistical representation models for mutation information within genomic data |
title_sort | statistical representation models for mutation information within genomic data |
topic | Research Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6567431/ https://www.ncbi.nlm.nih.gov/pubmed/31195961 http://dx.doi.org/10.1186/s12859-019-2868-4 |
work_keys_str_mv | AT ozcansimseknozlem statisticalrepresentationmodelsformutationinformationwithingenomicdata AT ozgurarzucan statisticalrepresentationmodelsformutationinformationwithingenomicdata AT gurgenfikret statisticalrepresentationmodelsformutationinformationwithingenomicdata |