Cargando…

Statistical representation models for mutation information within genomic data

BACKGROUND: As DNA sequencing technologies are improving and getting cheaper, genomic data can be utilized for diagnosis of many diseases such as cancer. Human raw genome data is huge in size for computational systems. Therefore, there is a need for a compact and accurate representation of the valua...

Descripción completa

Detalles Bibliográficos
Autores principales: ÖZCAN ŞİMŞEK, N. Özlem, ÖZGÜR, Arzucan, GÜRGEN, Fikret
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2019
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6567431/
https://www.ncbi.nlm.nih.gov/pubmed/31195961
http://dx.doi.org/10.1186/s12859-019-2868-4
_version_ 1783427075861905408
author ÖZCAN ŞİMŞEK, N. Özlem
ÖZGÜR, Arzucan
GÜRGEN, Fikret
author_facet ÖZCAN ŞİMŞEK, N. Özlem
ÖZGÜR, Arzucan
GÜRGEN, Fikret
author_sort ÖZCAN ŞİMŞEK, N. Özlem
collection PubMed
description BACKGROUND: As DNA sequencing technologies are improving and getting cheaper, genomic data can be utilized for diagnosis of many diseases such as cancer. Human raw genome data is huge in size for computational systems. Therefore, there is a need for a compact and accurate representation of the valuable information in DNA. The occurrence of complex genetic disorders often results from multiple gene mutations. The effect of each mutation is not equal for the development of a disease. Inspired from the field of information retrieval, we propose using the term frequency (tf) and BM25 term weighting measures with the inverse document frequency (idf) and relevance frequency (rf) measures to weight genes based on their mutations. The underlying assumption is that the more mutations a gene has in patients with a certain disease and the less mutations it has in other patients, the more discriminative that gene is. RESULTS: We evaluated the proposed representations on the task of cancer type classification. We applied various machine learning techniques using the tf-idf and tf-rf schemes and their BM25 versions. Our results show that the BM25-tf-rf representation leads to improved classification accuracy and f-score values compared to the other representations. The highest accuracy (76.44%) and f-score (76.95%) are achieved with the BM25-tf-rf based data representation. CONCLUSIONS: As a result of our experiments, the BM25-tf-rf scheme and the proposed neural network model is shown to be the best performing classification system for our case study of cancer type classification. This system is further utilized for causal gene analysis. Examples from the most effective genes that are used for decision making are found to be in the literature as target or causal genes.
format Online
Article
Text
id pubmed-6567431
institution National Center for Biotechnology Information
language English
publishDate 2019
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-65674312019-06-17 Statistical representation models for mutation information within genomic data ÖZCAN ŞİMŞEK, N. Özlem ÖZGÜR, Arzucan GÜRGEN, Fikret BMC Bioinformatics Research Article BACKGROUND: As DNA sequencing technologies are improving and getting cheaper, genomic data can be utilized for diagnosis of many diseases such as cancer. Human raw genome data is huge in size for computational systems. Therefore, there is a need for a compact and accurate representation of the valuable information in DNA. The occurrence of complex genetic disorders often results from multiple gene mutations. The effect of each mutation is not equal for the development of a disease. Inspired from the field of information retrieval, we propose using the term frequency (tf) and BM25 term weighting measures with the inverse document frequency (idf) and relevance frequency (rf) measures to weight genes based on their mutations. The underlying assumption is that the more mutations a gene has in patients with a certain disease and the less mutations it has in other patients, the more discriminative that gene is. RESULTS: We evaluated the proposed representations on the task of cancer type classification. We applied various machine learning techniques using the tf-idf and tf-rf schemes and their BM25 versions. Our results show that the BM25-tf-rf representation leads to improved classification accuracy and f-score values compared to the other representations. The highest accuracy (76.44%) and f-score (76.95%) are achieved with the BM25-tf-rf based data representation. CONCLUSIONS: As a result of our experiments, the BM25-tf-rf scheme and the proposed neural network model is shown to be the best performing classification system for our case study of cancer type classification. This system is further utilized for causal gene analysis. Examples from the most effective genes that are used for decision making are found to be in the literature as target or causal genes. BioMed Central 2019-06-13 /pmc/articles/PMC6567431/ /pubmed/31195961 http://dx.doi.org/10.1186/s12859-019-2868-4 Text en © The Author(s) 2019 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Research Article
ÖZCAN ŞİMŞEK, N. Özlem
ÖZGÜR, Arzucan
GÜRGEN, Fikret
Statistical representation models for mutation information within genomic data
title Statistical representation models for mutation information within genomic data
title_full Statistical representation models for mutation information within genomic data
title_fullStr Statistical representation models for mutation information within genomic data
title_full_unstemmed Statistical representation models for mutation information within genomic data
title_short Statistical representation models for mutation information within genomic data
title_sort statistical representation models for mutation information within genomic data
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6567431/
https://www.ncbi.nlm.nih.gov/pubmed/31195961
http://dx.doi.org/10.1186/s12859-019-2868-4
work_keys_str_mv AT ozcansimseknozlem statisticalrepresentationmodelsformutationinformationwithingenomicdata
AT ozgurarzucan statisticalrepresentationmodelsformutationinformationwithingenomicdata
AT gurgenfikret statisticalrepresentationmodelsformutationinformationwithingenomicdata