Cargando…

An entropy-reducing data representation approach for bioinformatic data

Non-semantic approaches to bioinformatic data analysis have potential relevance where semantic resources such as annotated finished reference genomes are lacking, such as in the analysis and utilisation of growing amounts of sequence data from non-model organisms, often associated with sequence-base...

Descripción completa

Detalles Bibliográficos
Autores principales: McCulloch, Alan F, Jauregui, Ruy, Maclean, Paul H, Ashby, Rachael L, Moraga, Roger A, Laugraud, Aurelie, Brauning, Rudiger, Dodds, Ken G, McEwan, John C
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2018
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5887302/
https://www.ncbi.nlm.nih.gov/pubmed/29688382
http://dx.doi.org/10.1093/database/bay029
_version_ 1783312270994964480
author McCulloch, Alan F
Jauregui, Ruy
Maclean, Paul H
Ashby, Rachael L
Moraga, Roger A
Laugraud, Aurelie
Brauning, Rudiger
Dodds, Ken G
McEwan, John C
author_facet McCulloch, Alan F
Jauregui, Ruy
Maclean, Paul H
Ashby, Rachael L
Moraga, Roger A
Laugraud, Aurelie
Brauning, Rudiger
Dodds, Ken G
McEwan, John C
author_sort McCulloch, Alan F
collection PubMed
description Non-semantic approaches to bioinformatic data analysis have potential relevance where semantic resources such as annotated finished reference genomes are lacking, such as in the analysis and utilisation of growing amounts of sequence data from non-model organisms, often associated with sequence-based agricultural, aqua-cultural and environmental sampling studies and commercial services. Even where rich semantic resources are available, semantic approaches to problems such as contrasting and comparing reference assemblies, and utilising multiple references in parallel to avoid reference bias, are costly and difficult to fully automate. We introduce and discuss a non-semantic data representation approach intended mainly for bioinformatic data called non-semantic labelling. Non-semantic labelling involves tensorially combining multiple kinds of model-based entropy-reducing data representation, with multiple representation models, so as to map both data and models into dual metric representation spaces, with goals of both reducing the statistical complexity of the data, and highlighting latent structure via machine learning and statistical analyses conducted within the dual representation spaces. As part of the framework, we introduce a novel algebraic abstraction of data representation mappings, and present four proof-of-concept examples of its application, to problems such as comparing and contrasting sequence assemblies, utilisation of multiple references for annotation and development of quality control diagnostics in a variety of high-throughput sequencing contexts. Database URL: https://github.com/AgResearch/data_prism
format Online
Article
Text
id pubmed-5887302
institution National Center for Biotechnology Information
language English
publishDate 2018
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-58873022018-04-11 An entropy-reducing data representation approach for bioinformatic data McCulloch, Alan F Jauregui, Ruy Maclean, Paul H Ashby, Rachael L Moraga, Roger A Laugraud, Aurelie Brauning, Rudiger Dodds, Ken G McEwan, John C Database (Oxford) Original Article Non-semantic approaches to bioinformatic data analysis have potential relevance where semantic resources such as annotated finished reference genomes are lacking, such as in the analysis and utilisation of growing amounts of sequence data from non-model organisms, often associated with sequence-based agricultural, aqua-cultural and environmental sampling studies and commercial services. Even where rich semantic resources are available, semantic approaches to problems such as contrasting and comparing reference assemblies, and utilising multiple references in parallel to avoid reference bias, are costly and difficult to fully automate. We introduce and discuss a non-semantic data representation approach intended mainly for bioinformatic data called non-semantic labelling. Non-semantic labelling involves tensorially combining multiple kinds of model-based entropy-reducing data representation, with multiple representation models, so as to map both data and models into dual metric representation spaces, with goals of both reducing the statistical complexity of the data, and highlighting latent structure via machine learning and statistical analyses conducted within the dual representation spaces. As part of the framework, we introduce a novel algebraic abstraction of data representation mappings, and present four proof-of-concept examples of its application, to problems such as comparing and contrasting sequence assemblies, utilisation of multiple references for annotation and development of quality control diagnostics in a variety of high-throughput sequencing contexts. Database URL: https://github.com/AgResearch/data_prism Oxford University Press 2018-04-05 /pmc/articles/PMC5887302/ /pubmed/29688382 http://dx.doi.org/10.1093/database/bay029 Text en © The Author(s) 2018. Published by Oxford University Press. http://creativecommons.org/licenses/by/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Original Article
McCulloch, Alan F
Jauregui, Ruy
Maclean, Paul H
Ashby, Rachael L
Moraga, Roger A
Laugraud, Aurelie
Brauning, Rudiger
Dodds, Ken G
McEwan, John C
An entropy-reducing data representation approach for bioinformatic data
title An entropy-reducing data representation approach for bioinformatic data
title_full An entropy-reducing data representation approach for bioinformatic data
title_fullStr An entropy-reducing data representation approach for bioinformatic data
title_full_unstemmed An entropy-reducing data representation approach for bioinformatic data
title_short An entropy-reducing data representation approach for bioinformatic data
title_sort entropy-reducing data representation approach for bioinformatic data
topic Original Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5887302/
https://www.ncbi.nlm.nih.gov/pubmed/29688382
http://dx.doi.org/10.1093/database/bay029
work_keys_str_mv AT mccullochalanf anentropyreducingdatarepresentationapproachforbioinformaticdata
AT jaureguiruy anentropyreducingdatarepresentationapproachforbioinformaticdata
AT macleanpaulh anentropyreducingdatarepresentationapproachforbioinformaticdata
AT ashbyrachaell anentropyreducingdatarepresentationapproachforbioinformaticdata
AT moragarogera anentropyreducingdatarepresentationapproachforbioinformaticdata
AT laugraudaurelie anentropyreducingdatarepresentationapproachforbioinformaticdata
AT brauningrudiger anentropyreducingdatarepresentationapproachforbioinformaticdata
AT doddskeng anentropyreducingdatarepresentationapproachforbioinformaticdata
AT mcewanjohnc anentropyreducingdatarepresentationapproachforbioinformaticdata
AT mccullochalanf entropyreducingdatarepresentationapproachforbioinformaticdata
AT jaureguiruy entropyreducingdatarepresentationapproachforbioinformaticdata
AT macleanpaulh entropyreducingdatarepresentationapproachforbioinformaticdata
AT ashbyrachaell entropyreducingdatarepresentationapproachforbioinformaticdata
AT moragarogera entropyreducingdatarepresentationapproachforbioinformaticdata
AT laugraudaurelie entropyreducingdatarepresentationapproachforbioinformaticdata
AT brauningrudiger entropyreducingdatarepresentationapproachforbioinformaticdata
AT doddskeng entropyreducingdatarepresentationapproachforbioinformaticdata
AT mcewanjohnc entropyreducingdatarepresentationapproachforbioinformaticdata