Cargando…

Harvestman: a framework for hierarchical feature learning and selection from whole genome sequencing data

BACKGROUND: Supervised learning from high-throughput sequencing data presents many challenges. For one, the curse of dimensionality often leads to overfitting as well as issues with scalability. This can bring about inaccurate models or those that require extensive compute time and resources. Additi...

Descripción completa

Detalles Bibliográficos
Autores principales:	Frisby, Trevor S., Baker, Shawn J., Marçais, Guillaume, Hoang, Quang Minh, Kingsford, Carl, Langmead, Christopher J.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2021
Materias:	Methodology Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8017869/ https://www.ncbi.nlm.nih.gov/pubmed/33794760 http://dx.doi.org/10.1186/s12859-021-04096-6

_version_	1783674133541814272
author	Frisby, Trevor S. Baker, Shawn J. Marçais, Guillaume Hoang, Quang Minh Kingsford, Carl Langmead, Christopher J.
author_facet	Frisby, Trevor S. Baker, Shawn J. Marçais, Guillaume Hoang, Quang Minh Kingsford, Carl Langmead, Christopher J.
author_sort	Frisby, Trevor S.
collection	PubMed
description	BACKGROUND: Supervised learning from high-throughput sequencing data presents many challenges. For one, the curse of dimensionality often leads to overfitting as well as issues with scalability. This can bring about inaccurate models or those that require extensive compute time and resources. Additionally, variant calls may not be the optimal encoding for a given learning task, which also contributes to poor predictive capabilities. To address these issues, we present Harvestman, a method that takes advantage of hierarchical relationships among the possible biological interpretations and representations of genomic variants to perform automatic feature learning, feature selection, and model building. RESULTS: We demonstrate that Harvestman scales to thousands of genomes comprising more than 84 million variants by processing phase 3 data from the 1000 Genomes Project, one of the largest publicly available collection of whole genome sequences. Using breast cancer data from The Cancer Genome Atlas, we show that Harvestman selects a rich combination of representations that are adapted to the learning task, and performs better than a binary representation of SNPs alone. We compare Harvestman to existing feature selection methods and demonstrate that our method is more parsimonious—it selects smaller and less redundant feature subsets while maintaining accuracy of the resulting classifier. CONCLUSION: Harvestman is a hierarchical feature selection approach for supervised model building from variant call data. By building a knowledge graph over genomic variants and solving an integer linear program , Harvestman automatically and optimally finds the right encoding for genomic variants. Compared to other hierarchical feature selection methods, Harvestman is faster and selects features more parsimoniously.
format	Online Article Text
id	pubmed-8017869
institution	National Center for Biotechnology Information
language	English
publishDate	2021
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-80178692021-04-05 Harvestman: a framework for hierarchical feature learning and selection from whole genome sequencing data Frisby, Trevor S. Baker, Shawn J. Marçais, Guillaume Hoang, Quang Minh Kingsford, Carl Langmead, Christopher J. BMC Bioinformatics Methodology Article BACKGROUND: Supervised learning from high-throughput sequencing data presents many challenges. For one, the curse of dimensionality often leads to overfitting as well as issues with scalability. This can bring about inaccurate models or those that require extensive compute time and resources. Additionally, variant calls may not be the optimal encoding for a given learning task, which also contributes to poor predictive capabilities. To address these issues, we present Harvestman, a method that takes advantage of hierarchical relationships among the possible biological interpretations and representations of genomic variants to perform automatic feature learning, feature selection, and model building. RESULTS: We demonstrate that Harvestman scales to thousands of genomes comprising more than 84 million variants by processing phase 3 data from the 1000 Genomes Project, one of the largest publicly available collection of whole genome sequences. Using breast cancer data from The Cancer Genome Atlas, we show that Harvestman selects a rich combination of representations that are adapted to the learning task, and performs better than a binary representation of SNPs alone. We compare Harvestman to existing feature selection methods and demonstrate that our method is more parsimonious—it selects smaller and less redundant feature subsets while maintaining accuracy of the resulting classifier. CONCLUSION: Harvestman is a hierarchical feature selection approach for supervised model building from variant call data. By building a knowledge graph over genomic variants and solving an integer linear program , Harvestman automatically and optimally finds the right encoding for genomic variants. Compared to other hierarchical feature selection methods, Harvestman is faster and selects features more parsimoniously. BioMed Central 2021-04-01 /pmc/articles/PMC8017869/ /pubmed/33794760 http://dx.doi.org/10.1186/s12859-021-04096-6 Text en © The Author(s) 2021 Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle	Methodology Article Frisby, Trevor S. Baker, Shawn J. Marçais, Guillaume Hoang, Quang Minh Kingsford, Carl Langmead, Christopher J. Harvestman: a framework for hierarchical feature learning and selection from whole genome sequencing data
title	Harvestman: a framework for hierarchical feature learning and selection from whole genome sequencing data
title_full	Harvestman: a framework for hierarchical feature learning and selection from whole genome sequencing data
title_fullStr	Harvestman: a framework for hierarchical feature learning and selection from whole genome sequencing data
title_full_unstemmed	Harvestman: a framework for hierarchical feature learning and selection from whole genome sequencing data
title_short	Harvestman: a framework for hierarchical feature learning and selection from whole genome sequencing data
title_sort	harvestman: a framework for hierarchical feature learning and selection from whole genome sequencing data
topic	Methodology Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8017869/ https://www.ncbi.nlm.nih.gov/pubmed/33794760 http://dx.doi.org/10.1186/s12859-021-04096-6
work_keys_str_mv	AT frisbytrevors harvestmanaframeworkforhierarchicalfeaturelearningandselectionfromwholegenomesequencingdata AT bakershawnj harvestmanaframeworkforhierarchicalfeaturelearningandselectionfromwholegenomesequencingdata AT marcaisguillaume harvestmanaframeworkforhierarchicalfeaturelearningandselectionfromwholegenomesequencingdata AT hoangquangminh harvestmanaframeworkforhierarchicalfeaturelearningandselectionfromwholegenomesequencingdata AT kingsfordcarl harvestmanaframeworkforhierarchicalfeaturelearningandselectionfromwholegenomesequencingdata AT langmeadchristopherj harvestmanaframeworkforhierarchicalfeaturelearningandselectionfromwholegenomesequencingdata

Harvestman: a framework for hierarchical feature learning and selection from whole genome sequencing data

Ejemplares similares