Cargando…

Identifying genetic determinants of complex phenotypes from whole genome sequence data

BACKGROUND: A critical goal in biology is to relate the phenotype to the genotype, that is, to find the genetic determinants of various traits. However, while simple monofactorial determinants are relatively easy to identify, the underpinnings of complex phenotypes are harder to predict. While tradi...

Descripción completa

Detalles Bibliográficos
Autores principales:	Long, George S., Hussen, Mohammed, Dench, Jonathan, Aris-Brosou, Stéphane
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2019
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6558885/ https://www.ncbi.nlm.nih.gov/pubmed/31182025 http://dx.doi.org/10.1186/s12864-019-5820-0

_version_	1783425724330278912
author	Long, George S. Hussen, Mohammed Dench, Jonathan Aris-Brosou, Stéphane
author_facet	Long, George S. Hussen, Mohammed Dench, Jonathan Aris-Brosou, Stéphane
author_sort	Long, George S.
collection	PubMed
description	BACKGROUND: A critical goal in biology is to relate the phenotype to the genotype, that is, to find the genetic determinants of various traits. However, while simple monofactorial determinants are relatively easy to identify, the underpinnings of complex phenotypes are harder to predict. While traditional approaches rely on genome-wide association studies based on Single Nucleotide Polymorphism data, the ability of machine learning algorithms to find these determinants in whole proteome data is still not well known. RESULTS: To better understand the applicability of machine learning in this case, we implemented two such algorithms, adaptive boosting (AB) and repeated random forest (RRF), and developed a chunking layer that facilitates the analysis of whole proteome data. We first assessed the performance of these algorithms and tuned them on an influenza data set, for which the determinants of three complex phenotypes (infectivity, transmissibility, and pathogenicity) are known based on experimental evidence. This allowed us to show that chunking improves runtimes by an order of magnitude. Based on simulations, we showed that chunking also increases sensitivity of the predictions, reaching 100% with as few as 20 sequences in a small proteome as in the influenza case (5k sites), but may require at least 30 sequences to reach 90% on larger alignments (500k sites). While RRF has less specificity than random forest, it was never <50%, and RRF sensitivity was significantly higher at smaller chunk sizes. We then used these algorithms to predict the determinants of three types of drug resistance (to Ciprofloxacin, Ceftazidime, and Gentamicin) in a bacterium, Pseudomonas aeruginosa. While both algorithms performed well in the case of the influenza data, results were more nuanced in the bacterial case, with RRF making more sensible predictions, with smaller errors rates, than AB. CONCLUSIONS: Altogether, we demonstrated that ML algorithms can be used to identify genetic determinants in small proteomes (viruses), even when trained on small numbers of individuals. We further showed that our RRF algorithm may deserve more scrutiny, which should be facilitated by the decreasing costs of both sequencing and phenotyping of large cohorts of individuals. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1186/s12864-019-5820-0) contains supplementary material, which is available to authorized users.
format	Online Article Text
id	pubmed-6558885
institution	National Center for Biotechnology Information
language	English
publishDate	2019
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-65588852019-06-13 Identifying genetic determinants of complex phenotypes from whole genome sequence data Long, George S. Hussen, Mohammed Dench, Jonathan Aris-Brosou, Stéphane BMC Genomics Research Article BACKGROUND: A critical goal in biology is to relate the phenotype to the genotype, that is, to find the genetic determinants of various traits. However, while simple monofactorial determinants are relatively easy to identify, the underpinnings of complex phenotypes are harder to predict. While traditional approaches rely on genome-wide association studies based on Single Nucleotide Polymorphism data, the ability of machine learning algorithms to find these determinants in whole proteome data is still not well known. RESULTS: To better understand the applicability of machine learning in this case, we implemented two such algorithms, adaptive boosting (AB) and repeated random forest (RRF), and developed a chunking layer that facilitates the analysis of whole proteome data. We first assessed the performance of these algorithms and tuned them on an influenza data set, for which the determinants of three complex phenotypes (infectivity, transmissibility, and pathogenicity) are known based on experimental evidence. This allowed us to show that chunking improves runtimes by an order of magnitude. Based on simulations, we showed that chunking also increases sensitivity of the predictions, reaching 100% with as few as 20 sequences in a small proteome as in the influenza case (5k sites), but may require at least 30 sequences to reach 90% on larger alignments (500k sites). While RRF has less specificity than random forest, it was never <50%, and RRF sensitivity was significantly higher at smaller chunk sizes. We then used these algorithms to predict the determinants of three types of drug resistance (to Ciprofloxacin, Ceftazidime, and Gentamicin) in a bacterium, Pseudomonas aeruginosa. While both algorithms performed well in the case of the influenza data, results were more nuanced in the bacterial case, with RRF making more sensible predictions, with smaller errors rates, than AB. CONCLUSIONS: Altogether, we demonstrated that ML algorithms can be used to identify genetic determinants in small proteomes (viruses), even when trained on small numbers of individuals. We further showed that our RRF algorithm may deserve more scrutiny, which should be facilitated by the decreasing costs of both sequencing and phenotyping of large cohorts of individuals. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1186/s12864-019-5820-0) contains supplementary material, which is available to authorized users. BioMed Central 2019-06-10 /pmc/articles/PMC6558885/ /pubmed/31182025 http://dx.doi.org/10.1186/s12864-019-5820-0 Text en © The Author(s) 2019 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Research Article Long, George S. Hussen, Mohammed Dench, Jonathan Aris-Brosou, Stéphane Identifying genetic determinants of complex phenotypes from whole genome sequence data
title	Identifying genetic determinants of complex phenotypes from whole genome sequence data
title_full	Identifying genetic determinants of complex phenotypes from whole genome sequence data
title_fullStr	Identifying genetic determinants of complex phenotypes from whole genome sequence data
title_full_unstemmed	Identifying genetic determinants of complex phenotypes from whole genome sequence data
title_short	Identifying genetic determinants of complex phenotypes from whole genome sequence data
title_sort	identifying genetic determinants of complex phenotypes from whole genome sequence data
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6558885/ https://www.ncbi.nlm.nih.gov/pubmed/31182025 http://dx.doi.org/10.1186/s12864-019-5820-0
work_keys_str_mv	AT longgeorges identifyinggeneticdeterminantsofcomplexphenotypesfromwholegenomesequencedata AT hussenmohammed identifyinggeneticdeterminantsofcomplexphenotypesfromwholegenomesequencedata AT denchjonathan identifyinggeneticdeterminantsofcomplexphenotypesfromwholegenomesequencedata AT arisbrosoustephane identifyinggeneticdeterminantsofcomplexphenotypesfromwholegenomesequencedata

Identifying genetic determinants of complex phenotypes from whole genome sequence data

Ejemplares similares