Cargando…

Selecting Clustering Algorithms for Identity-By-Descent Mapping

Groups of distantly related individuals who share a short segment of their genome identical-by-descent (IBD) can provide insights about rare traits and diseases in massive biobanks using IBD mapping. Clustering algorithms play an important role in finding these groups accurately and at scale. We set...

Descripción completa

Detalles Bibliográficos
Autores principales:	Shemirani, Ruhollah, Belbin, Gillian M, Burghardt, Keith, Lerman, Kristina, Avery, Christy L, Kenny, Eimear E, Gignoux, Christopher R, Ambite, José Luis
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	2023
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9782725/ https://www.ncbi.nlm.nih.gov/pubmed/36540970

_version_	1784857409502576640
author	Shemirani, Ruhollah Belbin, Gillian M Burghardt, Keith Lerman, Kristina Avery, Christy L Kenny, Eimear E Gignoux, Christopher R Ambite, José Luis
author_facet	Shemirani, Ruhollah Belbin, Gillian M Burghardt, Keith Lerman, Kristina Avery, Christy L Kenny, Eimear E Gignoux, Christopher R Ambite, José Luis
author_sort	Shemirani, Ruhollah
collection	PubMed
description	Groups of distantly related individuals who share a short segment of their genome identical-by-descent (IBD) can provide insights about rare traits and diseases in massive biobanks using IBD mapping. Clustering algorithms play an important role in finding these groups accurately and at scale. We set out to analyze the fitness of commonly used, fast and scalable clustering algorithms for IBD mapping applications. We designed a realistic benchmark for local IBD graphs and utilized it to compare the statistical power of clustering algorithms via simulating 2.3 million clusters across 850 experiments. We found Infomap and Markov Clustering (MCL) community detection methods to have high statistical power in most of the scenarios. They yield a 30% increase in power compared to the current state-of-art approach, with a 3 orders of magnitude lower runtime. We also found that standard clustering metrics, such as modularity, cannot predict statistical power of algorithms in IBD mapping applications. We extend our findings to real datasets by analyzing the Population Architecture using Genomics and Epidemiology (PAGE) Study dataset with 51,000 samples and 2 million shared segments on Chromosome 1, resulting in the extraction of 39 million local IBD clusters. We demonstrate the power of our approach by recovering signals of rare genetic variation in the Whole-Exome Sequence data of 200,000 individuals in the UK Biobank. We provide an efficient implementation to enable clustering at scale for IBD mapping for various populations and scenarios.
format	Online Article Text
id	pubmed-9782725
institution	National Center for Biotechnology Information
language	English
publishDate	2023
record_format	MEDLINE/PubMed
spelling	pubmed-97827252023-01-01 Selecting Clustering Algorithms for Identity-By-Descent Mapping Shemirani, Ruhollah Belbin, Gillian M Burghardt, Keith Lerman, Kristina Avery, Christy L Kenny, Eimear E Gignoux, Christopher R Ambite, José Luis Pac Symp Biocomput Article Groups of distantly related individuals who share a short segment of their genome identical-by-descent (IBD) can provide insights about rare traits and diseases in massive biobanks using IBD mapping. Clustering algorithms play an important role in finding these groups accurately and at scale. We set out to analyze the fitness of commonly used, fast and scalable clustering algorithms for IBD mapping applications. We designed a realistic benchmark for local IBD graphs and utilized it to compare the statistical power of clustering algorithms via simulating 2.3 million clusters across 850 experiments. We found Infomap and Markov Clustering (MCL) community detection methods to have high statistical power in most of the scenarios. They yield a 30% increase in power compared to the current state-of-art approach, with a 3 orders of magnitude lower runtime. We also found that standard clustering metrics, such as modularity, cannot predict statistical power of algorithms in IBD mapping applications. We extend our findings to real datasets by analyzing the Population Architecture using Genomics and Epidemiology (PAGE) Study dataset with 51,000 samples and 2 million shared segments on Chromosome 1, resulting in the extraction of 39 million local IBD clusters. We demonstrate the power of our approach by recovering signals of rare genetic variation in the Whole-Exome Sequence data of 200,000 individuals in the UK Biobank. We provide an efficient implementation to enable clustering at scale for IBD mapping for various populations and scenarios. 2023 /pmc/articles/PMC9782725/ /pubmed/36540970 Text en https://creativecommons.org/licenses/by-nc/4.0/Open Access chapter published by World Scientific Publishing Company and distributed under the terms of the Creative Commons Attribution Non-Commercial (CC BY-NC) 4.0 License.
spellingShingle	Article Shemirani, Ruhollah Belbin, Gillian M Burghardt, Keith Lerman, Kristina Avery, Christy L Kenny, Eimear E Gignoux, Christopher R Ambite, José Luis Selecting Clustering Algorithms for Identity-By-Descent Mapping
title	Selecting Clustering Algorithms for Identity-By-Descent Mapping
title_full	Selecting Clustering Algorithms for Identity-By-Descent Mapping
title_fullStr	Selecting Clustering Algorithms for Identity-By-Descent Mapping
title_full_unstemmed	Selecting Clustering Algorithms for Identity-By-Descent Mapping
title_short	Selecting Clustering Algorithms for Identity-By-Descent Mapping
title_sort	selecting clustering algorithms for identity-by-descent mapping
topic	Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9782725/ https://www.ncbi.nlm.nih.gov/pubmed/36540970
work_keys_str_mv	AT shemiraniruhollah selectingclusteringalgorithmsforidentitybydescentmapping AT belbingillianm selectingclusteringalgorithmsforidentitybydescentmapping AT burghardtkeith selectingclusteringalgorithmsforidentitybydescentmapping AT lermankristina selectingclusteringalgorithmsforidentitybydescentmapping AT averychristyl selectingclusteringalgorithmsforidentitybydescentmapping AT kennyeimeare selectingclusteringalgorithmsforidentitybydescentmapping AT gignouxchristopherr selectingclusteringalgorithmsforidentitybydescentmapping AT ambitejoseluis selectingclusteringalgorithmsforidentitybydescentmapping

Selecting Clustering Algorithms for Identity-By-Descent Mapping

Ejemplares similares