Cargando…

Fast, low-memory detection and localization of large, polymorphic inversions from SNPs

BACKGROUND: Large (>1 Mb), polymorphic inversions have substantial impacts on population structure and maintenance of genotypes. These large inversions can be detected from single nucleotide polymorphism (SNP) data using unsupervised learning techniques like PCA. Construction and analysis of a fe...

Descripción completa

Detalles Bibliográficos
Autores principales: Nowling, Ronald J., Fallas-Moya, Fabian, Sadovnik, Amir, Emrich, Scott, Aleck, Matthew, Leskiewicz, Daniel, Peters, John G.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: PeerJ Inc. 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8784018/
https://www.ncbi.nlm.nih.gov/pubmed/35116204
http://dx.doi.org/10.7717/peerj.12831
_version_ 1784638655752568832
author Nowling, Ronald J.
Fallas-Moya, Fabian
Sadovnik, Amir
Emrich, Scott
Aleck, Matthew
Leskiewicz, Daniel
Peters, John G.
author_facet Nowling, Ronald J.
Fallas-Moya, Fabian
Sadovnik, Amir
Emrich, Scott
Aleck, Matthew
Leskiewicz, Daniel
Peters, John G.
author_sort Nowling, Ronald J.
collection PubMed
description BACKGROUND: Large (>1 Mb), polymorphic inversions have substantial impacts on population structure and maintenance of genotypes. These large inversions can be detected from single nucleotide polymorphism (SNP) data using unsupervised learning techniques like PCA. Construction and analysis of a feature matrix from millions of SNPs requires large amount of memory and limits the sizes of data sets that can be analyzed. METHODS: We propose using feature hashing construct a feature matrix from a VCF file of SNPs for reducing memory usage. The matrix is constructed in a streaming fashion such that the entire VCF file is never loaded into memory at one time. RESULTS: When evaluated on Anopheles mosquito and Drosophila fly data sets, our approach reduced memory usage by 97% with minimal reductions in accuracy for inversion detection and localization tasks. CONCLUSION: With these changes, inversions in larger data sets can be analyzed easily and efficiently on common laptop and desktop computers. Our method is publicly available through our open-source inversion analysis software, Asaph.
format Online
Article
Text
id pubmed-8784018
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher PeerJ Inc.
record_format MEDLINE/PubMed
spelling pubmed-87840182022-02-02 Fast, low-memory detection and localization of large, polymorphic inversions from SNPs Nowling, Ronald J. Fallas-Moya, Fabian Sadovnik, Amir Emrich, Scott Aleck, Matthew Leskiewicz, Daniel Peters, John G. PeerJ Bioinformatics BACKGROUND: Large (>1 Mb), polymorphic inversions have substantial impacts on population structure and maintenance of genotypes. These large inversions can be detected from single nucleotide polymorphism (SNP) data using unsupervised learning techniques like PCA. Construction and analysis of a feature matrix from millions of SNPs requires large amount of memory and limits the sizes of data sets that can be analyzed. METHODS: We propose using feature hashing construct a feature matrix from a VCF file of SNPs for reducing memory usage. The matrix is constructed in a streaming fashion such that the entire VCF file is never loaded into memory at one time. RESULTS: When evaluated on Anopheles mosquito and Drosophila fly data sets, our approach reduced memory usage by 97% with minimal reductions in accuracy for inversion detection and localization tasks. CONCLUSION: With these changes, inversions in larger data sets can be analyzed easily and efficiently on common laptop and desktop computers. Our method is publicly available through our open-source inversion analysis software, Asaph. PeerJ Inc. 2022-01-20 /pmc/articles/PMC8784018/ /pubmed/35116204 http://dx.doi.org/10.7717/peerj.12831 Text en © 2022 Nowling et al. https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ) and either DOI or URL of the article must be cited.
spellingShingle Bioinformatics
Nowling, Ronald J.
Fallas-Moya, Fabian
Sadovnik, Amir
Emrich, Scott
Aleck, Matthew
Leskiewicz, Daniel
Peters, John G.
Fast, low-memory detection and localization of large, polymorphic inversions from SNPs
title Fast, low-memory detection and localization of large, polymorphic inversions from SNPs
title_full Fast, low-memory detection and localization of large, polymorphic inversions from SNPs
title_fullStr Fast, low-memory detection and localization of large, polymorphic inversions from SNPs
title_full_unstemmed Fast, low-memory detection and localization of large, polymorphic inversions from SNPs
title_short Fast, low-memory detection and localization of large, polymorphic inversions from SNPs
title_sort fast, low-memory detection and localization of large, polymorphic inversions from snps
topic Bioinformatics
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8784018/
https://www.ncbi.nlm.nih.gov/pubmed/35116204
http://dx.doi.org/10.7717/peerj.12831
work_keys_str_mv AT nowlingronaldj fastlowmemorydetectionandlocalizationoflargepolymorphicinversionsfromsnps
AT fallasmoyafabian fastlowmemorydetectionandlocalizationoflargepolymorphicinversionsfromsnps
AT sadovnikamir fastlowmemorydetectionandlocalizationoflargepolymorphicinversionsfromsnps
AT emrichscott fastlowmemorydetectionandlocalizationoflargepolymorphicinversionsfromsnps
AT aleckmatthew fastlowmemorydetectionandlocalizationoflargepolymorphicinversionsfromsnps
AT leskiewiczdaniel fastlowmemorydetectionandlocalizationoflargepolymorphicinversionsfromsnps
AT petersjohng fastlowmemorydetectionandlocalizationoflargepolymorphicinversionsfromsnps