Cargando…

PyKleeBarcode: Enabling representation of the whole animal kingdom in information space

As biological sequence databases continue growing, so do the insight that they promise to shed on the shape of the genetic diversity of life. However, to fulfil this promise the software must remain usable, be able to accommodate a large amount of data and allow use of modern high performance comput...

Descripción completa

Detalles Bibliográficos
Autores principales: Duchemin, Wandrille, Thaler, David S.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10237437/
https://www.ncbi.nlm.nih.gov/pubmed/37267256
http://dx.doi.org/10.1371/journal.pone.0286314
_version_ 1785053155388555264
author Duchemin, Wandrille
Thaler, David S.
author_facet Duchemin, Wandrille
Thaler, David S.
author_sort Duchemin, Wandrille
collection PubMed
description As biological sequence databases continue growing, so do the insight that they promise to shed on the shape of the genetic diversity of life. However, to fulfil this promise the software must remain usable, be able to accommodate a large amount of data and allow use of modern high performance computing infrastructure. In this study we present a reimplementation as well as an extension of a technique using indicator vectors to compute and visualize similarities between sets of nucleotide sequences. We have a flexible and easy to use python program relying on standard and open-source libraries. Our tool allows analysis of very large complement of sequences using code parallelization, as well as by providing routines to split a computational task in smaller and manageable subtasks whose results are then merged. This implementation also facilitates adding new sequences into an indicator vector-based representation without re-computing the whole set. The efficient synthesis of data into knowledge is no trivial matter given the size and rapid growth of biological sequence databases. Based on previous results regarding the properties of indicator vectors, the open-source approach proposed here efficiently and flexibly supports comparative analysis of genetic diversity at a large scale. Our software is freely available at: https://github.com/WandrilleD/pyKleeBarcode.
format Online
Article
Text
id pubmed-10237437
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-102374372023-06-03 PyKleeBarcode: Enabling representation of the whole animal kingdom in information space Duchemin, Wandrille Thaler, David S. PLoS One Research Article As biological sequence databases continue growing, so do the insight that they promise to shed on the shape of the genetic diversity of life. However, to fulfil this promise the software must remain usable, be able to accommodate a large amount of data and allow use of modern high performance computing infrastructure. In this study we present a reimplementation as well as an extension of a technique using indicator vectors to compute and visualize similarities between sets of nucleotide sequences. We have a flexible and easy to use python program relying on standard and open-source libraries. Our tool allows analysis of very large complement of sequences using code parallelization, as well as by providing routines to split a computational task in smaller and manageable subtasks whose results are then merged. This implementation also facilitates adding new sequences into an indicator vector-based representation without re-computing the whole set. The efficient synthesis of data into knowledge is no trivial matter given the size and rapid growth of biological sequence databases. Based on previous results regarding the properties of indicator vectors, the open-source approach proposed here efficiently and flexibly supports comparative analysis of genetic diversity at a large scale. Our software is freely available at: https://github.com/WandrilleD/pyKleeBarcode. Public Library of Science 2023-06-02 /pmc/articles/PMC10237437/ /pubmed/37267256 http://dx.doi.org/10.1371/journal.pone.0286314 Text en © 2023 Duchemin, Thaler https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle Research Article
Duchemin, Wandrille
Thaler, David S.
PyKleeBarcode: Enabling representation of the whole animal kingdom in information space
title PyKleeBarcode: Enabling representation of the whole animal kingdom in information space
title_full PyKleeBarcode: Enabling representation of the whole animal kingdom in information space
title_fullStr PyKleeBarcode: Enabling representation of the whole animal kingdom in information space
title_full_unstemmed PyKleeBarcode: Enabling representation of the whole animal kingdom in information space
title_short PyKleeBarcode: Enabling representation of the whole animal kingdom in information space
title_sort pykleebarcode: enabling representation of the whole animal kingdom in information space
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10237437/
https://www.ncbi.nlm.nih.gov/pubmed/37267256
http://dx.doi.org/10.1371/journal.pone.0286314
work_keys_str_mv AT ducheminwandrille pykleebarcodeenablingrepresentationofthewholeanimalkingdomininformationspace
AT thalerdavids pykleebarcodeenablingrepresentationofthewholeanimalkingdomininformationspace