Cargando…

GraphPart: homology partitioning for biological sequence analysis

When splitting biological sequence data for the development and testing of predictive models, it is necessary to avoid too-closely related pairs of sequences ending up in different partitions. If this is ignored, performance of prediction methods will tend to be overestimated. Several algorithms hav...

Descripción completa

Detalles Bibliográficos
Autores principales: Teufel, Felix, Gíslason, Magnús Halldór, Almagro Armenteros, José Juan, Johansen, Alexander Rosenberg, Winther, Ole, Nielsen, Henrik
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10578201/
https://www.ncbi.nlm.nih.gov/pubmed/37850036
http://dx.doi.org/10.1093/nargab/lqad088
_version_ 1785121468228567040
author Teufel, Felix
Gíslason, Magnús Halldór
Almagro Armenteros, José Juan
Johansen, Alexander Rosenberg
Winther, Ole
Nielsen, Henrik
author_facet Teufel, Felix
Gíslason, Magnús Halldór
Almagro Armenteros, José Juan
Johansen, Alexander Rosenberg
Winther, Ole
Nielsen, Henrik
author_sort Teufel, Felix
collection PubMed
description When splitting biological sequence data for the development and testing of predictive models, it is necessary to avoid too-closely related pairs of sequences ending up in different partitions. If this is ignored, performance of prediction methods will tend to be overestimated. Several algorithms have been proposed for homology reduction, where sequences are removed until no too-closely related pairs remain. We present GraphPart, an algorithm for homology partitioning that divides the data such that closely related sequences always end up in the same partition, while keeping as many sequences as possible in the dataset. Evaluation of GraphPart on Protein, DNA and RNA datasets shows that it is capable of retaining a larger number of sequences per dataset, while providing homology separation on a par with reduction approaches.
format Online
Article
Text
id pubmed-10578201
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-105782012023-10-17 GraphPart: homology partitioning for biological sequence analysis Teufel, Felix Gíslason, Magnús Halldór Almagro Armenteros, José Juan Johansen, Alexander Rosenberg Winther, Ole Nielsen, Henrik NAR Genom Bioinform Standard Article When splitting biological sequence data for the development and testing of predictive models, it is necessary to avoid too-closely related pairs of sequences ending up in different partitions. If this is ignored, performance of prediction methods will tend to be overestimated. Several algorithms have been proposed for homology reduction, where sequences are removed until no too-closely related pairs remain. We present GraphPart, an algorithm for homology partitioning that divides the data such that closely related sequences always end up in the same partition, while keeping as many sequences as possible in the dataset. Evaluation of GraphPart on Protein, DNA and RNA datasets shows that it is capable of retaining a larger number of sequences per dataset, while providing homology separation on a par with reduction approaches. Oxford University Press 2023-10-16 /pmc/articles/PMC10578201/ /pubmed/37850036 http://dx.doi.org/10.1093/nargab/lqad088 Text en © The Author(s) 2023. Published by Oxford University Press on behalf of NAR Genomics and Bioinformatics. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Standard Article
Teufel, Felix
Gíslason, Magnús Halldór
Almagro Armenteros, José Juan
Johansen, Alexander Rosenberg
Winther, Ole
Nielsen, Henrik
GraphPart: homology partitioning for biological sequence analysis
title GraphPart: homology partitioning for biological sequence analysis
title_full GraphPart: homology partitioning for biological sequence analysis
title_fullStr GraphPart: homology partitioning for biological sequence analysis
title_full_unstemmed GraphPart: homology partitioning for biological sequence analysis
title_short GraphPart: homology partitioning for biological sequence analysis
title_sort graphpart: homology partitioning for biological sequence analysis
topic Standard Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10578201/
https://www.ncbi.nlm.nih.gov/pubmed/37850036
http://dx.doi.org/10.1093/nargab/lqad088
work_keys_str_mv AT teufelfelix graphparthomologypartitioningforbiologicalsequenceanalysis
AT gislasonmagnushalldor graphparthomologypartitioningforbiologicalsequenceanalysis
AT almagroarmenterosjosejuan graphparthomologypartitioningforbiologicalsequenceanalysis
AT johansenalexanderrosenberg graphparthomologypartitioningforbiologicalsequenceanalysis
AT wintherole graphparthomologypartitioningforbiologicalsequenceanalysis
AT nielsenhenrik graphparthomologypartitioningforbiologicalsequenceanalysis