Cargando…
GraphPart: homology partitioning for biological sequence analysis
When splitting biological sequence data for the development and testing of predictive models, it is necessary to avoid too-closely related pairs of sequences ending up in different partitions. If this is ignored, performance of prediction methods will tend to be overestimated. Several algorithms hav...
Autores principales: | , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Oxford University Press
2023
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10578201/ https://www.ncbi.nlm.nih.gov/pubmed/37850036 http://dx.doi.org/10.1093/nargab/lqad088 |
_version_ | 1785121468228567040 |
---|---|
author | Teufel, Felix Gíslason, Magnús Halldór Almagro Armenteros, José Juan Johansen, Alexander Rosenberg Winther, Ole Nielsen, Henrik |
author_facet | Teufel, Felix Gíslason, Magnús Halldór Almagro Armenteros, José Juan Johansen, Alexander Rosenberg Winther, Ole Nielsen, Henrik |
author_sort | Teufel, Felix |
collection | PubMed |
description | When splitting biological sequence data for the development and testing of predictive models, it is necessary to avoid too-closely related pairs of sequences ending up in different partitions. If this is ignored, performance of prediction methods will tend to be overestimated. Several algorithms have been proposed for homology reduction, where sequences are removed until no too-closely related pairs remain. We present GraphPart, an algorithm for homology partitioning that divides the data such that closely related sequences always end up in the same partition, while keeping as many sequences as possible in the dataset. Evaluation of GraphPart on Protein, DNA and RNA datasets shows that it is capable of retaining a larger number of sequences per dataset, while providing homology separation on a par with reduction approaches. |
format | Online Article Text |
id | pubmed-10578201 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2023 |
publisher | Oxford University Press |
record_format | MEDLINE/PubMed |
spelling | pubmed-105782012023-10-17 GraphPart: homology partitioning for biological sequence analysis Teufel, Felix Gíslason, Magnús Halldór Almagro Armenteros, José Juan Johansen, Alexander Rosenberg Winther, Ole Nielsen, Henrik NAR Genom Bioinform Standard Article When splitting biological sequence data for the development and testing of predictive models, it is necessary to avoid too-closely related pairs of sequences ending up in different partitions. If this is ignored, performance of prediction methods will tend to be overestimated. Several algorithms have been proposed for homology reduction, where sequences are removed until no too-closely related pairs remain. We present GraphPart, an algorithm for homology partitioning that divides the data such that closely related sequences always end up in the same partition, while keeping as many sequences as possible in the dataset. Evaluation of GraphPart on Protein, DNA and RNA datasets shows that it is capable of retaining a larger number of sequences per dataset, while providing homology separation on a par with reduction approaches. Oxford University Press 2023-10-16 /pmc/articles/PMC10578201/ /pubmed/37850036 http://dx.doi.org/10.1093/nargab/lqad088 Text en © The Author(s) 2023. Published by Oxford University Press on behalf of NAR Genomics and Bioinformatics. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Standard Article Teufel, Felix Gíslason, Magnús Halldór Almagro Armenteros, José Juan Johansen, Alexander Rosenberg Winther, Ole Nielsen, Henrik GraphPart: homology partitioning for biological sequence analysis |
title | GraphPart: homology partitioning for biological sequence analysis |
title_full | GraphPart: homology partitioning for biological sequence analysis |
title_fullStr | GraphPart: homology partitioning for biological sequence analysis |
title_full_unstemmed | GraphPart: homology partitioning for biological sequence analysis |
title_short | GraphPart: homology partitioning for biological sequence analysis |
title_sort | graphpart: homology partitioning for biological sequence analysis |
topic | Standard Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10578201/ https://www.ncbi.nlm.nih.gov/pubmed/37850036 http://dx.doi.org/10.1093/nargab/lqad088 |
work_keys_str_mv | AT teufelfelix graphparthomologypartitioningforbiologicalsequenceanalysis AT gislasonmagnushalldor graphparthomologypartitioningforbiologicalsequenceanalysis AT almagroarmenterosjosejuan graphparthomologypartitioningforbiologicalsequenceanalysis AT johansenalexanderrosenberg graphparthomologypartitioningforbiologicalsequenceanalysis AT wintherole graphparthomologypartitioningforbiologicalsequenceanalysis AT nielsenhenrik graphparthomologypartitioningforbiologicalsequenceanalysis |