Cargando…

A nearest-neighbors network model for sequence data reveals new insight into genotype distribution of a pathogen

BACKGROUND: Sequence similarity networks are useful for classifying and characterizing biologically important proteins. Threshold-based approaches to similarity network construction using exact distance measures are prohibitively slow to compute and rely on the difficult task of selecting an appropr...

Descripción completa

Detalles Bibliográficos
Autores principales:	Catanese, Helen N., Brayton, Kelly A., Gebremedhin, Assefaw H.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2018
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6291930/ https://www.ncbi.nlm.nih.gov/pubmed/30541438 http://dx.doi.org/10.1186/s12859-018-2453-2

_version_	1783380308590067712
author	Catanese, Helen N. Brayton, Kelly A. Gebremedhin, Assefaw H.
author_facet	Catanese, Helen N. Brayton, Kelly A. Gebremedhin, Assefaw H.
author_sort	Catanese, Helen N.
collection	PubMed
description	BACKGROUND: Sequence similarity networks are useful for classifying and characterizing biologically important proteins. Threshold-based approaches to similarity network construction using exact distance measures are prohibitively slow to compute and rely on the difficult task of selecting an appropriate threshold, while similarity networks based on approximate distance calculations compromise useful structural information. RESULTS: We present an alternative network representation for a set of sequence data that overcomes these drawbacks. In our model, called the Directed Weighted All Nearest Neighbors (DiWANN) network, each sequence is represented by a node and is connected via a directed edge to only the closest sequence, or sequences in the case of ties, in the dataset. Our contributions span several aspects. Specifically, we: (i) Apply an all nearest neighbors network model to protein sequence data from three different applications and examine the structural properties of the networks; (ii) Compare the model against threshold-based networks to validate their semantic equivalence, and demonstrate the relative advantages the model offers; (iii) Demonstrate the model’s resilience to missing sequences; and (iv) Develop an efficient algorithm for constructing a DiWANN network from a set of sequences. We find that the DiWANN network representation attains similar semantic properties to threshold-based graphs, while avoiding weaknesses of both high and low threshold graphs. Additionally, we find that approximate distance networks, using BLAST bitscores in place of exact edit distances, can cause significant loss of structural information. We show that the proposed DiWANN network construction algorithm provides a fourfold speedup over a standard threshold based approach to network construction. We also identify a relationship between the centrality of a sequence in a similarity network of an Anaplasma marginale short sequence repeat dataset and how broadly that sequence is dispersed geographically. CONCLUSION: We demonstrate that using approximate distance measures to rapidly construct similarity networks may lead to significant deficiencies in the structure of that network in terms centrality and clustering analyses. We present a new network representation that maintains the structural semantics of threshold-based networks while increasing connectedness, and an algorithm for constructing the network using exact distance measures in a fraction of the time it would take to build a threshold-based equivalent.
format	Online Article Text
id	pubmed-6291930
institution	National Center for Biotechnology Information
language	English
publishDate	2018
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-62919302018-12-17 A nearest-neighbors network model for sequence data reveals new insight into genotype distribution of a pathogen Catanese, Helen N. Brayton, Kelly A. Gebremedhin, Assefaw H. BMC Bioinformatics Research Article BACKGROUND: Sequence similarity networks are useful for classifying and characterizing biologically important proteins. Threshold-based approaches to similarity network construction using exact distance measures are prohibitively slow to compute and rely on the difficult task of selecting an appropriate threshold, while similarity networks based on approximate distance calculations compromise useful structural information. RESULTS: We present an alternative network representation for a set of sequence data that overcomes these drawbacks. In our model, called the Directed Weighted All Nearest Neighbors (DiWANN) network, each sequence is represented by a node and is connected via a directed edge to only the closest sequence, or sequences in the case of ties, in the dataset. Our contributions span several aspects. Specifically, we: (i) Apply an all nearest neighbors network model to protein sequence data from three different applications and examine the structural properties of the networks; (ii) Compare the model against threshold-based networks to validate their semantic equivalence, and demonstrate the relative advantages the model offers; (iii) Demonstrate the model’s resilience to missing sequences; and (iv) Develop an efficient algorithm for constructing a DiWANN network from a set of sequences. We find that the DiWANN network representation attains similar semantic properties to threshold-based graphs, while avoiding weaknesses of both high and low threshold graphs. Additionally, we find that approximate distance networks, using BLAST bitscores in place of exact edit distances, can cause significant loss of structural information. We show that the proposed DiWANN network construction algorithm provides a fourfold speedup over a standard threshold based approach to network construction. We also identify a relationship between the centrality of a sequence in a similarity network of an Anaplasma marginale short sequence repeat dataset and how broadly that sequence is dispersed geographically. CONCLUSION: We demonstrate that using approximate distance measures to rapidly construct similarity networks may lead to significant deficiencies in the structure of that network in terms centrality and clustering analyses. We present a new network representation that maintains the structural semantics of threshold-based networks while increasing connectedness, and an algorithm for constructing the network using exact distance measures in a fraction of the time it would take to build a threshold-based equivalent. BioMed Central 2018-12-12 /pmc/articles/PMC6291930/ /pubmed/30541438 http://dx.doi.org/10.1186/s12859-018-2453-2 Text en © The Author(s) 2018 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Research Article Catanese, Helen N. Brayton, Kelly A. Gebremedhin, Assefaw H. A nearest-neighbors network model for sequence data reveals new insight into genotype distribution of a pathogen
title	A nearest-neighbors network model for sequence data reveals new insight into genotype distribution of a pathogen
title_full	A nearest-neighbors network model for sequence data reveals new insight into genotype distribution of a pathogen
title_fullStr	A nearest-neighbors network model for sequence data reveals new insight into genotype distribution of a pathogen
title_full_unstemmed	A nearest-neighbors network model for sequence data reveals new insight into genotype distribution of a pathogen
title_short	A nearest-neighbors network model for sequence data reveals new insight into genotype distribution of a pathogen
title_sort	nearest-neighbors network model for sequence data reveals new insight into genotype distribution of a pathogen
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6291930/ https://www.ncbi.nlm.nih.gov/pubmed/30541438 http://dx.doi.org/10.1186/s12859-018-2453-2
work_keys_str_mv	AT catanesehelenn anearestneighborsnetworkmodelforsequencedatarevealsnewinsightintogenotypedistributionofapathogen AT braytonkellya anearestneighborsnetworkmodelforsequencedatarevealsnewinsightintogenotypedistributionofapathogen AT gebremedhinassefawh anearestneighborsnetworkmodelforsequencedatarevealsnewinsightintogenotypedistributionofapathogen AT catanesehelenn nearestneighborsnetworkmodelforsequencedatarevealsnewinsightintogenotypedistributionofapathogen AT braytonkellya nearestneighborsnetworkmodelforsequencedatarevealsnewinsightintogenotypedistributionofapathogen AT gebremedhinassefawh nearestneighborsnetworkmodelforsequencedatarevealsnewinsightintogenotypedistributionofapathogen

A nearest-neighbors network model for sequence data reveals new insight into genotype distribution of a pathogen

Ejemplares similares