Cargando…

A domain sequence approach to pangenomics: applications to Escherichia coli

The study of microbial pangenomes relies on the computation of gene families, i.e. the clustering of coding sequences into groups of essentially similar genes. There is no standard approach to obtain such gene families. Ideally, the gene family computations should be robust against errors in the ann...

Descripción completa

Detalles Bibliográficos
Autores principales: Snipen, Lars-Gustav, Ussery, David W
Formato: Online Artículo Texto
Lenguaje:English
Publicado: F1000Research 2013
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3901455/
https://www.ncbi.nlm.nih.gov/pubmed/24555018
http://dx.doi.org/10.12688/f1000research.1-19.v2
_version_ 1782300851479511040
author Snipen, Lars-Gustav
Ussery, David W
author_facet Snipen, Lars-Gustav
Ussery, David W
author_sort Snipen, Lars-Gustav
collection PubMed
description The study of microbial pangenomes relies on the computation of gene families, i.e. the clustering of coding sequences into groups of essentially similar genes. There is no standard approach to obtain such gene families. Ideally, the gene family computations should be robust against errors in the annotation of genes in various genomes. In an attempt to achieve this robustness, we propose to cluster sequences by their domain sequence, i.e. the ordered sequence of domains in their protein sequence. In a study of 347 genomes from Escherichia coli we find on average around 4500 proteins having hits in Pfam-A in every genome, clustering into around 2500 distinct domain sequence families in each genome. Across all genomes we find a total of 5724 such families. A binomial mixture model approach indicates this is around 95% of all domain sequences we would expect to see in E. coli in the future. A Heaps law analysis indicates the population of domain sequences is larger, but this analysis is also very sensitive to smaller changes in the computation procedure. The resolution between strains is good despite the coarse grouping obtained by domain sequence families. Clustering sequences by their ordered domain content give us domain sequence families, who are robust to errors in the gene prediction step. The computational load of the procedure scales linearly with the number of genomes, which is needed for the future explosion in the number of re-sequenced strains. The use of domain sequence families for a functional classification of strains clearly has some potential to be explored.
format Online
Article
Text
id pubmed-3901455
institution National Center for Biotechnology Information
language English
publishDate 2013
publisher F1000Research
record_format MEDLINE/PubMed
spelling pubmed-39014552014-01-27 A domain sequence approach to pangenomics: applications to Escherichia coli Snipen, Lars-Gustav Ussery, David W F1000Res Research Article The study of microbial pangenomes relies on the computation of gene families, i.e. the clustering of coding sequences into groups of essentially similar genes. There is no standard approach to obtain such gene families. Ideally, the gene family computations should be robust against errors in the annotation of genes in various genomes. In an attempt to achieve this robustness, we propose to cluster sequences by their domain sequence, i.e. the ordered sequence of domains in their protein sequence. In a study of 347 genomes from Escherichia coli we find on average around 4500 proteins having hits in Pfam-A in every genome, clustering into around 2500 distinct domain sequence families in each genome. Across all genomes we find a total of 5724 such families. A binomial mixture model approach indicates this is around 95% of all domain sequences we would expect to see in E. coli in the future. A Heaps law analysis indicates the population of domain sequences is larger, but this analysis is also very sensitive to smaller changes in the computation procedure. The resolution between strains is good despite the coarse grouping obtained by domain sequence families. Clustering sequences by their ordered domain content give us domain sequence families, who are robust to errors in the gene prediction step. The computational load of the procedure scales linearly with the number of genomes, which is needed for the future explosion in the number of re-sequenced strains. The use of domain sequence families for a functional classification of strains clearly has some potential to be explored. F1000Research 2013-05-29 /pmc/articles/PMC3901455/ /pubmed/24555018 http://dx.doi.org/10.12688/f1000research.1-19.v2 Text en Copyright: © 2013 Snipen LG et al. http://creativecommons.org/licenses/by/3.0/ This is an open access article distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. http://creativecommons.org/publicdomain/zero/1.0/ Data associated with the article are available under the terms of the Creative Commons Zero "No rights reserved" data waiver (CC0 1.0 Public domain dedication).
spellingShingle Research Article
Snipen, Lars-Gustav
Ussery, David W
A domain sequence approach to pangenomics: applications to Escherichia coli
title A domain sequence approach to pangenomics: applications to Escherichia coli
title_full A domain sequence approach to pangenomics: applications to Escherichia coli
title_fullStr A domain sequence approach to pangenomics: applications to Escherichia coli
title_full_unstemmed A domain sequence approach to pangenomics: applications to Escherichia coli
title_short A domain sequence approach to pangenomics: applications to Escherichia coli
title_sort domain sequence approach to pangenomics: applications to escherichia coli
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3901455/
https://www.ncbi.nlm.nih.gov/pubmed/24555018
http://dx.doi.org/10.12688/f1000research.1-19.v2
work_keys_str_mv AT snipenlarsgustav adomainsequenceapproachtopangenomicsapplicationstoescherichiacoli
AT usserydavidw adomainsequenceapproachtopangenomicsapplicationstoescherichiacoli
AT snipenlarsgustav domainsequenceapproachtopangenomicsapplicationstoescherichiacoli
AT usserydavidw domainsequenceapproachtopangenomicsapplicationstoescherichiacoli