Cargando…
A domain sequence approach to pangenomics: applications to Escherichia coli
The study of microbial pangenomes relies on the computation of gene families, i.e. the clustering of coding sequences into groups of essentially similar genes. There is no standard approach to obtain such gene families. Ideally, the gene family computations should be robust against errors in the ann...
Autores principales: | , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
F1000Research
2013
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3901455/ https://www.ncbi.nlm.nih.gov/pubmed/24555018 http://dx.doi.org/10.12688/f1000research.1-19.v2 |
_version_ | 1782300851479511040 |
---|---|
author | Snipen, Lars-Gustav Ussery, David W |
author_facet | Snipen, Lars-Gustav Ussery, David W |
author_sort | Snipen, Lars-Gustav |
collection | PubMed |
description | The study of microbial pangenomes relies on the computation of gene families, i.e. the clustering of coding sequences into groups of essentially similar genes. There is no standard approach to obtain such gene families. Ideally, the gene family computations should be robust against errors in the annotation of genes in various genomes. In an attempt to achieve this robustness, we propose to cluster sequences by their domain sequence, i.e. the ordered sequence of domains in their protein sequence. In a study of 347 genomes from Escherichia coli we find on average around 4500 proteins having hits in Pfam-A in every genome, clustering into around 2500 distinct domain sequence families in each genome. Across all genomes we find a total of 5724 such families. A binomial mixture model approach indicates this is around 95% of all domain sequences we would expect to see in E. coli in the future. A Heaps law analysis indicates the population of domain sequences is larger, but this analysis is also very sensitive to smaller changes in the computation procedure. The resolution between strains is good despite the coarse grouping obtained by domain sequence families. Clustering sequences by their ordered domain content give us domain sequence families, who are robust to errors in the gene prediction step. The computational load of the procedure scales linearly with the number of genomes, which is needed for the future explosion in the number of re-sequenced strains. The use of domain sequence families for a functional classification of strains clearly has some potential to be explored. |
format | Online Article Text |
id | pubmed-3901455 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2013 |
publisher | F1000Research |
record_format | MEDLINE/PubMed |
spelling | pubmed-39014552014-01-27 A domain sequence approach to pangenomics: applications to Escherichia coli Snipen, Lars-Gustav Ussery, David W F1000Res Research Article The study of microbial pangenomes relies on the computation of gene families, i.e. the clustering of coding sequences into groups of essentially similar genes. There is no standard approach to obtain such gene families. Ideally, the gene family computations should be robust against errors in the annotation of genes in various genomes. In an attempt to achieve this robustness, we propose to cluster sequences by their domain sequence, i.e. the ordered sequence of domains in their protein sequence. In a study of 347 genomes from Escherichia coli we find on average around 4500 proteins having hits in Pfam-A in every genome, clustering into around 2500 distinct domain sequence families in each genome. Across all genomes we find a total of 5724 such families. A binomial mixture model approach indicates this is around 95% of all domain sequences we would expect to see in E. coli in the future. A Heaps law analysis indicates the population of domain sequences is larger, but this analysis is also very sensitive to smaller changes in the computation procedure. The resolution between strains is good despite the coarse grouping obtained by domain sequence families. Clustering sequences by their ordered domain content give us domain sequence families, who are robust to errors in the gene prediction step. The computational load of the procedure scales linearly with the number of genomes, which is needed for the future explosion in the number of re-sequenced strains. The use of domain sequence families for a functional classification of strains clearly has some potential to be explored. F1000Research 2013-05-29 /pmc/articles/PMC3901455/ /pubmed/24555018 http://dx.doi.org/10.12688/f1000research.1-19.v2 Text en Copyright: © 2013 Snipen LG et al. http://creativecommons.org/licenses/by/3.0/ This is an open access article distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. http://creativecommons.org/publicdomain/zero/1.0/ Data associated with the article are available under the terms of the Creative Commons Zero "No rights reserved" data waiver (CC0 1.0 Public domain dedication). |
spellingShingle | Research Article Snipen, Lars-Gustav Ussery, David W A domain sequence approach to pangenomics: applications to Escherichia coli |
title | A domain sequence approach to pangenomics: applications to
Escherichia coli
|
title_full | A domain sequence approach to pangenomics: applications to
Escherichia coli
|
title_fullStr | A domain sequence approach to pangenomics: applications to
Escherichia coli
|
title_full_unstemmed | A domain sequence approach to pangenomics: applications to
Escherichia coli
|
title_short | A domain sequence approach to pangenomics: applications to
Escherichia coli
|
title_sort | domain sequence approach to pangenomics: applications to
escherichia coli |
topic | Research Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3901455/ https://www.ncbi.nlm.nih.gov/pubmed/24555018 http://dx.doi.org/10.12688/f1000research.1-19.v2 |
work_keys_str_mv | AT snipenlarsgustav adomainsequenceapproachtopangenomicsapplicationstoescherichiacoli AT usserydavidw adomainsequenceapproachtopangenomicsapplicationstoescherichiacoli AT snipenlarsgustav domainsequenceapproachtopangenomicsapplicationstoescherichiacoli AT usserydavidw domainsequenceapproachtopangenomicsapplicationstoescherichiacoli |