Cargando…

About the dark corners in the gene function space of Escherichia coli remaining without illumination by scientific literature

BACKGROUND: Although Escherichia coli (E. coli) is the most studied prokaryote organism in the history of life sciences, many molecular mechanisms and gene functions encoded in its genome remain to be discovered. This work aims at quantifying the illumination of the E. coli gene function space by th...

Descripción completa

Detalles Bibliográficos
Autores principales: Tantoso, Erwin, Eisenhaber, Birgit, Sinha, Swati, Jensen, Lars Juhl, Eisenhaber, Frank
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9976479/
https://www.ncbi.nlm.nih.gov/pubmed/36855185
http://dx.doi.org/10.1186/s13062-023-00362-0
_version_ 1784899093137457152
author Tantoso, Erwin
Eisenhaber, Birgit
Sinha, Swati
Jensen, Lars Juhl
Eisenhaber, Frank
author_facet Tantoso, Erwin
Eisenhaber, Birgit
Sinha, Swati
Jensen, Lars Juhl
Eisenhaber, Frank
author_sort Tantoso, Erwin
collection PubMed
description BACKGROUND: Although Escherichia coli (E. coli) is the most studied prokaryote organism in the history of life sciences, many molecular mechanisms and gene functions encoded in its genome remain to be discovered. This work aims at quantifying the illumination of the E. coli gene function space by the scientific literature and how close we are towards the goal of a complete list of E. coli gene functions. RESULTS: The scientific literature about E. coli protein-coding genes has been mapped onto the genome via the mentioning of names for genomic regions in scientific articles both for the case of the strain K-12 MG1655 as well as for the 95%-threshold softcore genome of 1324 E. coli strains with known complete genome. The article match was quantified with the ratio of a given gene name’s occurrence to the mentioning of any gene names in the paper. The various genome regions have an extremely uneven literature coverage. A group of elite genes with ≥ 100 full publication equivalents (FPEs, FPE = 1 is an idealized publication devoted to just a single gene) attracts the lion share of the papers. For K-12, ~ 65% of the literature covers just 342 elite genes; for the softcore genome, ~ 68% of the FPEs is about only 342 elite gene families (GFs). We also find that most genes/GFs have at least one mentioning in a dedicated scientific article (with the exception of at least 137 protein-coding transcripts for K-12 and 26 GFs from the softcore genome). Whereas the literature growth rates were highest for uncharacterized or understudied genes until 2005–2010 compared with other groups of genes, they became negative thereafter. At the same time, literature for anyhow well-studied genes started to grow explosively with threshold T10 (≥ 10 FPEs). Typically, a body of ~ 20 actual articles generated over ~ 15 years of research effort was necessary to reach T10. Lineage-specific co-occurrence analysis of genes belonging to the accessory genome of E. coli together with genomic co-localization and sequence-analytic exploration hints previously completely uncharacterized genes yahV and yddL being associated with osmotic stress response/motility mechanisms. CONCLUSION: If the numbers of scientific articles about uncharacterized and understudied genes remain at least at present levels, full gene function lists for the strain K-12 MG1655 and the E. coli softcore genome are in reach within the next 25–30 years. Once the literature body for a gene crosses 10 FPEs, most of the critical fundamental research risk appears overcome and steady incremental research becomes possible. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s13062-023-00362-0.
format Online
Article
Text
id pubmed-9976479
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-99764792023-03-02 About the dark corners in the gene function space of Escherichia coli remaining without illumination by scientific literature Tantoso, Erwin Eisenhaber, Birgit Sinha, Swati Jensen, Lars Juhl Eisenhaber, Frank Biol Direct Research BACKGROUND: Although Escherichia coli (E. coli) is the most studied prokaryote organism in the history of life sciences, many molecular mechanisms and gene functions encoded in its genome remain to be discovered. This work aims at quantifying the illumination of the E. coli gene function space by the scientific literature and how close we are towards the goal of a complete list of E. coli gene functions. RESULTS: The scientific literature about E. coli protein-coding genes has been mapped onto the genome via the mentioning of names for genomic regions in scientific articles both for the case of the strain K-12 MG1655 as well as for the 95%-threshold softcore genome of 1324 E. coli strains with known complete genome. The article match was quantified with the ratio of a given gene name’s occurrence to the mentioning of any gene names in the paper. The various genome regions have an extremely uneven literature coverage. A group of elite genes with ≥ 100 full publication equivalents (FPEs, FPE = 1 is an idealized publication devoted to just a single gene) attracts the lion share of the papers. For K-12, ~ 65% of the literature covers just 342 elite genes; for the softcore genome, ~ 68% of the FPEs is about only 342 elite gene families (GFs). We also find that most genes/GFs have at least one mentioning in a dedicated scientific article (with the exception of at least 137 protein-coding transcripts for K-12 and 26 GFs from the softcore genome). Whereas the literature growth rates were highest for uncharacterized or understudied genes until 2005–2010 compared with other groups of genes, they became negative thereafter. At the same time, literature for anyhow well-studied genes started to grow explosively with threshold T10 (≥ 10 FPEs). Typically, a body of ~ 20 actual articles generated over ~ 15 years of research effort was necessary to reach T10. Lineage-specific co-occurrence analysis of genes belonging to the accessory genome of E. coli together with genomic co-localization and sequence-analytic exploration hints previously completely uncharacterized genes yahV and yddL being associated with osmotic stress response/motility mechanisms. CONCLUSION: If the numbers of scientific articles about uncharacterized and understudied genes remain at least at present levels, full gene function lists for the strain K-12 MG1655 and the E. coli softcore genome are in reach within the next 25–30 years. Once the literature body for a gene crosses 10 FPEs, most of the critical fundamental research risk appears overcome and steady incremental research becomes possible. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s13062-023-00362-0. BioMed Central 2023-02-28 /pmc/articles/PMC9976479/ /pubmed/36855185 http://dx.doi.org/10.1186/s13062-023-00362-0 Text en © The Author(s) 2023 https://creativecommons.org/licenses/by/4.0/ Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle Research
Tantoso, Erwin
Eisenhaber, Birgit
Sinha, Swati
Jensen, Lars Juhl
Eisenhaber, Frank
About the dark corners in the gene function space of Escherichia coli remaining without illumination by scientific literature
title About the dark corners in the gene function space of Escherichia coli remaining without illumination by scientific literature
title_full About the dark corners in the gene function space of Escherichia coli remaining without illumination by scientific literature
title_fullStr About the dark corners in the gene function space of Escherichia coli remaining without illumination by scientific literature
title_full_unstemmed About the dark corners in the gene function space of Escherichia coli remaining without illumination by scientific literature
title_short About the dark corners in the gene function space of Escherichia coli remaining without illumination by scientific literature
title_sort about the dark corners in the gene function space of escherichia coli remaining without illumination by scientific literature
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9976479/
https://www.ncbi.nlm.nih.gov/pubmed/36855185
http://dx.doi.org/10.1186/s13062-023-00362-0
work_keys_str_mv AT tantosoerwin aboutthedarkcornersinthegenefunctionspaceofescherichiacoliremainingwithoutilluminationbyscientificliterature
AT eisenhaberbirgit aboutthedarkcornersinthegenefunctionspaceofescherichiacoliremainingwithoutilluminationbyscientificliterature
AT sinhaswati aboutthedarkcornersinthegenefunctionspaceofescherichiacoliremainingwithoutilluminationbyscientificliterature
AT jensenlarsjuhl aboutthedarkcornersinthegenefunctionspaceofescherichiacoliremainingwithoutilluminationbyscientificliterature
AT eisenhaberfrank aboutthedarkcornersinthegenefunctionspaceofescherichiacoliremainingwithoutilluminationbyscientificliterature