Cargando…
PubChem chemical structure standardization
BACKGROUND: PubChem is a chemical information repository, consisting of three primary databases: Substance, Compound, and BioAssay. When individual data contributors submit chemical substance descriptions to Substance, the unique chemical structures are extracted and stored into Compound through an...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Springer International Publishing
2018
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6086778/ https://www.ncbi.nlm.nih.gov/pubmed/30097821 http://dx.doi.org/10.1186/s13321-018-0293-8 |
_version_ | 1783346559950258176 |
---|---|
author | Hähnke, Volker D. Kim, Sunghwan Bolton, Evan E. |
author_facet | Hähnke, Volker D. Kim, Sunghwan Bolton, Evan E. |
author_sort | Hähnke, Volker D. |
collection | PubMed |
description | BACKGROUND: PubChem is a chemical information repository, consisting of three primary databases: Substance, Compound, and BioAssay. When individual data contributors submit chemical substance descriptions to Substance, the unique chemical structures are extracted and stored into Compound through an automated process called structure standardization. The present study describes the PubChem standardization approaches and analyzes them for their success rates, reasons that cause structures to be rejected, and modifications applied to structures during the standardization process. Furthermore, the PubChem standardization is compared to the structure normalization of the IUPAC International Chemical Identifier (InChI) software, as manifested by conversion of the InChI back into a chemical structure. RESULTS: The observed rejection rate for substances processed by PubChem standardization was 0.36%, which is predominantly attributed to structures with invalid atom valences that cannot be readily corrected without additional information from contributors. Of all structures that pass standardization, 44% are modified in the process, reducing the count of unique structures from 53,574,724 in substance to 45,808,881 in compound as identified by de-aromatized canonical isomeric SMILES. Even though the processing time is very low on average (only 0.4% of structures have individual standardization time above 0.1 s), total standardization time is completely dominated by edge cases: 90% of the time to standardize all structures in PubChem substance is spent on the 2.05% of structures with the highest individual standardization time. It is worth noting that 60% of the structures obtained from PubChem structure standardization are not identical to the chemical structure resulting from the InChI (primarily due to preferences for a different tautomeric form). CONCLUSIONS: Standardization of chemical structures is complicated by the diversity of chemical information and their representations approaches. The PubChem standardization is an effective and efficient tool to account for molecular diversity and to eliminate invalid/incomplete structures. Further development will concentrate on improved tautomer consideration and an expanded stereocenter definition. Modifications are difficult to thoroughly validate, with slight changes often affecting many thousands of structures and various edge cases. The PubChem structure standardization service is accessible as a public resource (https://pubchem.ncbi.nlm.nih.gov/standardize), and via programmatic interfaces. [Image: see text] ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1186/s13321-018-0293-8) contains supplementary material, which is available to authorized users. |
format | Online Article Text |
id | pubmed-6086778 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2018 |
publisher | Springer International Publishing |
record_format | MEDLINE/PubMed |
spelling | pubmed-60867782018-08-24 PubChem chemical structure standardization Hähnke, Volker D. Kim, Sunghwan Bolton, Evan E. J Cheminform Research Article BACKGROUND: PubChem is a chemical information repository, consisting of three primary databases: Substance, Compound, and BioAssay. When individual data contributors submit chemical substance descriptions to Substance, the unique chemical structures are extracted and stored into Compound through an automated process called structure standardization. The present study describes the PubChem standardization approaches and analyzes them for their success rates, reasons that cause structures to be rejected, and modifications applied to structures during the standardization process. Furthermore, the PubChem standardization is compared to the structure normalization of the IUPAC International Chemical Identifier (InChI) software, as manifested by conversion of the InChI back into a chemical structure. RESULTS: The observed rejection rate for substances processed by PubChem standardization was 0.36%, which is predominantly attributed to structures with invalid atom valences that cannot be readily corrected without additional information from contributors. Of all structures that pass standardization, 44% are modified in the process, reducing the count of unique structures from 53,574,724 in substance to 45,808,881 in compound as identified by de-aromatized canonical isomeric SMILES. Even though the processing time is very low on average (only 0.4% of structures have individual standardization time above 0.1 s), total standardization time is completely dominated by edge cases: 90% of the time to standardize all structures in PubChem substance is spent on the 2.05% of structures with the highest individual standardization time. It is worth noting that 60% of the structures obtained from PubChem structure standardization are not identical to the chemical structure resulting from the InChI (primarily due to preferences for a different tautomeric form). CONCLUSIONS: Standardization of chemical structures is complicated by the diversity of chemical information and their representations approaches. The PubChem standardization is an effective and efficient tool to account for molecular diversity and to eliminate invalid/incomplete structures. Further development will concentrate on improved tautomer consideration and an expanded stereocenter definition. Modifications are difficult to thoroughly validate, with slight changes often affecting many thousands of structures and various edge cases. The PubChem structure standardization service is accessible as a public resource (https://pubchem.ncbi.nlm.nih.gov/standardize), and via programmatic interfaces. [Image: see text] ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1186/s13321-018-0293-8) contains supplementary material, which is available to authorized users. Springer International Publishing 2018-08-10 /pmc/articles/PMC6086778/ /pubmed/30097821 http://dx.doi.org/10.1186/s13321-018-0293-8 Text en © The Author(s) 2018 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. |
spellingShingle | Research Article Hähnke, Volker D. Kim, Sunghwan Bolton, Evan E. PubChem chemical structure standardization |
title | PubChem chemical structure standardization |
title_full | PubChem chemical structure standardization |
title_fullStr | PubChem chemical structure standardization |
title_full_unstemmed | PubChem chemical structure standardization |
title_short | PubChem chemical structure standardization |
title_sort | pubchem chemical structure standardization |
topic | Research Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6086778/ https://www.ncbi.nlm.nih.gov/pubmed/30097821 http://dx.doi.org/10.1186/s13321-018-0293-8 |
work_keys_str_mv | AT hahnkevolkerd pubchemchemicalstructurestandardization AT kimsunghwan pubchemchemicalstructurestandardization AT boltonevane pubchemchemicalstructurestandardization |