Cargando…

A Compressed Language Model Embedding Dataset of ICD 10 CM Descriptions

This paper presents novel datasets providing numerical representations of ICD-10-CM codes by generating description embeddings using a large language model followed by a dimension reduction via autoencoder. The embeddings serve as informative input features for machine learning models by capturing r...

Descripción completa

Detalles Bibliográficos
Autores principales: Kane, Michael J., King, Casey, Esserman, Denise, Latham, Nancy K., Greene, Erich J., Ganz, David A.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Cold Spring Harbor Laboratory 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10168496/
https://www.ncbi.nlm.nih.gov/pubmed/37162903
http://dx.doi.org/10.1101/2023.04.24.23289046
_version_ 1785038865782800384
author Kane, Michael J.
King, Casey
Esserman, Denise
Latham, Nancy K.
Greene, Erich J.
Ganz, David A.
author_facet Kane, Michael J.
King, Casey
Esserman, Denise
Latham, Nancy K.
Greene, Erich J.
Ganz, David A.
author_sort Kane, Michael J.
collection PubMed
description This paper presents novel datasets providing numerical representations of ICD-10-CM codes by generating description embeddings using a large language model followed by a dimension reduction via autoencoder. The embeddings serve as informative input features for machine learning models by capturing relationships among categories and preserving inherent context information. The model generating the data was validated in two ways. First, the dimension reduction was validated using an autoencoder, and secondly, a supervised model was created to estimate the ICD-10-CM hierarchical categories. Results show that the dimension of the data can be reduced to as few as 10 dimensions while maintaining the ability to reproduce the original embeddings, with the fidelity decreasing as the reduced-dimension representation decreases. Multiple compression levels are provided, allowing users to choose as per their requirements. The readily available datasets of ICD-10-CM codes are anticipated to be highly valuable for researchers in biomedical informatics, enabling more advanced analyses in the field. This approach has the potential to significantly improve the utility of ICD-10-CM codes in the biomedical domain.
format Online
Article
Text
id pubmed-10168496
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Cold Spring Harbor Laboratory
record_format MEDLINE/PubMed
spelling pubmed-101684962023-05-10 A Compressed Language Model Embedding Dataset of ICD 10 CM Descriptions Kane, Michael J. King, Casey Esserman, Denise Latham, Nancy K. Greene, Erich J. Ganz, David A. medRxiv Article This paper presents novel datasets providing numerical representations of ICD-10-CM codes by generating description embeddings using a large language model followed by a dimension reduction via autoencoder. The embeddings serve as informative input features for machine learning models by capturing relationships among categories and preserving inherent context information. The model generating the data was validated in two ways. First, the dimension reduction was validated using an autoencoder, and secondly, a supervised model was created to estimate the ICD-10-CM hierarchical categories. Results show that the dimension of the data can be reduced to as few as 10 dimensions while maintaining the ability to reproduce the original embeddings, with the fidelity decreasing as the reduced-dimension representation decreases. Multiple compression levels are provided, allowing users to choose as per their requirements. The readily available datasets of ICD-10-CM codes are anticipated to be highly valuable for researchers in biomedical informatics, enabling more advanced analyses in the field. This approach has the potential to significantly improve the utility of ICD-10-CM codes in the biomedical domain. Cold Spring Harbor Laboratory 2023-05-15 /pmc/articles/PMC10168496/ /pubmed/37162903 http://dx.doi.org/10.1101/2023.04.24.23289046 Text en https://creativecommons.org/licenses/by-nc-nd/4.0/This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License (https://creativecommons.org/licenses/by-nc-nd/4.0/) , which allows reusers to copy and distribute the material in any medium or format in unadapted form only, for noncommercial purposes only, and only so long as attribution is given to the creator.
spellingShingle Article
Kane, Michael J.
King, Casey
Esserman, Denise
Latham, Nancy K.
Greene, Erich J.
Ganz, David A.
A Compressed Language Model Embedding Dataset of ICD 10 CM Descriptions
title A Compressed Language Model Embedding Dataset of ICD 10 CM Descriptions
title_full A Compressed Language Model Embedding Dataset of ICD 10 CM Descriptions
title_fullStr A Compressed Language Model Embedding Dataset of ICD 10 CM Descriptions
title_full_unstemmed A Compressed Language Model Embedding Dataset of ICD 10 CM Descriptions
title_short A Compressed Language Model Embedding Dataset of ICD 10 CM Descriptions
title_sort compressed language model embedding dataset of icd 10 cm descriptions
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10168496/
https://www.ncbi.nlm.nih.gov/pubmed/37162903
http://dx.doi.org/10.1101/2023.04.24.23289046
work_keys_str_mv AT kanemichaelj acompressedlanguagemodelembeddingdatasetoficd10cmdescriptions
AT kingcasey acompressedlanguagemodelembeddingdatasetoficd10cmdescriptions
AT essermandenise acompressedlanguagemodelembeddingdatasetoficd10cmdescriptions
AT lathamnancyk acompressedlanguagemodelembeddingdatasetoficd10cmdescriptions
AT greeneerichj acompressedlanguagemodelembeddingdatasetoficd10cmdescriptions
AT ganzdavida acompressedlanguagemodelembeddingdatasetoficd10cmdescriptions
AT kanemichaelj compressedlanguagemodelembeddingdatasetoficd10cmdescriptions
AT kingcasey compressedlanguagemodelembeddingdatasetoficd10cmdescriptions
AT essermandenise compressedlanguagemodelembeddingdatasetoficd10cmdescriptions
AT lathamnancyk compressedlanguagemodelembeddingdatasetoficd10cmdescriptions
AT greeneerichj compressedlanguagemodelembeddingdatasetoficd10cmdescriptions
AT ganzdavida compressedlanguagemodelembeddingdatasetoficd10cmdescriptions