Cargando…

Performance analysis of large language models in the domain of legal argument mining

Generative pre-trained transformers (GPT) have recently demonstrated excellent performance in various natural language tasks. The development of ChatGPT and the recently released GPT-4 model has shown competence in solving complex and higher-order reasoning tasks without further training or fine-tun...

Descripción completa

Detalles Bibliográficos
Autores principales:	Al Zubaer, Abdullah, Granitzer, Michael, Mitrović, Jelena
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Frontiers Media S.A. 2023
Materias:	Artificial Intelligence
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10691378/ https://www.ncbi.nlm.nih.gov/pubmed/38045763 http://dx.doi.org/10.3389/frai.2023.1278796

_version_	1785152727004741632
author	Al Zubaer, Abdullah Granitzer, Michael Mitrović, Jelena
author_facet	Al Zubaer, Abdullah Granitzer, Michael Mitrović, Jelena
author_sort	Al Zubaer, Abdullah
collection	PubMed
description	Generative pre-trained transformers (GPT) have recently demonstrated excellent performance in various natural language tasks. The development of ChatGPT and the recently released GPT-4 model has shown competence in solving complex and higher-order reasoning tasks without further training or fine-tuning. However, the applicability and strength of these models in classifying legal texts in the context of argument mining are yet to be realized and have not been tested thoroughly. In this study, we investigate the effectiveness of GPT-like models, specifically GPT-3.5 and GPT-4, for argument mining via prompting. We closely study the model's performance considering diverse prompt formulation and example selection in the prompt via semantic search using state-of-the-art embedding models from OpenAI and sentence transformers. We primarily concentrate on the argument component classification task on the legal corpus from the European Court of Human Rights. To address these models' inherent non-deterministic nature and make our result statistically sound, we conducted 5-fold cross-validation on the test set. Our experiments demonstrate, quite surprisingly, that relatively small domain-specific models outperform GPT 3.5 and GPT-4 in the F1-score for premise and conclusion classes, with 1.9% and 12% improvements, respectively. We hypothesize that the performance drop indirectly reflects the complexity of the structure in the dataset, which we verify through prompt and data analysis. Nevertheless, our results demonstrate a noteworthy variation in the performance of GPT models based on prompt formulation. We observe comparable performance between the two embedding models, with a slight improvement in the local model's ability for prompt selection. This suggests that local models are as semantically rich as the embeddings from the OpenAI model. Our results indicate that the structure of prompts significantly impacts the performance of GPT models and should be considered when designing them.
format	Online Article Text
id	pubmed-10691378
institution	National Center for Biotechnology Information
language	English
publishDate	2023
publisher	Frontiers Media S.A.
record_format	MEDLINE/PubMed
spelling	pubmed-106913782023-12-02 Performance analysis of large language models in the domain of legal argument mining Al Zubaer, Abdullah Granitzer, Michael Mitrović, Jelena Front Artif Intell Artificial Intelligence Generative pre-trained transformers (GPT) have recently demonstrated excellent performance in various natural language tasks. The development of ChatGPT and the recently released GPT-4 model has shown competence in solving complex and higher-order reasoning tasks without further training or fine-tuning. However, the applicability and strength of these models in classifying legal texts in the context of argument mining are yet to be realized and have not been tested thoroughly. In this study, we investigate the effectiveness of GPT-like models, specifically GPT-3.5 and GPT-4, for argument mining via prompting. We closely study the model's performance considering diverse prompt formulation and example selection in the prompt via semantic search using state-of-the-art embedding models from OpenAI and sentence transformers. We primarily concentrate on the argument component classification task on the legal corpus from the European Court of Human Rights. To address these models' inherent non-deterministic nature and make our result statistically sound, we conducted 5-fold cross-validation on the test set. Our experiments demonstrate, quite surprisingly, that relatively small domain-specific models outperform GPT 3.5 and GPT-4 in the F1-score for premise and conclusion classes, with 1.9% and 12% improvements, respectively. We hypothesize that the performance drop indirectly reflects the complexity of the structure in the dataset, which we verify through prompt and data analysis. Nevertheless, our results demonstrate a noteworthy variation in the performance of GPT models based on prompt formulation. We observe comparable performance between the two embedding models, with a slight improvement in the local model's ability for prompt selection. This suggests that local models are as semantically rich as the embeddings from the OpenAI model. Our results indicate that the structure of prompts significantly impacts the performance of GPT models and should be considered when designing them. Frontiers Media S.A. 2023-11-17 /pmc/articles/PMC10691378/ /pubmed/38045763 http://dx.doi.org/10.3389/frai.2023.1278796 Text en Copyright © 2023 Al Zubaer, Granitzer and Mitrović. https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
spellingShingle	Artificial Intelligence Al Zubaer, Abdullah Granitzer, Michael Mitrović, Jelena Performance analysis of large language models in the domain of legal argument mining
title	Performance analysis of large language models in the domain of legal argument mining
title_full	Performance analysis of large language models in the domain of legal argument mining
title_fullStr	Performance analysis of large language models in the domain of legal argument mining
title_full_unstemmed	Performance analysis of large language models in the domain of legal argument mining
title_short	Performance analysis of large language models in the domain of legal argument mining
title_sort	performance analysis of large language models in the domain of legal argument mining
topic	Artificial Intelligence
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10691378/ https://www.ncbi.nlm.nih.gov/pubmed/38045763 http://dx.doi.org/10.3389/frai.2023.1278796
work_keys_str_mv	AT alzubaerabdullah performanceanalysisoflargelanguagemodelsinthedomainoflegalargumentmining AT granitzermichael performanceanalysisoflargelanguagemodelsinthedomainoflegalargumentmining AT mitrovicjelena performanceanalysisoflargelanguagemodelsinthedomainoflegalargumentmining

Performance analysis of large language models in the domain of legal argument mining

Ejemplares similares