Cargando…

Comparing scientific abstracts generated by ChatGPT to real abstracts with detectors and blinded human reviewers

Large language models such as ChatGPT can produce increasingly realistic text, with unknown information on the accuracy and integrity of using these models in scientific writing. We gathered fifth research abstracts from five high-impact factor medical journals and asked ChatGPT to generate research...

Descripción completa

Detalles Bibliográficos
Autores principales: Gao, Catherine A., Howard, Frederick M., Markov, Nikolay S., Dyer, Emma C., Ramesh, Siddhi, Luo, Yuan, Pearson, Alexander T.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Nature Publishing Group UK 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10133283/
https://www.ncbi.nlm.nih.gov/pubmed/37100871
http://dx.doi.org/10.1038/s41746-023-00819-6
_version_ 1785031535066349568
author Gao, Catherine A.
Howard, Frederick M.
Markov, Nikolay S.
Dyer, Emma C.
Ramesh, Siddhi
Luo, Yuan
Pearson, Alexander T.
author_facet Gao, Catherine A.
Howard, Frederick M.
Markov, Nikolay S.
Dyer, Emma C.
Ramesh, Siddhi
Luo, Yuan
Pearson, Alexander T.
author_sort Gao, Catherine A.
collection PubMed
description Large language models such as ChatGPT can produce increasingly realistic text, with unknown information on the accuracy and integrity of using these models in scientific writing. We gathered fifth research abstracts from five high-impact factor medical journals and asked ChatGPT to generate research abstracts based on their titles and journals. Most generated abstracts were detected using an AI output detector, ‘GPT-2 Output Detector’, with % ‘fake’ scores (higher meaning more likely to be generated) of median [interquartile range] of 99.98% ‘fake’ [12.73%, 99.98%] compared with median 0.02% [IQR 0.02%, 0.09%] for the original abstracts. The AUROC of the AI output detector was 0.94. Generated abstracts scored lower than original abstracts when run through a plagiarism detector website and iThenticate (higher scores meaning more matching text found). When given a mixture of original and general abstracts, blinded human reviewers correctly identified 68% of generated abstracts as being generated by ChatGPT, but incorrectly identified 14% of original abstracts as being generated. Reviewers indicated that it was surprisingly difficult to differentiate between the two, though abstracts they suspected were generated were vaguer and more formulaic. ChatGPT writes believable scientific abstracts, though with completely generated data. Depending on publisher-specific guidelines, AI output detectors may serve as an editorial tool to help maintain scientific standards. The boundaries of ethical and acceptable use of large language models to help scientific writing are still being discussed, and different journals and conferences are adopting varying policies.
format Online
Article
Text
id pubmed-10133283
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Nature Publishing Group UK
record_format MEDLINE/PubMed
spelling pubmed-101332832023-04-28 Comparing scientific abstracts generated by ChatGPT to real abstracts with detectors and blinded human reviewers Gao, Catherine A. Howard, Frederick M. Markov, Nikolay S. Dyer, Emma C. Ramesh, Siddhi Luo, Yuan Pearson, Alexander T. NPJ Digit Med Brief Communication Large language models such as ChatGPT can produce increasingly realistic text, with unknown information on the accuracy and integrity of using these models in scientific writing. We gathered fifth research abstracts from five high-impact factor medical journals and asked ChatGPT to generate research abstracts based on their titles and journals. Most generated abstracts were detected using an AI output detector, ‘GPT-2 Output Detector’, with % ‘fake’ scores (higher meaning more likely to be generated) of median [interquartile range] of 99.98% ‘fake’ [12.73%, 99.98%] compared with median 0.02% [IQR 0.02%, 0.09%] for the original abstracts. The AUROC of the AI output detector was 0.94. Generated abstracts scored lower than original abstracts when run through a plagiarism detector website and iThenticate (higher scores meaning more matching text found). When given a mixture of original and general abstracts, blinded human reviewers correctly identified 68% of generated abstracts as being generated by ChatGPT, but incorrectly identified 14% of original abstracts as being generated. Reviewers indicated that it was surprisingly difficult to differentiate between the two, though abstracts they suspected were generated were vaguer and more formulaic. ChatGPT writes believable scientific abstracts, though with completely generated data. Depending on publisher-specific guidelines, AI output detectors may serve as an editorial tool to help maintain scientific standards. The boundaries of ethical and acceptable use of large language models to help scientific writing are still being discussed, and different journals and conferences are adopting varying policies. Nature Publishing Group UK 2023-04-26 /pmc/articles/PMC10133283/ /pubmed/37100871 http://dx.doi.org/10.1038/s41746-023-00819-6 Text en © The Author(s) 2023 https://creativecommons.org/licenses/by/4.0/Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) .
spellingShingle Brief Communication
Gao, Catherine A.
Howard, Frederick M.
Markov, Nikolay S.
Dyer, Emma C.
Ramesh, Siddhi
Luo, Yuan
Pearson, Alexander T.
Comparing scientific abstracts generated by ChatGPT to real abstracts with detectors and blinded human reviewers
title Comparing scientific abstracts generated by ChatGPT to real abstracts with detectors and blinded human reviewers
title_full Comparing scientific abstracts generated by ChatGPT to real abstracts with detectors and blinded human reviewers
title_fullStr Comparing scientific abstracts generated by ChatGPT to real abstracts with detectors and blinded human reviewers
title_full_unstemmed Comparing scientific abstracts generated by ChatGPT to real abstracts with detectors and blinded human reviewers
title_short Comparing scientific abstracts generated by ChatGPT to real abstracts with detectors and blinded human reviewers
title_sort comparing scientific abstracts generated by chatgpt to real abstracts with detectors and blinded human reviewers
topic Brief Communication
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10133283/
https://www.ncbi.nlm.nih.gov/pubmed/37100871
http://dx.doi.org/10.1038/s41746-023-00819-6
work_keys_str_mv AT gaocatherinea comparingscientificabstractsgeneratedbychatgpttorealabstractswithdetectorsandblindedhumanreviewers
AT howardfrederickm comparingscientificabstractsgeneratedbychatgpttorealabstractswithdetectorsandblindedhumanreviewers
AT markovnikolays comparingscientificabstractsgeneratedbychatgpttorealabstractswithdetectorsandblindedhumanreviewers
AT dyeremmac comparingscientificabstractsgeneratedbychatgpttorealabstractswithdetectorsandblindedhumanreviewers
AT rameshsiddhi comparingscientificabstractsgeneratedbychatgpttorealabstractswithdetectorsandblindedhumanreviewers
AT luoyuan comparingscientificabstractsgeneratedbychatgpttorealabstractswithdetectorsandblindedhumanreviewers
AT pearsonalexandert comparingscientificabstractsgeneratedbychatgpttorealabstractswithdetectorsandblindedhumanreviewers