Cargando…

A data package for abstractive opinion summarization, title generation, and rating-based sentiment prediction for airline reviews

Customer reviews are valuable resources containing customer opinions and sentiments toward the product. The reviews are informative but can be quite lengthy or may contain repetitive information calling for opinion summarization systems that retain only the significant opinion information from the r...

Descripción completa

Detalles Bibliográficos
Autores principales: Syed, Ayesha Ayub, Gaol, Ford Lumban, Boediman, Alfred, Matsuo, Tokuro, Budiharto, Widodo
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Elsevier 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10504493/
https://www.ncbi.nlm.nih.gov/pubmed/37720686
http://dx.doi.org/10.1016/j.dib.2023.109535
Descripción
Sumario:Customer reviews are valuable resources containing customer opinions and sentiments toward the product. The reviews are informative but can be quite lengthy or may contain repetitive information calling for opinion summarization systems that retain only the significant opinion information from the review. Abstractive summarization is a form of text summarization that generates a summary mimicking a human-written summary [1]. When pretrained language models are finetuned for abstractive review summarization, there usually occurs a problem known as the ‘domain shift’, because the source and target domains exhibit data from varying distributions [2]. This issue results in performance degradation of the model at the target end. This paper contributes a data package comprising of an annotated abstractive summarization dataset (annotated_abs_summ) of airline reviews having 500 reviews and abstractive summary pairs, a dataset (review_titles_data) consisting of 7079 reviews and review title pairs for review title generatioon or domain adaptive training [3] to address the domain shift problem for abstractive opinion summarization and, an annotated reviews dataset (annotated_sentiment) for rating-based sentiment classification. All datasets have been collected from the Skytrax Review Portal via web scraping using Python programming language. The datasets have several potential use cases. The abstractive summarization dataset can serve as a benchmark dataset for airline review summarization. The dataset for domain adaptive training can be used as a standalone dataset for review title generation. The dataset for sentiment analysis is multipurpose having columns like user rating and recommendation value, that can be used for statistical analysis like finding correlation between these data items as well as for other Natural Language Processing (NLP) tasks like predicting rating or recommendation value from the customer reviews. The datasets can be extended using various data augmentation techniques [4,5]. Moreover, the datasets are related and can be collectively used to develop a multi-task learning model [6] for better learning efficiency and improved performance.