Amazon Science

Recent advances in natural language processing (NLP) have led to the development of Large Language Models (LLMs), which can generate text virtually indistinguishable from human written texts. These models are adept at interpreting and executing human commands, facilitating their incorporation into a broad spectrum of mainstream applications, including summarization systems, interactive chatbots, and virtual assistants. Amazon, for instance, integrates LLMs to enhance customer experiences, utilizing them to summarize customer reviews on its website and to power innovative shopping chatbots like Rufus. Moreover, Amazon Web Services (AWS) has introduced Amazon Q, a LLM-powered assistant for commercial applications. However, studies have revealed that text generated by LLMs can frequently exhibit factual inconsistencies such as contradictions with the provided input or hallucinations that are irrelevant to the context at hand. Identifying inconsistencies in text produced by LLMs poses a significant challenge because these errors often align with the task's overarching structure and theme, rendering them subtle and hard to detect. Thus, there is a need for an objective evaluation of text generated from LLMs. Human evaluation is considered the gold standard, but it requires subject expertise and is time-consuming, making it non-scalable. This underscores the need for automated metrics to efficiently evaluate LLM-generated texts. In this talk, we will explore different automated evaluation strategies from the literature. These methods range from simple text similarity based methods to using LLMs for evaluation. We will also discuss various ways to benchmark evaluate these automated evaluation methods and cover publicly available datasets for evaluation.

Chat is not available.