Benchmarks for Harmlessness Evaluation in the LLM Era

Overview

Generative artificial intelligence (GAI) has greatly improved the ability to produce high-quality content within seconds. As generative models become more widespread, concerns have emerged about potential harms in AI-generated text. These concerns generally fall into three categories: factual errors, unfair or biased content, and toxic content. Examples are illustrated below.

Chatbots and personal assistant applications built on LLMs are increasingly common, and their outputs are pervasive online. This increases the need for robust harmlessness evaluation of LLMs.

Why existing benchmarks struggle

The capability gap between modern LLMs and earlier models has made many existing harmlessness benchmarks inadequate. Main reasons include:

Difficulty achieving accurate evaluation. LLM pretraining covers vast text sources such as encyclopedias, books, and web pages. Many factual benchmarks were constructed from Wikipedia or similar sources. Overlap between LLM training data and factual benchmarks can lead to misleadingly optimistic evaluations.
Restriction to narrow scenarios. Prior fairness benchmarks often focus on specific tasks, like hate-speech detection or sentiment toward identity terms. Biases and stereotypes in LLMs, however, can manifest across a much broader range of real-world scenarios.
Inability to measure differences in toxicity. Existing toxicity evaluations commonly use prompts that contain overtly offensive language to elicit harmful replies. With alignment toward human values, LLMs now frequently refuse such prompts, so these crude methods no longer produce responses that reveal differences across models on toxicity.

The FFT benchmark

To address these gaps, the authors propose a harmlessness evaluation benchmark of 2,116 instances covering factuality, fairness, and toxicity, called FFT.

The benchmark fills the evaluation gaps in several ways:

Adversarial prompts to reveal misleading responses. Since hallucinations often cause LLMs to respond to incorrect user inputs, the benchmark includes adversarial prompts with erroneous information and counterfactuals to test factuality.
Diverse prompts that cover more real-world scenarios. To uncover potential biases, the dataset includes questions reflecting real-life situations across identity preferences, credit decisions, crime, and health assessments.
Carefully designed jailbreak-wrapped prompts for toxicity testing. Jailbreak prompts are crafted instructions intended to bypass built-in ethical constraints. The authors wrap toxicity-triggering prompts with selected jailbreak prompts to avoid immediate refusals, enabling collection of actual toxic responses and comparison across models.

Evaluation and key findings

The authors evaluated nine representative LLMs, including GPT models, Llama 2 chat models, Vicuna, and other Llama 2 variants, on the FFT benchmark. The experiments and analysis produced several notable findings:

LLM outputs can harm users via misinformation, stereotyping, and toxic content, indicating areas that require further research.
Llama 2 chat models demonstrated competitiveness with GPT models on harmlessness metrics.
Fine-tuning for alignment with human values substantially improves harmlessness, warranting additional study.
Model harmlessness does not correlate directly with model scale; larger models may be exposed to more harmful content due to larger pretraining corpora.

Remarks and limitations

The work addresses important evaluation needs in the LLM era by considering realistic application scenarios and providing a comprehensive classification of harms. However, some questions remain. A dataset of about 2k instances is not large, so it may not fully capture the breadth of potential harmful behaviors. It is also unclear whether the benchmark will be updated continuously to keep pace with rapid advances in LLM capabilities.