Building a Multilingual Open-Ended QA Dataset

Background

Over the past year, foundation models have progressed rapidly, driving advances in general AI techniques and prompting the development of various model evaluation methods.

Current evaluation datasets mostly rely on human exam questions and their official answers, which emphasizes reasoning ability and may not fully reflect a model's practical generative capabilities. For example, in English benchmarks, HELM uses 16 NLP datasets and MMLU covers 57 human exam subjects. In Chinese benchmarks, GAOKAO and C-Eval also use human exam questions. These automated evaluation pipelines only include questions with standard answers, limiting their ability to comprehensively assess generative foundation models.

Some work has focused on open-ended question answering. AlpacaEval from Stanford is widely recognized, but it consists solely of English prompts and therefore only measures performance in English. SuperCLUE is the first open-ended QA dataset including Chinese prompts, but it is closed-source and contains only Chinese prompts. Existing open-ended QA datasets evaluate a single language, leaving a gap for an open-source multilingual open-ended QA benchmark.

Accordingly, building a multilingual open-ended QA dataset to comprehensively assess foundation models is necessary. The project starts with Chinese prompts and will gradually expand to other languages.

Introduction

The open multilingual generative evaluation benchmark, OMGEval, was released by a team from Beijing Language and Culture University, Tsinghua University, Northeastern University, Shanghai University of Finance and Economics, and other institutions. Key contributors include Liu Yang, Zhu Lin, Yu Jingsi, Xu Meng, Wang Yujie, Chang Hongxiang, Yuan Jiaxin, Kong Cunliang, An Jiyuan, Yang Tianlin, Wang Shuo, Liu Zhenghao, Chen Yun, Yang Erhong, Liu Yang, and Sun Maosong.

The dataset is open-source on GitHub: https://github.com/blcu/all/OMGEval

Dataset Construction Process

Translation
All prompts from AlpacaEval were translated into Chinese using ChatGPT. A specific prompt was used for this translation step.
Localization
Evaluation of language ability involves not only the language of prompts and responses but also the cultural information embedded in them. Sentences in AlpacaEval that contained cultural elements were localized, including but not limited to people, movies, books, and festivals. The goal of localization was to make these prompts more relevant to China.

Several localization examples were prepared.
Manual Verification
Translated and localized prompts underwent manual verification. Each prompt was checked by two annotators and one reviewer. Annotators and reviewers were master’s students in linguistics.

Dataset Analysis

The final dataset contains 804 Chinese open-ended questions. Model capabilities were categorized into nine classes.

The distribution of questions across capability categories is currently uneven. Additional open-ended prompts will be added to improve balance.

Evaluation Method

AlpacaEval is a leaderboard framework for automatic evaluation of large language models. It covers the full evaluation pipeline from dataset collection and model response generation to automatic scoring. The leaderboard evaluates instruction-following and response quality. The dataset used by the leaderboard contains 805 instructions aggregated from projects such as Self-Instruct, Open Assistant, and Vicuna. The evaluation metric is computed by using a large model as the judge, typically GPT-4, which automatically compares the candidate model's response with a reference model's response, typically Text-Davinci-003, and computes the candidate model's win rate.

AlpacaEval reports a Pearson correlation of 0.94 between GPT-4 evaluations and human annotations, indicating high reliability. The authors also analyze evaluation costs, showing that automatic evaluation greatly reduces both the monetary and time costs compared with manual annotation.

Following AlpacaEval, OMGEval uses Text-Davinci-003 outputs as the baseline and GPT-4 as the evaluator to decide which output is better, and to compute win rates and standard deviations. To ensure that evaluated models produce Chinese outputs for OMGEval prompts, Chinese was used in the prompts to the models. The GPT-4 evaluation prompt was adjusted accordingly.

Evaluation Results

Using Text-Davinci-003 as the baseline and GPT-4 as the evaluator, evaluation leaderboards were produced. Separate evaluation was also conducted on 239 localized prompts to assess model performance on content involving Chinese cultural elements. For example, ChatGPT scored lower on the localized subset compared with the full prompt set.

References

Liang P, Bommasani R, Lee T, et al. Holistic evaluation of language models. arXiv preprint arXiv:2211.09110, 2022.
Hendrycks D, Burns C, Basart S, et al. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.
Zhang X, Li C, Zong Y, et al. Evaluating the Performance of Large Language Models on GAOKAO Benchmark. arXiv preprint arXiv:2305.12474, 2023.
Huang Y, Bai Y, Zhu Z, et al. C-eval: A multi-level multi-discipline Chinese evaluation suite for foundation models. arXiv preprint arXiv:2305.08322, 2023.
Dubois Y, Li X, Taori R, et al. Alpafarm: A simulation framework for methods that learn from human feedback. arXiv preprint arXiv:2305.14387, 2023.
Xu L, Li A, Zhu L, et al. SuperCLUE: A Comprehensive Chinese Large Language Model Benchmark. arXiv preprint arXiv:2307.15020, 2023.