arxiv:2505.21115

Will It Still Be True Tomorrow? Multilingual Evergreen Question Classification to Improve Trustworthy QA

Published on May 27, 2025

· Submitted by

Maria Marina on Jun 9, 2025

#1 Paper of the day

Upvote

144

Authors:

Sergey Pletenev ,

Maria Marina ,

Nikolay Ivanov ,

Daria Galimzianova ,

Nikita Krayko ,

Mikhail Salnikov ,

Vasily Konovalov ,

Alexander Panchenko ,

Viktor Moskvoretskii

Abstract

EverGreenQA, a multilingual QA dataset with evergreen labels, is introduced to benchmark LLMs on temporality encoding and assess their performance through verbalized judgments and uncertainty signals.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Large Language Models (LLMs) often hallucinate in question answering (QA) tasks. A key yet underexplored factor contributing to this is the temporality of questions -- whether they are evergreen (answers remain stable over time) or mutable (answers change). In this work, we introduce EverGreen QA, the first multilingual QA dataset with evergreen labels, supporting both evaluation and training. Using EverGreen QA, we benchmark 12 modern LLMs to assess whether they encode question temporality explicitly (via verbalized judgments) or implicitly (via uncertainty signals). We also train EG-E5, a lightweight multilingual classifier that achieves SoTA performance on this task. Finally, we demonstrate the practical utility of evergreen classification across three applications: improving self-knowledge estimation, filtering QA datasets, and explaining GPT-4o retrieval behavior.

View arXiv page View PDF Project page GitHub 23 Add to collection

Community

zlatamaria

Paper author Paper submitter Jun 9, 2025

Large Language Models (LLMs) often hallucinate in question answering (QA) tasks. A key yet underexplored factor contributing to this is the temporality of questions -- whether they are evergreen (answers remain stable over time) or mutable (answers change). In this work, we introduce EverGreenQA, the first multilingual QA dataset with evergreen labels, supporting both evaluation and training. Using EverGreenQA, we benchmark 12 modern LLMs to assess whether they encode question temporality explicitly (via verbalized judgments) or implicitly (via uncertainty signals). We also train EG-E5, a lightweight multilingual classifier that achieves SoTA performance on this task. Finally, we demonstrate the practical utility of evergreen classification across three applications: improving self-knowledge estimation, filtering QA datasets, and explaining GPT-4o retrieval behavior.