Papers
arxiv:2505.21115

Will It Still Be True Tomorrow? Multilingual Evergreen Question Classification to Improve Trustworthy QA

Published on May 27, 2025
· Submitted by
Maria Marina
on Jun 9, 2025
#1 Paper of the day

Abstract

EverGreenQA, a multilingual QA dataset with evergreen labels, is introduced to benchmark LLMs on temporality encoding and assess their performance through verbalized judgments and uncertainty signals.

Large Language Models (LLMs) often hallucinate in question answering (QA) tasks. A key yet underexplored factor contributing to this is the temporality of questions -- whether they are evergreen (answers remain stable over time) or mutable (answers change). In this work, we introduce EverGreenQA, the first multilingual QA dataset with evergreen labels, supporting both evaluation and training. Using EverGreenQA, we benchmark 12 modern LLMs to assess whether they encode question temporality explicitly (via verbalized judgments) or implicitly (via uncertainty signals). We also train EG-E5, a lightweight multilingual classifier that achieves SoTA performance on this task. Finally, we demonstrate the practical utility of evergreen classification across three applications: improving self-knowledge estimation, filtering QA datasets, and explaining GPT-4o retrieval behavior.

Community

Paper author Paper submitter

Large Language Models (LLMs) often hallucinate in question answering (QA) tasks. A key yet underexplored factor contributing to this is the temporality of questions -- whether they are evergreen (answers remain stable over time) or mutable (answers change). In this work, we introduce EverGreenQA, the first multilingual QA dataset with evergreen labels, supporting both evaluation and training. Using EverGreenQA, we benchmark 12 modern LLMs to assess whether they encode question temporality explicitly (via verbalized judgments) or implicitly (via uncertainty signals). We also train EG-E5, a lightweight multilingual classifier that achieves SoTA performance on this task. Finally, we demonstrate the practical utility of evergreen classification across three applications: improving self-knowledge estimation, filtering QA datasets, and explaining GPT-4o retrieval behavior.

Paper author Paper submitter
Paper author Paper submitter

The clear example of evergreen vs non-evergreen questions:

EG_main_leaves.png

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

arXiv lens breakdown of this paper 👉 https://arxivlens.com/PaperView/Details/will-it-still-be-true-tomorrow-multilingual-evergreen-question-classification-to-improve-trustworthy-qa-3077-c4a5d4cf

  • Executive Summary
  • Detailed Breakdown
  • Practical Applications

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2505.21115
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 2

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2505.21115 in a Space README.md to link it from this page.

Collections including this paper 3