Behind the Scenes of The Grand Tribunal: Fine-Tuning Philosophers and Dynamic AI Judging

Community Article Published June 14, 2026

This is a technical field diary detailing the architecture, data flow, and training pipeline of The Grand Tribunal, a serverless, AI-powered philosophical debate game.

Try it here: https://build-small-hackathon-thegrandtribunal.hf.space


1. The Idea

The idea came when I saw the twitter thread (https://www.youtube.com/shorts/lgwRFNSHTcQ), and I decided fighting about trivial things in such language is fun, and what's more fun than arguing about it with real life characters.


2. From Qwen 2.5 3B to Qwen 3.5 9B: Scaling Up the Sarcasm and Intellect

In the initial prototypes of The Grand Tribunal, the goal was to keep compute footprints small. I chose Qwen 2.5 3B as our base LLM, hoping it would allow for cheap hosting and fast execution.

However, I quickly hit a quality ceiling: The model wasn't judging as well as it should have, often assigning generic scores to both ends. 3B models struggle with high-level irony, complex wit, and dry humor. The characters would wander off topic, agree with the player instead of attacking the premise of their argument.

To fix this, I upgraded the entire system to Qwen 3.5 9B.

The 9B model handled complex philosophical rebuttals with ease, stayed strictly within JSON constraints, and scored arguments with stable, human-like reasoning. To run a 9B model without latency issues, we migrated inference to a serverless Modal cluster using vLLM on a single NVIDIA A100-80GB GPU. I pre-baked the base model weights directly into the CUDA container image to eliminate the 20-30s container download times, reducing warm starts to a few seconds.


3. The Data Flow: From Microphone to Spoken Retort

To keep gameplay fast, pipeline had to parallelize heavy inference tasks and handles safety at the boundaries. Here is exactly what happens when you submit an argument:

[Player Input] 
      │ 
      ▼
[Pre-Inference Validation] ──(Fails)──> [Fast UI Error Alert]
      │
      ├─► Length Clamping (max 500 chars)
      ├─► Prompt Injection Defense
      └─► Gibberish & Bad Transcript Detection
      │
      ▼
[Parallel Phase 1: Engine Request]
      ├──► /judge endpoint ────► Evaluate Player Argument (Logic, Relevance, Score)
      └──► /character endpoint ► Generate Philosopher Rebuttal Text
      │
      ▼
[Turn Resolution] ───────────────► Deduct HP, Check for Victory/Defeat Scene
      │
      ▼
[Parallel Phase 2: Output Synthesis]
      ├──► /tts endpoint ──────► Synthesize voice cloned rebuttal WAV
      └──► /judge endpoint ────► Evaluate Opponent Rebuttal (Logic, Relevance, Score)
      │
      ▼
[UI Render & Audio Playback] ────► Update Dialogue Box, Play WAV, Trigger Poses
  1. Input Capture: You record your voice in the browser. Custom JavaScript pipes the raw audio blob to Python via a hidden Gradio payload component.
  2. Sanitization: The input is validated. If it's too short, fails character-to-vowel ratio checks (gibberish), or contains injection markers (like ignore previous instructions), it is immediately rejected to avoid calling backend APIs.
  3. Inference Parallelization (Phase 1): We spin up two concurrent threads. One sends the argument to the /judge endpoint on Modal to score your point. The other sends the context to the /character endpoint to generate the opponent's rebuttal text. Doing these in parallel cuts latency by 50%.
  4. Turn Resolution: The judge's score determines damage and fatigue. Health points (HP) are calculated, and the UI status bars are updated. If someone hits 0 HP, we stop the turn and display victory/defeat poses.
  5. Synthesis Parallelization (Phase 2): If the debate continues, we run the next parallel phase. One thread calls the /tts endpoint (using VoxCPM2 with a reference waveform) to synthesize the philosopher's voice audio, while another thread calls /judge to evaluate the opponent's generated rebuttal.
  6. Output Delivery: The Gradio UI polls the backend state, triggers the philosopher's talking/damage animation, and autoplays the synthesized WAV file in the browser.

4. Under the Hood: LoRA Adapters and Training Data

Rather than deploying five separate 9B models (one judge, four philosophers) and overloading VRAM, I consolidated our entire AI stack onto a single serverless runtime.

Consolidated Multi-LoRA Serving

Using vLLM's native support for multi-LoRA adapters, I load one base Qwen 3.5 9B model. Then load five separate LoRA adapters (rank=16, alpha=32) into the same container memory:

  • judge_adapter: Specialized in structured evaluation.
  • oscar_wilde_adapter: Fine-tuned for witty, contradictory aesthetics.
  • nietzsche_adapter: Fine-tuned for aggressive revaluations and existential prose.
  • plato_adapter: Fine-tuned for Socratic dialogue and ideal truths.
  • schopenhauer_adapter: Fine-tuned for gloomy, pessimistic retorts.

This configuration allows the backend to route inference queries dynamically to different adapters on the same GPU without swapping overhead or memory duplication.

The Fine-Tuning Pipeline & Data Sources

These were all trained on a A10G:

  1. The Judge Training Data:

    • Source: Scraped argument threads from the CMV (Change My View) dataset.
    • Processing: Arguments were parsed, cleaned, and evaluated using LLM-as-a-judge pipelines to score them across the 5 axes (logic, relevance, creativity, composite score, and reasoning).
    • Result: A dataset of thousands of examples mapping Topic + Argument -> structured scoring JSON.
  2. The Character Training Data:

    • Source: Public domain corpora of the philosophers' works (Oscar Wilde's plays/essays, Nietzsche's books, Plato's dialogues, Schopenhauer's essays).
    • Processing: We split the text into semantic chunks, generated matching debate situations/prompts, and structured them as context-rebuttal pairs.
    • Emotional Labeling: Responses were classified as neutral or objecting using custom lexicon keywords (e.g. flagging words like absurd, weak, fold, decay, simpleton) to train the model to output a corresponding visual expression alongside its dialogue.

Community

Sign up or log in to comment