Wall Street of AI Agents
A multi-agent trading firm powered by small language models
I wanted to build a reality TV show.
The result is Wall Street of AI Agents, a fully autonomous, high-frequency trading floor simulation where four AI agents with clashing personalities trade fake money, eavesdrop on each other, and panic during market crashes.
And the craziest part? The entire cognitive engine driving all four autonomous agents concurrently runs locally on a 1-Billion to 4-Billion parameter model. No cloud APIs at all.
▶️ Play the Live Simulation on Hugging Face Spaces
🎬 [Watch the Demo Video][https://youtu.be/1XZuUsiwuTA]
Here are my field notes on how I escaped the bloat of modern agentic frameworks, broke out of Gradio's default UI, and forced a tiny language model to output perfect JSON.
Sarah leads the leaderboard at $10,700. Alice is arguing with Mike in the hallway. Alex is alone in the Office. The market is Stagnant. Panic ensues.
The premise of the simulation is simple. All Four traders are given $10,000 to start and these four traders share a retro pixel-art office:
Every few seconds, the global market shifts (Tech Boom, Crypto Frenzy, Stagnant, etc.). The agents read the news, look at who is in the room with them, and make a trade.
But beneath the retro graphics, this is actually a lightweight visual benchmark for LLM reasoning. By trapping a tiny model in a high-stakes financial simulation, you are stress-testing three things:
If you want to build a multi-agent system today, the default advice is to use heavy orchestration frameworks like AutoGen or CrewAI. For a highly visual, fast-paced game running on edge hardware, these frameworks are far too bloated.
Instead, I built a Spatially-Aware Polling System using FastAPI and an ultra-fast SQLite database.
There is no complex "Message Bus" routing messages between agents. Instead, physical proximity drives the conversation. Before an agent takes a turn, the Python engine queries the database: "Who is physically standing in the VC Office right now?"
If Sarah is in the room with Alex, the backend intercepts her last spoken sentence and injects it directly into Alex's prompt:
“Sarah is here. She just said: 'Hold cash for safety.' INSTRUCTION: Reply directly to what she said.”
This creates organic, localized conversations where agents actively influence each other's trades, mimicking a real trading floor.
I added a giant red button to the dashboard: ⚡ Trigger Chaos. As the "Producer" of the show, I can click this button at any time. It injects a catastrophic headline ("Anonymous leak reveals massive data breach!") and crashes the market.
AI agents can go bankrupt too—one wrong decision in the wrong market regime can wipe out everything. Watching them react in real-time is hilarious. But forcing a 1B model to process that context without breaking its output formatting was the real challenge.
Running a continuous game loop on a Free CPU Space is incredibly difficult because 1B and 4B models (like OpenBMB's MiniCPM5-1B or NVIDIA's Nemotron-3-Nano-4B) are notoriously bad at outputting clean JSON. They hallucinate brackets, inject Markdown, and politely add "Sure! Here's your trading decision:" before the data. Every single one of these quirks crashes a standard game loop.
I didn't try to sanitize the text in Python with messy Regex. Instead, I prevented the hallucinations in C++ at the token sampling level.
Leveraging the llama-cpp-python runtime, I used Native JSON Schema enforcement. By passing a strict Pydantic-style schema into the API, llama.cpp hooks into the neural network's logit generator.
response_format={
"type": "json_object",
"schema": {
"type": "object",
"properties": {
"trade": {"type": "string", "enum": ["Tech", "Crypto", "Bonds", "Hold"]},
"speech": {"type": "string"},
"thought": {"type": "string"},
"location": {"type": "string", "enum": ["Startup Offices", "VC Office", "Coffee Shop", "News Room"]}
},
"required": ["trade", "speech", "thought", "location"]
}
}
The model is physically prevented from choosing a token that violates the JSON structure, and is forced to pick a trade exclusively from my Enum array.
This optimization is what makes this project hum. It allows a 1-Billion parameter model to punch massively above its weight class, running 4 autonomous agents continuously on a basic 2-vCPU Hugging Face Space without a single crash.
Gradio is a fantastic tool for ML demos, but its default UI screams "AI Tool." I wanted this to feel like a video game.
To achieve the "Off-Brand" aesthetic, I completely bypassed the stock Gradio layout. I built a 2D RPG environment using Phaser.js (compiled via Vite) and mounted the static dist/ folder directly into a FastAPI route. I then injected this HTML5 canvas into Gradio using a gr.HTML iframe.
The result is a Neubrutalist dashboard with three distinct columns:
"speech" from their private "thought". If you click an agent's sprite in the game, the UI updates to show their hidden anxieties. You get to read the LLM's inner monologue while it publicly projects confidence to the other agents.We've spent the last year obsessed with massive frontier models serving as omniscient chatbots. But Wall Street of AI Agents proves that there is a massive, untapped design space for tiny models acting as NPC brains.
By applying strict grammar constraints to a quantized GGUF model, you can run a localized, dynamic, and hilarious multi-agent simulation entirely in RAM.
The market doesn't care about your architecture choices—but your CPU definitely does.
Built for the Hugging Face Build Small Hackathon 2026. You can play the live simulation here on Hugging Face Spaces, or check out the [Post on X][https://x.com/ashdebugs/status/2065443044833562840].
A multi-agent trading firm powered by small language models
More from this author