Aranya: Teaching a Small Model to See Plants and Tell Their Stories

Community Article Published June 13, 2026

Field Notes from the Build Small Hackathon


Here's a stat that caught my attention: the average plant parent has killed at least seven plants. Not from neglect - from confusion. Overwatering, misidentification, missed early signs of disease. There are 86 million households in the US alone with plants on their windowsills, and most of them are winging it.

I wanted to build something that actually helps. Not another dry plant encyclopedia, not another app that spits out a Latin name and a wall of text. Something that makes you want to learn about the green thing on your desk.

The question I set out to answer: can a 1.3-billion-parameter vision-language model, fine-tuned on community knowledge, deliver genuinely useful plant identification and health diagnosis? And can I wrap it in an experience that doesn't feel like homework?

The answer, it turns out, is yes. With caveats. Here's how.

UI concept for Aranya : the Wildkeeper adventure aesthetic Early UI concept : generated through iterative prompting with OpenAI's image generation. The adventure/gamification aesthetic clicked immediately.


The Vision: Plant Care as an Expedition

Most plant care apps feel clinical. You scan a leaf, you get a paragraph of text, you close the app. The retention problem isn't information - it's engagement.

I kept coming back to this idea: what if every time you point your phone at a plant, you're not "scanning" it — you're discovering it? What if you're a Wildkeeper on an expedition, and each plant you encounter adds to your journal?

The gamification frame emerged through rapid iteration with OpenAI's image generation. I'd describe aesthetics like "jungle expedition notebook,", "aged parchment with botanical sketches," "adventure game quest log" and the AI would generate concepts I could react to. Within an afternoon I had a visual language: gold-on-dark-green, weathered paper textures, a jungle canopy backdrop. The "Wildkeeper" metaphor locked in everything else.

Key UX decisions that followed from the metaphor:

  • Narrated responses via text-to-speech so you don't read about a plant, you hear its story told to you
  • A discovery journal that grows as you explore hence the species counter, rescue counter, your personal botanical record
  • Storytelling tabs after each identification: folklore and legends, "plant superpowers" (medicinal/ecological uses), fun facts, care tips
  • A community leaderboard because adventure is better with others on the trail

Data: 310K Training Conversations from Community Knowledge

The backbone of Aranya's intelligence isn't a massive pretrained model. It's a fine-tuned specialist trained on real-world plant discussions. The kind of knowledge that lives in online community posts where people share photos of their struggling houseplants and experienced gardeners diagnose the problem in the comments.

Sourcing and Scale

I collected approximately 806,000 image posts from online plant care communities, forums where people photograph their plants, describe symptoms or ask "what is this?", and other enthusiasts respond with identifications and care advice. Each post typically contained: a photo of the plant, the owner's description of the problem or question, and community responses (often detailed diagnoses from experienced growers).

The signal-to-noise ratio in raw community data is poor. Plenty of posts are just someone showing off a healthy monstera. Others are memes, or low-effort questions with no useful answers. The real gold is the subset where someone posts a clear photo of a specific problem and gets a thoughtful, detailed community response. Getting to that subset required a multi-stage pipeline.

The Classification Layer

Not every community post is useful for training. People share healthy plants for fun, ask general questions, or post memes. I needed to separate the signal.

Stage 1 - Rule-based classification: A keyword scoring system with community-specific priors bucketed posts into five categories:

  • Disease/stress (sick plants, health concerns)
  • Species identification (what is this plant?)
  • Showcase (look at my healthy baby)
  • General care questions
  • Other

Stage 2 - ML refinement: A TF-IDF + Logistic Regression model trained on the high-confidence rule-based labels, then re-classified the ambiguous posts. Only predictions above 0.60 probability got re-labeled.

Final breakdown: 239K disease/stress posts and 304K species identification posts.

Synthetic Training Data via Gemini Batch API

Raw community threads aren't in a format suitable for fine-tuning a vision-language model. The conversations are messy, multi-party, and unstructured. I needed to distill them into clean user/assistant training pairs.

Enter Gemini 2.5 Flash Lite on Vertex AI's Batch Prediction API.

The idea: use a larger model (Gemini) to distill messy community knowledge into clean training pairs for a smaller model (MiniCPM-V). This is a form of synthetic data generation but grounded in real-world examples rather than generated from thin air.

For each community thread, I built a structured prompt containing the original post text and community responses, asking Gemini to synthesize a clean training conversation: one where the user describes their plant's situation, and the assistant provides a visually-grounded diagnosis or identification. The key instruction: the assistant response must reference what's visible in the image and not just regurgitate textbook information.

Two parallel batch jobs:

  • Disease diagnosis: "Look at the plant in the image. Based on visible symptoms, diagnose the issue and provide recovery steps." (288K requests)
  • Plant identification: "Identify this plant from the image. Provide common and scientific names with visual reasoning." (250K requests)

Each batch file held 50,000 requests. I submitted them via Vertex AI with temperature=0.4 for consistency. The Batch API's 50% cost discount made processing at this scale feasible.

Results: 228K disease training records and 105K plant-ID training records (92% acceptance rate : the rest filtered as irrelevant or malformed).

Data pipeline funnel — from 806K community posts to 333K training records

Community plant posts (~806K)
    ↓ classify (rules + ML)
239K disease + 304K plant_id posts
    ↓ Gemini Batch API (Vertex AI)
310K+ multimodal training conversations
    ↓ ms-swift LoRA on Modal (L40S)
Two fine-tuned adapters
    ↓ merge + GGUF quantize
F16 + Q8_0 weights → Hugging Face Hub

Fine-Tuning: LoRA on MiniCPM-V 4.6

Why MiniCPM-V 4.6

I needed a vision-language model that:

  1. Stays under the hackathon's 32B parameter limit (ideally way under)
  2. Handles multimodal input natively (image + text in, text out)
  3. Has a permissive license
  4. Actually works well enough at small scale to be worth fine-tuning

OpenBMB's MiniCPM-V 4.6 checked every box. At 1.32B parameters, it's genuinely small, fits on a single consumer GPU, but punches above its weight on vision-language tasks thanks to a solid architecture (vision tower + language model + aligner).

The LoRA Configuration

Training was orchestrated through ms-swift v4.2.3 (ModelScope's SWIFT framework) with PEFT as the LoRA backend. All runs executed on Modal with L40S GPUs.

Parameter Value
LoRA Rank 16
LoRA Alpha 32 (scaling ratio 2x)
Dropout 0.05
Target Modules all linear layers
Vision Tower unfrozen (LoRA applied)
Aligner frozen
Trainable Parameters ~20.3M (1.54% of model)

The interesting decision here: vision-inclusive LoRA. Most practitioners freeze the vision encoder during fine-tuning. I didn't. My reasoning: plant photographs look nothing like the web images MiniCPM-V was pretrained on. Community posts are often poorly lit, blurry, close-up shots of diseased leaves taken with phone cameras. The visual domain shift was significant enough that I wanted the vision tower to adapt too.

Freezing just the aligner preserved the learned vision-language alignment while letting both the vision encoder and language model specialize.

Training Runs

Plant Disease Plant ID
Dataset size ~210K records ~100K records
Epochs 1 2
Steps 1,641 1,576
Learning rate 2e-6 2e-6
LR schedule cosine, 3% warmup cosine, 3% warmup
Effective batch 128 (8 x 16 grad accum) 128
Time 26.7 hours 22.6 hours
Final train loss 0.79 1.54
Final eval loss 1.40 1.31
Final token acc 63.1% 66.7%

The learning rate (2e-6) is deliberately conservative. With a 1.3B model and 200K+ samples, there's enough data that you don't need aggressive learning and you really don't want to blow past the good loss basin early on.

Results

Evaluated on 300 held-out test samples per task, comparing the base MiniCPM-V 4.6 against the fine-tuned models:

Metric Disease: Base → Fine-tuned Plant ID: Base → Fine-tuned
BLEU 2.52 → 3.87 (+53.9%) 1.69 → 2.30 (+36.3%)
ROUGE-1 0.298 → 0.357 (+19.6%) 0.240 → 0.237
ROUGE-2 0.063 → 0.089 (+40.8%) 0.053 → 0.064 (+21.3%)
ROUGE-L 0.153 → 0.163 (+6.7%) 0.144 → 0.140
MoverScore 0.531 → 0.539 0.519 → 0.520

The disease model showed stronger gains across the board. That makes sense as disease diagnosis requires specific vocabulary ("root rot," "fungal infection," "overwatering damage") that the base model simply hadn't seen enough of. The plant-ID model improved BLEU and ROUGE-2 (phrasal precision) while ROUGE-1 stayed flat — it got better at constructing correct multi-word identifications without necessarily changing its unigram vocabulary.

Training loss curves for both models

Metrics comparison — baseline vs fine-tuned across BLEU, ROUGE, and MoverScore

Token accuracy progression during training

Merging and Quantization

After training, I merged the LoRA adapters back into the base weights on Modal:

swift export \
  --model openbmb/MiniCPM-V-4.6 \
  --adapter checkpoint-1641 \
  --merge_lora true \
  --output_dir merged-checkpoint-1641

Then converted to GGUF using llama.cpp's conversion tool:

  • F16 — full precision for ZeroGPU (where VRAM isn't the constraint)
  • Q8_0 — 8-bit quantized for local deployment
  • Q4_K_M — aggressive quantization for truly resource-constrained scenarios
  • mmproj — the vision projector, exported separately

Published models:


The App: Stitching It Together

Architecture

Aranya runs as a Gradio Space on Hugging Face, but it doesn't look like a Gradio app. I'm using gr.Server as a headless FastAPI host. It handles deployment, ZeroGPU lifecycle, and OAuth while I serve a completely custom frontend.

Browser (vanilla JS)                     FastAPI via gr.Server
┌──────────────────┐                     ┌───────────────────────────────┐
│ Upload image     │── POST /aranya/run ─▶│ Image preprocessing          │
│ Choose mode      │                     │         ↓                     │
│                  │◀── NDJSON stream ───│ MiniCPM-V GGUF (ZeroGPU)     │
│ Text trickle     │                     │         ↓                     │
│ + audio playback │                     │ Pocket TTS (CPU, streaming)   │
└──────────────────┘                     └───────────────────────────────┘

Inference: llama-cpp-python with the MTMDChatHandler for multimodal chat. Images are preprocessed (EXIF transpose, center crop, resize to 1024px) and passed as base64 data URIs alongside the text prompt. ZeroGPU provides the GPU for inference with no standing reservation.

TTS: Pocket TTS (~100M parameters) runs on CPU. It's fast enough for streaming. I don't need to wait for the full LLM response before starting narration.

The Streaming Pipeline

This is where it gets fun. The LLM generates text token-by-token. The TTS needs sentence-length chunks to produce natural-sounding audio. The frontend needs to sync text appearance with audio playback.

My solution: parallel producers feeding a single NDJSON stream.

  1. The LLM streams text deltas into a buffer
  2. A "speakable segmenter" watches the buffer and extracts natural phrase boundaries (min 32 chars, max 350 — balancing TTS latency against prosody)
  3. Each segment goes to Pocket TTS, which produces PCM audio
  4. Both text deltas and audio chunks flow as NDJSON events to the client
  5. The frontend trickles text at 80ms/character until audio playback catches up, then syncs reveal rate to the audio stream via Web Audio API

It's a bit of an asyncio circus, queues, producers, consumers, backpressure, but the result feels seamless. You see words appearing as you hear them spoken.

One gnarly bug I hit: llama-cpp-python's multimodal handling has a quirk where hybrid models wipe image embeddings on state reset. I had to monkey-patch Llama.generate() to preserve the multimodal context across the generation loop. Not something you'll find in the docs, just hours of debugging why the model was not seeing the images.

Off the Grid

No external APIs are called during core inference. The plant identification model, the disease diagnosis model, and the TTS engine all run within the Space. The only outbound calls happen for follow-up content (fun facts, mythology, etc.) which uses DuckDuckGo search + the base MiniCPM-V as a utility model.

Frontend assets are vendored too, images inlined as base64, only marked.js from a CDN for markdown rendering. It's about as "off the grid" as you can get while still being a web app.

The SQLite Backbone

Every discovery, every rescue, every leaderboard entry, all persisted in SQLite with WAL mode. The database handles concurrent reads during streaming (one request generating, another user browsing the leaderboard), automatic backups, and corruption recovery. On Hugging Face Spaces, the /data persistent volume can get into weird states after cold starts so I added stale WAL sidecar cleanup and auto-recovery to handle that gracefully.


Making Plant Care Interesting: The Storytelling Angle

Traditional plant care apps treat information delivery as the product. Here's your plant name. Here's a care schedule. Done.

The problem: nobody retains information presented that way. People remember stories.

Aranya wraps every interaction in narrative. When you identify a plant, you don't get a data sheet, you get a spoken explanation of what the model sees, how it arrived at the identification, and why this plant is interesting. The voice creates a sense of being told something by a guide, not looking it up in a reference book.

After identification, four storytelling tabs extend the engagement:

  • Did You Know? — surprising facts that make you look at the plant differently
  • Keep It Happy — care tips framed as a relationship ("your Monstera likes humidity because...")
  • Stories & Legends — mythology, folklore, cultural significance across civilizations
  • Plant Superpowers — medicinal uses, ecological roles, culinary applications

Each tab is generated on-demand by the utility model with web search context.

The gamification layer is light by design. You're a "Wildkeeper." Each scan is either a "Discovery" (identification) or a "Rescue" (health diagnosis). Your stats grow. You can see where you stand on the leaderboard. But there's no XP grinding, no artificial gates. The motivation is curiosity, and the storytelling feeds it.

I think this is an underexplored design space. We spend so much time optimizing model quality and squeezing another point of BLEU out of a fine-tune when the UX framing might matter more for actual user retention. A slightly worse model wrapped in a compelling narrative will outperform a better model behind a clinical interface. At least for consumer apps.


Sponsors: How They Made This Possible

None of this happens without the infrastructure and credits provided by the hackathon sponsors. Here's what each contributed to Aranya specifically:

Sponsor What They Enabled
Hugging Face Spaces hosting with ZeroGPU for inference, Model Hub for publishing GGUF weights, $20 in platform credits, and the hackathon itself
OpenAI $100 in Codex credits that powered end-to-end development of all code : data pipeline, training orchestration, the Space app, and eval tooling. Plus image generation for UI concept iteration
Modal $250 in compute credits for all LoRA training runs AND LoRA merging AND GGUF quantization on L40S GPUs (more than 50 hours of total compute)
OpenBMB Created MiniCPM-V 4.6 : the base model that made small multimodal AI accessible. Without a strong 1.3B VLM, this project doesn't exist
Gradio The SDK for deployment. gr.Server specifically enabled the fully custom UI while keeping all the Spaces infrastructure benefits

What I Learned

Small models can be specialists. The base MiniCPM-V 4.6 is a generalist, okay at a lot of things, great at nothing specific. After fine-tuning on 310K plant conversations, it knows what root rot looks like. It knows the difference between a Pothos and a Philodendron. Specialization through data works, even at 1.3B params.

Community knowledge scales. Thousands of real plant enthusiasts diagnosing real problems over years of forum activity, that's an incredible training signal. Combined with synthetic data generation via Gemini to clean and structure it, you get a scalable pipeline that doesn't require a single manual annotation.

Unfreeze the vision encoder. The conventional wisdom says freeze the ViT during fine-tuning. For domain-specific visual tasks where your images look nothing like ImageNet or web crawls, that advice is wrong. Plant photos from community posts are noisy, close-up, poorly lit. The vision tower needs to adapt to that domain or you're fighting with one hand tied behind your back.

Limitations are real. A 1.3B VLM has a ceiling. Rare species, subtle symptoms, unusual lighting - it struggles. The community data has geographic and species-popularity biases. The TTS occasionally butchers unusual Latin names. These aren't dealbreakers for a hackathon, but they're honest constraints.


What's Next

Everything I learned building Aranya: the data pipeline, the fine-tuning workflow, the streaming TTS architecture, the storytelling UX — is being carried forward into a full locally-deployed mobile plant care app at getsounth.com.

The hackathon was the proving ground. Ten days to validate that a small model can be a genuine plant specialist, that voice narration transforms engagement, and that community knowledge can bootstrap training data at scale. Now the real product is next.


Built for the Build Small Hackathon — small models, big adventure.

Links:

Community

Sign up or log in to comment