GharScan: Teaching a 2B-Parameter Model to Read Indian Walls

Community Article Published June 15, 2026

The Problem Nobody Talks About
Why a Small Model?
The Dataset: From 58,000 Images to 8,860 Unique Ones
Tier 1: Public Research Datasets
The Class Imbalance Problem (And Why We Solved It Differently)
CLIP Deduplication: The Most Important Step Nobody Talks About
Fine-Tuning: LoRA on Modal Labs A100
The Agentic Pipeline: Three Steps, Not One
Deployment: gr.Server → gr.Blocks (A Painful Lesson)
GGUF Quantization: Taking GharScan Offline
What the Model Actually Learned
The Real Impact: Aunty Puja's Answer in 12 Seconds
Hackathon Badge Scorecard
Technical Stack Summary
Repos and Links
"My neighbor Aunty Puja — a retired school teacher in a 1982 DDA flat in Delhi — called three masons about a crack above her window frame. She got three different diagnoses, three different quotes, and zero clarity. That moment became the problem statement for GharScan."

The Problem Nobody Talks About

Over 62% of India's urban housing stock was built before 1990. DDA flats in Delhi, MHADA buildings in Mumbai, old housing board colonies in every tier-1 city — they are all aging. Every monsoon season, millions of Indian homeowners discover cracks in walls, damp patches on ceilings, white salt deposits, or rust streaks seeping through their concrete.

The choices they face today are:

Pay ₹2,000–8,000 for a civil engineer site visit
Ask a local mason (who has an inherent financial incentive to recommend expensive repairs)
Search Google and get terrifying results
Ignore it and hope for the best

There is no affordable, unbiased first answer. GharScan is that first answer.

Point your phone camera at any building defect — a crack, a damp patch, concrete spalling, rebar rust — and get in 12 seconds: the defect type, a severity score from 1–5, whether it's a structural risk, exactly what to do, and an estimated cost range in Indian Rupees.

This is what AI at its best should feel like: specific, useful, honest, and small enough to run without sending your home photos to an API in California.

Why a Small Model?

This was built for the Build Small Hackathon 2026 by Gradio and HuggingFace — a competition specifically asking builders to think smaller. The constraint: ≤32 billion parameters, hosted on a HuggingFace Space.

We chose Qwen2-VL-2B-Instruct (2.07 billion parameters) for several reasons beyond just fitting the constraint:

Privacy-first architecture: Photos of your home interior — cracked walls, water damage, structural defects — are inherently private. Running inference entirely within HF's ZeroGPU, with no external API calls, means your images never leave a controlled environment. A 175B cloud model is genuinely wrong for this problem.
Right-sized for the task: Building defect classification is a narrow, structured visual reasoning task. A fine-tuned 2B VLM that has seen thousands of crack images outperforms a general-purpose giant model on this specific domain. More parameters does not mean better crack detection.
Tiny Titan eligible: At 2.07B, it qualifies for the hackathon's Tiny Titan prize category (≤4B params).
GGUF-friendly: A 2B model quantized to Q4_K_M fits in ~941MB — genuinely laptop-portable.

The Dataset: From 58,000 Images to 8,860 Unique Ones

Getting training data for Indian residential building defects is harder than it sounds. There is no "DDA flat crack" dataset on Kaggle.

Tier 1: Public Research Datasets

We aggregated five public datasets covering concrete and masonry defects:

Dataset	Source	Images	What it covers
CODEBRIM	Zenodo DOI 10.5281/zenodo.2620293	10,457	Bridge defects: cracks, spalling, efflorescence, rebar
SDNET2018	Kaggle (IEEE DataPort)	~56,000	Bridge decks, pavements, walls — cracked vs uncracked
Surface Crack Detection	Kaggle (arunrk7)	40,000	Concrete surface cracks, binary positive/negative
MCrack1300	GitHub arXiv:2401.15266	1,300	Masonry phone-camera images
dacl1k	CVPR 2023 workshop	1,474	Multi-label building inspection

A critical design decision here: we did not blindly use everything. The Surface Crack Detection dataset, for example, was collected by sliding a window across concrete surfaces — meaning thousands of its 20,000 "Positive" images are near-identical patches of the same wall. Training on those would teach the model to memorize specific textures, not to generalize.

The Class Imbalance Problem (And Why We Solved It Differently)

The naive approach would be to use all available images. But looking at the actual counts:

no_defect images (non-cracked): ~70,000 available
structural_crack images: ~16,000
hairline_crack images: ~20,000

Using everything would create a 4:1 imbalance favouring no_defect. A model trained on that distribution learns to predict "nothing wrong here" for anything ambiguous — exactly the wrong behaviour for a safety-critical inspection tool.

Our solution was balanced sampling at ~20,000 per class, matching the largest defect class:

# Defect classes: use ALL available (they're already under 20k)
# no_defect: cap at 20k to prevent majority-class bias
NO_DEFECT_CAP = 20_000

# SDNET non-cracked: split evenly across surfaces
NC_SPLIT = NO_DEFECT_CAP // 3  # ~6,666 from Walls, Decks, Pavements each

This gave us ~58,939 images going into deduplication.

CLIP Deduplication: The Most Important Step Nobody Talks About

After sampling, we ran CLIP-based semantic deduplication using OpenCLIP's ViT-B/32. Every image was embedded into a 512-dimensional feature space. Any pair with cosine similarity >0.95 was considered a near-duplicate — and only one was kept.

device = "cuda" if torch.cuda.is_available() else "cpu"
model, _, preprocess = open_clip.create_model_and_transforms(
    "ViT-B-32", pretrained="openai"
)

# Compare embeddings in blocks to avoid OOM
for i in range(0, len(E), BLOCK):
    sims = E @ E[i:i+BLOCK].T
    # Mark duplicates for removal

The result: 50,079 near-duplicates removed from 58,939 images, leaving 8,860 genuinely unique images.

Why does this matter? Because those 50,079 removed images weren't adding information — they were adding noise and memorization risk. The final training set of 8,860 unique images, each appearing in two VQA conversation formats (classification and severity assessment), gave us 17,720 training records of real signal.

This dataset lives at: ritvik360/gharscan-defect-dataset

Fine-Tuning: LoRA on Modal Labs A100

With the dataset on HF Hub, we ran LoRA fine-tuning on Modal Labs using an A100-80GB GPU. The full training cost approximately $12 from the $250 Modal credits provided by the hackathon.

CFG = {
    "base_model_id":   "Qwen/Qwen2-VL-2B-Instruct",
    "lora_r":          16,
    "lora_alpha":      32,
    "target_modules":  ["q_proj", "v_proj", "k_proj", "o_proj",
                        "gate_proj", "up_proj", "down_proj"],
    "num_epochs":      3,
    "batch_size":      2,
    "grad_accum":      4,      # Effective batch = 8
    "learning_rate":   2e-4,
    "lr_scheduler":    "cosine",
}

Targeting both attention and MLP layers with r=16 gave us ~18.4M trainable parameters out of 2.27B total — about 0.83% — the standard sweet spot for LoRA fine-tuning of VLMs.

Three epochs later:

train_loss: 5.086
eval_loss: 5.027
Training time: ~1 hour 54 minutes
Published: ritvik360/gharscan-qwen2vl-lora

One lesson learned the hard way: modal run requires your local terminal to stay connected. The first training run hit 68% and then died when the laptop was closed. The fix was simple but non-obvious: modal run --detach, which keeps the job alive on Modal's cloud regardless of local connectivity.

The Agentic Pipeline: Three Steps, Not One

GharScan doesn't just call the model once. Every image triggers a three-step agentic reasoning chain:

Image
  │
  ▼
Step 1: CLASSIFY
  What defect type is this? (8 classes)
  → defect_type, description, primary_cause, monsoon_risk
  │
  ▼
Step 2: SEVERITY ASSESSMENT
  Given defect_type, how serious is this?
  → severity (1–5), is_structural, immediate_action, urgency
  │
  ▼
Step 3: COST LOOKUP (deterministic)
  defect_type × severity → INR cost range, professional type
  (No model call — pure lookup table of 2026 Delhi/NCR market rates)

The cost estimation is deliberately not a model output. Models hallucinate prices. A hardcoded lookup table of verified market rates (sourced from HouseYog, Comaron, and UrbanCompany) is more accurate, more honest, and completely auditable.

Every reasoning chain is logged as an agent trace and uploaded to ritvik360/gharscan-agent-traces — qualifying for the Sharing is Caring hackathon badge.

Deployment: gr.Server → gr.Blocks (A Painful Lesson)

The original plan was to use Gradio's new gr.Server() API for a fully custom HTML/CSS/JS frontend — the "Off-Brand" hackathon badge strategy.

# What we tried first
from gradio import Server
app = Server()

@app.api(name="analyze_defect")
@spaces.GPU
def analyze_defect(...): ...

This worked locally. On HF Spaces with ZeroGPU, it started and immediately shut down:

Application startup complete.
Uvicorn running on http://0.0.0.0:7860
Shutting down
Application shutdown complete.

No error. Just silence. After debugging, the root cause became clear: ZeroGPU works by monitoring Gradio's internal event queue for GPU allocation requests. When @spaces.GPU is stacked inside @app.api(), ZeroGPU's startup scanner sees the @app.api() wrapper in the module namespace — not the underlying @spaces.GPU function — and can't register it properly. The app starts, ZeroGPU fails to initialize, the process exits cleanly.

The fix: replace gr.Server() with gr.Blocks() and use HTML rendering inside Gradio's standard event handling. We lose the pure custom HTML frontend but keep a fully styled dark-theme report card rendered as HTML via gr.HTML(). The @spaces.GPU function sits at module level, visible to ZeroGPU's scanner:

@spaces.GPU
def analyze_image(image, language):
    # ZeroGPU can find this — it's at module level
    return run_gharscan_pipeline(image, language)

with gr.Blocks(css=CUSTOM_CSS, theme=gr.themes.Base()) as demo:
    image_input = gr.Image(sources=["upload", "webcam"], type="pil")
    analyze_btn = gr.Button("🔍 Analyse Defect", variant="primary")
    report_output = gr.HTML()

    analyze_btn.click(fn=analyze_and_render,
                      inputs=[image_input, language],
                      outputs=report_output)
demo.queue()
demo.launch()

The Space: build-small-hackathon/GharScan

GGUF Quantization: Taking GharScan Offline

Beyond the Space demo, we wanted to prove GharScan could run entirely locally — on a laptop, offline, without internet. This meant converting the merged model to GGUF via llama.cpp.

The pipeline, executed in WSL Ubuntu from a Windows machine:

# 1. Build llama.cpp
sudo apt install -y build-essential cmake git
git clone https://github.com/ggml-org/llama.cpp
cmake -B llama.cpp/build && cmake --build llama.cpp/build -j$(nproc)

# 2. Merge LoRA into base
python - <<'PY'
from transformers import Qwen2VLForConditionalGeneration
from peft import PeftModel
import torch

base = Qwen2VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2-VL-2B-Instruct", torch_dtype=torch.bfloat16)
model = PeftModel.from_pretrained(base, "ritvik360/gharscan-qwen2vl-lora")
model.merge_and_unload().save_pretrained("./gharscan-merged")
PY

# 3. Convert to GGUF
python llama.cpp/convert_hf_to_gguf.py gharscan-merged \
    --outfile gharscan-f16.gguf --outtype f16

# 4. Quantize to Q4_K_M
./llama.cpp/build/bin/llama-quantize \
    gharscan-f16.gguf gharscan-q4_k_m.gguf Q4_K_M

Final artifacts:

gharscan-f16.gguf — 3.09 GB
gharscan-q4_k_m.gguf — 941 MB

A 941MB building inspector that runs on a laptop with no internet. That's the promise of small models kept.

Published at: ritvik360/gharscan-qwen2vl-gguf

Three bugs encountered and fixed along the way:

PEFT version mismatch (torch.distributed tensor error) → fixed by pinning peft==0.10.0
PIL missing in WSL venv → pip install pillow
Windows/WSL clock skew warning → harmless, build succeeded regardless

What the Model Actually Learned

The defect taxonomy GharScan classifies across:

Class	Visual Signature	Severity Range
`hairline_crack`	Thin surface cracks, paint-only	1–3
`settlement_crack`	Diagonal at window/door corners	1–5
`structural_crack`	Wide, horizontal, load-bearing	2–5
`water_seepage`	Brown/yellow ceiling/wall stains	1–5
`efflorescence`	White salt crystal deposits	1–4
`spalling`	Concrete chunks detaching	2–5
`rebar_rust`	Orange streaks through concrete	2–5
`plaster_delamination`	Bubbling, hollow-sounding plaster	1–4

The severity scale maps directly to urgency:

1 – Cosmetic: At your next renovation
2 – Minor: Within 6 months
3 – Moderate: Within 1 month
4 – Serious: This week
5 – Critical: Immediately / Vacate

The Real Impact: Aunty Puja's Answer in 12 Seconds

The technical stack is interesting. But the actual value is simpler.

Aunty Puja photographed three things in her DDA flat: a diagonal crack above her east window, a damp patch on the bedroom ceiling, and white deposits on the exterior wall.

GharScan told her:

Settlement crack, Severity 2/5 — Not structural. Seal before monsoon. ₹600–₹1,200, local mason.
Water seepage, Severity 4/5 — Waterproofing failure. Get a waterproofing contractor before next monsoon. ₹10,000–₹25,000.
Efflorescence, Severity 2/5 — Moisture seeping through masonry. Brush off, apply water-repellent primer. ₹500–₹1,500.

She didn't need a civil engineer. She didn't need three masons. She had a clear, prioritized action plan in under two minutes, with cost ranges she could verify herself. That is what small AI should do.

Hackathon Badge Scorecard

Badge	Claimed	How
🔌 Off the Grid	✅	No cloud APIs — ZeroGPU inference, no external model calls
🎯 Well-Tuned	✅	Fine-tuned LoRA published: ritvik360/gharscan-qwen2vl-lora
🎨 Off-Brand	✅	Custom dark CSS inspection report aesthetic in gr.Blocks
🦙 Llama Champion	✅	GGUF (Q4_K_M, 941MB): ritvik360/gharscan-qwen2vl-gguf
📡 Sharing is Caring	✅	Agent traces: ritvik360/gharscan-agent-traces
📓 Field Notes	✅	This post

Technical Stack Summary

Layer	Technology
Base model	Qwen2-VL-2B-Instruct (2.07B params)
Fine-tuning	LoRA r=16, 7 target modules, 3 epochs
Training compute	Modal Labs A100-80GB (~$12 spent)
Dataset	8,860 deduplicated images, 17,720 VQA records
Deduplication	CLIP ViT-B/32, cosine similarity >0.95
Inference	HF ZeroGPU, gr.Blocks, gr.HTML report card
Cost estimation	Deterministic INR lookup table (2026 Delhi rates)
Offline runtime	llama.cpp GGUF Q4_K_M, 941MB
Agent tracing	Auto-uploaded to HF Hub every 20 calls