GharScan: Teaching a 2B-Parameter Model to Read Indian Walls
"My neighbor Aunty Puja — a retired school teacher in a 1982 DDA flat in Delhi — called three masons about a crack above her window frame. She got three different diagnoses, three different quotes, and zero clarity. That moment became the problem statement for GharScan."
The Problem Nobody Talks About
Over 62% of India's urban housing stock was built before 1990. DDA flats in Delhi, MHADA buildings in Mumbai, old housing board colonies in every tier-1 city — they are all aging. Every monsoon season, millions of Indian homeowners discover cracks in walls, damp patches on ceilings, white salt deposits, or rust streaks seeping through their concrete.
The choices they face today are:
- Pay ₹2,000–8,000 for a civil engineer site visit
- Ask a local mason (who has an inherent financial incentive to recommend expensive repairs)
- Search Google and get terrifying results
- Ignore it and hope for the best
There is no affordable, unbiased first answer. GharScan is that first answer.
Point your phone camera at any building defect — a crack, a damp patch, concrete spalling, rebar rust — and get in 12 seconds: the defect type, a severity score from 1–5, whether it's a structural risk, exactly what to do, and an estimated cost range in Indian Rupees.
This is what AI at its best should feel like: specific, useful, honest, and small enough to run without sending your home photos to an API in California.
Why a Small Model?
This was built for the Build Small Hackathon 2026 by Gradio and HuggingFace — a competition specifically asking builders to think smaller. The constraint: ≤32 billion parameters, hosted on a HuggingFace Space.
We chose Qwen2-VL-2B-Instruct (2.07 billion parameters) for several reasons beyond just fitting the constraint:
Privacy-first architecture: Photos of your home interior — cracked walls, water damage, structural defects — are inherently private. Running inference entirely within HF's ZeroGPU, with no external API calls, means your images never leave a controlled environment. A 175B cloud model is genuinely wrong for this problem.
Right-sized for the task: Building defect classification is a narrow, structured visual reasoning task. A fine-tuned 2B VLM that has seen thousands of crack images outperforms a general-purpose giant model on this specific domain. More parameters does not mean better crack detection.
Tiny Titan eligible: At 2.07B, it qualifies for the hackathon's Tiny Titan prize category (≤4B params).
GGUF-friendly: A 2B model quantized to Q4_K_M fits in ~941MB — genuinely laptop-portable.
The Dataset: From 58,000 Images to 8,860 Unique Ones
Getting training data for Indian residential building defects is harder than it sounds. There is no "DDA flat crack" dataset on Kaggle.
Tier 1: Public Research Datasets
We aggregated five public datasets covering concrete and masonry defects:
| Dataset | Source | Images | What it covers |
|---|---|---|---|
| CODEBRIM | Zenodo DOI 10.5281/zenodo.2620293 | 10,457 | Bridge defects: cracks, spalling, efflorescence, rebar |
| SDNET2018 | Kaggle (IEEE DataPort) | ~56,000 | Bridge decks, pavements, walls — cracked vs uncracked |
| Surface Crack Detection | Kaggle (arunrk7) | 40,000 | Concrete surface cracks, binary positive/negative |
| MCrack1300 | GitHub arXiv:2401.15266 | 1,300 | Masonry phone-camera images |
| dacl1k | CVPR 2023 workshop | 1,474 | Multi-label building inspection |
A critical design decision here: we did not blindly use everything. The Surface Crack Detection dataset, for example, was collected by sliding a window across concrete surfaces — meaning thousands of its 20,000 "Positive" images are near-identical patches of the same wall. Training on those would teach the model to memorize specific textures, not to generalize.
The Class Imbalance Problem (And Why We Solved It Differently)
The naive approach would be to use all available images. But looking at the actual counts:
no_defectimages (non-cracked): ~70,000 availablestructural_crackimages: ~16,000hairline_crackimages: ~20,000
Using everything would create a 4:1 imbalance favouring no_defect. A model trained on that distribution learns to predict "nothing wrong here" for anything ambiguous — exactly the wrong behaviour for a safety-critical inspection tool.
Our solution was balanced sampling at ~20,000 per class, matching the largest defect class:
# Defect classes: use ALL available (they're already under 20k)
# no_defect: cap at 20k to prevent majority-class bias
NO_DEFECT_CAP = 20_000
# SDNET non-cracked: split evenly across surfaces
NC_SPLIT = NO_DEFECT_CAP // 3 # ~6,666 from Walls, Decks, Pavements each
This gave us ~58,939 images going into deduplication.
CLIP Deduplication: The Most Important Step Nobody Talks About
After sampling, we ran CLIP-based semantic deduplication using OpenCLIP's ViT-B/32. Every image was embedded into a 512-dimensional feature space. Any pair with cosine similarity >0.95 was considered a near-duplicate — and only one was kept.
device = "cuda" if torch.cuda.is_available() else "cpu"
model, _, preprocess = open_clip.create_model_and_transforms(
"ViT-B-32", pretrained="openai"
)
# Compare embeddings in blocks to avoid OOM
for i in range(0, len(E), BLOCK):
sims = E @ E[i:i+BLOCK].T
# Mark duplicates for removal
The result: 50,079 near-duplicates removed from 58,939 images, leaving 8,860 genuinely unique images.
Why does this matter? Because those 50,079 removed images weren't adding information — they were adding noise and memorization risk. The final training set of 8,860 unique images, each appearing in two VQA conversation formats (classification and severity assessment), gave us 17,720 training records of real signal.
This dataset lives at: ritvik360/gharscan-defect-dataset
Fine-Tuning: LoRA on Modal Labs A100
With the dataset on HF Hub, we ran LoRA fine-tuning on Modal Labs using an A100-80GB GPU. The full training cost approximately $12 from the $250 Modal credits provided by the hackathon.
CFG = {
"base_model_id": "Qwen/Qwen2-VL-2B-Instruct",
"lora_r": 16,
"lora_alpha": 32,
"target_modules": ["q_proj", "v_proj", "k_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"],
"num_epochs": 3,
"batch_size": 2,
"grad_accum": 4, # Effective batch = 8
"learning_rate": 2e-4,
"lr_scheduler": "cosine",
}
Targeting both attention and MLP layers with r=16 gave us ~18.4M trainable parameters out of 2.27B total — about 0.83% — the standard sweet spot for LoRA fine-tuning of VLMs.
Three epochs later:
train_loss: 5.086eval_loss: 5.027- Training time: ~1 hour 54 minutes
- Published: ritvik360/gharscan-qwen2vl-lora
One lesson learned the hard way: modal run requires your local terminal to stay connected. The first training run hit 68% and then died when the laptop was closed. The fix was simple but non-obvious: modal run --detach, which keeps the job alive on Modal's cloud regardless of local connectivity.
The Agentic Pipeline: Three Steps, Not One
GharScan doesn't just call the model once. Every image triggers a three-step agentic reasoning chain:
Image
│
▼
Step 1: CLASSIFY
What defect type is this? (8 classes)
→ defect_type, description, primary_cause, monsoon_risk
│
▼
Step 2: SEVERITY ASSESSMENT
Given defect_type, how serious is this?
→ severity (1–5), is_structural, immediate_action, urgency
│
▼
Step 3: COST LOOKUP (deterministic)
defect_type × severity → INR cost range, professional type
(No model call — pure lookup table of 2026 Delhi/NCR market rates)
The cost estimation is deliberately not a model output. Models hallucinate prices. A hardcoded lookup table of verified market rates (sourced from HouseYog, Comaron, and UrbanCompany) is more accurate, more honest, and completely auditable.
Every reasoning chain is logged as an agent trace and uploaded to ritvik360/gharscan-agent-traces — qualifying for the Sharing is Caring hackathon badge.
Deployment: gr.Server → gr.Blocks (A Painful Lesson)
The original plan was to use Gradio's new gr.Server() API for a fully custom HTML/CSS/JS frontend — the "Off-Brand" hackathon badge strategy.
# What we tried first
from gradio import Server
app = Server()
@app.api(name="analyze_defect")
@spaces.GPU
def analyze_defect(...): ...
This worked locally. On HF Spaces with ZeroGPU, it started and immediately shut down:
Application startup complete.
Uvicorn running on http://0.0.0.0:7860
Shutting down
Application shutdown complete.
No error. Just silence. After debugging, the root cause became clear: ZeroGPU works by monitoring Gradio's internal event queue for GPU allocation requests. When @spaces.GPU is stacked inside @app.api(), ZeroGPU's startup scanner sees the @app.api() wrapper in the module namespace — not the underlying @spaces.GPU function — and can't register it properly. The app starts, ZeroGPU fails to initialize, the process exits cleanly.
The fix: replace gr.Server() with gr.Blocks() and use HTML rendering inside Gradio's standard event handling. We lose the pure custom HTML frontend but keep a fully styled dark-theme report card rendered as HTML via gr.HTML(). The @spaces.GPU function sits at module level, visible to ZeroGPU's scanner:
@spaces.GPU
def analyze_image(image, language):
# ZeroGPU can find this — it's at module level
return run_gharscan_pipeline(image, language)
with gr.Blocks(css=CUSTOM_CSS, theme=gr.themes.Base()) as demo:
image_input = gr.Image(sources=["upload", "webcam"], type="pil")
analyze_btn = gr.Button("🔍 Analyse Defect", variant="primary")
report_output = gr.HTML()
analyze_btn.click(fn=analyze_and_render,
inputs=[image_input, language],
outputs=report_output)
demo.queue()
demo.launch()
The Space: build-small-hackathon/GharScan
GGUF Quantization: Taking GharScan Offline
Beyond the Space demo, we wanted to prove GharScan could run entirely locally — on a laptop, offline, without internet. This meant converting the merged model to GGUF via llama.cpp.
The pipeline, executed in WSL Ubuntu from a Windows machine:
# 1. Build llama.cpp
sudo apt install -y build-essential cmake git
git clone https://github.com/ggml-org/llama.cpp
cmake -B llama.cpp/build && cmake --build llama.cpp/build -j$(nproc)
# 2. Merge LoRA into base
python - <<'PY'
from transformers import Qwen2VLForConditionalGeneration
from peft import PeftModel
import torch
base = Qwen2VLForConditionalGeneration.from_pretrained(
"Qwen/Qwen2-VL-2B-Instruct", torch_dtype=torch.bfloat16)
model = PeftModel.from_pretrained(base, "ritvik360/gharscan-qwen2vl-lora")
model.merge_and_unload().save_pretrained("./gharscan-merged")
PY
# 3. Convert to GGUF
python llama.cpp/convert_hf_to_gguf.py gharscan-merged \
--outfile gharscan-f16.gguf --outtype f16
# 4. Quantize to Q4_K_M
./llama.cpp/build/bin/llama-quantize \
gharscan-f16.gguf gharscan-q4_k_m.gguf Q4_K_M
Final artifacts:
gharscan-f16.gguf— 3.09 GBgharscan-q4_k_m.gguf— 941 MB
A 941MB building inspector that runs on a laptop with no internet. That's the promise of small models kept.
Published at: ritvik360/gharscan-qwen2vl-gguf
Three bugs encountered and fixed along the way:
- PEFT version mismatch (
torch.distributedtensor error) → fixed by pinningpeft==0.10.0 - PIL missing in WSL venv →
pip install pillow - Windows/WSL clock skew warning → harmless, build succeeded regardless
What the Model Actually Learned
The defect taxonomy GharScan classifies across:
| Class | Visual Signature | Severity Range |
|---|---|---|
hairline_crack |
Thin surface cracks, paint-only | 1–3 |
settlement_crack |
Diagonal at window/door corners | 1–5 |
structural_crack |
Wide, horizontal, load-bearing | 2–5 |
water_seepage |
Brown/yellow ceiling/wall stains | 1–5 |
efflorescence |
White salt crystal deposits | 1–4 |
spalling |
Concrete chunks detaching | 2–5 |
rebar_rust |
Orange streaks through concrete | 2–5 |
plaster_delamination |
Bubbling, hollow-sounding plaster | 1–4 |
The severity scale maps directly to urgency:
- 1 – Cosmetic: At your next renovation
- 2 – Minor: Within 6 months
- 3 – Moderate: Within 1 month
- 4 – Serious: This week
- 5 – Critical: Immediately / Vacate
The Real Impact: Aunty Puja's Answer in 12 Seconds
The technical stack is interesting. But the actual value is simpler.
Aunty Puja photographed three things in her DDA flat: a diagonal crack above her east window, a damp patch on the bedroom ceiling, and white deposits on the exterior wall.
GharScan told her:
- Settlement crack, Severity 2/5 — Not structural. Seal before monsoon. ₹600–₹1,200, local mason.
- Water seepage, Severity 4/5 — Waterproofing failure. Get a waterproofing contractor before next monsoon. ₹10,000–₹25,000.
- Efflorescence, Severity 2/5 — Moisture seeping through masonry. Brush off, apply water-repellent primer. ₹500–₹1,500.
She didn't need a civil engineer. She didn't need three masons. She had a clear, prioritized action plan in under two minutes, with cost ranges she could verify herself. That is what small AI should do.
Hackathon Badge Scorecard
| Badge | Claimed | How |
|---|---|---|
| 🔌 Off the Grid | ✅ | No cloud APIs — ZeroGPU inference, no external model calls |
| 🎯 Well-Tuned | ✅ | Fine-tuned LoRA published: ritvik360/gharscan-qwen2vl-lora |
| 🎨 Off-Brand | ✅ | Custom dark CSS inspection report aesthetic in gr.Blocks |
| 🦙 Llama Champion | ✅ | GGUF (Q4_K_M, 941MB): ritvik360/gharscan-qwen2vl-gguf |
| 📡 Sharing is Caring | ✅ | Agent traces: ritvik360/gharscan-agent-traces |
| 📓 Field Notes | ✅ | This post |
Technical Stack Summary
| Layer | Technology |
|---|---|
| Base model | Qwen2-VL-2B-Instruct (2.07B params) |
| Fine-tuning | LoRA r=16, 7 target modules, 3 epochs |
| Training compute | Modal Labs A100-80GB (~$12 spent) |
| Dataset | 8,860 deduplicated images, 17,720 VQA records |
| Deduplication | CLIP ViT-B/32, cosine similarity >0.95 |
| Inference | HF ZeroGPU, gr.Blocks, gr.HTML report card |
| Cost estimation | Deterministic INR lookup table (2026 Delhi rates) |
| Offline runtime | llama.cpp GGUF Q4_K_M, 941MB |
| Agent tracing | Auto-uploaded to HF Hub every 20 calls |
Repos and Links
- 🏗️ Space: build-small-hackathon/GharScan
- 🤖 LoRA model: ritvik360/gharscan-qwen2vl-lora
- 📦 GGUF: ritvik360/gharscan-qwen2vl-gguf
- 📊 Dataset: ritvik360/gharscan-defect-dataset
- 🔍 Agent traces: ritvik360/gharscan-agent-traces
Built in 10 days for the Build Small Hackathon 2026 by Gradio and HuggingFace. Track: Backyard AI.
GharScan is a triage tool. For Severity 4–5 defects, always obtain a professional structural assessment.