Field Notes: Building a Chest X-ray Draft Auditor with a Tiny VLM

Community Article Published June 14, 2026

Summary

CXR Draft Auditor is a research and educational QA tool, not a report generator and not a diagnostic device: you hand it a chest X-ray plus a human-written draft impression, and it flags where the image and the draft disagree, drawing an evidence box on the image for each MISSING finding (present on the image but absent or denied in the draft), each UNSUPPORTED over-call (asserted in the draft but absent on the image), and each URGENT can't-miss finding (pneumothorax or nodule/mass present on the image, surfaced for a second look). The architecture is deliberately decomposed and transparent rather than one black box: a fine-tuned MedGemma 4B vision-language model grounds the image into a constrained label-and-box set, NVIDIA Nemotron-3 Nano 4B parses the draft text into the same labels (reading explicit denials and keeping verbatim spans), and a deterministic, model-free comparator does the only judging, so every flag traces back to a specific image finding and draft phrase. The whole project was shaped by an open-data constraint, no PhysioNet access, which is why it leans on Kaggle-reachable VinDr-CXR boxes and a synthetic-draft method (faithfully describe the real box labels, then corrupt the draft in exactly one controlled way: drop a finding, add an absent finding, or change nothing) that manufactured known ground truth and unblocked credible evaluation. The hardest lesson was about evaluation integrity: a silent output-handling bug was discarding a chunk of the better model's generations before scoring, making v2 look worse on the urgent findings I cared about most and nearly leading me to serve the worse v1, until I forced the pipeline to stop dropping data and the v2-versus-v1 decision flipped to the correct model.

This is a write-up of what I built for the Build Small Hackathon and what I learned along the way: the problem, the data story (which was the hard part), the transparent decomposed architecture (a fine-tuned MedGemma grounds the image, NVIDIA Nemotron parses the draft, and a deterministic comparator does the only judging), the synthetic-draft method that let me test the audit loop without a perfect dataset, the evaluation-integrity lesson that nearly cost me the better model, and a look at what the resulting system does and does not do well, backed by a real held-out evaluation.

How this maps to the hackathon badges

For judges, here is how the project lines up with the tracks and badges, each tied to a concrete decision in the build:

  • 🏡 Backyard AI (track:backyard): I built this for a real backyard problem. A radiologist I know needed a reliable second read on chest X-ray draft impressions, and the whole tool is shaped around that need.
  • 🟢 NVIDIA (sponsor:nvidia): NVIDIA Nemotron-3 Nano 4B is one of the two models that make the audit work. It parses the free-text draft into the canonical labels and their denials, the text half of the loop that the grounding model could not read reliably.
  • 🎯 Well-Tuned (achievement:welltuned): the grounding model is a MedGemma I fine-tuned, twice (v1, then a cleaner-data v2), for constrained CXR finding-and-box extraction, both published on the Hub with a held-out evaluation behind the choice.
  • 🎨 Off-Brand (achievement:offbrand): the app wears a custom "Reading Room / Clinical Light" interface with a light and dark toggle, not the stock Gradio look.
  • 📡 Sharing is Caring (achievement:sharing): I published a small open dataset of real audit traces from the Space, so anyone can see what the tool decided on a real run.
  • 📓 Field Notes (achievement:fieldnotes): this write-up, including the evaluation-integrity lesson that nearly cost me the better model.
  • 🤏 Tiny Titan (achievement:tinytitan): every model in the pipeline is genuinely tiny. The fine-tuned MedGemma grounding model and NVIDIA Nemotron-3 Nano 4B are each 4B parameters, within the badge's 4B limit, with no large model anywhere in the stack.
  • 🎬 Best Demo (achievement:bestdemo): the submission is the full package, a custom Reading Room app, a demo video that walks through an audit end to end, and a social post (see Links).

The rest of this post is the story behind those decisions.

Links

The problem

Automated chest X-ray report generation is not solved. Recent work shows generated reports are error-free on fewer than half of abnormal cases, and the literature on grounded fact-checking explicitly leaves omission detection as future work. The two failure modes that matter clinically are the two hardest to catch automatically: a draft that misses a finding that is actually present (an omission), and a draft that asserts a finding the image does not support (an over-call).

So instead of generating yet another report, I built an auditor. It takes a chest X-ray and a human-written draft impression, and it asks a narrower question: where do the image and the draft appear to disagree, and can I show the evidence? The output is never a verdict. It is a set of flags, each tied to a region on the image, that send a person back to look again.

The no-PhysioNet data story

The single biggest constraint shaped everything: no PhysioNet credentials. That rules out MIMIC-CXR and most of the paired image-plus-report-plus-box datasets the field leans on. I needed real radiologist bounding boxes from a source I could actually access.

The crux turned out to be VinDr-CXR. It is commonly described as PhysioNet-gated, but it is reachable without PhysioNet through Kaggle: the VinBigData Chest X-ray Abnormalities Detection competition (via the rules click-through, or Late Submission) and public resized PNG mirrors that need no competition entry at all. The boxes are radiologist-drawn, with up to three readers per image. The important caveat is licensing: the upstream VinDr Data Use Agreement is non-commercial research only, and a CC0 tag on a downstream mirror does not override that. My project is research and educational, which fits. Because of that DUA I keep the SFT corpora private and never redistribute VinDr pixels on the public Space.

I layered a few more open sources on top. The VinDr-CXR-VQA dataset (faizan711/VinDR-CXR-VQA) is annotations only, no images: it ships a single data_v1.json that I join to the Kaggle VinDr pixels by image_id, a 32-character hex filename. Its gt_location boxes are in original full-resolution pixel space, so I rescale them per image whenever I pair them against a resized image mirror. ChestX-Det (natealberti/ChestX-Det) gave me a second box source under Apache-2.0 annotations. NIH ChestX-ray14 with BBox_List_2017.csv is held out for box evaluation, and because its images are openly licensed I use a few of them as the example images on the public Space (via the natealberti/ChestX-Det redistribution). IU-Xray / Open-i gave me real radiology reports, used only to check that my draft parser handles realistic phrasing.

The gap: there is no instant-access open dataset with images, real free-text reports, and boxes all at once. PadChest-GR is the closest, but it is request-gated, so I never put it on the critical path. The way around the gap is the synthetic-draft method below.

I normalized every dataset's native labels into one small canonical set of six findings: pleural effusion, pneumothorax, lung opacity / consolidation, nodule / mass, cardiomegaly, and no-finding. Labels with no canonical counterpart (aortic enlargement, atelectasis, calcification, and so on) are dropped rather than forced.

The decomposed, transparent architecture

I deliberately did not build one end-to-end black box. The system is three layers, the two perception layers use the model that is actually good at each job, and the only layer that makes a judgment is the one with no model in it.

  1. Image to grounded findings. A fine-tuned MedGemma 4B vision-language model, running on the GPU, emits a constrained JSON list of findings over the six labels, each with a normalized bounding box in MedGemma's native [y0, x0, y1, x1] format.
  2. Draft to labels. NVIDIA Nemotron-3 Nano 4B, running on the GPU through Hugging Face transformers, parses the draft impression into the same six labels, marking each as present or absent and keeping the verbatim draft phrase that produced each label. It reasons briefly over the draft before emitting the label JSON, which materially improves extraction on multi-clause drafts; the reasoning trace is stripped before the labels are parsed. Crucially, it reads explicit denials: paste "Cardiomegaly is present. No pneumothorax." and it returns cardiomegaly present plus pneumothorax absent, each with the exact span it came from.
  3. Deterministic comparison. A pure-logic comparator, no model, no randomness, applies three rules: a finding present in the image but absent or denied in the draft is MISSING; a finding asserted in the draft but absent from the image is UNSUPPORTED; any image-present finding on the urgent whitelist (pneumothorax and nodule / mass, both can't-miss findings) is surfaced as URGENT.

The reason for the two-model split is practical, not decorative. My first design used the fine-tuned MedGemma for both jobs: grounding the image and parsing the draft. It is genuinely good at the first, because that is what I fine-tuned it to do, but I had narrowed it so hard on grounded finding extraction that it had become an unreliable reader of free text. It would miss denials, drop the verbatim span, or wander outside the label set on ordinary report phrasing. The draft parser does not need to look at the image at all; it needs to follow instructions over text and respect a strict schema. So I gave that job to a model built for it. NVIDIA Nemotron-3 Nano 4B is a small, instruction-following text model whose native nemotron_h architecture (a Mamba2-Transformer hybrid) transformers supports directly, so it loads from the bf16 weights and runs on the GPU with no extra runtime and no CUDA build of its own. It parses the draft cleanly, including the denials and the spans, while MedGemma stays on the GPU doing the grounding it was tuned for. Both models run on the GPU, so a full audit takes well under a minute, roughly 15 to 30 seconds. If the draft cannot be parsed at all, the audit degrades to an image-only pass with a visible note rather than failing.

The reason for the decomposition as a whole is trust. Because the comparator is deterministic and reads two explicit label sets, every flag is explainable: you can see which image finding and which draft phrase produced it, and you can see the box. Swapping the draft parser for Nemotron did not add a judgment layer; the only thing that decides MISSING, UNSUPPORTED, or URGENT is still the model-free comparator. The two perception models can each be wrong, but the flag they feed into is never a mystery.

The pure-logic core (the label set, the schema, the prompts, the comparator, the metrics, and the synthetic-draft generator) depends only on the Python standard library, numpy, and pydantic. It unit-tests with no GPU, no torch, and no network. The heavy stacks are optional extras, imported lazily. That separation kept iteration fast and the tests objective.

The synthetic-draft method

I could not get the perfect triple of image, real report, and box, so I manufactured the part I was missing. Starting from the real box labels, I generate a synthetic draft impression that faithfully describes those findings, and then I corrupt it in exactly one of three controlled ways:

  • Drop a present finding. This produces a MISSING case with known ground truth.
  • Add an absent finding. This produces an UNSUPPORTED case with known ground truth.
  • Change nothing. This is the faithful draft, my negative control, which must produce no flags.

Because I know the corruption I applied, I know the correct audit decision, so I can measure audit precision and recall directly. The real IU-Xray reports are used only to validate that the parser reads realistic prose, never as box supervision. This decoupling is what made a credible audit loop possible inside one week from open data alone.

Training

Before I fine-tuned anything, I tried the stock base model. I pointed google/medgemma-1.5-4b-it, exactly as it ships, at the grounding job with the same production prompt I planned to use, on the handful of example chest X-rays. The encouraging part was that the medical knowledge was plainly already in there: read its output and it reasons sensibly about opacities, heart size, the things a chest X-ray model should know. The discouraging part was that it would not honor the contract the rest of the auditor depends on, a clean, on-vocabulary JSON list of {label, box} and nothing else, the two explicit label sets the deterministic comparator reads without guessing. The knowledge was present; the discipline was not.

A small, illustrative pass over the four example images, not a benchmark, just enough to see the shape of the problem, made the gap concrete. On one image it narrated paragraphs of step-by-step reasoning instead of emitting the JSON, so nothing parsed and the audit came back with zero findings on an image that had them. On another it called a real cardiomegaly normal and missed the finding entirely. On the mass image it labeled the mass as "consolidation," off-vocabulary for what it was, and then tacked on a contradictory "no finding" to the same output, which silently knocked out the URGENT flag, the single case I least wanted to lose. None of this was a knowledge gap; it was a reliability gap, narrating instead of emitting, missing findings, and mislabels that quietly cost the urgent flag. That is what sent me to fine-tuning, and it framed exactly what fine-tuning was for: not to teach the model to see a chest X-ray, which it already could, but to discipline it into reliable, parseable, on-vocabulary findings-with-boxes, every time, so the comparator downstream could trust what it was handed.

The fine-tune itself used QLoRA (4-bit NF4) on that same base, through TRL's SFTTrainer with PEFT: rank 16, alpha 16, learning rate 2e-4, with LoRA on the attention and MLP projections and the loss over the assistant target. The training target is the constrained finding JSON with boxes. I ran it on a single A100 through Hugging Face Jobs (after a cheap smoke run to catch container and CUDA issues first), merged the adapter into a clean bf16 base rather than the 4-bit model, and verified the merge actually captured the adapter before publishing, because a silent merge would publish the unchanged base. The merged 16-bit model fits the ZeroGPU tier at bf16 with no quantization and serves through a Gradio Space. An Unsloth FastVisionModel path is also provided for free Kaggle or local training.

I trained this twice. V1 was one epoch over roughly 6,800 curated, class-balanced grounding examples (alex-feeel/cxr-sft). When I looked at v1 closely it under-localized on harder images and kept double-labeling one region as both opacity and nodule, so I re-curated the corpus: alex-feeel/cxr-sft-v2 deduplicates same-region cross-finding overlaps down to the more specific label (cross-finding overlap at IoU >= 0.6) and merges duplicate triple-reader boxes (union-find at IoU >= 0.5). V2 is two epochs over that cleaner corpus. V2 (alex-feeel/medgemma-cxr-auditor-v2) is the model the Space serves today; v1 (alex-feeel/medgemma-cxr-auditor) is kept public for reference and comparison.

What the evaluation showed

For a while my own impression was that v2 had regressed: on a couple of out-of-distribution cancer X-rays it seemed to drop an urgent flag, and on the held-out comparison v2 looked worse than v1 on exactly the can't-miss findings I cared about most. I came close to keeping v1 as the served model on the strength of that comparison. That would have been the wrong call, and catching why is the most useful thing I learned this week.

The numbers were not measuring the model. A silent output-handling failure was quietly corrupting the evaluation itself. The step that turned a raw generation into a scored result threw away any output it could not consume cleanly, with no error and no count, so a chunk of v2's generations were being dropped before they were ever scored. Six of v2's nine held-out nodule cases vanished that way, which is exactly why v2 looked worse on urgent recall. The model was producing the findings; the evaluation was discarding them and then scoring v2 as if it had stayed silent. A measurement that silently drops data does not produce a smaller dataset, it produces wrong conclusions, and this one was about to flip a real decision toward the worse model.

The lesson I took from it is about evaluation integrity, not about one bug. Once I made the output handling refuse to drop anything silently (recover what it can from imperfect output, fall back visibly when it genuinely cannot, and never discard a generation without saying so) and re-ran the held-out comparison, the real signal showed through and the v1-versus-v2 decision flipped: v2 was the better model and had been all along. I would not have known that without distrusting my own first set of numbers and tracing where the data went. The same principle now governs the live system: when the draft parser cannot make sense of a draft, the audit degrades to an image-only pass with a visible note rather than silently dropping the draft, so a failure is always something you can see.

With the evaluation trustworthy I ran the systematic held-out comparison: 273 images held out from both models, a single greedy generation that matches exactly what production does, scored end to end with nothing silently discarded. V2 wins on every axis that matters.

Metric V1 V2
Presence macro-F1 0.646 0.735
Box IoU@0.3 rate 0.484 0.633
Box IoU@0.3 precision 0.613 0.791
Box IoU@0.5 rate 0.360 0.531
Mean IoU on matched boxes 0.614 0.700
Urgent recall, nodule / mass 3/9 4/9
Urgent recall, pneumothorax 0/1 1/1

So v2 detects better, localizes better, and catches more of the can't-miss findings. That is why v2 is the served model. The earlier "v2 regressed" story was entirely an artifact of the corrupted evaluation; once nothing was silently dropped, the real signal showed through and the decision flipped to the right model.

I want to be clear about the limits of these numbers. The urgent classes are scarce in the held-out data (nodule / mass N=9, pneumothorax N=1), so the urgent recall figures are directional, not statistically robust, and I would not present them as anything stronger. The held-out ground truth still carries the same-region double-labels that v2's curation deduplicated, which slightly understates v2's generic-label recall against that ground truth. And the headline remains: a 4B model on noisy, triple-annotated boxes is good enough to demonstrate the audit loop and to flag can't-miss findings for a second look, but it is frequently wrong and is research and educational only, never a diagnosis and never a substitute for a radiologist.

Where this came from: a radiologist's need

This project did not start with a model. It started with a person. I built it because of a radiologist I know, Alexey Amelin (https://vk.ru/xraydiag), a pediatric radiologist, and the everyday reality of his work: under real reading volume, fatigue near the end of a long list, time pressure, and the genuinely subtle findings, a second read is quietly valuable. That is not a knock on anyone's skill; it is the shape of the workload. A quiet second pair of eyes, with the radiologist always in the loop, is the problem this tool tries to help with. That need is what the auditor is built around.

Alexey has since tried it himself. Here, in his own words, is what he thinks:

When a colleague told me about this project, the idea landed on familiar ground right away. Every radiologist knows the value of a second look. The reading volume is high, the eyes tire toward the end of a long worklist, time is short — add night shifts and even an experienced specialist may not catch a detail immediately. This is no reproach to the profession; it is its everyday reality. And some findings are genuinely subtle: the smallest pleural effusion; focal changes superimposed on dense tissue; abnormalities of the chest organs in patients with coexisting somatic disease; the early changes of a disseminating process. It is not always possible to ask a colleague for a second read, even when the need for one is obvious. Emerging radiograph-audit models are a safety net for the radiologist and the patient alike.

The concept of the tool itself resonates with me. It does not make a diagnosis; it compares the doctor's draft impression against the image and highlights the disagreements: a finding present on the image but missed or denied in the text; a claim in the text that the image does not support; and, separately, it surfaces potentially urgent findings for a second look — all marked on the image so a person can look again. The decision always stays with the doctor. That is precisely the kind of helper I would value — a calm second pair of eyes, not a replacement for the specialist.

At the same time, as a radiologist whose practice is mainly pediatric, I should be clear: for pediatric chest radiography, artificial intelligence is less validated than it is for adults. Large, high-quality, age-diverse pediatric image sets are scarce — children make up only a small share of publicly available medical imaging, and the major open chest-radiograph databases were collected from adults. Because of this, models trained mostly on adult data can show clinically meaningful age-related bias in children — for example, a noticeable rise in false-positive cardiomegaly and thymomegaly in infants. So tools that grew out of adult data should not be assumed reliable by default: they need to be independently validated and recalibrated on pediatric data before there is any talk of using them in children.

And let me underline this separately: this is a research and educational quality-assurance tool, not a medical device and not a diagnostic instrument. Imaging findings cannot be interpreted in isolation from a particular patient's clinical and laboratory picture and history. The final word always rests with a qualified radiologist.

— Alexey Amelin (https://vk.ru/xraydiag), pediatric radiologist

What I learned

  • The data constraint, not the model, was the real problem. Once I accepted that the perfect triple did not exist as open data, the synthetic-draft method unblocked everything.
  • Decomposition buys trust. Putting the only judgment in a deterministic, model-free comparator made every flag explainable, which matters more than a slightly higher end-to-end score would.
  • Trust your evaluation before you trust its verdict. A silent output-handling failure was corrupting my numbers and nearly led me to serve the worse model. Nothing in a measurement pipeline should ever discard data without saying so; the moment it does, you are scoring an artifact, not a model.
  • Use the right model for each job, not the same model for everything. Fine-tuning MedGemma hard on grounded extraction made it worse at reading free text, so the draft is now parsed by a small instruction-following text model (NVIDIA Nemotron-3 Nano 4B, on the GPU through transformers) while MedGemma keeps doing the grounding. Two narrow models each doing what they are good at beat one model stretched across two jobs.
  • Keeping the core dependency-light made the week survivable. Pure-logic modules that test without a GPU meant I could iterate on the audit rules and the schema without waiting on the model stack.
  • A tiny model is enough for this framing. I am not asking the vision model to write a report; I am asking it for a constrained label set with boxes, and I am asking a small text model to read a draft into the same labels. Both are narrow asks that small models do usefully.

Try it and read the rest

The Space runs the full loop end to end. The served model is alex-feeel/medgemma-cxr-auditor-v2, with the earlier v1 alex-feeel/medgemma-cxr-auditor kept up for comparison.

Research and educational QA only. The system described here is NOT a medical device, NOT diagnosis, and NOT for clinical use. Outputs are frequently wrong. Always consult a qualified radiologist.

Community

Sign up or log in to comment