CVPR 2026 Workshop Submission

FRAC-MAS: Fracture Radiograph Analysis using
Conformal Multi-Agent System

Anonymous Authors

89.3%

Ensemble Accuracy

8-class HBFMID test set

96.4%

External Detection Rate

Roboflow dataset (stacking)

4.14 / 5

Clinical Accuracy Score

Mean rating by 3 orthopaedic surgeons

68.2%

Readability ≥ 4/5

Layperson comprehensibility

Abstract

Accurate interpretation of X-ray images is essential for fracture diagnosis and management. However, this task remains challenging due to complex fracture patterns, variability in clinical interpretation, and high case volumes. We trained a stacked ensemble of four fine-tuned models (HyperColumn-CBAM DenseNet-169, MaxViT, RAD-DINO, and YOLOv26m-cls) alongside four agents: Knowledge, Critic, Educational, and Patient Interface, each addressing distinct clinical deployment risks — hallucination, false confidence, interpretability, and accessibility.

The results underscore the pipeline's efficacy in distilling technical fracture data into summaries that patients can easily understand and act upon. The modest AUC while validating the system on an external dataset like FracAtlas confirms that features from hand and wrist X-rays do not transfer robustly to femoral neck or vertebral fractures. Bridging this gap will require multi-anatomy training data and explicit domain adaptation strategies. Ultimately, our system, Frac-MAS, enables patients to better understand their orthopedic ailments and assists in identifying fractures in resource-constrained settings, acting as a complementary diagnostic aid for medical professionals.

System Architecture

End-to-End Multi-Agent Clinical Pipeline

Figure 1. End-to-end system architecture integrating ensemble inference, Grad-CAM localization, multi-agent reasoning, and conformal prediction for verifiable clinical report generation. The Patient Interface Agent orchestrates downstream routing to the Knowledge Agent, Critic Agent, and Educational Agent.

Contributions

Three Core Technical Contributions

Ensemble Inference with Conformal Prediction

A stacked ensemble of four fine-tuned vision backbones combined via two-pass weighted soft voting. Outputs are wrapped in conformal prediction sets with a distribution-free coverage guarantee, converting point predictions into statistically rigorous differential diagnoses (92.0% empirical coverage at α = 0.10).

Multi-Agent Clinical Decision Support

A four-agent workflow (Patient Interface, Knowledge, Critic, Educational) addressing accessibility, hallucination, false confidence, and interpretability. The Critic Agent performs blind two-step post-hoc verification, flagging ambiguous cases for human review. Post-Critic accuracy improves from 89.3% to 94.6%.

External Validation & Clinical Assessment

Pipeline validated on two external datasets (FracAtlas, Roboflow Bone Break Classification) and assessed by 3 practicing orthopedic surgeons. Substantial inter-rater reliability (Fleiss' κ = 0.72 for accuracy), with 75.0% of responses rated clinically accurate or highly accurate.

Methodology

Model Ensemble Design

Fractures vary widely in appearance — from hairline cracks to full structural breaks. Our system employs an ensemble of four architectures to capture this morphological spectrum:

HyperColumn-CBAM DenseNet-169: Custom backbone aggregating multi-scale DenseNet-169 features with CBAM channel-spatial attention, preserving both low-level edge detail and high-level structural context.
MaxViT: Multi-axis attention Vision Transformer capturing long-range anatomical dependencies across the full radiograph. Strongest standalone performer at 96.2% accuracy.
RAD-DINO: Microsoft's self-supervised vision backbone pre-trained on radiology images, contributing domain-specific representations that ImageNet-pretrained backbones miss.
YOLOv26m-cls: Adapted for classification, provides a complementary CNN-based inductive bias with efficient feature extraction and strong detection rate on unseen fracture types.

Predictions are combined via a two-pass weighted soft voting scheme. A second pass elevates the HyperColumn-CBAM weight for commonly confused categories (Oblique, Transverse and their displaced variants). A stacking meta-learner (logistic regression) achieves the best external detection rate of 96.4%.

Table 1. Single-model ablation on the held-out 8-class test set (no augmentation).

Configuration	Accuracy	F1 (macro)
MaxViT	96.2%	0.966
HC-CBAM-DenseNet-169	93.4%	0.936
YOLOv26m-cls	93.1%	0.938
RAD-DINO	92.5%	0.931
Ensemble (Stacking) †	89.3% test	—

† Stacking also achieves 96.4% detection rate on the external Roboflow dataset (140 images, all positive; F1 = 0.982).

Frac-MAS — Multi-Agent Clinical Decision Support System

Four-Agent Workflow

Every case is routed through a structured four-agent pipeline with clear handoffs, each targeting a specific clinical deployment risk.

Patient Interface Agent

Entry point & orchestrator

Provides the chat interface, personalises responses based on patient background, and uses a LangGraph workflow to decide whether to answer immediately or invoke downstream agents for evidence gathering and verification.

Knowledge Agent

Hallucination mitigation via RAG

Queries a curated ChromaDB vector store with ICD-10 codes, severity ratings, treatment guidelines, and indexed AO/OTA, Radiopaedia, and FDA AI/ML references. Semantic search via sentence-transformer embeddings; generates only from retrieved material.

Critic Agent

Blind verification & triage

Two-step blind-first protocol: (1) independent assessment of the radiograph, then (2) evaluate the ensemble label. Flags cases for human review when it rejects the ensemble, disagrees with confidence ≥ 0.6, or detects statistical ambiguity. Improves confirmed accuracy from 89.3% → 94.6%.

Educational Agent

Patient-friendly translation

Uses the Grad-CAM heatmap and verified classification to generate lay summaries with severity and next-steps guidance via Gemini 2.5 Pro. Falls back to template-based generation when API is unavailable. Outputs validated by orthopedic clinicians.

Results & Analysis

Quantitative Evaluation

Grad-CAM Attention Analysis

$Grad-CAM heatmaps comparison for Comminuted fracture$

Figure 2. Grad-CAM heatmaps for a Comminuted fracture. MaxViT (centre) attends to 2.4% of the image, precisely targeting the cortical break. HyperColumn-CBAM DenseNet-169 (right) activates 59.8%, capturing the surrounding bone and tissue context — a 25× difference in spatial coverage reflecting complementary diagnostic strategies.

Sample: Oblique Displaced Fracture

$Original X-ray — Oblique Displaced fracture$

$Grad-CAM heatmap — Oblique Displaced fracture — DenseNet-169$

Figure 3. Left: Raw X-ray of an Oblique Displaced fracture. Right: HyperColumn-CBAM DenseNet-169 Grad-CAM activation map highlighting the fracture region and surrounding cortical context.

Conformal Coverage & Confusion Matrix

Per-class accuracy and conformal coverage

Figure 4. Per-class accuracy vs. conformal coverage at α = 0.05 and α = 0.10. Coverage gains are largest for the most error-prone classes (Oblique: 70.6% → 82.4%).

Ensemble confusion matrix on 8-class test set

Figure 5. Ensemble confusion matrix on the 112-sample held-out test set. Misclassifications concentrate among morphologically adjacent categories: Oblique ↔ Transverse.

External Validation — FracAtlas ROC Curves

ROC curves on FracAtlas external dataset

Figure 6. ROC curves for base models, ensemble strategies, and triplet configurations evaluated on the external FracAtlas dataset (200 balanced samples). Optimal decision thresholds via Youden's J statistic are shown as solid markers. Best AUC = 0.652, reflecting domain shift from hand/wrist to full-body musculoskeletal radiographs.

Critic Agent Operational Summary

Table 2. Critic Agent operational summary on the 112-sample test set (hybrid protocol, t̂ = 0.529).

Metric	Value
Raw ensemble accuracy	89.3%
Confirmed accuracy (post-Critic)	94.6%
Safety margin (confirmed vs. raw)	+5.3 pp
Auto-confirmed / total	92 / 112 (82%)
Flagged uncertain / total	20 / 112 (18%)

The Critic concentrates a high-confidence confirmed set (82%) and isolates a compact uncertain cohort (18%) for clinician review, without issuing hard automatic rejections.

Human Validation

Clinician Reader Study

Three practicing orthopedic surgeons evaluated system outputs using the Frac-MAS grading protocol. Each rater first provided a blind diagnosis, then reviewed the AI output and scored it on a 5-point Likert scale.

Table 3. Clinician ratings of educational outputs (1–5 scale) and inter-rater agreement across three orthopaedic raters.

Dimension	Mean Score	Score ≥ 4 (%)	Fleiss' κ
Technical Accuracy	4.14	75.0%	0.72
Comprehensibility	3.95	68.2%	0.65

The system demonstrates strong alignment with clinical standards. Fleiss' Kappa indicates substantial inter-rater reliability for technical accuracy (κ = 0.72) and moderate-to-substantial agreement for comprehensibility (κ = 0.65).

A perfect score of 5 was awarded in 43.2% of accuracy evaluations. Edge-case failures surfaced on complex displaced fractures (proximal radius shaft, PIP joint dislocation), reinforcing the necessity of a human-in-the-loop for definitive treatment planning.

Statistical Rigor

Conformal Prediction & Differential Diagnosis

Rather than a single hard classification, the system wraps ensemble outputs in conformal prediction sets with a distribution-free coverage guarantee using split conformal calibration on the validation set.

At α = 0.10, the procedure achieves 92.0% empirical coverage with an average set size of 1.07. Of 112 predictions, 104 (92.9%) are singletons and only 8 are two-element differentials.

This maps naturally to a clinical differential: a set of {Oblique, Oblique Displaced} at 90% coverage is more actionable than a potentially wrong single prediction, directly communicating model uncertainty to clinicians.

Table 4. Per-class accuracy and conformal coverage on the test set. Coverage failures concentrate among morphologically similar types.

Class	N	Acc.	Cov. α=0.05	Cov. α=0.10
Comminuted	17	100%	100%	100%
Greenstick	13	100%	100%	100%
Healthy	10	100%	100%	100%
Oblique	17	70.6%	76.5%	82.4%
Oblique Disp.	9	100%	100%	100%
Spiral	12	100%	100%	100%
Transverse	17	70.6%	70.6%	70.6%
Transverse Disp.	17	88.2%	88.2%	94.1%
Overall	112	89.3%	90.2%	92.0%

Resources

Code, Demo & Protocol

Live Demo

Interactive fracture diagnosis interface

Code Repository

View anonymous source code

Clinician Rubric

Frac-MAS evaluation protocol (PDF + web)

FRAC-MAS: Fracture Radiograph Analysis using Conformal Multi-Agent System