89.3%
Ensemble Accuracy
8-class HBFMID test set
96.4%
External Detection Rate
Roboflow dataset (stacking)
4.14 / 5
Clinical Accuracy Score
Mean rating by 3 orthopaedic surgeons
68.2%
Readability ≥ 4/5
Layperson comprehensibility
Abstract
Accurate interpretation of X-ray images is essential for fracture diagnosis and management. However, this task remains challenging due to complex fracture patterns, variability in clinical interpretation, and high case volumes. We trained a stacked ensemble of four fine-tuned models (HyperColumn-CBAM DenseNet-169, MaxViT, RAD-DINO, and YOLOv26m-cls) alongside four agents: Knowledge, Critic, Educational, and Patient Interface, each addressing distinct clinical deployment risks — hallucination, false confidence, interpretability, and accessibility.
The results underscore the pipeline's efficacy in distilling technical fracture data into summaries that patients can easily understand and act upon. The modest AUC while validating the system on an external dataset like FracAtlas confirms that features from hand and wrist X-rays do not transfer robustly to femoral neck or vertebral fractures. Bridging this gap will require multi-anatomy training data and explicit domain adaptation strategies. Ultimately, our system, Frac-MAS, enables patients to better understand their orthopedic ailments and assists in identifying fractures in resource-constrained settings, acting as a complementary diagnostic aid for medical professionals.
System Architecture
End-to-End Multi-Agent Clinical Pipeline

Figure 1. End-to-end system architecture integrating ensemble inference, Grad-CAM localization, multi-agent reasoning, and conformal prediction for verifiable clinical report generation. The Patient Interface Agent orchestrates downstream routing to the Knowledge Agent, Critic Agent, and Educational Agent.
Contributions
Three Core Technical Contributions
01
Ensemble Inference with Conformal Prediction
A stacked ensemble of four fine-tuned vision backbones combined via two-pass weighted soft voting. Outputs are wrapped in conformal prediction sets with a distribution-free coverage guarantee, converting point predictions into statistically rigorous differential diagnoses (92.0% empirical coverage at α = 0.10).
02
Multi-Agent Clinical Decision Support
A four-agent workflow (Patient Interface, Knowledge, Critic, Educational) addressing accessibility, hallucination, false confidence, and interpretability. The Critic Agent performs blind two-step post-hoc verification, flagging ambiguous cases for human review. Post-Critic accuracy improves from 89.3% to 94.6%.
03
External Validation & Clinical Assessment
Pipeline validated on two external datasets (FracAtlas, Roboflow Bone Break Classification) and assessed by 3 practicing orthopedic surgeons. Substantial inter-rater reliability (Fleiss' κ = 0.72 for accuracy), with 75.0% of responses rated clinically accurate or highly accurate.
Methodology
Model Ensemble Design
Fractures vary widely in appearance — from hairline cracks to full structural breaks. Our system employs an ensemble of four architectures to capture this morphological spectrum:
- HyperColumn-CBAM DenseNet-169: Custom backbone aggregating multi-scale DenseNet-169 features with CBAM channel-spatial attention, preserving both low-level edge detail and high-level structural context.
- MaxViT: Multi-axis attention Vision Transformer capturing long-range anatomical dependencies across the full radiograph. Strongest standalone performer at 96.2% accuracy.
- RAD-DINO: Microsoft's self-supervised vision backbone pre-trained on radiology images, contributing domain-specific representations that ImageNet-pretrained backbones miss.
- YOLOv26m-cls: Adapted for classification, provides a complementary CNN-based inductive bias with efficient feature extraction and strong detection rate on unseen fracture types.
Predictions are combined via a two-pass weighted soft voting scheme. A second pass elevates the HyperColumn-CBAM weight for commonly confused categories (Oblique, Transverse and their displaced variants). A stacking meta-learner (logistic regression) achieves the best external detection rate of 96.4%.
Table 1. Single-model ablation on the held-out 8-class test set (no augmentation).
| Configuration | Accuracy | F1 (macro) |
|---|---|---|
| MaxViT | 96.2% | 0.966 |
| HC-CBAM-DenseNet-169 | 93.4% | 0.936 |
| YOLOv26m-cls | 93.1% | 0.938 |
| RAD-DINO | 92.5% | 0.931 |
| Ensemble (Stacking) † | 89.3% test | — |
† Stacking also achieves 96.4% detection rate on the external Roboflow dataset (140 images, all positive; F1 = 0.982).
Frac-MAS — Multi-Agent Clinical Decision Support System
Four-Agent Workflow
Every case is routed through a structured four-agent pipeline with clear handoffs, each targeting a specific clinical deployment risk.
Patient Interface Agent
Entry point & orchestrator
Provides the chat interface, personalises responses based on patient background, and uses a LangGraph workflow to decide whether to answer immediately or invoke downstream agents for evidence gathering and verification.
Knowledge Agent
Hallucination mitigation via RAG
Queries a curated ChromaDB vector store with ICD-10 codes, severity ratings, treatment guidelines, and indexed AO/OTA, Radiopaedia, and FDA AI/ML references. Semantic search via sentence-transformer embeddings; generates only from retrieved material.
Critic Agent
Blind verification & triage
Two-step blind-first protocol: (1) independent assessment of the radiograph, then (2) evaluate the ensemble label. Flags cases for human review when it rejects the ensemble, disagrees with confidence ≥ 0.6, or detects statistical ambiguity. Improves confirmed accuracy from 89.3% → 94.6%.
Educational Agent
Patient-friendly translation
Uses the Grad-CAM heatmap and verified classification to generate lay summaries with severity and next-steps guidance via Gemini 2.5 Pro. Falls back to template-based generation when API is unavailable. Outputs validated by orthopedic clinicians.
Results & Analysis
Quantitative Evaluation
Grad-CAM Attention Analysis

Figure 2. Grad-CAM heatmaps for a Comminuted fracture. MaxViT (centre) attends to 2.4% of the image, precisely targeting the cortical break. HyperColumn-CBAM DenseNet-169 (right) activates 59.8%, capturing the surrounding bone and tissue context — a 25× difference in spatial coverage reflecting complementary diagnostic strategies.
Sample: Oblique Displaced Fracture


Figure 3. Left: Raw X-ray of an Oblique Displaced fracture. Right: HyperColumn-CBAM DenseNet-169 Grad-CAM activation map highlighting the fracture region and surrounding cortical context.
Conformal Coverage & Confusion Matrix

Figure 4. Per-class accuracy vs. conformal coverage at α = 0.05 and α = 0.10. Coverage gains are largest for the most error-prone classes (Oblique: 70.6% → 82.4%).

Figure 5. Ensemble confusion matrix on the 112-sample held-out test set. Misclassifications concentrate among morphologically adjacent categories: Oblique ↔ Transverse.
External Validation — FracAtlas ROC Curves

Figure 6. ROC curves for base models, ensemble strategies, and triplet configurations evaluated on the external FracAtlas dataset (200 balanced samples). Optimal decision thresholds via Youden's J statistic are shown as solid markers. Best AUC = 0.652, reflecting domain shift from hand/wrist to full-body musculoskeletal radiographs.
Critic Agent Operational Summary

Table 2. Critic Agent operational summary on the 112-sample test set (hybrid protocol, t̂ = 0.529).
| Metric | Value |
|---|---|
| Raw ensemble accuracy | 89.3% |
| Confirmed accuracy (post-Critic) | 94.6% |
| Safety margin (confirmed vs. raw) | +5.3 pp |
| Auto-confirmed / total | 92 / 112 (82%) |
| Flagged uncertain / total | 20 / 112 (18%) |
The Critic concentrates a high-confidence confirmed set (82%) and isolates a compact uncertain cohort (18%) for clinician review, without issuing hard automatic rejections.
Human Validation
Clinician Reader Study
Three practicing orthopedic surgeons evaluated system outputs using the Frac-MAS grading protocol. Each rater first provided a blind diagnosis, then reviewed the AI output and scored it on a 5-point Likert scale.
Table 3. Clinician ratings of educational outputs (1–5 scale) and inter-rater agreement across three orthopaedic raters.
| Dimension | Mean Score | Score ≥ 4 (%) | Fleiss' κ |
|---|---|---|---|
| Technical Accuracy | 4.14 | 75.0% | 0.72 |
| Comprehensibility | 3.95 | 68.2% | 0.65 |
The system demonstrates strong alignment with clinical standards. Fleiss' Kappa indicates substantial inter-rater reliability for technical accuracy (κ = 0.72) and moderate-to-substantial agreement for comprehensibility (κ = 0.65).
A perfect score of 5 was awarded in 43.2% of accuracy evaluations. Edge-case failures surfaced on complex displaced fractures (proximal radius shaft, PIP joint dislocation), reinforcing the necessity of a human-in-the-loop for definitive treatment planning.
Statistical Rigor
Conformal Prediction & Differential Diagnosis
Rather than a single hard classification, the system wraps ensemble outputs in conformal prediction sets with a distribution-free coverage guarantee using split conformal calibration on the validation set.
At α = 0.10, the procedure achieves 92.0% empirical coverage with an average set size of 1.07. Of 112 predictions, 104 (92.9%) are singletons and only 8 are two-element differentials.
This maps naturally to a clinical differential: a set of {Oblique, Oblique Displaced} at 90% coverage is more actionable than a potentially wrong single prediction, directly communicating model uncertainty to clinicians.
Table 4. Per-class accuracy and conformal coverage on the test set. Coverage failures concentrate among morphologically similar types.
| Class | N | Acc. | Cov. α=0.05 | Cov. α=0.10 |
|---|---|---|---|---|
| Comminuted | 17 | 100% | 100% | 100% |
| Greenstick | 13 | 100% | 100% | 100% |
| Healthy | 10 | 100% | 100% | 100% |
| Oblique | 17 | 70.6% | 76.5% | 82.4% |
| Oblique Disp. | 9 | 100% | 100% | 100% |
| Spiral | 12 | 100% | 100% | 100% |
| Transverse | 17 | 70.6% | 70.6% | 70.6% |
| Transverse Disp. | 17 | 88.2% | 88.2% | 94.1% |
| Overall | 112 | 89.3% | 90.2% | 92.0% |