Trials / Recruiting

RecruitingNCT07500428

Construction of a Benchmark for Breast Ultrasound AI Interpretation and Performance Evaluation of Multimodal AI Models

Construction of a Standardized Benchmark Evaluation System for Intelligent Breast Ultrasound Image Interpretation and Systematic Performance Assessment of Multimodal Artificial Intelligence Models Based on ACR BI-RADS v2025 Criteria

Status: Recruiting
Phase: —
Study type: Observational
Enrollment: 1,380 (estimated)

Sponsor: Peking Union Medical College Hospital · Academic / Other
Sex: Female
Age: 18 Years – 75 Years
Healthy volunteers: Accepted

Summary

This single-center, retrospective, observational study aims to construct a standardized benchmark evaluation system for intelligent breast ultrasound image interpretation and to systematically assess the diagnostic performance of current mainstream multimodal artificial intelligence (AI) models. De-identified B-mode breast ultrasound images with confirmed pathological diagnoses will be retrospectively collected from the institutional archive (2018-2025) and supplemented with images from published open-access datasets. Expert radiologists with varying experience levels will independently annotate all images according to the American College of Radiology (ACR) Breast Imaging Reporting and Data System (BI-RADS) v2025 criteria, including glandular tissue composition, lesion characterization (mass vs. non-mass lesion), morphological descriptors, and final BI-RADS classification. Baseline deep learning models (CNN-based ResNet-50 and Transformer-based USFM) will be trained to establish performance baselines and to stratify cases by diagnostic difficulty through cross-architecture consensus. Multiple multimodal large language models (MLLMs), including both general-purpose and medical-domain models, will then be evaluated via standardized API calls using BI-RADS-guided chain-of-thought prompts at temperature 0 for reproducibility. Primary endpoints include BI-RADS classification accuracy and diagnostic AUC for benign-malignant differentiation. Model robustness and safety will be assessed through out-of-distribution rejection testing, temperature-stability experiments, and thinking-mode ablation studies. This study adheres to the FLAIR and TRIPOD-LLM reporting guidelines.

Detailed description

Background: Breast cancer is the most prevalent malignancy among women worldwide. Ultrasound is a first-line screening modality, particularly in Asian populations with dense breast tissue where mammographic sensitivity is limited. However, ultrasound interpretation is highly operator-dependent, with substantial inter-observer variability in BI-RADS classification, especially for category 4A-4B lesions. Multimodal large language models (MLLMs) have emerged as a promising tool for medical image analysis due to their zero-shot diagnostic capability, interpretable chain-of-thought reasoning, and structured report generation. Nevertheless, there is currently no standardized benchmark for evaluating AI performance in breast ultrasound interpretation. Study Design: Approximately 1,380 breast ultrasound images will be curated (1,200 evaluation set + 150 out-of-distribution safety test set + 30 prompt development set), encompassing three diagnostic categories: normal breast, benign lesions (BI-RADS 2-4B), and malignant lesions (BI-RADS 3-5). Two junior radiologists (\<5 years of experience) and two senior radiologists (\>15 years) will independently annotate images per ACR BI-RADS v2025 with arbitration by a fifth expert for discordant cases. Diagnostic difficulty will be stratified into three tiers using cross-architecture deep learning consensus: Tier 1 (straightforward, both models correct), Tier 2 (equivocal, one correct/one incorrect), and Tier 3 (difficult, both incorrect, with senior expert validation). MLLMs will be evaluated across multiple dimensions: classification accuracy, sensitivity, specificity, F1 score, AUC, Cohen's kappa agreement with expert consensus, expected calibration error (ECE), morphological feature description accuracy, and chain-of-thought reasoning quality. Safety Assessment: (1) Out-of-distribution rejection test using 150 non-diagnostic images (degraded images, non-breast ultrasound, other imaging modalities); (2) Temperature-stability pre-experiment across parameter settings; (3) Thinking-mode ablation comparing standard vs. chain-of-thought reasoning modes. All experiments use fixed model snapshots, system fingerprint monitoring, and complete logging for reproducibility.

Conditions

Interventions

Type	Name	Description
DIAGNOSTIC_TEST	Multimodal AI Model Diagnostic Evaluation	Retrospective evaluation of de-identified breast ultrasound images by multiple AI systems, including baseline deep learning models (ResNet-50, USFM) and multimodal large language models, using standardized BI-RADS-guided chain-of-thought prompts via API. No patient contact or clinical decision-making is involved.

Timeline

Start date: 2026-03-12
Primary completion: 2026-12-01
Completion: 2027-03-01
First posted: 2026-03-30
Last updated: 2026-03-30

Locations

1 site across 1 country: China

Source: ClinicalTrials.gov record NCT07500428. Inclusion in this directory is not an endorsement.