Trials / Recruiting
RecruitingNCT07352475
Reasoning Enrichment With Feedback From IA in NEphrology Trial
Reasoning Enhancement With Feedback From a Generative AI in Nephrology (REFINe): A Randomized Evaluation of Generative AI Support in Nephrology Diagnosis
- Status
- Recruiting
- Phase
- N/A
- Study type
- Interventional
- Enrollment
- 100 (estimated)
- Sponsor
- University Hospital, Lille · Academic / Other
- Sex
- All
- Age
- 18 Years
- Healthy volunteers
- Accepted
Summary
The goal of this clinical trial is to learn how artificial intelligence (AI) may help doctors make diagnoses in kidney medicine. The researchers want to know whether an AI tool called a large language model (LLM) can help doctors choose the correct diagnosis more often and feel more confident in their answers. Before starting the study, the research team tested several AI models and chose one of the best performers, a GPT-5-class model set to use high reasoning effort. The main questions this study aims to answer are: 1. Do doctors make more correct diagnoses when they can see AI suggestions? 2. Does seeing AI suggestions change how confident doctors feel about their diagnosis? Researchers will compare doctors who receive AI suggestions with doctors who do not receive AI suggestions to see how the AI affects accuracy, confidence, and decision-making. Participants will complete up to 10 online clinical cases. For each case, they will: 1. Read a short medical scenario 2. Suggest up to three possible diagnoses (If in the AI group) Review the AI's suggestions and decide whether to change their answer The study will also look at how long participants take to answer each case and how the AI's performance compares to the human answers.
Detailed description
This study evaluates whether providing clinicians with real-time diagnostic suggestions from a high-reasoning large language model (GPT-5) improves diagnostic accuracy, confidence, and efficiency when solving nephrology clinical vignettes. Prior to selecting the model for the trial, the research team benchmarked several state-of-the-art models across a pilot set of nephrology cases, including: GPT-5, GPT-5-mini, O3, GPT-4o, Llama-4 Maverick-17B, Gemini-2.5-Pro, Qwen-3 VL-235B Thinking, DeepSeek-V3.2-Exp, MedGEMMA-27B, Claude Sonnet-4.5, and Magistral-Medium-2509. GPT-5 (high-reasoning) demonstrated the highest diagnostic performance, stability, and interpretability, and was selected as the AI system used in the intervention arm. Participants include medical students, residents, fellows, and practicing physicians. After creating an account, participants complete a demographic questionnaire (specialty, years of experience, practice type, age category, AI familiarity) and must explicitly agree to the use of these data for research purposes before accessing the vignettes. No directly identifying information is collected. Participants are randomized (with stratification by professional status) to either the AI-supported arm or the control arm. Each participant is assigned 10 nephrology vignettes in French or English and may complete them over multiple sessions. Once a vignette is submitted, it cannot be revisited ("no backtracking"). Completion time per vignette is automatically recorded. Control Arm Participants view each vignette and provide up to three diagnoses ("Top-3"), followed by a confidence rating (0-10). AI-Supported Arm Participants first enter an initial Top-3 diagnosis and confidence rating without AI assistance. The system then displays GPT-5's diagnostic suggestions, after which participants may revise their diagnoses once. The vignette is locked after submission. The study collects: * initial and final diagnoses, * confidence ratings before and (if applicable) after AI suggestions, * completion times, * participant demographic variables, * and the AI model's own diagnostic outputs. Partial completion is permitted; all completed vignettes contribute to the analysis. Primary and secondary outcomes include diagnostic accuracy (Top-3 and Top-1), accuracy improvement before vs. after AI, changes in diagnostic confidence, AI-induced diagnostic errors, human-versus-AI benchmarking, completion-time efficiency metrics, and the proportion of assigned vignettes completed. The primary analysis will compare diagnostic accuracy between the control arm (physicians alone) and the experimental arm (physicians assisted by the AI model). Accuracy is analyzed as a binary outcome (correct vs incorrect diagnosis). Because each participant evaluates multiple clinical vignettes, accuracy will be modeled using a mixed-effects logistic regression with a fixed effect for study arm and random intercepts for both participant and vignette. This approach accounts for clustering and varying difficulty across cases. The primary hypothesis test uses a two-sided α = 0.05. Effect sizes will be reported as odds ratios with 95% confidence intervals. Secondary analyses will explore whether accuracy varies by demographic factors (e.g., experience level, specialty) using interaction terms. Because each participant evaluates multiple vignettes, the team also performed simulation-based power analyses using mixed-effects logistic regression models with random intercepts for both participant and vignette, assuming an intra-participant ICC of 0.10. Under these assumptions, a total sample of 100 participants (50 per arm) with 10 vignettes per participant provides \>99% power to detect a clinically meaningful improvement in diagnostic accuracy. The investigators therefore plan to enroll approximately 100 participants overall. This study aims to quantify whether AI-augmented reasoning meaningfully improves diagnostic performance and decision-making when clinicians evaluate complex nephrology cases.
Conditions
- Diagnosis
- Clinical Decision-making
- Artificial Intelligence (AI) in Diagnosis
- Decision Support Systems, Clinical
Interventions
| Type | Name | Description |
|---|---|---|
| OTHER | AI suggestion | This intervention consists of displaying an AI-generated diagnostic suggestion during the clinical case-solving task. After reading each vignette, participants see the top diagnostic proposal produced by a large language model (GPT-5, high-reasoning configuration), selected after internal benchmarking. The AI suggestion appears once per vignette and cannot be requested again or modified. Participants may revise their diagnostic answer after viewing the suggestion, but they cannot return to the vignette later. No additional guidance, coaching, or interactive features are provided. |
Timeline
- Start date
- 2025-11-20
- Primary completion
- 2026-10-31
- Completion
- 2026-12-31
- First posted
- 2026-01-20
- Last updated
- 2026-01-20
Locations
1 site across 1 country: France
Source: ClinicalTrials.gov record NCT07352475. Inclusion in this directory is not an endorsement.