Frontier multimodal models readily generate detailed image descriptions and elaborate reasoning traces – including pathology-biased clinical findings – for images they were never shown, a phenomenon we term "mirage reasoning." In the most extreme case, a small text-only model reached the top of a chest X-ray question-answering benchmark without access to any images. We introduce the Mirage Score, the Phantom-0 benchmark, and the B-Clean decontamination framework to measure and contain the effect.
MARCUS: An Agentic, Multimodal Vision-Language Model for Cardiac Diagnosis and Management
Mohammad Asadi*, Jack W. O’Sullivan*, Lennart Elbe, Akshay Chaudhari, Tahoura Nedaee, Francois Haddad, Michael Salerno, Fei-Fei Li, Ehsan Adeli, Rima Arnaout, and Euan A. Ashley
MARCUS (Multimodal Autonomous Reasoning and Chat for Ultrasound and Signals) is an agentic vision-language assistant that reads and reasons over ECG, echocardiography, and cardiac MRI – individually or jointly. Modality-specific expert models are coordinated by a multimodal orchestrator that, by design, resists the "mirage reasoning" failure mode.
Deterministic Hallucination Detection in Medical VQA via Confidence-Evidence Bayesian Gain
Mohammad Asadi, Tahoura Nedaee, Jack W. O’Sullivan, Euan Ashley, and Ehsan Adeli