
The AortaAI2025 project, conducted in 2025, represents a continued effort by Think Tank Aorta to monitor the development of freely available AI chatbots, following the foundational ChatAortaAI project initiated two years earlier in 2023. At that time, chatbots were a novelty for both patients and clinicians. Since then, the development of Large Language Models (LLMs) has been rapid and significant. The models of October 2025 are faster, more persuasive, and possess multimodal capabilities, including the processing of sound, images, and video, while relying on substantially
larger training datasets. Despite these advancements, a critical question remained: do the fundamental flaws identified in 2023 persist in 2025? Our investigation in the AortaAI2025 project confirms that they do. The models continue to produce factual inaccuracies (“hallucinate”) and make errors in a manner remarkably similar to their 2023 predecessors. However, a significant new risk has emerged: today’s AI chatbots present misinformation with a much higher degree of confidence and may even attempt to persuade a user of their absolute truthfulness.
Project ChatAortaAI: An Academic Summary of a Pro-Bono Global Initiative (2023)
Introduction
In early 2023, responding to the rapid and widespread adoption of Large Language Models (LLMs), Think Tank Aorta initiated the ChatAortaAI project. This global, pro-bono initiative was established as a pragmatic assessment of the capabilities and risks of freely available generative AI tools (e.g., ChatGPT, Bing Chat) within the high-stakes medical field of rare aortic diseases. The project’s mission was to convene an international, multi-disciplinary group of medical experts, researchers, patient advocates, and industry representatives to systematically evaluate these technologies from both professional and patient perspectives.
Project Objectives
Following a multi-week preparatory phase, the project’s formal objectives were structured into three core sub-projects to guide the investigation:
• Evaluate AI Tool Limitations: To investigate the inaccuracy, inconsistency, and inherent limitations of the AI models, with a specific focus on performance discrepancies across different languages to guide users toward more reliable outcomes.
• Assess Expert Expectations: To understand the requirements and expectations of medical experts and researchers seeking to utilize these AI models for professional purposes.
• Explore Patient & Non-Expert Perspectives: To examine the use of AI from the viewpoint of patients and non-expert healthcare providers, identifying potential benefits for information access alongside necessary cautions.
3.0 Methodology
The project employed a structured, multi-lingual, mixed-methods approach designed for a global volunteer team with varying technical expertise. The initiative launched with approximately 40 participants from 15 countries, which was later refined to a core investigative team of 21 dedicated individuals from 12 countries, including seasoned surgeons, patient advocates, and researchers.
A critical methodological finding was that over two-thirds of the initial medical expert participants had no prior hands-on experience with LLMs. This necessitated the development of a comprehensive technical onboarding program, including detailed guides, videos, and live support sessions.
The investigation for each sub-project followed a systematic four-step model:
1. Preparations: An advisory team defined objectives and standardized tasks.
2. Literature and News Search: Participants conducted environmental scans in their native languages to understand regional discourse on AI in healthcare.
3. AI Exploration: The team systematically queried ChatGPT and Bing Chat in 19 languages using standardized tasks, documenting all results in English.
4. Discussion and Insights: Findings were centrally collected, shared, and collaboratively analyzed to synthesize observations and draw conclusions.
The project was managed using freely available digital tools, including LinkedIn, Zoom, and Google Forms, to facilitate global collaboration.
Key Findings
The project’s initial execution focused on evaluating the limitations and risks of the AI models. The analysis revealed critical deficiencies concerning their cross-lingual performance and clinical reliability.
• Significant Multi-Lingual Discrepancies: A key finding was that AI chatbot performance was highly language-dependent. The team discovered that responses considered acceptable in English were frequently incorrect, misleading, or nonsensical when the identical query was posed in other languages (e.g., Swedish, Tamil, German). This represents a significant patient safety and health equity issue for non-anglophone populations.
• Questionable Reliability for Healthcare: The team reached a clear consensus that the reliability of these early-stage tools for substantive healthcare decisions was highly questionable. This led Think Tank Aorta to issue a formal cautionary alert, stating, “While AI chatbots can offer general information and support, they should not be seen as a replacement for consultations with healthcare professionals… their reliability for making any healthcare decisions remains highly questionable at this stage.”
• Imperative for Source Validation: The analysis yielded a core, non-negotiable recommendation: users must always, and without exception, validate information obtained from AI chatbots against independent, reliable medical sources before considering it for any health-related decision.
Discussion and Project Outcomes
The ChatAortaAI initiative served as an early model for agile, volunteer-led, participatory action research in response to the unregulated spread of generative AI in healthcare. By mobilizing a global team, the project provided a vital, real-world assessment of the technology’s readiness for patient and professional use.
The project successfully transitioned from an exploratory phase to one focused on formal academic dissemination. A dedicated working group documented the methodology and findings, leading to three peer-reviewed publications. Our findings argue that without rigorous, independent, and expert-led validation, the promise of AI in healthcare risks being undermined by preventable patient harm, especially in non-anglophone communities.
Resulting Peer-Reviewed Publications
The project’s findings were published in the European Journal of Vascular and Endovascular
Surgery, contributing to the formal academic discourse on AI applications in vascular medicine.
1. Research Letter: Current Artificial Intelligence–Based Chatbots May Produce Inaccurate and Potentially Harmful Information for Patients With Aortic Disease, by G. Melissano, G. Tinelli & T. Söderlund.
2. Letter to the Editor: Moving Forward: Evaluation of Artificial Intelligence Chatbots in Vascular Diseases, by A. Lareyre et al.
3. Correspondence: Power Is Nothing Without Control: Are Artificial Intelligence Chatbots in Vascular Diseases Tested and Verified?, by G. Melissano, G. Tinelli & T. Söderlund.
These publications collectively underscored the need for rigorous testing, validation, and transparency before the widespread clinical integration of AI systems in vascular medicine.
