Clinical Summary | Artificial Intelligence | Chatbots

From Text to Treatment: Evaluating Multimodal AI Chatbots in Clinical Oncology


Time to read: 05:16
Time to listen:12:20 

 
Published on MedED: 29 October 2024 
Originally Published: 23 October 2024
Source: JAMA Network Open

Type of article: Clinical Research Summary
MedED Catalogue Reference:  MNCS001

Category: Artificial Intelligence
Cross-reference: Oncology

Keywords: AI, oncology, multimodal, unimodal, chatbots, clinical practice
 

Originally Published JAMA Network Open, 23 October 2024. View Disclaimer and Copyright Notices 

 

Key Take Aways

1. Multimodal AI chatbots do not necessarily outperform unimodal ones in medical accuracy

2. Multimodal chatbots struggled with cases involving multiple images, indicating that increased complexity in visual input can negatively affect performance

3. Effective integration of AI chatbots in clinical practice requires clinician familiarity and collaboration to enhance functionality and address patient concerns


Top
Overview | Study Purpose | Study Design | Findings |DiscussionLimitations | Conclusion | Original Research | Funding | References 

 

Overview

 


Multimodal AI chatbots represent a leap forward in artificial intelligence applications in healthcare. 


Unlike traditional, unimodal AI systems that process only one type of data - usually text - multimodal AI chatbots can interpret and combine different types of information, including text, images, and, in some cases, audio. This capability is especially valuable in clinical settings where doctors need to make quick, data-informed decisions. For instance, a clinician could interact with a multimodal chatbot to evaluate a patient’s symptoms described in text form alongside imaging data like X-rays or MRIs. By synthesising data from these multiple “modes,” the chatbot can provide more insightful recommendations, potentially improving diagnostic accuracy and treatment planning.2

Previous studies have shown that these chatbots can achieve over 70% accuracy when responding to general medical questions, underscoring their potential to alleviate clinical workload by offering timely and reliable support.1

Multimodal AI tools could be especially impactful in fields that require analysing diverse data types, such as oncology, where text-based data, such as patient history and treatment notes, are routinely combined with visual data from imaging scans to monitor tumour growth or assess treatment effectiveness. 

The researchers of this study aimed to test and benchmark the diagnostic and response accuracy of multimodal AI chatbots—those that can interpret both text and images—with an emphasis on complex oncology cases. 

They sought to understand whether these AI systems can accurately answer specialist-level questions in oncology care. To do this, they compared responses from multimodal and text-only chatbots to oncology-focused multiple-choice and free-text questions. 
Additionally, they explored advanced techniques like "zero-shot chain-of-thought prompting" which improves the AI's medical response accuracy by guiding it through logical steps in complex queries. 

This study represents one of the first to explore multimodal AI's effectiveness in a specialized medical domain, setting the benchmark for future clinical decision-support tools.



Back to top
Study Purpose

The researchers aimed to ….” evaluate the utility of prompt engineering (zero-shot chain-of-thought) and compare the competency of multimodal and unimodal AI chatbots to generate medically accurate responses to questions about clinical oncology cases.”

Back to top

Study Design & Selection Criteria
 
In this cross-sectional study, researchers evaluated the medical accuracy of 10 chatbots—3 multimodal (text and image) and seven text-only models in answering oncology-related questions. 


Click to view the respective chatbot engines used 

Naming Convention ChatBot Used
Chatbot 1 ChatGPT-4 Vision
Chatbot 2 Claude-3 Sonnet Vision
Chatbot 3 Gemini Vision

Naming Convention ChatBot Used
Chatbot 4 ChatGPT-3.5
Chatbot 5 ChatGPT-4
Chatbot 6 Claude-2.1
Chatbot 7 Claude-3 Sonnet
Chatbot 8 Gemini
Chatbot 9 Llama2
Chatbot 10 Mistral Large

The study used 79 unique cases obtained from JAMA Network Learning. Each case had associated questions and images aimed at specialist-level knowledge. 

Each chatbot was tested on both multiple-choice and free-text responses. 
  • They were given case information in individual sessions and asked to select or generate responses for each question.
  • Pairs of oncologists then evaluated the chatbots’ answers. They reviewed the cases and question responses and ruled on the accuracy, marking answers as correct or incorrect. 
  • This detailed methodology enabled researchers to rigorously assess each chatbot’s ability to interpret both textual and visual medical data accurately.
The primary outcome measured was medical accuracy evaluated by the number of correct responses by each AI chatbot.

Multiple-choice responses were marked as correct based on ground truth answers 

Free-text responses were rated by a team of oncology specialists in duplicate and marked as correct based on consensus or resolved by a review of a third oncology specialist. 

This detailed methodology enabled researchers to rigorously assess each chatbot’s ability to interpret both textual and visual medical data accurately.
 
Back to top
 
Findings 
The results of this study highlighted significant differences in the performance of multimodal versus text-only chatbots in responding to oncology-related questions. 
Notably, while one of the text only Chatbots (Chatbot 10)  excelled in multiple-choice responses, only one of the multi-modal chatbots (Chatbot2) exhibited strong capabilities in both multiple-choice and free-text formats, indicating that integrating visual data could enhance diagnostic accuracy.

The following specific findings were of interest:


Multiple-Choice Response Accuracy 
In relation to accuracy in the multiple-choice evaluation, 89% of incorrect responses were coded as wrong. Text-only Chatbot 10 led with 72.15% accuracy (57 of 79), followed by multimodal chatbot 2 with 70.89% (56 of 79) and text-only chatbot 5 with 68.35%(54 of 79).

Similarly, in the diagnostic vs. management questions, a text-only Chatbot  - Chatbot 7- showed improved accuracy on diagnostic questions compared to clinical management questions.


Free-Text Response Accuracy
In the free-text response accuracy, 90% of non-correct responses were also coded as incorrect. Here, the results were mixed with text-only Chatbots 5, 7, and multimodal chatbot 2, achieving 37.97% accuracy (30 of 79), while chatbot 10 followed closely with 36.71% (29 of 79).

Comparative Accuracy
The three multimodal chatbots displayed varying accuracy compared to text-only chatbots. They generally performed better on multiple-choice evaluations than on free-text evaluations. Zero-shot chain of thought prompting did not consistently enhance chatbot accuracy.

Response Error Analysis
The accuracy of multimodal chatbots with multiple images declined compared to single-image questions, while correct responses tended to have equal or greater word counts than incorrect responses. The observed decrease in accuracy for multimodal chatbots with multiple images suggests challenges in processing complex visual inputs.

 
Back to top  

Discussion

The findings challenge the assumption that multimodal AI chatbots inherently provide better medical accuracy due to their ability to process various types of information. 

Notably, one of the unimodal chatbots (chatbot 10) outperformed others in multiple-choice accuracy, while both unimodal (chatbots 5 and 7) and multimodal (chatbot 2) showed equal success in free-text responses.

The study identified that multimodal chatbots struggled more with cases involving multiple images, suggesting that more complex visual input may hinder performance. Additionally, the study emphasizes the critical role of quality training data in achieving accurate diagnostics. It points out inconsistencies in chatbot responses and highlights the need for improved instruction tuning to enhance free-text accuracy.

Furthermore, the research underscores the importance of realistic clinical assessments of chatbot capabilities. It indicates that zero-shot prompt engineering strategies require more exploration to enhance reasoning and contextually appropriate responses in clinical applications. 

Clinicians' acceptance and familiarity with AI chatbots are crucial for their effective integration into clinical practice. This emphasizes the need for collaboration among healthcare professionals, engineers, and researchers to improve chatbot functionalities and address patient concerns.

 

Limitations

The study's primary limitations include potential overlap between study cases and chatbot training data, a modest sample size of 79 cases, and a high representation of haematological cancer questions. Variability in oncologists' ratings of chatbot accuracy may arise from differing professional judgments. Future research should develop standardized measures for evaluating chatbot response quality and assess the reliability of multimodal chatbots across independent replicates in various oncology contexts.


Back to top  

Conclusion

This study found that multimodal chatbots performed similarly to unimodal chatbots in answering clinical oncology questions but were less accurate with cases featuring multiple images. Additionally, chatbots showed lower accuracy for free-text responses compared to multiple-choice formats. Further research is needed to improve the prompt engineering method to enhance the reliability and utility of AI chatbots as decision-support tools in oncology.

 

Back to top

Conflict of Interest, Funding and Support

Role of the Funder/Sponsor
The study's funder had no role in the design, data collection, data analysis, data interpretation, or writing of the report.

Conflict of Interest Disclosures
Dr Hope reported grants from the Canadian Institute of Health Research, personal fees from AstraZeneca Canada, and nonfinancial support from Elekta Inc outside the submitted work. No other disclosures were reported.

Funding/Support
This work was partially supported by a Canadian Association of Radiation Oncology–Canadian Radiation Oncology Foundation Pamela Catton Summer Studentship and Robert L. Tundermann and Christine E. Couturier philanthropic funds.

This study was reproduced under Creative Commons CC-BY licence. 


Back to top
 


References
 

Back to top


Disclaimer
This is an open-access article distributed under the terms of the CC-BY License. The following is a summary of the clinical study and in no way represents the original research. Unless otherwise indicated, all work contained here is implicitly referenced to the original author and trial. Links to all original material can be found at the end of this summary.

Every effort has been made to attribute quotes and content correctly. Where possible, all information has been independently verified. The Medical Education Network bears no responsibility for any inaccuracies which may occur from the use of third-party sources. If you have any queries regarding this article contact us 

Fact-checking Policy

The Medical Education Network makes every effort to review and fact-check the articles used as source material in our summaries and original material. We have strict guidelines in relation to the publications we use as our source data, favouring peer-reviewed research wherever possible. Every effort is made to ensure that the information contained here accurately reflects the original material. Should you find inaccuracies or out-of-date content or have any additional issues with our articles, please make use of the Contact Us form to notify us.

 

 

 

 

 

 

 

Llama2 
Rapid SSL

The Medical Education Network
Powered by eLecture, a VisualLive Solution