Welcome to our repository dedicated to the evaluation of DxGPT across various AI models, both for common and rare diseases.
This project aims to explore the capabilities and limitations of different AI models in diagnosing diseases through synthetic and real-world datasets. Our comprehensive analysis includes closed models like GPT-4, Claude3 and open models like Llama2, Mistral and Cohere Command R +, providing insights into their diagnostic accuracy and potential applications in healthcare.
In this repository, you will find detailed evaluations, comparisons, and insights derived from multiple AI models. Our goal is to contribute to the advancement of AI in healthcare, specifically in improving diagnostic processes and outcomes for a wide range of diseases. We encourage collaboration, discussion, and feedback to further enhance the understanding and development of AI-driven diagnostics.
Stay tuned for updates and findings as we delve deeper into the world of AI and healthcare.
The naming convention of the files in this repository is systematic and provides quick insights into the contents and purpose of each file. Understanding the naming structure will help you navigate and utilize the data effectively.
Each file name is composed of four main parts:
-
Evaluation data prefix: All files related to model evaluation scores begin with
scores_. This prefix is a clear indicator that the file contains data from the evaluation process.diagnoses_prefix is used for the files that contain the actual diagnoses from each test run, same naming convention as the scores files.synthetic_*prefix is used for the synthetic datasets. -
Additionally, the dataset name is included to provide context. Example datasets include:
(empty)is a gpt4 synthetic datasetclaudeis a claude 2 synthetic datasetmedisearchis a medisearch synthetic datasetRAMEDISis the RAMEDIS dataset from RareBenchPUMCH_ADMis the PUMCH dataset from RareBenchURG_Torre_Dic_200is our proprietary dataset from common diseases in urgency care.
-
This is followed by the version of the dataset used for the evaluation (
(empty)is v1,v2is the second version of the dataset). -
Model identifier: Following the prefix, the name includes an identifier for the AI model used during the evaluation. Some of the possible model identifiers are:
_gpt4_0613: Data evaluated using the GPT-4 model checkpoint 0613._llama: Data evaluated using the LLaMA model._c3: Data evaluated using the Claude 3 model._mistral: Data evaluated using the Mistral model._geminipro15: Data evaluated using the Gemini Pro 1.5 model.
In addition to the main parts, file names may include modifiers that provide further context about the evaluation:
_improved: Indicates that the file contains data from an evaluation using an improved version of the prompt._rare_only_prompt: Specifies that the evaluation prompt was a test focused exclusively on rare diseases.
scores_v2_gpt4_0613.csv: Evaluation scores from the second version of the dataset using the GPT-4 model checkpoint 0613.scores_medisearch_v2_gpt4turbo1106.csv: Evaluation scores from the medisearch synthetic dataset using the GPT-4 model turbo checkpoint 1106.scores_URG_Torre_Dic_200_improved_c3sonnet.csv: Evaluation scores from the urgency care dataset from December using the Claude 3 Sonnet model with an improved prompt.scores_RAMEDIS_cohere_cplus.csv: Evaluation scores from the RAMEDIS dataset using the Cohere Command R + model.scores_PUMCH_ADM_mistralmoe.csv: Evaluation scores from the PUMCH dataset using the Mistral MoE 8x7B model.
This structured approach to file naming ensures that each file is easily identifiable and that its contents are self-explanatory based on the name alone.
This repository contains all the code, data, and results for the evaluation of DxGPT's diagnostic accuracy on synthetic rare disease cases. The paper "Evaluation of DxGPT Accuracy for Rare Diseases Diagnoses" describes the methodology and findings of this analysis in detail. The goal of open sourcing this content is to provide full transparency on the evaluation process and enable further research to build on this work.
This paper evaluates DxGPT, a web platform designed to accelerate the diagnosis of rare diseases. The platform uses GPT-4 to provide diagnostic suggestions based on a brief clinical description. The evaluation utilized 200 synthetic patient cases, derived from three models: GPT-4, Claude2, and MediSearch.
These 200 cases were selected from a curated list from Orphanet Prevalency List, a list of diseases that are the most common in rare diseases.
- Accelerated diagnosis: Targets rare diseases that often face long diagnosis delays (average of 5-6 years).
- Input: Takes a brief clinical description.
- Output: Provides a ranked list of potential diagnoses.
- Strict Accuracy (P1): Top suggestion matches the ground truth.
- Top-5 Accuracy (P1+P5): Ground truth appears within the top 5 suggestions.
- 67.5% Strict Accuracy (for GPT-4 cases)
- 57% Strict Accuracy (for MediSearch cases)
- 88.5% Top-5 Accuracy (for GPT-4 cases)
- 83.5% Top-5 Accuracy (for MediSearch cases)
The results are promising but require further validation on real clinical data and against human expert diagnoses.
- Examine model performance per disease type.
- Investigate qualitative errors.
- Compare model performance to clinician baselines.
With further rigorous evaluation, DxGPT shows potential to significantly assist doctors in diagnosing rare diseases faster, thus leading to improved patient outcomes.



