ChatGPT vs the GSSE: Evaluating AI Performance on The Generic Surgical Sciences Examination

ChatGPT has proven itself able to pass the Generic Surgical Sciences Exam with a pass mark of 88%. Discover how AI is transforming surgical education.

GSSE

•

ChatGPT vs the GSSE: Evaluating AI Performance on The Generic Surgical Sciences Examination

Study Overview

Recent research published in the International Journal of Surgical Education examined ChatGPT-4's performance on the Generic Surgical Sciences Examination (GSSE), a mandatory prerequisite for surgical training in Australia. The study assessed whether the large language model could achieve passing scores across multiple-choice questions in anatomy, pathology, and physiology.

The GSSE is administered by the Royal Australasian College of Surgeons (RACS) and consists of 185 questions delivered over two days. Candidates must pass each individual section as well as achieve an overall passing score.

Methodology

Researchers selected 100 questions from RACS's database of 5,179 practice questions using computer pseudo-randomization. The distribution maintained exam proportions: 50 anatomy questions, 25 pathology questions, and 25 physiology questions. Questions were presented to ChatGPT-4 in open-ended format without multiple-choice options.

Responses were evaluated against official RACS answers using binary classification. The study compared ChatGPT's performance against pass marks from the October 2022 examination sitting.

Results

ChatGPT achieved an overall accuracy of 88.0%, well exceeding the required pass mark of 63.8%. Performance across individual sections also surpassed minimum standards.

All results demonstrated statistical significance, with pathology showing the highest accuracy at 93.9%.

Study Limitations

Several methodological constraints limit interpretation of these findings. The sample size of 100 questions represents approximately 19% of the complete examination. The study excluded anatomy spotter stations, which constitute significant portions of the actual GSSE. Current large language models lack native image processing capabilities necessary for anatomical specimen identification.

Large language models exhibit documented tendencies toward generating plausible but factually incorrect information, termed "hallucinations" in the literature. While educational applications are generally low risk - the generalisability towards clinical applicability has not been shown in this study.

Implications and Considerations

The results indicate that large language models can successfully recall and apply medical knowledge across broad domains when assessed through multiple-choice formats. However, medical examinations evaluate competencies beyond factual recall, including communication skills, professional judgment, ethical reasoning, and clinical acumen.

As AI becomes more integrated into healthcare, several key considerations emerge. Protecting patient data, ensuring transparency in how algorithms make decisions, and preventing the amplification of existing biases are essential. Equally important is clearly defining when AI should support clinicians—and when decisions must remain entirely human-led.

There is also a risk that leaning too heavily on AI could weaken the development of core clinical reasoning skills if not paired with thoughtful educational strategies. For this reason, AI should be used to enhance human abilities, not replace them. Strong governance, oversight, and training frameworks will be crucial to safely and effectively embedding these technologies into modern healthcare.

Conclusions

This study demonstrates that ChatGPT-4 can achieve passing scores on the multiple-choice component of the GSSE, with performance significantly exceeding minimum standards. These findings align with emerging literature on large language model capabilities in medical knowledge assessment.

However, results should not be extrapolated beyond the specific testing conditions. The examination components not evaluated in this study—including practical assessments, visual identification tasks, and open-ended clinical reasoning—represent critical elements of surgical competency that warrant further investigation.

Have you used AI in your GSSE or surgical examination prep?

Let us know about your experience over at GetThru!

Notify Me

ChatGPT vs the GSSE: Evaluating AI Performance on The Generic Surgical Sciences Examination

ChatGPT vs the GSSE: Evaluating AI Performance on The Generic Surgical Sciences Examination

Study Overview

Methodology

Results

Study Limitations

Implications and Considerations

Conclusions