Performance of ChatGPT, GPT-4, and Google Bard on a Neurosurgery Oral Boards Preparation Question Bank.

Publication

Ali R, Tang OY, Connolly ID, Fridley JS, Shin JH, Zadnik Sullivan PL, Cielo D, Oyelese AA, Doberstein CE, Telfeian AE, Gokaslan ZL, Asaad WF. Neurosurgery. 2023 Jun 12. doi:10.1227/neu.0000000000002551. Epub ahead of print. PMID: 37306460.

Large language models (LLMs) like GPT-3.5, GPT-4, and Google Bard were tested on a neurosurgery exam with complex questions. GPT-4 showed the highest accuracy, getting 82.6% of questions right, while GPT-3.5 scored 62.4%, and Bard got 44.2% correct. GPT-4 excelled in various categories, especially spine-related questions. Questions that required higher-order problem solving were harder for GPT-3.5 and Bard, but not for GPT-4. GPT-4 performed well on imaging questions, even outperforming GPT-3.5 and Bard, and had fewer instances of incorrect “hallucination” in responses. This study highlights GPT-4’s effectiveness in answering complex neurosurgery questions and its potential for medical applications.