Back
Beyond "Good Enough" in Medical Education
Inside our three-model system for creating exam-ready medical content at scale.
Putting Together High-Quality Study Resources With AI
Medical education has always demanded something close to impossible: content that is simultaneously broad and precise, current and timeless, accessible to a student encountering a concept for the first time yet rigorous enough to hold up under exam pressure and clinical practice.
The challenge isn't intent — it's scale and complexity.
Medicine moves fast. Guidelines update. Evidence shifts hierarchies. A drug that was first-line becomes third-line. A contraindication gets a new exception. A clinical decision that seemed straightforward turns out to depend on a nuance that wasn't in the original explanation.
Keeping pace with that across every subject, every topic, every level of detail is genuinely hard. And even when the facts are right, producing content that is not just accurate but clear — that doesn't inadvertently flatten important distinctions, doesn't create false confidence by covering a topic without covering it fully — is harder still.
Getting something mostly right in medicine isn't good enough. The gaps matter.
AI Can Help — If Used Carefully
AI, if used carefully, can help close some of those gaps — not by replacing expertise, but by adding a layer of independent, tireless, multi-perspective checking that is difficult to replicate at scale any other way.
The key word is carefully.
A single model generating and self-reviewing content simply inherits its own blind spots. The gains come from structure: different models, different training, different tendencies, each interrogating the output in a different way.
Three Models That Don't Trust Each Other
That's the thinking behind how we generate and validate study resources at Oncourse. We use three models that don't trust each other:
Generator (Anthropic Claude) — Creates the content: structured lessons, clinical vignettes, MCQs calibrated to Bloom's taxonomy, all tailored to your specific exam format.
Validator (OpenAI GPT) — Reads it like a strict examiner: Is the answer correct? Is the clinical reasoning sound? Are the distractors fair without being misleading?
Adversarial Reviewer (Google Gemini) — Tries to break it: Can a well-read student justify a different answer? Is something stated in a way that sounds right but leaves a subtly wrong mental model? Is there a nuance that would confuse more than it teaches?
Using different models matters beyond just having a second opinion. Each has its own training data, its own tendencies, its own blind spots. When they converge, you can have reasonable confidence. When they disagree, that's a signal worth investigating.
We saw it working — and working much better than single-model generating and validating, which we used to do a year ago.
Only content that survives both reviewers gets approved.
Example #1: Rate Control in Atrial Fibrillation
Let's make this real with concrete examples from our Lessons and QBank for UKMLA Prep.
What the generator produced:
"Rate control in AF can be achieved with beta-blockers such as bisoprolol, calcium channel blockers such as diltiazem, or digoxin. All three reduce the ventricular rate and are appropriate first-line choices depending on the clinical context. The target resting heart rate is below 110 bpm in most patients."
At first pass, this looks completely fine — accurate, readable, covers the main options.
What the validator caught:
Digoxin controls rate at rest but not on exertion, making it third-line by current guidelines. Calling all three "appropriate first-line choices" isn't technically false, but it flattens a distinction that costs marks and builds the wrong clinical intuition.
After validator review:
"Beta-blockers (e.g. bisoprolol) and rate-limiting calcium channel blockers (e.g. diltiazem) are first-line for rate control in AF. Digoxin is reserved for sedentary patients or as an add-on — it provides rate control at rest only. This distinction is clinically important to remember."
What the adversarial reviewer flagged:
Something different and subtler. The lesson covered cardioversion technique in detail, but never connected it to the CHA₂DS₂-VASc decision. A student finishing the section would know how to cardiovert and have no idea when it is and isn't safe — which is exactly what exams ask.
The lesson created a false sense of completeness.
Paragraph added:
"Before attempting cardioversion in AF lasting more than 48 hours — or of unknown duration — adequate anticoagulation for at least three weeks is required, or a TOE to exclude left atrial thrombus, regardless of CHA₂DS₂-VASc score. The score determines long-term anticoagulation need; it does not determine the safety of cardioversion."
One small section. A factual hierarchy corrected, a clinical nuance made explicit, and a conceptual gap closed that wasn't visible until something looked for it independently.
Example #2: Variceal Bleeding Question
The original question:
A 45-year-old man with a history of alcohol excess presents with haematemesis. He is haemodynamically stable. Upper GI endoscopy confirms active bleeding from oesophageal varices. IV terlipressin has been administered. What is the next most appropriate intervention?
A) Oesophageal band ligation
B) Sengstaken-Blakemore tube insertion
C) TIPSS
D) IV propranolol
E) Repeat endoscopy in 24 hours
The validator approved it. Correct answer (A), plausible distractors, clean clinical stem, tests application-level reasoning.
The adversarial reviewer rejected it — not because the answer was wrong, but because antibiotics were absent from the vignette.
IV ceftriaxone is mandatory in variceal bleeding in current UK practice, given before or concurrent with endoscopy. A student who knows this pauses: Has prophylaxis been given — does that change the priority? A student who doesn't gets it right and walks away never having learned that antibiotics belong in this bundle at all.
The question works as a test. It fails as a teaching tool.
The fix:
One clause added to the stem: "IV ceftriaxone has been administered."
The question now tests purely endoscopic decision-making, and every student who reads it absorbs the full management bundle — not because it was explained, but because it was shown.
Why This Matters
The result isn't just factually accurate content. It's content stress-tested for the exact way medical students think — and mis-think — under exam pressure.
The validator catches what's wrong.
The adversarial reviewer catches what's missing, what's subtly misleading, what creates false confidence by covering a topic without covering it fully.
Different reviews by different models bring different perspectives and different failure modes — so the gaps one misses, the other finds.
That's the standard every student studying for a high-stakes exam deserves. And it's what careful use of AI can — for the first time — make possible at scale.
