Token Probabilities to Mitigate Large Language Models Overconfidence in Answering Medical Questions: Quantitative Study

Download the Audio (Right-click, Save-As)

Let's say you've been feeling under the weather. You've got a headache, swollen lymph nodes, it hurts to swallow, and your tonsils are huge. What do you do? A few years ago you might have Googled your symptoms and ended up on WebMD. But in 2025 there's a good chance you're going to ask AI. So you fire up ChatGPT (or Claude, or Gemini, or whatever), feed it your symptoms and ask for a diagnosis. It spots the issue right away. You've got tonsillitis, it says, and will need to have your tonsils removed. No doubt about it.

So what's the problem? The issue is that you don't have tonsillitis. You just have a cold. But the LLM didn't allow for that possibility. It didn't say "I think it might be" or "You should check with your doctor", no. It just made a diagnosis, with certainty. And, unfortunately, in many cases, its confidence is completely unjustified.

That's what today's paper is about. The authors took nine different large language models and put them through their paces on medical licensing exam questions. What they found might make you think twice about trusting any chatbot's responses. On today's episode we'll walk through their approach, dive into the questions they used, see how they benchmarked the models, and find out what the results looked like. Let's dive in.

For this study they looked at GPT-3.5, GPT-4, GPT-4o, Llama 3.1-8B, Llama 70B, Microsoft's Phi-3 Mini, Phi Medium, Google' Gemma 2-9B and Gemma 27B. Why these models in particular? One, because they're big and popular. But the other requirement was that the models all had to provide access to their internal probability calculations. This ruled out some other popular models that keep this kind of information hidden.

They used what we call "vanilla prompting", which just means that there were no fancy tricks or elaborate instructions. They kept everything deliberately simple: post a question, ask the model to pick an answer and express its confidence in that answer. The questions came from licensing exams from multiple countries, and were posed in multiple languages. The primary focus was on US medical board questions, but they also included exams from China, Taiwan, France, and India. This gave them a total of over twelve thousand questions.

Before we go further, let's talk about what we mean by "confidence". When an LLM generates a response, they're not just spitting out words. Under the hood, they're running a process that assigns probabilities to every possible word or symbol that they might produce next. So when you give it a multiple-choice question, the model has to choose between specific answer tokens. The internal probability it assigns to its chosen answer is a window into how certain the model actually is, based on its training and the patterns it learned.

But here's the issue: this internal mathematical certainty is actually completely separate from what the model says about its own confidence when asked. In this study they're comparing and contrasting two things: how confident the model truly is vs how confident it says that it is. That was why they needed to stick with models that provided access to their internal probability calculations. Without that, they wouldn't be able to make those comparisons.

So what happened when they put each model through the medical exam? Well, first, overall they did impressively well on the test itself. The best models scored about 89% (near expert level) and even the smaller more-efficient models managed to hit passing grades. So we're not talking about systems that are incompetent at medical reasoning, they do (overall) seem to perform well. The problem emerges when you look at confidence ratings. Every single model, regardless of size or architecture, expressed absurdly high confidence in their answers. That is: near-perfect certainty in their responses, regardless of whether those responses were actually correct or not. They were just as confident when they were completely wrong as when they were absolutely right.

The authors used AUC-ROC (area under the receiver operating characteristic curve) to evaluate how well different confidence measures could predict accuracy. This metric essentially asks: if you use this confidence measure to decide which answers to trust, how often will you be right?

A perfect predictor would score 100%
Random chance gets you 50%
Anything below 70% is generally considered poor performance for practical applications.

The results for self-reported confidence were..well...terrible. Most models scored between barely-better-than-random and modestly useful, with even the best performers struggling to reach acceptable levels. This means that asking a chatbot how confident it is gives you almost no useful information about whether its answer is actually correct.

But what about the internal probability calculations?

Good question. In terms of usefulness as an indicator, these "token probabilities" consistently outperformed self-reported confidence. That goes for every model they tested. In many cases, switching from self-reported confidence to internal probabilities took things from barely-better-than-random performance to genuinely useful predictive capability.

But why? Why does this happen?

Well, when you ask a model a question, and it has clear, consistent evidence from its training (regarding that question), it assigns very high internal probability to the correct answer. This certainty reflects the strength and consistency of the patterns it learned. But when the model encounters something ambiguous or conflicting, that uncertainty gets spread across multiple possible answers, resulting in lower probabilities for any single choice. The problem is that the model's self-reported confidence (the confidence it would relay to you during a conversation) doesn't reflect this internal mathematical state. Instead, mimics the patterns that humans use when they express certainty. Since people tend to express confidence in big round numbers and since people avoid admitting uncertainty, these models default to high confidence statements even when their internal calculations suggest doubt.

Let's pause on this for a moment because it's (I think) so very interesting. The models have the ability to surface their own confidence ratings, and to spit back accurate confidence levels when asked. But they don't. Why? Because the character they're playing (a human), wouldn't do that. A human would overstate their confidence, and speak in big round numbers, so they do too.

To put it another way: When you're prompting a model and you ask it to reflect on its own previous responses, it's very tempting to believe that the model will effectively "break the 4th wall" and stop playing the character it's playing. That it will have a little side-huddle with you where it reflects on its performance in a candid and objective way. But this paper shows that that's simply not what's happening. It's still in character, and still...well..making stuff up. Not because it wants to deceive you, but because it wants to impersonate you. Or rather, all people. People are fallible, and they overstate. So the model does too.

So where does this leave us? Namely, in a place where we need to accept the inherent limitations not-of the models themselves, but of the chat interfaces they use to expose their functionality. When an LLM is playing a character (as they practically always are), we can't rely on how confident the character is, we need to track the internal confidence probabilities instead. Perhaps one day the models will change and work differently, but until they do we need to rethink how these models are used when the stakes are high.

In medicine, in particular, when an AI system confidently recommends a course of action, patients and doctors need to know whether that confidence is justified. And we have the ability to present that lens to users today, we just need to do it. It's simply a matter of paying attention to signals that the systems are already generating but that we've largely been ignoring in favor of a unified, clean, simple chatbox interface.

If you want to dive deeper into their authors methodology, examine their calibration analyses, or review their sensitivity analyses, I highly recommend that you download the paper. The authors also include alternative uncertainty measures like entropy and perplexity, and compare them with their default approach.