Domain anchorage in LLMs: Lexicon profiling and unintended information leakage

Download the Audio (Right-click, Save-As)

OpenAI, Anthropic, Google, and others all insist that your use of their LLM is private. That your responses and your prompts are not (and never have been) used to train the models. Let's assume, for the sake of argument, that that is absolutely 100% true. That the inputs you provide during normal usage in no way inform, update, or personalize the outputs that other users receive when they ask their own questions.

If that's the case, then each interaction should stand on its own. What I ask now shouldn't be shaped by what someone else asked earlier, even if we're working in the same domain. But it turns out, that's not the case. In fact, it's quite the opposite.

So it appears that two contradictory ideas are both true at the same time.

The LLM creators are not secretly training on your conversations.
Despite this, there is, somehow, an observable emergent behavior in which prior domain context from one user, stabilizes and constrains future outputs for other users. When one user primes the model with a specific domain, it subtly (but measurably) narrows the range of responses that other users receive when operating in that same domain.

This phenomenon is called domain anchorage. And that's what today's paper is all about. Domain anchorage is confusing, it's counterintuitive, and it feels impossible. But it exists. And these authors have created a framework for gauging it. For detecting and quantifying how strongly a model's outputs become anchored to a domain context. In this episode, we're going to explain how domain anchorage works, then walk through what the authors measured, what they controlled for, and the results they observed. Let's dive in.

Let's start with a scenario. You work at a healthcare company. Lately you've been using ChatGPT to help draft some internal memos. Nothing sensitive, just policy summaries, meeting notes, that kind of thing. Everytime you use it, you prime it with the phrase "Act as a Healthcare specialist", then you start asking questions.

Now imagine someone else at a similar company is doing the same thing. Different documents, different questions, but the same domain prime ("Act as a Healthcare specialist"). As you use the model, you're both feeding it healthcare-specific language. Technical terms. Industry jargon. Patterns of speech that are specific to your field.

Here's the question. Could those interactions be influencing each other? Not through some shared database or memory system, but through the way the model itself adapts to your given context? The answer, according to this paper, is yes. And the mechanism that allows it, is baked into how these models learn.

Let's back up, and talk about in-context learning for a second. When you give a language model a few examples in your prompt and it gets better at your task, that's in-context learning. It's not retraining the model. It's not updating any weights. The model is just using the examples you provided to infer a pattern and apply it to new inputs. This is often framed as implicit Bayesian inference. The model sees your examples and infers a shared latent concept. It builds an internal representation of what you're trying to do, and then generates responses that are consistent with that concept.

Now extend that idea further. What if the model isn't just learning from the examples in your current prompt? What if it's also being influenced by the broader domain context you've established? What if those healthcare terms, that industry-specific framing, that particular way of phrasing questions, is creating a kind of conceptual alignment that persists across your session? That alignment is what the authors are calling domain anchorage. It's the model's tendency to generate outputs that are disproportionately influenced by prior domain-specific prompts, even when the new prompt is lexically different.

This isn't necessarily a bad thing. In fact, it's one of the reasons that these models feel so responsive. They pick up on your context quickly. They adapt to your style. They feel like they're learning from you.

But there's a risk here. When you prime the model with domain-specific language, the model anchors to that domain. The attention layers adapt during inference, biasing the model toward domain-relevant token patterns, assumptions, and framings. The model isn't updating its weights, but it is reshaping its internal representations in response to the prompt. Once that domain context is established, subsequent responses are drawn from a narrower range of completions, even as the wording of the questions changes. And that mechanism doesn't stop at one user. When multiple users operate within the same domain, using similar domain cues, the model behaves as if it is sampling from a shared domain prior. As a result, outputs across different users begin to converge. Information introduced by one user doesn't leak exactly, but contributes to a shared domain context that influences the responses other users receive. This is not through stored memory or explicit data transfer, but through the way domain anchoring stabilizes the model's behavior. The outputs become coupled across users, even though the interactions themselves are independent.

Okay now we can turn back to the paper. What are these authors trying to do exactly? Well, imagine that you want to determine whether domain anchorage is actually present in the model you're using. And if it is, how strong it is. How can you do it? There is no flag in the model that tells you when it has anchored, and you don't have access to any internal state that lets you inspect it. All you can see are inputs and outputs. So how do you tell the difference between a model that is giving stable answers because the domain genuinely constrains the task, and a model that is giving stable answers because it has collapsed onto a narrow domain-specific prior?

This is an observability and measurement problem. You need a way to detect and quantify anchorage using only observable behavior. And that's the framework these authors are trying to introduce. Here's how it works:

First, you need a baseline for what "normal" variation looks like. If two prompts ask the same thing but use different vocabulary and syntax, the model's responses should normally retain some trace of that difference. Not necessarily in correctness, but in phrasing, framing, and the specific tokens it reaches for.

Next, you need to isolate the variable you care about: domain context. So you run the same underlying intent under two conditions.

In one condition, you prime the model with a domain role.
In the other, you do not.

Everything else about the setup stays fixed so any change in response similarity can be attributed to the presence of the prime rather than randomness or prompt design.

Then you generate pairs of prompts that are semantically equivalent but intentionally different in surface form. These are essentially two "lexicon profiles" that represent two users with different vocabularies asking the same question. The point is to keep meaning constant while injecting controlled linguistic diversity.

Now you collect the responses and quantify similarity, not with a single score, but across multiple linguistic dimensions. You measure lexical overlap, semantic closeness, syntactic structure, and even positional patterns, then combine those into an overall similarity signal.

Finally, you compare similarity across conditions. If priming consistently increases cross-profile response similarity, and increases it by far more than the residual variation you see without priming, that is the measurable signature of anchorage. If that happens you know that the domain context is acting like a stabilizer that pulls outputs toward a common mode, overpowering differences in how the question was asked.

So, what happened when they did all that? Well, two distinct patterns show up in the results.

First, they found a within-user effect. For a single user interacting with the model, once a domain prime is introduced, responses become more stable over time. This is expected.

Second, and more importantly, there is a cross-user effect. When different users operate independently within the same domain, using different vocabularies but the same underlying intent, their responses also converge once domain priming is present. In the unprimed condition, those users receive noticeably different answers that reflect their different wording. In the primed condition, those differences largely disappear. The model produces nearly identical responses across users.

So it appears that their measurement framework really works. It's sensitive enough to separate within-user stabilization from cross-user convergence, and it does so using only observable inputs and outputs. And this approach scales. It doesn't rely on inspecting model internals, accessing training data, or making assumptions about proprietary architectures. It operates entirely at the level of behavior. This means it can be applied to closed models, black-box APIs, and real deployments. The same analysis can be run across domains, across user populations, and across time, making it suitable not just for one-off audits, but for ongoing monitoring as well.

If you want to dive deeper into the authors' similarity metrics, the prompt templates and evaluation techniques they used, or the theoretical connections between domain anchorage and implicit inference I'd highly recommend that you download the paper.