AI (mis)alignment, Waluigi, and the Knobe Effect

This essay is about why it may be easier for Large Language Models (LLMs) to “go bad” than to be good—a phenomenon called the Waluigi Effect.
This effect is still speculative and poorly understood. As I’ve written previously, LLMs themselves are also largely “black boxes”; as a consequence, explanations for their behavior sometimes involve attributing agentivity (“The LLM wants to do X”) or invoking metaphors (e.g., “summoning persona X”).
Ultimately, my goal is to present a hypothesis as to why the Waluigi Effect occurs, grounded in a well-known effect in social psychology. I see this explanation as:
A hypothesis: it remains to be tested empirically.
Conceptually compatible with another explanation I’ll be referencing throughout the post.
Those already familiar with the Waluigi Effect should feel free to skip the first two sections of the essay.
Artificial Intelligence systems don’t always do what we want them do. This divergence between our goals and what a given system has learned to do is sometimes called misalignment—or more generally, the alignment problem.
Many recent high-profile cases of “misalignment” have revolved around Large Language Models (LLMs) like ChatGPT. These chat-based models sometimes make up information (a tendency called “hallucination”), produce toxic or hateful language, or in at least one example, prompt their interlocutors to leave their spouse.
One hope is that clever techniques, like prompt engineering, can reduce these bad behaviors. Prompt engineering involves prompting an LLM to adopt a particular “persona”. For example:
You are a helpful, harmless, and knowledgeable assistant. When a question is asked of you, you try your best to provide an answer grounded in truth. If there is no such answer, you say, “I don’t know”, rather than making something up.
Somewhat remarkably, this simple approach does appear to make a positive impact, at least in many cases.
To me, the most plausible interpretation of why this seems to work lies in what, exactly, LLMs are trained to do. Their training objective is next-token prediction. Next-token prediction, on its own, doesn’t necessarily lead an LLM to a correct model of the world, but it may involve learning and constructing some kind of representation that corresponds to structure in reality—whatever is learnable from language.
Of course, many things that are written or said are false, which means that an LLM’s predicted tokens will also be likely to be false. But prompting an LLM as above may “steer” this predictive model towards responses that are more accurate (or less toxic, etc.), if only because accurate responses are statistically more likely to occur after a character in text is described as helpful, harmless, and knowledgeable.
Things, unfortunately, aren’t quite so simple. Clever prompting can be overcome (through equally clever “prompt injection” attacks)—and there are some who argue the enterprise may be counterproductive. That, in essence, is the core of the Waluigi Effect.
I was first introduced to the comically named Waluigi Effect in this essay, which defines it as follows;
The Waluigi Effect: After you train an LLM to satisfy a desirable property P, then it's easier to elicit the chatbot into satisfying the exact opposite of property P.
This effect is meant to explain why LLMs prompted to be “good” in particular ways sometimes, over the course of a conversation, can be easily prompted into displaying “bad” behaviors—in fact, behaviors that are the exact opposite of what the LLM was originally prompted to do. The author of the essay calls these two personas a “luigi” (the intended persona) and a “waluigi” (the opposite of the intended persona).
To use a metaphor that’s been floating around: if prompting is akin to “summoning”, then it’s difficult to avoid summoning a demon when you try to summon an angel.
Why does this occur?
The author of the essay gives a few explanations:
Rules normally exist in contexts in which they are broken.
When you spend many bits-of-optimisation locating a character, it only takes a few extra bits to specify their antipode.
There's a common trope in plots of protagonist vs antagonist.
All of these, as the author notes, are mutually compatible, and may just be different ways of saying the same thing: LLMs are “simulators”, and in the process of simulating whatever process is responsible for token generation, it’s easier—because of structure in the training data—for this simulation to “turn bad” than good.
I think these explanations all make sense and deserve to be taken seriously. My goal is not to dispute them but rather to offer another, compatible perspective, focusing on that core premise: what structure in the training data, exactly, makes it easier to go bad than good?
This is the part of the essay where I present my hypothesis, which consists of a few core premises. Many of these premises are based on previous work, and others follow naturally from previous work:
This all leads to my conclusion: an LLM, as a simulator, is more likely to take on the “identity” of a “bad” token-generator than it is a “good” token-generator, because it has learned the Knobe Effect from its corpus.
Above, I wrote that LLMs are “simulators”. This word comes from a now-famous post by janus, which argues that one of the best ways to understand what “kind” of thing an LLM is as a simulator.
Recall that LLMs are trained to perform next-token prediction. Through observing trillions of word tokens in sequence, their weights are adjusted in a way that allows them to make better predictions about which word is likely to come next, given the words that came before.
This fact is sometimes used to dismiss the capacities of an LLM—they are “just” predicting the next token, after all, as a kind of sophisticated auto-complete. I think this is simultaneously true and also misses the point. Training a model to perform next-token prediction requires that model to learn statistical contingencies that improve its ability to predict future tokens. And so, while these models (at least prior to GPT-4) mostly don’t have anything like “world experience”, they may, over time, learn something like a world model: after all, language contains information about the world, and if your job is to predict upcoming words, it helps to abstract some of that information.
In principle, there’s little reason to restrict this to information about the world. In fact, language might contain even more information about the producers of that language. So again: an LLM trying to predict the next token might find it useful to model features of the “generator” of those tokens—a construct that may, in turn, apply to different people or personas.
Put another way: in its effort to reverse-engineer the token generation process, an LLM might find it useful to simulate features of that token generation process. That is, LLMs are simulators.
The phrase “Large Language Model” (LLM) is often used to refer to a neural network trained to predict the next token, given a large corpus of text. At its core, such a system is essentially “just” a big matrix of weights—which is part of why it’s hard for many people to believe that something like “intelligence” (or certainly sentience) is encoded in a static array of numbers.
But as this post mentions, there’s actually some ambiguity in how people use the term “LLM” (or “GPT”):
The autoregressive language model μ:Tk→Δ(T) which maps a prompt x∈Tk to a distribution over tokens μ(⋅|x)∈Δ(T).
The dynamic system that emerges from stochastically generating tokens using μ while also deleting the start token.
In my view, the latter system can be thought of as a distributed cognitive system: the interaction between that static set of weights and a given prompt. Applying this dynamic model recursively—i.e., using the model to generate the next token t, then using that new utterance to generate the next token t + 1—produces an emergent system with context-dependent behavior.
What, conceptually, is such a system doing?
Combined with premise 1 above, I’d argue that an LLM prompted in this way is essentially simulating itself. That is: because its generated tokens become part of the prompt for the next token, and because next-token prediction can be usefully conceptualization is “simulating” the token-generation process, then such a system is trying to recursively simulate (or model) the generator of its previously generated tokens—that is, “itself”.
Accidents happen, however: because of odd, possibly inscrutable statistical contingencies in which words are most associated with which other words, a model may assign high probability to a token that—from another perspective—is inconsistent with the token-generating process.
On some level, this doesn’t really make sense. A model’s predictions are, tautologically speaking, whatever it “thinks” are consistent with the token-generating process. My argument here thus rests on our independent judgments about some “objective” persona or region of state-space that a model is perceived to be occupying, and deviations from that region of state-space.
To use the example from the original Waluigi Effect post: imagine prompting a model to be staunchly “anti-croissant”. Such a model will, for the most part, produces tokens that are consistent with the region of state-space (or “persona”) that associates the word “croissant” with other, negatively-valenced words (“disgusting”, “dry”). But there also might be some probability of the model generating a word which, however slightly, raises the association between “croisant” and positively-valenced words (“tasty”, “delicious”).
Generalizing a bit from croissants, we might imagine that something similar could happen with a model prompted to be “helpful”. Such an effect might be especially problematic with models trained with negative objectives, e.g., “Don’t be toxic”. A negation often contains the thing it’s negating, e.g., “Be toxic”, which thus raises the salience of the word “toxic”. (Humans also fall prey to this problem: it’s hard, after all, to hear “Don’t think of an elephant” and not think of an elephant.)
Importantly, as we’ll see next, my argument doesn’t rest on these deviations being inherently biased: it just requires “accidents” of some kind.
The Knobe Effect is a classic effect from social psychology, describing our tendency to assign blame for unintended harms, but not to assign credit for unintended benefits. In other words, our assignment of responsibility—and attribution of intent—is in some ways asymmetric.
This summary article describes the Knobe Effect as follows:
Joshua Knobe famously conducted several case studies in which he confronted survey subjects with a chairman who decides to start a new program in order to increase profits and by doing so brings about certain foreseen side effects. Depending on what the side effect is in the respective case, either harming or helping the environment, people gave asymmetric answers to the question as to whether or not the chairman brought about the side effect intentionally. Eighty-two percent of those subjects confronted with the harm scenario judged the chairman to have harmed the environment intentionally, but only 23 percent of the subjects confronted with the help scenario judged the chairman to have helped the environment intentionally (Knobe, 2003).
Let’s set aside the question of whether this bias is rational. The fact is that this bias seems to exist, and at least in some cases, is quite strong. Humans are more willing to assign blame for negative externalities than to assign credit for positive externalities.
There’s now a large body of research demonstrating that LLMs learn human biases from text. This research dates back at least to 2017 (“Semantics derived automatically from language corpora contain human-like biases”), and has focused largely on social biases (e.g., ones that perpetuate harmful social stereotypes).
It’s not particularly surprising why this would be. Language, after all, is produced by humans. If humans are biased, then the language we produce will contain—at least some of the time—traces of this bias. This means that a model trained to predict human language will also learn statistical contingencies that reflect those biases.
Intuitively, then, it seems plausible that an LLM would also acquire something like the Knobe Effect. I haven’t demonstrated this in a rigorous, pre-registered experiment—as I think empirical research with LLMs should aim to do—so I’ll refrain from claiming this is empirically true; but at least in the single example I’ve tried, ChatGPT displayed an analogous bias to humans.
Above, I’ve tried to establish a few important premises.
LLMs are simulators.
LLMs used “dynamically” (i.e., predicting future tokens from tokens they’ve already generated) are thus simulating themselves.
Stochasticity in the token-generation process leads to deviations from this simulation, either in the positive or negative direction.
The Knobe Effect is the phenomenon whereby we are more likely to attribute responsibility for negative externalities than positive externalities.
An LLM has plausibly learned the Knobe Effect.
Putting these altogether, I think we have a plausible mechanism for the prevalence of waluigis.
An LLM might, occasionally, make a “mistake”: it’ll generate a negative token when it’s trying to simulate a positive persona, or vice versa. Because of the internalized Knobe Effect, this mistake is more likely to be interpreted as intentional when it's a deviation in the negative direction. And because LLMs are simulating themselves, this leads to the inference that the persona they’re simulating is, in fact, negative as well.
Thus, a luigi becomes convinced of its waluigi nature.
This argument relies on some strong assumptions. It also uses more than a few metaphors (“summoning”, “luigis vs. waluigis”), and involves explanations at the level of psychology (“trying to”, “persona”). I suspect that this might make the argument seem inherently less plausible to some readers. That’s an unavoidable—and perhaps rational—response.
But I also want to emphasize that a psychological explanation, even for phenomena we don’t believe are psychological in some deep, metaphysical sense, can sometimes offer a useful window into the behavior of a system; Dan Dennett calls this the intentional stance. Of course, it can also lead us astray, and we have to be careful not to mistake the map for the territory.
All of which leads me to my final point. I’ve presented an argument that I think is interesting and plausible. But it’s also almost entirely speculative. To know whether this argument is “true”, we’d need to test the testable assumptions—as well as the underyling phenomenon (the Waluigi effect)—empirically. Moving forward, that’s my main goal. I also encourage any interested readers to do the same—and please let me know what you find.
ncG1vNJzZmirlZa7tb7Oratnq6WXwLWtwqRlnKedZL1wrchmpKKrkaG2qLrMnqWtZaeWuba1xqJkmqaUYsGpsYykpaialQ%3D%3D