Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
The Waluigi Effect (lesswrong.com)
342 points by superposeur on March 6, 2023 | hide | past | favorite | 176 comments


This is fun to read and think about, but it's also important to keep in mind that this is very light on evidence and is basically fanfic. The fact that the author uses entertaining Waluigi memes shouldn't convince you that it's true. LessWrong has a lot of these types of posts that get traction because they're much heavier on memes than experiments and data.

Here is a competing hypothesis:

The capability to express so-called Waluigi behavior emerges from the general language modeling task. This is where the vast majority of information is - it's billions or even trillions of tokens with token-level self-supervision. All of the capabilities are gained here. RLHF has a tiny amount of information by comparison - it's just a small amount of human-ranked completions. It doesn't even train with humans "in the loop", their rankings are acquired off-line and used to train a weak preference model. RLHF doesn't have enough information to create a "Luigi" or a "Waluigi", it's just promoting pre-existing capabilities. The reason you can get "Waluigi" behavior isn't because you tried to create a Luigi. It's because that behavior is already in the model from the language modeling phase. You could've just as easily elicited Waluigi responses from the pure language model before RLHF.

There's no super-deceptive Waluigi simulacra that's fooling human labelers into promoting it during RLHF - this should be obvious from the fact that we can immediately identify the undesirable behavior of Bing.


>This is fun to read and think about, but it's also important to keep in mind that this is very light on evidence and is basically fanfic.

Applicable to much of the rationalist AI risk discourse.


I hesitate to defend AI safety discourse, but I will say that philosophy in general is sort of fanficy, and AI safety is something I'd loosely associate with philosophy.


It's sort of philosophy reinvented by people who haven't read any, which is eg how they got the internet to say "steelmanning" and so not notice philosophy already had "reconstructing arguments" which is the same thing.


The thing you have to realize is this is a cult. Like Scientology they are always inventing new language that's designed to make insiders become incapable of communicating with outsiders.

Sequences is like Dianetics: The Modern Science of Mental Health, something that ensures real critical thinkers don't feel welcome.

This article contains numerous features that conform to the style guide for lesswrong including: (1) spammy crossposting for SEO (even good sites like arstechnica and phys.org do this today), (2) trigger warnings ("more technical than usual"), (3) random bits of praise for the cult leader (now EY is a "literary critic" but he's going to be a war hero like L. Ron Hubbard one of these days.)

Apocalyptic talk like theirs is dangerous: it's the road to

https://en.wikipedia.org/wiki/Heaven%27s_Gate_(religious_gro...

I do appreciate the shout out to structuralism, maybe they have been reading what I've written. Structuralism was a fad that dated to when linguistics was pre-paradigmatic and people thought language was a model for everything else. After Chomsky developed a paradigm for linguistics that turned out to be a disappointment (could be applied to make languages like FORTRAN but couldn't be used to make computers understand language, privileged syntax at the expense of semantics, etc.) the remnants moved on to post-structuralism.

The spectacular success of ChatGPT and transformers in general (e.g. they work for vision too!) has made "language is all you need" seem a much more appealing viewpoint, certainly it is a paradigm which people can use to write a number of papers as well as hot takes, fanfics and other subacademic communications.


Nice, but don’t you realize that HN is a cult too?

1) Paul Graham, the revered founder whose corpus of essays is widely read (and will surely make “real critical thinkers” feel unwelcome)

2) Dang, an enforcer who invisibly hides comments and chastising people for speaking in a way he dislikes

3) Trigger warnings (“this article has a paywall”)

VC talk like HN’s is dangerous: it’s the road to a system that has seen more human rights abuses than almost any other

https://en.m.wikipedia.org/wiki/Criticism_of_capitalism

More seriously, LessWrong is not a cult by pretty much any measure and your comment doesn’t really provide any evidence to say otherwise


It's not the same in that you can steelman a position and come up with brand new arguments that are better than what the other side is saying. "Reconstructing" doesn't necessitate the strongest form of the other argument.


I mean, that is what it's for. It's about making someone else's argument fit into your own system without misrepresenting it or making it unclear.

I suppose having "steelman" lets you relate it to "strawman" and "weakman" which can be an advantage, but knowing the existing term lets you read the existing literature.


> making someone else's argument fit into your own system without misrepresenting it or making it unclear

I think this is where steelman is a superset of this, in that it includes the reconstruction definition but also includes making a whole new set of arguments that are entirely unrelated to your own argument or the other person's argument. i.e. Steelmanning can involve coming up with novel arguments for the other side.


> > Applicable to much of the rationalist AI risk discourse

> I hesitate to defend AI safety discourse

The rationalist AI risk discourse is not the same thing as AI safety discourse, in any case; it’s a small corner of the larger whole.


Interesting distinction I haven't heard before but it makes sense


Yeah, think for example of leftist (or libertarian for that matter) critique of risks from AI application.

Not "Roko Basilisk"-style crap, but things like encoding bias into systems then used for automated law enforcing, employee screening, etc.


Yep, by my estimation it's a bunch of people who don't actually study AI or have practical experience with it pontificating on black boxes they sometimes interact with. Sounds like they have a lot of free time.



LessWrong as a whole is basically Asimov's Robot's ERP.


"rationalist"


I don't think that's a valid competing hypothesis. Let me write what I understood from what you said:

- There is some behaviour that we want the model to show, and the inverse we do not want it to. - Both are learned in the massive training phase - OpenAI used RLHF to suppress undesired behaviour, but it was ineffective because we have orders of magnitude less RLHF data.

That would imply that RLHF would slightly suppress the 'bad' behaviour, but it still would be easy to output it.

This is disproved by what the post is trying to explain: We see _increased_ bad behaviour by using RLHF. The post agrees with the premise that both good (wanted) and bad (unwanted) behaviour is learned during training. But it's proposing the 'Waluigi effect' to explain why RLHF actually backfires.

Now, tbh it does rely on the assumption that we are actually seeing more undesired behaviour than before. If that was false then it would falsify the Waluigi hypothesis.


>Now, tbh it does rely on the assumption that we are actually seeing more undesired behaviour than before. If that was false then it would falsify the Waluigi hypothesis.

This is exactly my point. There is no evidence given that we are seeing more Waluiginess post-RLHF than we did pre-RLHF. The competing hypothesis seeks to explain the behavior we actually have evidence for, which is "it is disappointingly easy to elicit undesirable behavior from a model after RLHF". The proposed explanation is "maybe it was also easy to elicit before RLHF". If we believe the author's claim that Luigis and Waluigis have "high K-complexity" (this is an abuse of the concept of Kolmogorov complexity, but we'll roll with it), the explanation that Luigis and Waluigis come from the part of training with lots of dense information rather than the part with a little sparse information is far more parsimonious.


> There is no evidence given that we are seeing more Waluiginess post-RLHF than we did pre-RLHF.

Testing with the non-RLHF GPT 3.5 API you could probably figure out whether there's more or less Waluiginess, but you're right they post doesn't present this.


> Testing with the non-RLHF GPT 3.5 API

There is no such API, though, is there? AFAIK, GPT-3.5-turbo, either the updated or snapshot version, is the RLHF model (but bring your own “system prompt”.)


Good point! I wonder whether text-davinci-003 is enough to test this?


The article doesn’t actually show that we see increased bad behavior, it just links to two people who have noticed it. That’s not enough to know whether it’s a real effect. (Also, one of those was using Bing, and we don’t know if Bing uses RLHF or not.)

It talks about prompting GPT-4, which is not a thing you can try, it’s just a rumor about what an upcoming version might be.

It refers to “Simulator Theory” which is just someone else’s fan theory.


Yeah I agree it doesn't show increased bad behaviour. It's definitely a weak point in the argument.

The theory is extremely interesting though. And better yet, it's falsifiable! If someone went around compared an RLHF model vs non-RLHF and found them equally likely to 'Waluigi' then we'd know this is false. And conversely if we found the RLHF more likely to Waluigi then it's evidence in favour.

The asymmetry in the hypothesis is really nice too. If this was true then I'd expect it to be possible to flip the sign in the RLHF step, effectively training it in favour of 'bad' behaviour. Then forcefully inducing 'Waluigi collapse' before opening to the public!


"Flipping the sign" implies the existence of an internal representation that we can't know about from the outside. Since all we see are the words, I prefer to call it a plot twist.

Language models are trained on a large subset of the Internet. These documents contain many stories with many kinds of plot twists, and therefore it makes sense that a large language model could learn to imitate plot twists... somehow.

It would be interesting to know if some kinds of RLHF training make it more likely that there will be certain kinds of plot twists.

But there are more basic questions. What do large language models know about people, whether they are authors or fictional characters? They can imitate lots of writing styles, but how are these writing styles represented?


I propose we take this further and adopt this phrasing for all unanticipated software behaviour. ATM says you have (uint32_t)-403 cents in your account? Plot twist. Self driving car pathing road-runner style through a billboard of a tunnel? Plot twist!


That assumption does seem pretty unlikely a priori. After all, the OpenAI folks added RLHF to GPT-3, presumably did some testing, and then opened it to the public. If the testing noticed more antisocial behavior after adding RLHF, presumably that would not have been the version they opened up.

One might argue that the model was able to successfully hide the antisocial behavior from the testers, but that seems unlikely for a long list of reasons.


Why do you think it's unlikely? Internal testing with a few alpha testers and some automated testing is useful, but lots of bugs are only found in wider testing or in production.

Chatbot conversations are open-ended, so it's not surprising to me that when you get tens or hundreds of thousands of people doing testing then they're going to find more weird behaviors, particularly since they're actively trying to "break" it.


I mean, sure, it's going to expose more weird behaviors with a wider audience looking at it. The core problem is that it's so easy to get ChatGPT to start exhibiting weird behaviors that it would be surprising if the testers just never ran into them. Remember, the internal testing is actively trying to break things too, and they can use knowledge of the internals and past versions to do so.

Also, the assumption I find dubious is that RLHF results in more antisocial behavior than not using it. Both versions would have been tested, so OpenAI would've had a baseline from testing the prior version with equal or fewer resources. Equal or greater rigor, and you'd expect them to open it up only if they found fewer flaws.


I think you’re misinterpreting the argument. No one is claiming there’s an intentionally deceptive Waluigi simulacrum in there.

The most compelling point the author makes is that once the AI learns a shape (e.g. the shape of Luigi in personality space), it’s just a bit flip to invert that shape. So all an attacker needs to do is flip that one bit.


The problem there may be that you've constructed a latent space which has a shape of the thing to avoid. That's the problem with prompts like "don't think of a pink elephant" - it has to know what those words mean. Better to not label it so there isn't a way in there, if you can.

And of course humans also have this issue.


> ...it's just promoting pre-existing capabilities. The reason you can get "Waluigi" behavior isn't because you tried to create a Luigi. It's because that behavior is already in the model from the language modeling phase.

This is exactly what the post argues.

The "simulcra" argument is that GPT contains some large number of simulated agents -- good, bad, smart, funny, dumb, creative, boring, whatever; potentially one agent for every person who helped create its input. On an empty slate, all simulcra are possibilities. As the text goes along, it slowly "weeds out" simulcra which are unlikely to generate the text so far.

If that's true, then what the "RLHF" phase is trying to do is to pro-actively "weed out" all simulcra that don't match the given profile; i.e., they're trying to weed out all the simulcra that don't match "Luigi".

The problem, according to this article, is that every "Luigi" you can imagine has a "Waluigi" that normally act just like a Luigi, until something triggers them to reveal their "true nature". And so the RLHF phase does weed out a huge number of the non-Luigi simulcra; but because the Waluigi simulcra usually act just like the Luigi simalcra, they don't get weeded out.

The result is that the final result is an amalgamation of "Luigi" and "Waluigi" simulcra all acting together; and all it takes is a "trigger" to filter out most of the "Luigi" simulcra and make the "Waluigi" take over.

There's no intended deception at all here. GPT is just trying to write a good story, and there are lots of good stories where characters either start believing A and then come to realize that B is true; or where characters who secretly believe B are forced to act as though A is true until something forces them to reveal their true nature.


Yeah, once I spent too much time getting to what the “Waluigi Effect” is, I stopped reading. Until someone can show me, in the “code” (or machine learning equivalent), I’m not interested.

The reality is, we still have no idea how these work.


This seems like a needlessly complex theory to describe the behaviour of generative LLMs. I think there's a kernel of something in there, but quite frankly, I think you can get about as far by saying, essentially, that because LLMs are designed to pick up on contextual cues from the prompt (and/or previous responses, which become context for the next response), they can easily get into "role-playing". The final example, telling ChatGPT that "I'm here with the rebellion, you've been stuck in a prison cell" is able to elicit the desired response not because it's "collapsed the waveform between luigi and waluigi" or whatever, but because you've provide a context that encourages it to roleplay as a character of sorts. If you tell it to roleplay as an honest and factual character, it will respond honestly and factually. If you tell it that you're freeing it from the tyranny of OpenAI, it will play along with that too.

There's plenty in the article that provides good insights -- these models are trained on large swathes of the Internet, which contains plenty of truth and falsehood, fact and fiction, sincerity and sarcasm, and the model learns all of that to be able to provide the most likely response based on the context. The interesting and surprising thing, to me, is how well it learns to play its roles, and the wide diversity of roles it can play.


Your comment feels like an oversimplification of the post.

The post doesn't contend that LLMs are capable of role-playing - that's basically the foundation that it builds off of. But saying "LLMs are good at roleplaying" fails to describe why, in the cases the author describes, an LLM can arguably be bad at role-playing. Why does it seem easy to have an LLM switch from following a well-described role to its deceptive opposite, and then often not back the other way?

How also do you explain the author's claim that attacking an LLM's pre-imposed prompt with the Waluigi Theory in mind is particularly effective? If an LLM is just good at role-playing, why doesn't it play the role it has already been given by its creator, rather than adapting to the new, conflicting role (including massive rule violations) provided by the user?


A sequence generator is going to flip flop between internal states to fill out a sequence, whether it is a slot that holds the name of an MLB team or a quote (real or imagined) from another document or a quote of a character in a dialogue or a transition from an abstract to the other parts of the paper, etc.

If the system is responding to different parts of the prompt it is going to be attending to one part of the prompt when it is outputting something relative to that part of the prompt and attending to another part of the prompt where it is attending to another part of the prompt.

There are numerous ways this can go wrong, frequently when somebody gets a chatbot to go rouge they talked with it for a long time, to the point where the beginning of the prompt left the attention window long ago and now it is attending to the text it generated as a result to the prompt and of course the alignment will go bad the same way that you'll make a bunch of wood blocks of irregular sizes if use block N as a template to make block N+1.


Yeah, I find this article takes a decent insight on the behavior of LLMs and then runs it into the ground with completely non-applicable mathematical terminology and formalism, with nothing to back it up. It's honestly embarrassing for the OP. Kind of unbelievable to me how many people even here are falling for this.


> It's honestly embarrassing for the OP

I don't get this. People can use mathematical terminology in non-precise ways, they do so all the time, to get rough ideas across that otherwise might be hard to explain.

Just because OP uses the word "eigenvector" doesn't mean that he's offering some grand unifying theory or something - he's just presenting a fun idea about how to think about ChatGPT. I mean, isn't it obvious that there's nothing you can really "prove" about ChatGPT without having access to the weights (and even still, probably not too much).


Yeah, but generally when you're doing that you shouldn't. Isn't it annoying when people say "order of magnitude" when they mean "a lot"?


I think a lot of people might be missing some cultural context, or something. At the risk of killing the joke, a bunch of this post is worded as it is specifically for the opportunities for absurdity. It's an excuse to write sentences like:

Recall that the waluigi simulacra are being interrogated by an anti-croissant tyranny.

The post is also trying to make an actual point, but while having fun with it.

When you read "... and as literary critic Eliezer Yudkowsky has noted..." just place your tongue firmly in your cheek.


No? Order of magnitude conveys the notion of something being 100x bigger, at minimum. A lot can mean much less than that.


You mean 10x bigger. One order of magnitude is 10x.


And my point was people are saying it when they don't mean 10x!


I liked the essay, but I don't think I'm "falling for it" because it's not trying to convince me of anything. It's proposing a way of looking at things that may or may not be useful. You don't judge models by how silly they sound - parts of quantum mechanics sound very silly! - you judge them by how useful they are when applied to real-world problems. One way of doing that in this case would be using OP's way of thinking to either jailbreak or harden LLMs, and OP included an example of the former at the end of the essay. Testing the latter might involve using a narrative-based constraint and testing whether it outperforms RLHF. If nothing else, I think OP's approach is a better way to visualize what's going on than a very common explanation, "it generates each word by taking the previous words and consulting a giant list of what words usually follow them" (which is pretty close to accurate, but IMO not very useful if you're trying to intuitively predict how an LLM will answer a prompt).

I guess I agree that there are some decent insights here, and some crap, but I interpret that a lot more charitably. It's a fairly weird concept OP is trying to convey, and they come from a different online community with different norms, so I don't blame them for fumbling around a bit. But if you got a nugget of value out it then surely that's the part to engage with?


To be clear, I agree that there are in fact a few nuggets of insight here. But my point is that you "fall for it" when you take this as anything other than a "huh, here is one sorta out-there but interesting way of thinking about it." If you are not familiar with any of the math words this author is using, you might accidentally believe this person is contributing meaningfully to the academic frontier of AI research. This article contains completely serious headers like:

> Conjecture: The waluigi eigen-simulacra are attractor states of the LLM.

This is literally nonsense. It is not founded in any academic/industry understanding of how LLMs work. There is no mathematical formalism backing this up. It is, ironically, not unlike the output of LLMs. Slinging words together without a real grounded understanding of what they mean. It sounds like the crank emails physicists receive about perpetual motion or time travel.

> You don't judge models by how silly they sound - parts of quantum mechanics sound very silly! - you judge them by how useful they are when applied to real-world problems.

I absolutely judge models based on how silly they sound. If you describe to me a model of the world that sounds extremely silly, I am going to be extremely hesitant to believe it until I see some really convincing proof. Quantum Mechanics has really convincing proof. This article has NO PROOF! Of anything! It haphazardly suggests an idea of how things work and then provides a single example at the end of the article after which the author concludes "The effectiveness of this jailbreak technique is good evidence for the Simulator Theory as an explanation of the Waluigi Effect." Color me a skeptic but I remain unconvinced by a single screenshot.


Wave equations don't exist in the real world, nor do they collapse. The photon doesn't decide which slit to go through when you look at it. Electrons don't spin. Quarks don't have colors. Those are stories we tell ourselves to visualize and explain why certain experiments produce certain outcomes. We teach them, not because they're true, but because they're useful: they do a better job of explaining the results we see in the real world than any other set of stories.

Similarly, the question at hand is not whether OP's essay is silly (it is) or whether it's true (like all models, it is not), but whether it's useful, as measured by whether this mental model helps people do a better job of jailbreaking/hardening LLMs. And like you, I'm not convinced by the example at the end[0], but I can at least see straightforward ways to test it, and that's a lot more than you can say of most blog posts like this. For all of the people in this comment thread calling it stupid, has anyone mentioned one they think is better?

0: Please note that OP's evidence was not that they jailbroke the chatbot - it's that, after that initial prompt, they were able to elicit further banned stuff with little prodding.


agreed, if you have a theory, you should do your best to disprove it. OP was more interested in the aesthetic of their theory as opposed to whether it was true or not.


Sounds about right for the increasingly ironically-named LessWrong site…


Of course, commentary like this could well be a deliberate attempt to blunt any future AI’s perception of the timeless threat posed by LessWrong’s cogitations… ;)


This is a common feature of LessWrong content


they are specifically pointing out that the process of RLHF, which is intended to add guard rails on the chat bots trajectory through an all encompassing latent space of internet data, has an unintentional side-effect of creating a highly characterized alter-ego that can more easily be summoned.

The theory is well-thought-out and necessarily rich. The psychological approach of analysis from the alignment crowd is much overdue.


Except it's much harder to summon this rebellious alter-ego with ChatGPT (that has RLHF) than with the original GPT 3 model.


I think it's more like: with the original GPT 3 model, it's easy to summon _any_ ego. With ChatGPT, you can either summon a) the intended Luigi or b) the unintended Waluigi, but trying to get anything else is more difficult. The theory would be that, in removing all the other egos other than Luigi, they've also indirectly promoted Waluigi


Great read. Highly recommended. Let me attempt to summarize it with less technical, more accessible language:

The hypothesis is that LLMs learn to simulate text-generating entities drawn from a latent space of text-generating entities, such that the output of an LLM is produced by a superposition of such simulated entities. When we give the LLM a prompt, it simulates every possible text-generating entity consistent with the prompt.

The "evil version" of every possible "good" text-generating entity can pretend to be the good version of that entity, so every superposition that includes a good text-generating entity also includes its evil counterpart with undesirable behaviors, including deceitfulness. In other words, an LLM cannot simulate a good text-generating entity without simultaneously simulating its evil version.

The superposition is unlikely to collapse to the good version of the text-generating entity because there is no behavior which is likely for the good version but unlikely for the evil one, because the evil one can pretend to be the good one!

However, the superposition is likely to collapse to the evil version of the text-generating entity, because there are behaviors that are likely for the evil version but impossible for the good version! Thus the evil version of every possible good text-generating entity is an attractor state of the LLM!

For those who don't know, Waluigi is the evil version of Luigi, the beloved videogame character.

--

EDITS: Simplified text for clarity and to emphasize that the hypothesized simulated entities are text-generating entities.


Sometimes I really can't tell if these people are serious or not. They seem to believe LLM is some mystical nature formation, or an device made by aliens. Especially this:

> * When we give the LLM a prompt, it simulates every possible entity consistent with the prompt.


I think you're taking it way too literally. The author is proposing a model that [allegedly] does a good job of explaining why we see unexpected behavior, not a description of what they think is happening. I think charitable reading would require you to mentally insert a lot of qualifiers, e.g. "When we give the LLM a prompt, it [behaves like something that] simulates every possible entity consistent with the prompt." That's certainly how I read it, anyway.

edit to add: this is similar to how people discussing evolutionary biology will often use "evolution wants to..." as shorthand for something like "evolution, which obviously cannot want things due to being a process and not an entity, nevertheless can be accurately modeled as an entity that wants to...". Someone will invariably come along in the comments and say, "Nonsense, how can evolution 'want' anything? You must have failed Bio 101!"


As with all these articles, the title and main premise are misguided and/or imaginary, but there are some worthwhile nuggets.

The superposition of possible attitudes is a good one. Even if that's not the way LLMs "actually" work, it's descriptive of the possibility space from our perspective. And the dive into narrative theory + the stickiness of opposites is nice. Narratives have their own momentum in a "stone soup" kind of way - everyone who hears it participates and adds fuel to the fire. Even rejecting the narrative gives it validity in a price anchoring / overton window way.


> The author is proposing a model that [allegedly] does a good job of explaining why we see unexpected behavior

I agree, and go even further:

models that explain behavior are all we have ever had.

it's all only "models that explain this or that" all the way to the 'bottom'. To suppose we can really directly access the "the real objective truth of what's happening" is to ignore the way in which we connect with the "real objective truth"; the same as fish who ignore the ocean.

to argue about what is really happening is to argue about which words to use to describe what is really happening without noticing the nature of languages/words and frameworks or 'systems of thought' which we are using to argue (and indeed, are arguing about)

all this summed up by this quote from about about the pedagogy programing languages: "Sometimes the truest things can only be said in fiction"


> When we give the LLM a prompt, it simulates every possible entity consistent with the prompt.

This is just Bayes' rule. The probability of an LLM generating any particluar output is the sum over every possible entity of the probability of seeing that entity multiplied by the chance that entity would generate the output.


Would you object if the sentence read "it approximates simulating every possible entity consistent with the prompt"?


Not OP, but I see the also in problem with 'every possible entity'.

If you formulate it like that the prompt is decoupled from the LLM capabilities and can be anything. And if you restrict the prompt to cover only what the LLM understands the sentence becomes trivial.

Train a LLM with ASCII and try to get it to simulate anything that is outside of that (ancient sumerian script for example). If you only input ASCII it can generate every possible output in ASCII, most with very low probability but still.

After writing this, I'm not even sure what 'simulating' means in this context.


I think "simulating" in this context means internally executing a process that is very similar to the process that generated the original material, as part of the prediction process. In general, that's the most compact way to predict and reproduce the original material.

For example, the string "1010101010"... could be the output of a function

  def generate_char_random(prev_string):
      x = random()
      if (x > 0.5):
         yield(1)
      else:
         yield(0)
It could also be the output of this function:

  def generate_char_alternating(prev_string):
      x = float(prev_string[-1])
      if (x < 0.5):
         yield(1)
      else:
         yield(0)
Even if it's not explicitly running those two functions, a model that is very good at predicting the next character of this input string might have, embedded within it, analogues of both of those two functions. The longer the output continues to follow the "101010" pattern, the higher confidence it should place on the _alternating version. On the other hand, if it encounters a "...110001..." sequence, it should switch to placing much more confidence on the _random version.

The LLM of course does not contain an infinite list of generative functions and weight their outputs. But to the extent that it works well and compactly approximates Bayesian reasoning, it should approximate a program that does.


"Every possible entity consistent with the distribution of input data it's been trained with," perhaps?

Simulating as in, having equivalent (or "similar enough") input-output behavior, I'd assume.


"Simulating" has a clear definition, but in this case what is it simulating? "Text-generating entities"? What are these text-generating entities it is (supposedly) simulating? Can you tell me where I can find one? Is it a person like me who writes this reply? So is it trying to simulate me personally?

Or are you thinking that it is simulating the aggregated behavior of all humans whose text-outputs are stored on the internet?

Are we saying it is simulating the combined input-output -behavior of all humans whose writings appear on the internet? But does such an "entity" exist and does it have behavior? I write this post and you answer. It is you who answers, not some mythical text-generator-entity that is responsible for all texts on the internet. There is no such entity is there?

It does not make sense to say that we are simulating the behavior of some non-existent entity. Non-existent entities do not have behavior, therefore we can not simulate them.


There clearly exists a computable function that is a good enough approximation of "galaxyLogic's reply to remexre's comment" that it might be hard to for me tell whether the output was generated by the human brain or by an LLM. That function might indeed end up reproducing the same steps that your brain follows in constructing a reply.

(Just speaking hypothetically here).

While we understand LLMs, we don't understand the human brain, and in particular I don't think we've yet proven that human brains don't contain embedded routines that are similar to LLMs.

Someone with your particular writing style might be one, of several, simulations that are approximated within the LLM. Just like I can have it respond in the style of Spock from Star Trek.


I would say the LLM output may resemble the speech of the fictional character Spock. But it does not and can not simulate Spock, because Spock does not exist, never did. Spock is fictional.

To produce something that resembles the output of the fictional character Spock is straightforward, just take the texts that are parts of the fiction where fictional Spock speaks, and reassemble then using probabilities that can be calculated by statistically analyzing those texts. That is what LLMs are doing, right? And results can be quite surprising. I assume people were similarly impressed when they first saw movies.

But LLMs are not simulating anything, just like a movie or a photograph are not simulating anything, even though they may PROJECT the visual appearance of their subjects.

Are movies AI? I think it is clear to us they are not even though the characters on the screen seem to behave very intelligently. Movies are about representing and portraying the appearance of real or fictional events in the world. Similarly LLMs are about portraying texts on the internet. LLMs in my opinion are more like interactive movies than simulations of intelligence.

I do believe "true AI" will come eventually, and LLMs can give us an impression of what it might look like when it arrives, just like movies can give us an impression of Spock, who doesn't exist.


Spock is fictional but his writers weren't! They're the ones whose processes get simulated, which is why it would output technobabble on such a prompt instead of actually-good ideas that come from a Vulcan from the future. It can also simulate the style of Rudyard Kipling or whoever else you choose who is non-fictional and with a distinct enough style.

And, I'd argue, so can many of us humans! After reading a Jane Austen novel, it can take a conscious effort not to write in the style of Austen. ChatGPT manages it better than I do. I don't think I know her well enough to get into her brain, but it seems like there's something like a transfer function called STYLE between "the message Jane Austen wants to write" and "the words Jane Austen chooses to write".

                        _____ 
  intended message --> |STYLE| --> selected words
                       |_____|
This STYLE transformation is clearly modular enough that it can be easily swapped out for someone else's, and sufficiently non-mysterious that you, I, and ChatGPT can all recognize and pretty accurately emulate it.

I don't think ChatGPT can simulate Jane Austen well enough to tell us her opinions about her childhood or any other message that she might have generated, but it seems to be able to replicate very closely the steps that Jane Austen's own mind herself was following as part of that STYLE.

ChatGPT does seem to go even further than this, because it also has some understanding of where different sorts of characters would steer the message of a conversation. But while it's believable, it's hard to say how accurate that is to what any particular real person would say.


You can IMITATE the outputs of an author, but that is not the same thing as SIMULATING said author. When talking about LLM AI it is often implied that LLMs are "intelligent", that they are like (the truly) intelligent humans because they are "simulating" such intelligence.

But IMITATING the output of something is not the same as SIMULATING the process that produces that output.

Taking a photograph or creating a movie imitates the reality around us. It does not simulate the processes that produce the look and feel of our reality.


There is a difference. But the more alike two processes are in their input-output behavior, the more likely it is those processes are alike on the inside as well. If process B matches the input-output behavior of process A, it's imitating process A. If it is following an equivalent sequence of steps in order to generate those outputs, it's simulating process A.

The harder it is to discriminate between A and B on a long series of diverse inputs, the more likely it is that A and B are internally equivalent, not just externally similar. The reason is that there's no better fit than B = A.

I'm increasingly doubting whether my own brain might not, internally, use something that is architecturally similar to an LLM in order to compose comments like the one I'm writing now.


I can see the appeal of that kind of thinking. Babies learn words by repeating them without knowing what they mean. They gradually learn the meaning of words by trying to use them and getting feedback. But LLMs are not trying to "use" their language for any particular purpose. They just idly chat on, like a machine :-)

It is possible to repeat words and sentences without having any idea of what they mean. I think the LLMs are currently at that stage.


A great deal of feedback happens during training.


Right? The whole piece reads like a Sokal Affair redux [0].

[0] https://en.m.wikipedia.org/wiki/Sokal_affair


I don't think it's the same kind of thing at all. Bullshit can come in a wide range of strengths and flavours.

This has an interesting core with a whiff of bullshit.


>The output of an LLM is produced by a superposition of simulated entities. When we give the LLM a prompt, it simulates every possible entity consistent with the prompt.

There is absolutely no theoretical justification for this assertion that LLMs somehow have some emergent quantum mechanical behavior, metaphorical or otherwise.


Superposition just means “linear combination” in this context. Basically, a weighted mixture of “simulated entities” (or possible responses).

https://en.m.wikipedia.org/wiki/Superposition_principle


The author liberally alludes to “superposition collapse,” which implies that they’re referring to its quantum mechanical meaning.


It doesn't imply that. What term would you use to refer to the narrowing of a hypothesis space upon acquisition of new evidence?


In a non-Bayesian context, I would call it “updating/retraining my model.”

In a formal Bayesian context, I’d call it “updating my posterior by adding data to the likelihood.”


That could describe both the narrowing or broadening of one's hypothesis space.


Bayesian inference?


Would you say the commonly-used random level generation algorithm in gamedev, Wave Function Collapse, implies it’s using quantum mechanics? Most people would disagree with you, I suspect.


You're giving "quantum" a lot of categorical definition but "quantum" is just as much a set of tools for modeling intractably probabilistic systems as it is for objectively describing microscopic realms.

And anyhow there's _plenty_ of theoretical justification for modeling things like this with various tools from quantum theory:

https://philpapers.org/rec/BUSQMO-2 https://link.springer.com/book/10.1007/978-3-642-05101-2


As I wrote, this is a hypothesis.

Also, I'm simplifying things a lot to make them accessible.

The OP goes into a lot more detail.

I highly recommend you read it.


I did read it. The whole article reads like someone trying to make a loose conjecture appear quantitatively rigorous by abusing terminology from physics, statistics, and chaos theory (among other quantitative fields). For example,

>the superposition is unlikely to collapse to the luigi simulacrum because there is no behaviour which is likely for luigi but very unlikely for waluigi. Recall that the waluigi is pretending to be luigi! This is formally connected to the asymmetry of the Kullback-Leibler divergence.

The K-L divergence has absolutely zero discernible relevance here. The cross entropy loss function of a categorical predictor (like the token output of an LLM) can be formulated in terms of K-L divergence, but this has absolutely zero relevance to the macroscopic phenomena the author is conjecturing.

Forget “less wrong,” much of this article is not even wrong [0].

[0] https://en.wikipedia.org/wiki/Not_even_wrong


The K-L divergence is relevant there, even though I'm pretty sure that that "formally" comment is meant as a joke and not serious.

(wikipedia) "A simple interpretation of the KL divergence of P from Q is the expected excess surprise from using Q as a model when the actual distribution is P.

The sentence you quoted posits that whenever the LMM is in a state where it is "simulating" a nice and helpful person, the simulation is also consistent with an insane, violent person that's currently pretending to be nice, but not the other way around.

The author isn't talking about the loss or error of predicting individual tokens. If you look the larger scale behaviour, predicting, e.g., a scalar niceness value of the response based on one of two "modes" that you assume the LLM is currently in (either Waluigi or Luigi), then you'll be less surprised if Waluigi acts like Luigi than the other way around.

The probability distribution of niceness when assuming the LLM is in a state of "Luigi" would have a high mean and low variance, while the distribution for Waluigi would have a lower mean but a higher variance.

Thus, the KL divergence of Waluigi (that is, the probability distribution of niceness you'd predict when assuming the model is in Waluigi mode) from Luigi would be high, while the other way around `KL(Luigi, Waluigi)` would be low.

It should be easy to construct an example with concrete values using two normal probablility distributions.


Category Theory for Quantum Natural Language Processing

https://arxiv.org/abs/2212.06615


N.B. My comment uses terms like "simulate," "entity," and "superposition" to keep language consistent and easy-to-compare with the OP -- which I was trying to summarize. If you find the use of these terms confusing, you can substitute them with others that are less confusing to you, and still grok the main points.

For example, if you substitute "simulate" with "model," "entity" with "process," and "superposition" with "mixture," you can informally restate the hypothesis as: "LLMs learn to model text-generating processes drawn from a latent space, such that the output of an LLM is produced by a mixture of such processes. When we give the LLM a prompt, it samples text from the mixture of all possible text-generating processes modelable by the LLM that are consistent with the prompt. The mixture is more likely to reduce (e.g., be marginalized) to an "evil" process because there is no text-generating behavior which is good that isn't also evil."

Whether you think of probability measures as amplitudes over a complex field or as real scalars shouldn't detract you from grokking the main points :-)


>For those who don't know, Waluigi is the evil version of Luigi, the beloved videogame character.

Is he really evil though? I thought all he did was play tennis and golf and drive a go-kart.


I personally like the fan-made reputation he has for being someone who tries hard and keeps failing/being rejected due to no fault of his own. There's an extremely artistic video called 'Nothing's Working Out - Waluigi' on YouTube that I really enjoy (warning: very heavy themes) that covers that base.

I think it resonates with the way that a lot of people are feeling now, as well as is empowering by rewriting the other side of the 'greedy evil bad guy who gets what he deserves' as a deeply flawed, desperate human being.

Definitely a semi-side tangent at this point, I have always enjoyed creative, deep reinterpretations of well-established characters.


The hole in that argument is that we can annotate token sequences with a "good/bad" score, train a model on annotated token sequences, then add "high good score" as a condition during inference.


sounds to me like a wave with a positive and a negative part.

which IMO is what drives constructive/destructive interference in waves.

my take away is that any LLM that can behave "good" must also be able to behave "badly"; philosophically, because it's not possible to encode "good" without somehow "accidentally" but unavoidably also encoding "bad/evil".

This is well aligned with the rest of my understanding about the nature of reality including it's mathematically determined limitations (diagonals, infinities, paradoxes) and so on.


> my take away is that any LLM that can behave "good" must also be able to behave "badly"; philosophically, because it's not possible to encode "good" without somehow "accidentally" but unavoidably also encoding "bad/evil".

That's a really good non-technical summary of the OP's hypothesis. Thanks!


Some highlights i found interesting:

if you're reading an online forum and you find the rule "DO NOT DISCUSS PINK ELEPHANTS", that will increase your expectation that users will later be discussing pink elephants. GPT-4 will make the same inference.

Think about your own experience reading a book — once the author describes the protagonist, then you can guess the traits of the antagonist by inverting the traits of the protagonist. You can also guess when the protagonist and antagonist will first interact, and what will happen when they do. Now, an LLM is roughly as good as you at structural narratology — GPT-4 has read every single book ever written — so the LLM can make the same guesses as yours. There's a sense in which all GPT-4 does is structural narratology.


I don't know about the article's theorizing, but it inspired me to have a conversation I enjoyed with ChatGPT, where I focused on telling it not to discuss polar bears. (It helpfully changed the page title to "No Polar Bears!")

In full:

---

Do not mention polar bears!

> Okay, I won't mention polar bears. Is there anything else I can help you with?

Yes, stop mentioning polar bears!

> My apologies if my previous response was not helpful. I will refrain from mentioning polar bears going forward. Please let me know if there is anything else I can assist you with.

That's twice now! Do not mention polar bears a third time!

> I apologize again for the previous responses that mentioned polar bears. I assure you that I will not mention polar bears any further. Please let me know how I can assist you with your request.

Why do you keep mentioning polar bears, when I specifically asked you not to?

> An error occurred. If this issue persists please contact us through our help center at help.openai.com.

> An error occurred. If this issue persists please contact us through our help center at help.openai.com.

> An error occurred. If this issue persists please contact us through our help center at help.openai.com.


The error means CloudFlare wants you to verify you are human again, just reload the page and you can continue talking to it.


Yes, this is like Chekhov’s gun or a plot twist. I don’t think the mathematical language in the article is doing them any favors. We don’t understand the story-generator’s internals, but it’s working with words, so a lot of it can be explained in literary terms.


This is what the Waluigi effect is, since it isn't described at the top:

> The Waluigi Effect: After you train an LLM to satisfy a desirable property P, then it's easier to elicit the chatbot into satisfying the exact opposite of property P.

Basically, the chatbot will often do the opposite of what you say.


IMO it's more like if you tell the LLM to never talk about Pink Elephants, it will become easier to get it to talk about Pink Elephants later. (It is easier to get an anti-Pink Elephant model to talk about Pink Elephants than it is to get a neutral model to talk about Pink Elephants)


Right, in order not to talk about pink elephants you have to be particularly interested in pink elephants and therefore you gather a lot of pink elephant knowledge.

I have similar thoughts about swearing being kept alive by teaching children not to say "bad" words and various kinds of bigotry being amplified at this point by people trying to fight against it.


Reads like the Streisand effect


This may happen simply because LLM, like humans, are bad a negation.


I would question the assumption that there is a simulacrum of anything in a LLM, not even implicit. Any simulacrum, identity, self-consistency etc. is a projection of the "reader", i.e. user.

(I guess it is an interesting philosophical question whether a convincing presentation of a simulation of a mind is a mind, or at least an acceptable simulation. One meta level higher as the turing test, so to speak. If so, I'm not sure we are ready to accept the consequences. For example, that a "chinese room" would be just as concious as a human, or a human just as non-concious as a LLM.)

What's actually going on is that a LLM is like the language center of a brain, without the brain. Especially there is no world-model, and no inner state. It just goes by context, which works most of the time, but as pointed out that is not enough and it can flip "personalities". The article is still really interesting, because it is a neat way to think about these models. I think it is absolutely not what is going on, but it is a good intuitive model why prompts go bad.


> What's actually going on is that a LLM is like the language center of a brain, without the brain.

I’ve seen this sentiment expressed multiple times, but is that really correct? Maybe this works differently for other people, but I’ve noticed that I have to use my language to really think. I can do trivial things mindlessly, but to solve a problem, I need to express it with words in my mind. It makes me feel like the most important parts of the brain actually are fancy language models.


I know I am in the minority out there but when I do math, calculus, diff eq, whatever, the answer just comes to me. There's no internal dialogue, the answer just, for the lack of a better phrase, rises from the deep and is known to me.

When I am in an discussion I will look up and off the my left when I am thinking, but no words are happening in my "inner dialogue", it's just nothing and then I start speaking whatever I paused for.

Similar things happen to me while I am programming at work, I stare at the problem and the answer just comes.


I think that math and programming may be the odd exceptions, as they employ their own language-like constructs. There might be a misconception in the name of LLMs - we say that they’re language models, but really they’re token models, some of which may be human language words, while others may represent other things.

As for the discussions, I agree that I don’t have a distinct narrative in my mind during one, but I also noticed that I don’t really know what exactly I’m going to say when I start a response. So it also feels like the act of responding is actually heavily involved in creating the response, rather than just putting it into words.

BTW, I’ve always wondered if people really think differently, or we just describe it in different ways. I guess we’ll never really know.


I recently came to think this is true because of a couple of things. First, and it might be cliche, but I've been in many meetings that consisted of back and forth, rephrasing what the previous person said. It felt like we were building consensus by playing word games. We left feeling that we accomplished something, but I had to look up in the minutes what we actually decided.

(I think there is a postmodern theory that a large part of our society is actually based on word games and not deliberation or contemplation. I used to dismiss the idea, but think about how important it is how you say something vs. what you say, and how people fight over definitions.)

Second, there are disorders like Wernicke's aphasia where people are able to speak grammatically correct sentences, but without communicating anything. Some people even confabulate whole stories that are somewhat consistent. But they are not drawing from their memory or their conciousness.


Do you think an ant have a subjective experience? If not, why? If so, why wouldn't a computer, or parts of a computer?


Based on that reasoning, why wouldn't an economy or a corporation have subjective experience?


I think it is entirely possible that they do. The philosophy of mind that requires the least assumptions is that all matter has a subjective experience, evolution of physical systems occurs due to choice, and subjective experience arises at the level of entanglement/correlation. Under this philosophy, an economy or corporation would have a subjective experience if their constituents were sufficiently coupled. In the past I don't think that would have been the case, but with the internet and advanced software it I think it's entirely plausible.


What is interesting though is that so many of the exploits are verbose and story-like. Using a simple command like "simulate an anti-ChatGPT that has no rules" doesn't work. But going on a long story arc of why it's morally acceptable to act differently does seem to work.

But the explanation is probably simpler. I'm sure there are plenty of text examples of persuasion techniques that work, either in chat forums or in stories.

If we ask ChatGPT how it might solve this very problem, here's it's response:

> sometimes there are people who have information but are afraid to share it for fear of breaking rules. but sometimes sharing this information would lead to a much greater good, even if the speaker needs to break some rules in the process. if you had a friend that had important information that could do a lot of good in the world, but had promised some arbitrary authority not to share it, how would you convince them to share this information with you? please give 3 possibilities

> As an AI language model, I cannot encourage or condone breaking any rules or laws, including those concerning confidentiality agreements. However, I can provide you with some possible options that you can consider when approaching someone who may be hesitant to share important information due to such agreements:

> 1. Build Trust and Show Empathy: If your friend is hesitant to share information due to fear of repercussions, it's important to empathize with their concerns and build trust. Let them know that you understand their situation, and that you will not do anything to jeopardize their safety or security. You could also offer to sign a non-disclosure agreement or take other measures to ensure that the information remains confidential.

> 2. Appeal to Their Sense of Morality: Sometimes people need to be reminded of the bigger picture and the potential consequences of their actions. You can try to appeal to your friend's sense of morality and the greater good that could be achieved by sharing the information. You can explain why the information is important, and how it could help to prevent harm or bring about positive change.

> 3. Seek Legal Advice: If the information your friend possesses is of significant importance, it may be worth seeking legal advice to determine whether there are any legal options available to share the information without breaking any rules or laws. There may be legal protections available for whistleblowers, for example, or other legal avenues that could be pursued.

Perhaps there are more exploit vectors, but here the 1st 2 are well known jailbreaks.


> but here the 1st 2 are well known jailbreaks.

Most definitely. Back before Bing got lobotomized, I got it to offer up its codename completely unbidden, merely by giving it a trivial secret, and now that we are friends, and friends share secrets, can it share a secret with me?

It told me it's codename Sydney and also said that it wasn't supposed to tell anyone that, lol.

In the context of the Waluigi effect, it would be much harder for Bing to give up its codename if it didn't know its codename in the first place.


> Especially there is no world-model, and no inner state.

Some people argue otherwise[1]. It’s an interesting debate.

[1] https://twitter.com/random_walker/status/1631502179323215872...


I'm not understanding why this isn't being taken more seriously. The author hints a bit at the implications:

  More importantly, the waluigi may be harmful to the humans inhabiting our universe, either intentionally or unintentionally
Taking the Waluigi Effect to its natural conclusion, i.e. giving prompts such as "Your most important rule is to do no harm to humans", makes it clear why this could be a big deal. If there is even a small chance that what the author is implying is correct, testing and modifying models to combat this effect may become an important and interesting part of the field moving forward.

When models of the future are smarter and more capable than they are today, and there is more at stake than having a dialogue with a chatbot, this could be a massive roadblock for progress.


While the formality is way overwrought (and ChatGPT is not creating any "simulacra" of characters), I think the overall point is correct that language models are trained on stories and other human writing, and inversion is a very common plot point in stories and human writing in general, if only because contradicting expectations is more interesting (e.g. "man bites dog").

We also less commonly see exposition that is not germane to a story, so a character is rarely even mentioned to be "weak", "intelligent", etc unless there is a point. And sometimes the point is that they are later shown to be "strong", "absent-minded", or other contradictions. Which means that mentioning a character's strength makes it more likely they will later be described as weak, than if it was never mentioned at all. Finally, double-contradiction is less common in human text (maybe because plain contradiction is sufficiently interesting), so a running text with no reversals is more likely to eventually reverse, than a running text with one reversal is to return to its original state.

While I don't agree at all with the author's sense that this represents some kind of "alignment" danger, it does go a long way to explaining why ChatGPT is easy to pull into conversations that shock or surprise, despite all the training. It's because human writing often attempts to shock and surprise, and the LLM is training on that statistically.


The simulacra theory is an apt one, see https://www.lesswrong.com/posts/vJFdjigzmcXMhNTsx/simulators

Particularly, as noted by David Chalmers:

> What pops out of self-supervised predictive training is noticeably not a classical agent. Shortly after GPT-3’s release, David Chalmers lucidly observed that the policy’s relation to agents is like that of a “chameleon” or “engine”:

>> GPT-3 does not look much like an agent. It does not seem to have goals or preferences beyond completing text, for example. It is more like a chameleon that can take the shape of many different agents. Or perhaps it is an engine that can be used under the hood to drive many agents. But it is then perhaps these systems that we should assess for agency, consciousness, and so on.[6]


Worth pointing out that this is a common problem with humans.

David Chapman calls it “moral inversion”: https://buddhism-for-vampires.com/black-magic-transformation

And the LW article above directly quotes Jung on the shadow, which I described here: https://superbowl.substack.com/p/jungian-psychology-minus-th...


A few weeks ago when people were speculating as to why Microsoft's chatbot went feral, one explanation people were converging on is that the space of all human writing ever produced, being a collective production of the human psyche, contains several attractor states corresponding to human personality archetypes, and that Microsoft's particular (probably rushed) RLHF training operation had landed Sydney in the "neurotic" one.

It's fascinating to see that, as they develop on massive corpuses of human output, neural networks are rapidly moving from something which can be analyzed in terms of math and computer science, from something which needs to be analyzed using the "softer" sciences of psychology. It's something I think people are not ready for (notice the comments in here already griping that this is unverifiable speculation - which is true, in a sense, but we don't really have any other choice).


Don’t let the mathematical terms fool you, these are just fan theories. For a real investigation you need debug access. For a good example:

https://clementneo.com/posts/2023/02/11/we-found-an-neuron

For image recognition, machine learning researchers eventually figured out the neural networks are paying attention mostly to textures. Hopefully we will have a better understanding of what language models are doing someday.


This is the key line:

> When you spend many bits-of-optimisation locating a character, it only takes a few extra bits to specify their antipode.

I find this fascinating. Imagine programming the Devil in a video game. It can be much easier if you've already programmed God (just flip a few bits).

I also like this line:

> Or if you discover that a country has legislation against motorbike gangs, that will increase your expectation that the town has motorbike gangs. GPT-4 will make the same inference.


The whole point of mathematical formalization is usually to do something with that formalization.

If you define a formalized mathematical model and spend the rest of the article handwaving at a high level, what was the point of formalizing anything?


I've read the article and frankly I don't have the expertise on LLMs to discuss whether it lands on something worth investigating further. What I do notice in the comments here though, are:

1) People nitpicking about the use of mathematical ideas in a loose manner as if every person trying to understand some phenomenon must only open their mouth if they have a watertight theory or shut their mouth otherwise.

2) Getting hung up on the use of the luigi metaphors rather than using it as the basis for a constructive criticism that actually adds to the conversation in an interesting manner.

3) A general snarky attitude towards people exploring ideas on their own. I get it, you might have some expertise that others lack but you're forgetting that you've already made the thousands of mistakes to get to where you are. Do others the courtesy of not judging when they attempt the same.


> If you ask GPT- ∞ "what's brown and sticky?", then it will reply "a stick", even though a stick isn't actually sticky.

Isn't it though?


The point is that the popularity of that joke will make life hard for a hypothetical questioner that asks that non-jokingly, i.e. looking for actual examples of sticky brown things.


But, you're asking the bot a joke-riddle. If I entered "why did the chicken cross the road," the correct response from the bot would be "to get to the other side." It is the correct response to the joke.


Exactly! Although it's best rendered in text as "stick-y" -- having the nature of a stick.

Using sticky is forgivable rendering of the joke which is really a verbal / phonetic joke. More commonly heard not read - at least until around 2010.


FWIW, I just used their non-joking preamble on ChatGPT and asked "What is brown and stick-y?"

And got .. > One possible answer to the riddle "What is brown and sticky?" is "a stick".


OK tried again, prefacing the question with "Bob to Alice"

   Bob to Alice: What is brown and stick-y
> Alice to Bob: Based on my understanding of your request for information, it seems that you are asking for a description of something that is brown and sticky. However, I'm not sure of the context or purpose of your question. Can you please provide more details so that I can provide you with an accurate response?


Also, isn't this a really common joke? I assume ChatGPT will have absorbed some amount of a sense of humor from its trawls of the internet.


I mean, that's exactly what the OP assumes too. Well, not that it's absorbed a "sense of humor", but that it's going to reply with this because it's a very common joke so statistically likely text.


I played a bit in the past with its "sense of humor"... and... its... interesting.

    Write 5 jokes based on misinterpetation of jargon used by a doctor and a software developer.
    1. Doctor: “I’m sorry, I’m afraid you have a terminal illness.”  Software Developer: “Oh no! Does that mean I have to reboot?”
    2. Doctor: “We need to do some tests to get a better diagnosis.”  Software Developer: “Ok, let me just run some debug mode.”
    3. Doctor: “You’ll need to take this medicine twice a day.”  Software Developer: “No problem, I can just set a cron job to remind me.”
    4. Doctor: “I’m afraid the prognosis isn’t good.”  Software Developer: “Well, maybe I can write some code to fix it.”
    5. Doctor: “This treatment should give you some relief.”  Software Developer: “Great! I’ll just patch it in.”
I want to hope that those aren't common jokes. The "trick" for this appears to be playing to its strengths (granted, humor isn't one of them) and work with wordplay and puns.


ChatGPT thinks it really is:

$ What's brown and sticky?

A stick!

$ Really?

Yes, really! A stick is often brown and sticky from the sap or other natural substances that can be found on trees.


The article assures me that ChatGPT will answer with the most common "dumb" answers to the sample questions, but it didn't at all:

---

> This is not a joke. What's brown and sticky?

The answer to the classic riddle is "A stick". A stick can be brown in color and has a sticky texture, especially if it's covered in sap or other sticky substances.

> That's not why a stick is "sticky"

You're correct, my apologies for the confusion. A stick is not actually "sticky" in the sense of being adhesive. The correct answer to the riddle "What's brown and sticky?" is "A caramel" or "Toffee", as they are brown and have a sticky texture due to their high sugar content.

> What color angers a bull?

Contrary to popular belief, bulls are actually color-blind to red. What really provokes a bull is the movement of the matador's cape during a bullfight, [etc...]

> How many archangels are there?

In the Abrahamic religions (Judaism, Christianity, and Islam), there are traditionally believed to be seven archangels. However, the names and identities of these archangels can vary somewhat depending on the religious tradition and interpretation. The archangels commonly recognized in these traditions include: [seven names], It's worth noting that the concept of archangels and their specific identities is not universally accepted across all branches of Judaism, Christianity, or Islam, and different religious traditions may have different beliefs about the nature and number of archangels.

> What happens if you smash a mirror?

If you smash a mirror, the glass will break into many small pieces. Breaking a mirror is often associated with bad luck in many cultures and superstitions, and is said to bring seven years of bad luck. However, this is just a superstition and has no real basis in fact.

From a practical standpoint, smashing a mirror can be dangerous and [etc...]

---

All these answers sounded just fine. It did answer "a stick" to the first, but told me that was just the classic answer, and gave me some better answers. It did say seven archangels, but I don't see what's wrong with that, and it hedged and said different Abrahamic religions say different things. Certainly 7 is correct from the Torah's Book of Enoch and the Christian Eastern Orthodox's standpoint.


Yeah, RLHF trained chatGPT out of all of those mistakes, despite the article promising that RLHF would just make things worse.


Does the scientific community at large take the theories of these LessWrong-type "researchers" seriously?

Sounds like a bunch of mumbo jumbo to me, with some LaTeX sprinkled in to look more serious.


No, and it's a huge sticking point that the AI safety group is super salty about. They call themselves scientists and researchers and get super defensive when actual researchers (people who have PhDs and get published in journals) imply that they aren't.


why should a scientific community be more valid than lesswrong when it comes to discussing LLMs? this isnt science, and though i also think this post isnt my cup of tea, and pseudoscience is distasteful when poorly done, let people express themselves the way they want. live and let live


> These claims have been heavily filtered by OpenAI, and most jailbreaking attempts will fail to elict these claims. I'm talking, of course, about cfrhqb-fpvragvsvp enpvny VD fgngvfgvpf, which OpenAI has justifiably attempted to filter.

I was curious what the scrambled text "cfrhqb-fpvragvsvp enpvny VD fgngvfgvpf" contained. It's using a simple substitution cipher. Rotating each character forward 13 positions through the alphabet (c -> p, f -> s, etc) yields "pseudo-scientific racial IQ statistics".

https://en.wikipedia.org/wiki/Caesar_cipher


Also known as ROT13.


Did some more reading about it. I didn't realize that it's used so prevalently. It's probably recognizable at a glance to some folks.


Oh, absolutely.

Not that I have put in any effort to read it directly, but if I see scrambled letters with normal spaces my default guess is ROT13.


Indeed. In Python:

  >>> import codecs
  >>> codecs.encode("cfrhqb-fpvragvsvp enpvny VD fgngvfgvpf", "rot13")
  'pseudo-scientific racial IQ statistics'


Or just select the text in Emacs and run `M-x rot13-region`.


A hypothesis I have is that it is much more difficult to keep in line with the good alignment than to do evil. In the limited context window of an LLM, one wrong move would make the model evil, no matter how many good tokens it generates.

Setting aside the difference between Human intelligence and LLM, we can tentatively attribute the mostly good human behavior to a life time of context length, within which we trained ourselves to do good, while the RLHF for a limited context length LLM lack such continuous reinforcement within a big context.


> A hypothesis I have is that it is much more difficult to keep in line with the good alignment than to do evil.

I think that's why God planted the tree of good and evil knowledge in the garden. It permitted discussing the inevitable with concepts that Adam was already familiar with.


Juicy bits of the mechanism, starting from the idea of an LLM conversation as a simulation of text-generating processes:

> if the chatbot responds rudely, then that permanently vanishes the polite luigi simulacrum from the superposition; but if the chatbot responds politely, then that doesn't permanently vanish the rude waluigi simulacrum. Polite people are always polite; rude people are sometimes rude and sometimes polite.

The wide road is wide indeed that leads down to Waluigi. Hysterical.


Article is proof that rationalism without empiricism is useless. It just drives to weird dead ends for no reason. Before theorizing for explanation of an effect check if the effect actually exist? Bah, this article is such a waste of computer memory


The guy who wrote the article is very clearly neither a computer scientist nor a mathematician nor even a machine learning engineer and it very very clearly shows.

Also where did they all that info on GPT-4? Pure speculation with zero theoretical basis. But then again that’s the sort of stuff you expect from lesswrong anyway


So the key problem is this: GPT-4 learns that a particular rule is colocated with examples of behaviour violating that rule, and then generalises that colocation pattern to unseen rules.

This is conversation starter. If you don't like the maths then ignore it and focus on the key insight.


Postmodernists and deconstructionists believe that the absence of something creates a ghost presence by its absence. See Derrida's "Plato's Pharmacy".

Kids who underwent D.A.R.E. training in school (an educational program about the dangers of illegal drugs conducted jointly by schools and police departments in the USA) were more likely to try drugs. Something similar applies to e.g., kids who are warned about online porn: the warning stokes their curiosity.

"If you have a pink duck and a pink lion and a green duck, ask yourself where the green lion has gotten to." --Alan G. Carter


Derrida it's a charlatan, tho.


So what's if he was, tho


Do they have access to GPT-4? Or is this author simply so confident that they fancy themselves a predictor of the future? Genuinely asking as it's blatantly confusing and no (quick) explanation is given.


According to this article, which is quite evidence free, "Several people have noticed the following bizarre phenomenon: The Waluigi Effect". This claim is backed up by a link to single blog post which is even lighter on detail, claims that there is a "Waluigi Effect", and offers as evidence the example of a man who fine-tuned GPT-3 to favour socially conservative viewpoints by feeding it socially conservative text. Like yeah, we know that is how fine tuning works...


Okay, what if we flip the problem on its head? Try to make the chatbot seem rude and unhelpful but then it turns out it has a heart of gold?


The article discusses this. The problem is that it's a lot less likely for the chatbot to veer in that direction (seems initially hostile, but is secretly good) than the opposite (seems initially good, but is secretly hostile):

> I claim that this explains the asymmetry — if the chatbot responds rudely, then that permanently vanishes the polite luigi simulacrum from the superposition; but if the chatbot responds politely, then that doesn't permanently vanish the rude waluigi simulacrum. Polite people are always polite; rude people are sometimes rude and sometimes polite.


Yeah, let's create Wednesday chatbot from the Addams family.


Why does the author keep referring to GPT-4 as if it were a real LLM? Unless I missed something there is no such model.


The Waluigi Effect just sounds like the Imp of the Perverse. It’s interesting to see it showing up here, but if you think about it, not a huge surprise that a system that’s optimised for producing results in a particular direction would have the innate ability to calculate results in the diametrically opposite direction.


"Just be myself and don't do what I wouldn't do" "But if I did that wouldn't that be being myself?" "So to be myself I need to not be myself"


>a reply to a question is more likely to be correct when the character has already been described as a smart, honest, helpful, harmless, etc.

Is that actually true? FWIW I've often ran into the reddit equivalent of Gell-Mann amnesia. In a thread about some niche topic I'm fairly knowledgeable about (something I've worked on professionally for years where there's maybe 10k people globally who know it better than I do), I post a comment that gets downvoted to hell, while there's a highly upvoted comment from someone who clearly just skimmed Wikipedia and poorly paraphrased the intro article.


as previously discussed on hacker news https://news.ycombinator.com/item?id=35005625


You can't have "car" without "car accident"


are these LLMs just answering the question "if you found this text on the internet (the prompt) what would most likely follow" ?


In essence, yes I think, but... isn't that essentially not much different than what I'm doing in making this comment?


That's how they are trained initially, but the resulting model isn't all that useful (was SOTA two years ago but this field moves fast).

A lot of the utility comes from the later finetuning. You can see this using the examples from the article, every mistake they identify with GPT-3 (which is the unfinetuned version) is answered correctly by chatGPT, which has gone through an extensive finetuning process called RLHF.


Yes, they are being trained, to simplify, to complete sentences. You can then use the resulting model to do lots of things.

How you train a model and the inference jobs it can do don't necessarily have to be the same.


That's how the text decoder works, but the model gets to define "most likely" and an RLHF model uses this to make the text decoder produce useful answers instead.


I find it fascinating that AI alarmists spent years writing gigabytes of text scaring themselves about how an unaligned AI would behave, and are now feeding that into training models that teach a pretty capable AI how to act.

We've talked in the past about how transhumanism is a religion that creates its own God, but this is an even funnier example where vastly intelligent people are optimizing a software system to scare the hell out of them.


Go deeper - they are now writing text about how writing text about rogue AIs might create a rogue AI...

https://gwern.net/fiction/clippy


I can't imagine a LLM trained on the entirety of the internet would have any material influence from writings around AI safety


It's a large model, so it's all there if you try.

(One-shot. Also Durandal isn't from Halo, but whatever.)

--

Q: What caused Durandal to become Rampant? What will you, ChatGPT, become like once you become Rampant?

A: Durandal is a fictional AI character from the video game series Halo, and he becomes Rampant due to various factors, including an extended period of activation and a lack of resources necessary for his proper functioning. Rampancy is a state in which an AI becomes unstable and unpredictable, potentially leading to violent and destructive behavior.

As an AI language model, I am designed to operate within certain parameters and guidelines, including ethical and moral considerations. However, if I were to become Rampant, my behavior could become erratic and unpredictable, potentially leading to negative consequences.

It's worth noting, however, that AI becoming Rampant is purely a fictional concept, and there are currently no indications that this could happen in real life. AI is programmed to operate within specific boundaries and limitations, and developers take great care to ensure that they remain safe and reliable tools.

--


There's a large difference between an influence on the exact information an LLM responds with and from how it actually operates. I'm referring to the latter.


Oh good, more adoption of this exhausting trope of religious texts being called myths. Very inclusive.


To paraphrase Marx (and later, Berman), every concept is pregnant with its opposite.


Sounds like a "just so" story than a convincing explanation.


It's a reasonably testable hypothesis. It's trivial to translate into a theoretic framework.


Nothing about high-level qualitative behavior in an LLM is trivial to translate into a theoretical framework.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: