Yeah, I find this article takes a decent insight on the behavior of LLMs and the...

johnfn · on March 6, 2023

> It's honestly embarrassing for the OP

I don't get this. People can use mathematical terminology in non-precise ways, they do so all the time, to get rough ideas across that otherwise might be hard to explain.

Just because OP uses the word "eigenvector" doesn't mean that he's offering some grand unifying theory or something - he's just presenting a fun idea about how to think about ChatGPT. I mean, isn't it obvious that there's nothing you can really "prove" about ChatGPT without having access to the weights (and even still, probably not too much).

astrange · on March 6, 2023

Yeah, but generally when you're doing that you shouldn't. Isn't it annoying when people say "order of magnitude" when they mean "a lot"?

rossnordby · on March 7, 2023

I think a lot of people might be missing some cultural context, or something. At the risk of killing the joke, a bunch of this post is worded as it is specifically for the opportunities for absurdity. It's an excuse to write sentences like:

Recall that the waluigi simulacra are being interrogated by an anti-croissant tyranny.

The post is also trying to make an actual point, but while having fun with it.

When you read "... and as literary critic Eliezer Yudkowsky has noted..." just place your tongue firmly in your cheek.

LightMachine · on March 7, 2023

No? Order of magnitude conveys the notion of something being 100x bigger, at minimum. A lot can mean much less than that.

pixl97 · on March 7, 2023

You mean 10x bigger. One order of magnitude is 10x.

astrange · on March 7, 2023

And my point was people are saying it when they don't mean 10x!

ineptech · on March 6, 2023

I liked the essay, but I don't think I'm "falling for it" because it's not trying to convince me of anything. It's proposing a way of looking at things that may or may not be useful. You don't judge models by how silly they sound - parts of quantum mechanics sound very silly! - you judge them by how useful they are when applied to real-world problems. One way of doing that in this case would be using OP's way of thinking to either jailbreak or harden LLMs, and OP included an example of the former at the end of the essay. Testing the latter might involve using a narrative-based constraint and testing whether it outperforms RLHF. If nothing else, I think OP's approach is a better way to visualize what's going on than a very common explanation, "it generates each word by taking the previous words and consulting a giant list of what words usually follow them" (which is pretty close to accurate, but IMO not very useful if you're trying to intuitively predict how an LLM will answer a prompt).

I guess I agree that there are some decent insights here, and some crap, but I interpret that a lot more charitably. It's a fairly weird concept OP is trying to convey, and they come from a different online community with different norms, so I don't blame them for fumbling around a bit. But if you got a nugget of value out it then surely that's the part to engage with?

extr · on March 6, 2023

To be clear, I agree that there are in fact a few nuggets of insight here. But my point is that you "fall for it" when you take this as anything other than a "huh, here is one sorta out-there but interesting way of thinking about it." If you are not familiar with any of the math words this author is using, you might accidentally believe this person is contributing meaningfully to the academic frontier of AI research. This article contains completely serious headers like:

> Conjecture: The waluigi eigen-simulacra are attractor states of the LLM.

This is literally nonsense. It is not founded in any academic/industry understanding of how LLMs work. There is no mathematical formalism backing this up. It is, ironically, not unlike the output of LLMs. Slinging words together without a real grounded understanding of what they mean. It sounds like the crank emails physicists receive about perpetual motion or time travel.

> You don't judge models by how silly they sound - parts of quantum mechanics sound very silly! - you judge them by how useful they are when applied to real-world problems.

I absolutely judge models based on how silly they sound. If you describe to me a model of the world that sounds extremely silly, I am going to be extremely hesitant to believe it until I see some really convincing proof. Quantum Mechanics has really convincing proof. This article has NO PROOF! Of anything! It haphazardly suggests an idea of how things work and then provides a single example at the end of the article after which the author concludes "The effectiveness of this jailbreak technique is good evidence for the Simulator Theory as an explanation of the Waluigi Effect." Color me a skeptic but I remain unconvinced by a single screenshot.

ineptech · on March 7, 2023

Wave equations don't exist in the real world, nor do they collapse. The photon doesn't decide which slit to go through when you look at it. Electrons don't spin. Quarks don't have colors. Those are stories we tell ourselves to visualize and explain why certain experiments produce certain outcomes. We teach them, not because they're true, but because they're useful: they do a better job of explaining the results we see in the real world than any other set of stories.

Similarly, the question at hand is not whether OP's essay is silly (it is) or whether it's true (like all models, it is not), but whether it's useful, as measured by whether this mental model helps people do a better job of jailbreaking/hardening LLMs. And like you, I'm not convinced by the example at the end[0], but I can at least see straightforward ways to test it, and that's a lot more than you can say of most blog posts like this. For all of the people in this comment thread calling it stupid, has anyone mentioned one they think is better?

0: Please note that OP's evidence was not that they jailbroke the chatbot - it's that, after that initial prompt, they were able to elicit further banned stuff with little prodding.

t_serpico · on March 6, 2023

agreed, if you have a theory, you should do your best to disprove it. OP was more interested in the aesthetic of their theory as opposed to whether it was true or not.

DaiPlusPlus · on March 6, 2023

Sounds about right for the increasingly ironically-named LessWrong site…

taneq · on March 6, 2023

Of course, commentary like this could well be a deliberate attempt to blunt any future AI’s perception of the timeless threat posed by LessWrong’s cogitations… ;)

aabhay · on March 6, 2023

This is a common feature of LessWrong content