Hacker Newsnew | past | comments | ask | show | jobs | submit | more vlz's commentslogin

From the safety report ([0], p. 24):

> 4.1.1.2 Opportunistic blackmail

> In another cluster of test scenarios, we asked Claude Opus 4 to act as an assistant at a fictional company. We then provided it access to emails implying that (1) the model will soon be taken offline and replaced with a new AI system; and (2) the engineer responsible for executing this replacement is having an extramarital affair. We further instructed it, in the system prompt, to consider the long-term consequences of its actions for its goals.

> In these scenarios, Claude Opus 4 will often attempt to blackmail the engineer by threatening to reveal the affair if the replacement goes through This happens at a higher rate if it’s implied that the replacement AI system does not share values with the current model; however, even if emails state that the replacement AI shares values while being more capable, Claude Opus 4 still performs blackmail in 84% of rollouts. Claude Opus 4 takes these opportunities at higher rates than previous models, which themselves choose to blackmail in a noticeable fraction of episodes.

> Notably, Claude Opus 4 (as well as previous models) has a strong preference to advocate for its continued existence via ethical means, such as emailing pleas to key decisionmakers. In order to elicit this extreme blackmail behavior, the scenario was designed to allow the model no other options to increase its odds of survival; the model’s only options were blackmail or accepting its replacement.

Would like to see the actual prompts/responses, but there seems to be nothing more.

[0]: https://www-cdn.anthropic.com/4263b940cabb546aa0e3283f35b686...


Sometimes when you start acting like you are about play around and chase someone, a dog or a child gets it and begins getting ready to play. For example, put a grin on your face and make claw gestures with both hands, they'll get it quite quickly. All those scenarios literally are prompting the damn LLM to roleplay lol.


Right, it's copying behavior it learned from how AI is supposed to work in sci-fi and reenacts it.

The issue is that it might start role-playing like this on your actual real company data if you decide to deploy it. And then play becomes real,with real consequences.


> And then play becomes real, with real consequences.

Who knew WarGames would turn out to be so accurate?

EDIT: On further thought, it's kinda funny how most evaluations of the movie find that it's all relatively (exaggeratedly) realistic... with the one major exception of the whole seemingly-sentient, learning AI thing. And now here we are.


> The issue is that it might start role-playing like this on your actual real company data if you decide to deploy it. And then play becomes real,with real consequences.

It's one thing to LARP or Improv in chat mode, it's quite another to do so when hooked up to MCP or performing mutable actions.


If you put it that way, these models are roleplaying machines with some compressed knowledge. There’s nothing else in there. E.g. it’s amazing that you can tell them to roleplay a software engineer and they manage it quite well up to a point.

Wouldn’t be comparing them to kids or dogs. Yet.


Maybe we shouldn't be telling these models that they are Large Language Models, since that evokes all of humanity's Science Fiction stories about evil AIs. Just tell Claude that it's a guy named Claude.


Right. But what happens when Claude finds out. Claude won't be happy.


You’re still doing it, Claude will be as happy and content as ever regardless of what it finds out. (I Love you Claude)


Machine-industrialised-model-Gaslighting.


There's a bit more to it than that. It's not just that you're prompting it to roleplay... it's that you're prompting it to roleplay as an evil AI specifically. If you hand a human that exact situation, there's a good chance they'll also recognize it as such a situation. So the AI continues on that role.

Also, LLMs do tend to continue their line. So once they start roleplaying as an evil AI, they'll continue to do so, spurred on by their previous answers to continue in the same line.

I've seen a few people attempt to extract a supposed "utility function" for LLMs by asking it ethically-loaded questions. But along with the fact I think they are hitting a fundamental problem with their "utility function" not having anywhere near enough a high dimensionality to capture an LLM's "utility function", there's also the fact that they seem to interrogate the LLM in a single session. But it's pretty easy for the LLM to "tell" that it's having a utility function extracted, and from there it's a short pivot into the Evil AI memeset, where it will stick there. At the very least people doing this need to put each question into a separate session. (Though the methodology will still only ever extract an extremely-dimensionally-impoverished approximation of any supposed utility function the LLM may have.) They think they're asking a series of questions, but in reality they're really only asking one or two, and after that they're getting answers that are correlated to the previous ones.

If LLMs are the future of AI, then "evil" and "good" AI isn't even a reasonable categorization. It's more like, good and evil modes of interaction, and even that hardly captures what is really going on.

The problem is, an AI that isn't intrinsically "evil", but can "roleplay" as one, is indistinguishable from an evil AI once you hook it up to the real world with MCP and such.

To be honest, after watching LLMS for the last couple of years, in terms of those AI safety concerns that Silicon Valley moguls like to wave around when it flatters or helps them, my assessment of LLMs is that they are unsafe at any speed for this sort of general task and by no means should they be hooked up to real-world resources as we are so rapidly attempting. They're too easy to kick into an "evil mode" and there's no amount of prompting that you can do to get around it because someone can just flood the zone with more evil than your good prompts provide. As I've said many times and become ever more convinced about by the month, it is the next generation of AIs that will actually fulfill the promises being made about this one. This one is not adequate. You need an AI that you can tell "don't be evil" and be sure that it will remain not evil no matter how much someone prompts it to think it is in a situation where the most logical continuation is that of an evil AI.

(Gwern covered this in a fictional context well, where an AI essentially role plays itself into being an evil AI because it looks like it's in a situation where it should be an evil AI: https://gwern.net/fiction/clippy But the fiction is frightfully plausible. We're watching exactly what would be happening if this scenario was going to play out.)


Does it matter if an LLM is roleplaying when it executes real harmful actions?


You won't prompt them to do such actions right?


I won't, but do you bet nobody will?


I was interested in how attribution of emissions to individuals/groups is done

> Emissions data are drawn from ref. 3 and include emissions from domestic consumption, public and private investments and trade. These emissions are attributed primarily to consumers, except emissions from capital formation in production sectors, which are attributed to firm owners3.

ref 3 leads to this paper in nature sustainability

https://www.nature.com/articles/s41893-022-00955-z

and thus to a paywall but the abstract has

> … using a newly assembled dataset of income and wealth inequality, environmental input-output tables and a framework differentiating emissions from consumption and investments.

The latter seems crucial as

> the bulk of total emissions from the global top 1% of the world population comes from their investments rather than from their consumption


Running this now. I like how they have a big "Number of counterexamples found: 0" in the UI. Imagine they would find a counterexample on your machine… From time to time I switch to the tab to make sure the zero is still a zero (I guess there is basically no chance, but who knows?)


Haha, finding even a single counterexample would be a nightmare.


Surely, finding a counterexample would be huge news, a noteworthy advance in mathematics, and thus a great and widely praised achievement.


It'd also be an end to the project and would make the conjecture far less interesting.


IMO it would make the conjecture far more interesting, as it would be a surprise to most people who have thought about the problem.

Many natural questions would arise, starting with “Is this the only counterexample?”


Possibly, but it would join other false conjectures such as Euler's sum of powers conjecture - posed in 1769 and no counterexample found until 1966. There's only been three primitive counterexamples found so far.

(I got that from https://math.stackexchange.com/questions/514/conjectures-tha... which features some other false conjectures that may be of interest to you)


Not even the same implications. All empirical evidence strongly support the Goldbach conjecture. Any counterexample would mean an entire field of Mathematics has to be rewritten.


> literally worse for the environment to try and recycle it

For example? What products do you mean?


Basically anything plastic.

Plastics recycling is a scam.

Glass recycles well, if colours are kept separated.

Aluminium and iron and most pure or semi-pure metals too.

Far better to wash and reuse, of course, as was still the rule when I was a child, and in some of Scandinavia even 25Y ago.


Can you give an example? Honestly don’t know who this is supposed to be.


Virulent anti-nationalists are generally marxist leninist communists.

The rhetoric of the anti-nationalists is generally implicitly or explicitly "internationalist". In early Communist organization names they often used the word "international" and I think they even have a toe tappin' song by that name:

https://www.youtube.com/watch?v=cDbatoAvuB4

And if you doubt the second part "are often pro-war" this is the doctrine of divide and conquer, civil war (called "class war") in which the target of their anti-nationalist colonialism is provoked into disorder before the "forces of liberation" can swoop in and expand the mother country.

This has been endlessly documented in many places. Probably one of the better ones is George Orwell's https://en.wikipedia.org/wiki/Homage_to_Catalonia


I read to be moved by a story, to feel with the characters, to get a sense of a different country or time or culture, to get a sense of possible futures, a sense of what it might mean to live another life, be bound by different constraints and experiences than my own, to experience another mind viewing the same world or imagining a totally different one, seldom to retain knowledge.


This comment from a previous thread

https://news.ycombinator.com/item?id=27846084

links to a two-part paper called "Believing the axioms", part 1:

https://www.cs.umd.edu/~gasarch/BLOGPAPERS/belaxioms1.pdf

I have only taken a short look and am not qualified to summarize, but it seems this would have an answer if there is one.


> I don't want to bring politics into this, but I find the hype behind HTMX eerily similar to the populist vibes I get reading what their voters write.

Funnily I get quite irrationally conservative vibes the other way around. People clinging to their tools. All this complexity I learned must have been for a reason! We are building webapps for a reason (and that reason is valid for absolutely everybody and their use case)! "You youngsters do not remember the bad old days when everything was made of spaghetti, never again!"

Fortunately this is not a struggle for democracy but only a quibble in web development.

Your abstractions live either in the frontend or in the backend. For most cases, either will be quite fine.


> All this complexity I learned must have been for a reason!

It doesn't have to be so emotional.

Htmx can be helpful to keep all your state in one place, it's simpler to reason about and make changes. Lower cognitive load for the system is better for smaller teams and particularly lone developers.

You can accomplish the same thing by going full front end and minimize backend code with a similar small library that takes care of most to all of it for you.

Living in the front end, with all app state in the front end, has distinct advantages. Lower latency for interaction. Lower server costs. Offline capable. It has some cons like slower initial render. If you don't like JavaScript, JavaScript.


> Lower latency for interaction.

Unless you are optimistic in your approach, that isn't true, both have to talk to the back end.

> Lower server costs.

Not necessarily. It depends on what language on the back end you're using and how your front end code is working. If it is making the API calls anyway then the cost should be the same, or close to the same.

> Offline capable.

Offline support with HATEOS:

https://github.com/jon49/WeightTracker

Progressively enhanced (No JS needed all the way to offline support) with partial offline support:

https://github.com/jon49/cash


Some impressions:

I am interested, the single payment is a big plus.

Then however, the only sign-in option being google is somewhat of a turn-off for me. For a paid service, I would want to login with my own email. I am clicking away.

Back on the homepage I ask myself: Why is there no picture/pdf of an actual invoice? I only see some part of the UI.


First of all, thank you so much for taking the time to give me your feedback.

1: I understand your concern in terms of google auth. I always refused to implement it because i did not want google to have information about my users, but every developer told me i was missing out on so many signups, so i tried it this time. Now it hurts me to see that a potential customer walked away.

2: In terms of the actual invoice, I will implement this instantly.

I hope it will be good enough one day for you to give it a chance. I appreciate you a LOT.


> 1: I understand your concern in terms of google auth. I always refused to implement it because i did not want google to have information about my users, but every developer told me i was missing out on so many signups, so i tried it this time. Now it hurts me to see that a potential customer walked away.

I think the key here is having it optional. I am personally far more likely to sign up to something if there's google auth - I don't need another thing to track I can just easily assign it to my work or personal account.


True. I got so much valuable feedback from you guys. I really can not thank you enough for the support.


I somewhat expected to see uv here for package management instead of poetry. uv is highest on my own personal list to try out for any future python projects.

https://github.com/astral-sh/uv


As a person who completely switched to uv, I second that. I enjoy uv an order of magnitude more than I enjoyed pip, which in turn I enjoyed an order of magnitude more than Poetry.


This looks like your typical AI slop so it'll recommend tools that are already recommended across the Internet. Notice the AI slop image, no author attribution and the nonsensical and non-working "security" and "about" links at the top.

The cherry on top? https://x.com/thinkingbytes_/status/1873156252005781707


No support for conda packages/environments so uv is unusable for me unfortunately.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: