Hacker Newsnew | past | comments | ask | show | jobs | submit | gbrindisi's commentslogin

are agents/ still relevant after we got skills? I am genuinely confused on why I would need custom system prompts for specific agents, what should I use them for?

thanks for raising the alarm and sharing this, very insightful

(also beautifully presented!)


1. I dont have hard metrics at hand but with the latest Sonnet I'd say we reach consensus around 80% of the time, with Opus is almost always but we are not using it due to cost

2. The difference I see in agent behavior when they don't reach consensus is usually either

- when one of them didn't explore enough and lack context

- and/or when their risk assessment is off

The latest happen often, in other workflows based on agents we are now giving clear instruction on how to assess risk and where to draw a line to consider something a true positive.

3. validation is on Sonnet, we don't use persona based prompts but all the 3 validators get's the same task and context. The agent orchestrating them will take their output and make the final decision. We use an internal fork of the claude code github action for now.


I like openspec, it lets you tune the workflow to your liking and doesn’t get in the way.

I started with all the standard spec flow and as I got more confident and opinionated I simplified it to my liking.

I think the point of any spec driven framework is that you want to eventually own the workflow yourself, so that you can constraint code generation on your own terms.


I also like openspec.

I think these type of systems (gsd/superpowers) are way too opinionated.

It's not that they can't or don't work. I just think that the best way to truly stay on top of the crazy pace of changes is to not attach yourself to super opinionated workflows like these.

I'm building an orchestrator library on top of openspec for that reason.


I am doing something similar: I use openspec to create context and a sequential task list that I feed to ralph loops, so that i’m involved for the planning and the verification step but completely hands off the wheel during code generation.


Exactly that. I created an "Open Ralph" loop initially within Claude directly with review gates per phase in the OpenSpec task list.

But it was always just a workaround to what I truly wanted (what I'm building now), a full external managed orchestrator loop. The agents aren't aware of the loop, they don't need to be.


fifteen years ago I use to do mobile pentests for banks and when we could not find anything significant for the reports we could’ve always count on “lack of rooting detection” and pin the risk on some vague mobile banking malware threat pushed by marketing. I am sorry I contributed to this nonsense.

100% security theater, and here we are.


It's understandable; I would maybe expect to undergo an extra step in verification for a sensitive app like, "we noticed this is the first time you are using this system that is not locked down; please type in the token we have mailed you".

But locking users out (which may not directly be the bank's fault for relying on OS's security APIs) seems anti-competitive.


ah I also did my own sandbox and at least twice the agent inside tried really hard to go around the firewall, so I ended up intercepting calls to `connect` to return a message that says "Connection refused by the sandbox, don't try to bypass".

Code here: https://github.com/gbrindisi/agentbox


the most annoying thing with Google Workspace is that you need super admin privilege to properly audit the environment programmatically, I believe because of the cloud-identity api.


I noticed that too and it’s kinda scary. Soon we will have the opposite of canceling, where the target will be deepfaked to say everything and its opposite to nullify their signal to noise ratio.


The crowdstrike incident taught us that no one is going to review any dependency whatsoever.


Yep, that's what late stage capitalism leaves you with: consolidation, abuse, helplessness and complacency/widespread incompetence as a result


I wonder how far I could go with a barebone agent prompted to take advantage of this with Sonnet and the Bash tool only, so that it will always try to use the tool to only do `python -c …`


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: