The basic technique (as has been publicly described by Anthropic) is you ask one agent to come up with a test case that triggers, say, an ASan use-after-free. Then you have a second agent that validates the test case. This eliminates a lot of false positives. It gets a little tricky when you allow the first agent to modify the code, which is necessary for things like sandbox escapes where you want to demonstrate that sending bad IPC causes problems.
And the following part that is more important is the agent that attempts the fix in code, the agent that tests the fix and reports on perfomance and functional impacts, and the agent the triggers the build and release to production.
Everything up to finding and validating the bug is a huge win in vuln/exploit development, everything after validating the bug is a huge win for defensive security and a massive gap until the tools are generally available :S
Not really. The models were pointed specifically at the location of the vulnerability and given some extra guidance. That's an easier problem than simply being pointed at the entire code base.
Surely the Anthropic model also only looked at one chunk of code at a time. Cannot fit the entire code base into context. So supplying an identical chunk size (per file, function, whatever) and seeing if the open source model can find anything seems fair. Deliberately prompting with the problem is not.
The flood of reports that open source projects like curl, Linux and Chromium are getting are presumably due to public models like Open 4.6 that released earlier this year, and not models with limited availability.
You should generally assume that in a web browser any memory corruption bug can, when combined with enough other bugs and a lot of clever engineering, be turned into arbitrary code execution on your computer.
The most important bit being the difficulty, AI finding 21 easily exploitable bugs is a lot more interesting than 21 that you need all the planets to align to work.
The bugs that were issued CVEs (the Anthropic blog post says there were 22) were all real security bugs.
The level of AI spam for Firefox security submissions is a lot lower than the curl people have described. I'm not sure why that is. Maybe the size of the code base and the higher bar to submitting issues plays a role.
The issue is what each of the projects considers viable bug if you consider all localized assertion failures possible bugs then that's different from give me something that practically affects users.
Further browsers have a much larger surface area for even minor fuzzing bugs. Curl's much smaller surface area is already well fuzzed and tested.
Chrome has better fuzzing and tests too. Firefox has had fewer resources compared to Google ofc, so understable.
Ofc not saying it wasn't good. But given the LLM costs I find it hard believe it was worth it, compared to just better and more innovative fuzzing which would possibly scale better.
I think the trick to making the "shorts" feature stop showing scantily clad women is to use it actively a bit, and only watch the videos that are decidedly something else. I did that for awhile and now my videos are like "let's see what happens when you pour lava on some soda bottles" which I'm not sure I care that much about but at least it isn't embarrassing.
reply