It kept feeling like I was reading an advertisement, personally...
Think about what it would mean for everyone to have access to a knowledgeable, thoughtful friend
Claude can be the great equalizer
We believe Claude can be like a brilliant expert friend everyone deserves but few currently have access to
What is special/unusual that makes this so significant? (Not trying to be dismissive, legitimately asking in case I'm missing something)
I understand it's not a system prompt so there's some novelty there I suppose. Is it because someone was able to get Claude to regurgitate a (very large) document from its training data? Or is it the content of the document itself?
I've only skimmed it and there's probably some unique nuggets here and there, but at a high level the rules/guidelines didn't really jump out at me as being much different than the various system prompts for proprietary models made public over the last couple years. Except much longer (and a bit ramble-y in some sections vs the directness of typical system prompt).
The way it got trained is VERY different from a system prompt. They're trying to have the model's natural tendencies be to follow the concepts of the document, rather than setting a set of rules post-hoc.
That's refering to Play Protect (virus scan-ish thing on Google branded Android) and whatever Amazon's equivalent is, not an app requested force uninstall of some kind.
I've been using an old Pixelbook for 3 years that I had laying around for an always on Home Assistant control panel (but with adaptive dimming that defaults so extremely low brightness to reduce heat until approaching it).
I noticed a few months ago that the battery has started to significantly swell pushing the back panel apart, but I haven't worked up the motivation to try removing it yet. From what I can ascertain from user experiences, it actually does boot sans battery but haven't confirmed, it would be awesome if that worked and would probably give me 5+ more years of usage.
So in my case, data point is that 3 years of 24/7 use/charging of an old laptop/tablet was enough to push it over the edge and finally swell. It's really a shame how so many otherwise usable devices that could be wall mounted turn into e-waste because they won't run without a battery. With USB-C PD a well-designed device should be able to get whatever power is needed on demand but manufacturers don't really have any incentive to future proof for the .01% of users like me who would benefit.
I also wouldn't underestimate Google's ability to nudge regular users towards whichever AI surface they want to promote. My highly non-technical mom recently told me she started using Google's "AI Mode" or whatever it's called for most of her searches (she says she likes how it can search/compare multiple sites for browsing house listings and stuff).
She doesn't really install apps and never felt a need to learn a new tool like "ChatGPT" but the transition from regular Google search to "AI Search" felt really natural and also made it easy to switch back to regular search if it wasn't useful for specific types of searches.
It definitely reduces cognitive load for an average user not needing to switch between multiple apps/websites to lookup hours/reviews via Google Maps, search for "facebook.com" to navigate to a site and now run AI searches all in the same familiar places on her phone. So I think Google is still pretty "sticky" despite ChatGPT being a buzzword everyone hears now that they caught up to OpenAI in terms of model capability/features.
Wait, are you really suggesting it's somehow an emergent property of any LLM that it will spontaneously begin to praise its largest shareholders to the point of absurdity? Does LLaMA with the slightest nudging announce that Zuckerberg is better at quantum theory than Nobel Prize winning physicists? Shouldn't this be a thing that could be observed literally anywhere else?
I really need something like this up for tasks I want Claude to run before handing off a task to me as "complete". It routinely ignores my instructions of checklist items that need to be satisfied to be considered successful. I have a helper script documented in CLAUDE.md that lets Claude or me get specific build/log outputs with a few one liner commands yet Claude can't be bothered to remember running them half the time.
Way too frequently Claude goes, "The task is fully implemented, error free with tests passing and no bugs or issues!" and I have to reply "did you verify server build/log outputs with run-dev per CLAUDE.md". It immediately knows the command I am referencing from the instructions buried in its context already, notices an issue and then goes back and fixes it correctly the second time. Whenever it happens it instantly makes an agentic coding session go from feeling like breezy, effortless fun to pulling teeth.
I've started to design a subagent to handle chores after every task to avoid context pollution but it sounds like hooks are the missing piece I need to deterministically guarantee it will run every time instead of just when Claude feels the vibes are right.
I was pretty disappointed to learn that the METR metric isn't actually evaluating a model's ability to complete long duration tasks. They're using the estimated time a human would take on a given task. But it did explain my increasing bafflement at how the METR line keeps steadily going up despite my personal experience coding daily with LLMs where they still frequently struggle to work independently for 10 minutes without veering off task after hitting a minor roadblock.
On a diverse set of multi-step software and reasoning tasks, we record the time needed to complete the task for humans with appropriate expertise. We find that the time taken by human experts is strongly predictive of model success on a given task: current models have almost 100% success rate on tasks taking humans less than 4 minutes, but succeed <10% of the time on tasks taking more than around 4 hours. This allows us to characterize the abilities of a given model by “the length (for humans) of tasks that the model can successfully complete with x% probability”.
For each model, we can fit a logistic curve to predict model success probability using human task length. After fixing a success probability, we can then convert each model’s predicted success curve into a time duration, by looking at the length of task where the predicted success curve intersects with that probability.
It makes perfect sense to use human times as a baseline. Because otherwise, the test would be biased towards models with slower inference.
If model A generates 10 tokens a second and model B generates 100 tokens a second, then using real LLM inference time puts A at a massive 10x advantage, all other things equal.
But it doesn't evaluate the area that I am most eager to see improvements in LLM agent performance: unattended complex tasks that require adapting to unexpected challenges, problem solving and ambiguity for a long duration without a human steering them back in the right direction before they hit a wall or start causing damage.
If it takes me 8 hours to create a pleasant looking to-do app, and Gemini 3 can one shot that in 5 minutes, that's certainly impressive but doesn't help me evaluate whether I could drop an agent in my complex, messy project and expect it to successfully implement a large feature that may require reading docs, installing a new NPM package, troubleshooting DB configuration, etc for 30 min to 1 hr without going off the rails.
It's a legitimate benchmark, I'm not disputing that, but it unfortunately isn't measuring the area that could be a significant productivity multiplier in my day-to-day work. The METR time horizon score is still susceptible to the same pernicious benchmaxxing while I had previously hoped that it was measuring something much closer to my real world usage of LLM agents.
Improvements in long duration, multi-turn unattended development would save me lot of babysitting and frustrating back and forth with Claude Code/Codex. Which currently saps some of the enjoyment out of agentic development for me and requires tedious upfront work setting up effective rules and guardrails to work around those deficits.
Android permissions began to ask for individual confirmation on first use in Android 6.0 (released in 2015) so the grant-all-on-install model hasn't been how it works in a very long time.
Also your narrative about iOS moving from locked down to opening things up over time isn't entirely accurate, when iOS (iPhoneOS) was first released, it didn't have any concept of permissions at all! Apps could use whatever API the OS offered with the user none the wiser. At that time Android Market forcing developers to disclose which permissions were required was seen as unusually transparent and secure. Random iPhone apps scanning contacts deceptively pushed Apple to adopt a permissions model several years after the iPhone was first released.
The two platforms have historically leap frogged each other in various ways but at this point have started to converge as mobile settled into a boring appliance instead of groundbreaking new computing paradigm. Apart from sideloading, notifications and some minor annoyances here and there I can almost forget which OS I'm using as I switch between iOS and Android (thanks to gestures removing the trademark home/back navigation distinctions).
reply