> I never really understood why you have to stuff all the tools in the context.
You probably don't for... like, trivial cases?
...but, tool use is the most fine grained point, usually, in an agent's step-by-step implementation plan; So when planning, if you don't know what tool definitions exist, an agent might end up solving a problem naively step-by-step using primitive operations, when a single tool already exists that does that, or does part of it.
Like, it's not quite as simple as "Hey, do X"
It's more like: "Hey, make a plan to do X. When you're planning, first fetch a big list of the tools that seem vaguely related to the task and make a step-by-step plan keeping in mind the tools available to you"
...and then, for each step in the plan, you can do a tool search to find the best tool for x, then invoke it.
Without a top level context of the tools, or tool categories, I think you'll end up in some dead-ends with agents trying to use very low level tools to do high level tasks and just spinning.
The higher level your tool definitions are, the worse the problem is.
I've found this is the case even now with MCP, where sometimes you have to explicitly tell an agent to use particular tools, not to try to re-invent stuff or use bash commands.
An, the dream, a cross platform App Store you can install apps into any client application that supports MCP, but is open, free and agentic.
It’s basically a “web App Store” and we side step the existing app stores (and their content guidelines, security restrictions and billing requirements) because it’s all done via a mega app (the MCP client).
How could it go wrong?
If only someone had done this before, we wouldnt be stuck in Apples, etc’s walled gardens…
Seriously though; honest question: this is literally circumventing platform requirements to use the platform app stores. How do you imagine this is going to be allowed?
Is ChatGPT really big enough they can pull the “we’re gonna do it, watcha gonna do?” to Apple?
Who’s going to curate this app store so non technical users (the explicitly stated audience) can discover these MCP apps?
It feels like MCP itself; half baked. Overly ambitious. “We’ll figure the details out later”
The apps are LLM agnostic, so all MCP apps will be portable. Economically, this means developers don’t have convince users to pay $20 a month, these users are already paying that. Devs just have to convince users to buy the app on the platform.
I don’t see this being the future state. We’d be talking about a world where any and all apps exist inside of fucking ChatGPT and that just sounds ridiculous.
This was sort of my takeaway too. The OP got help from someone else and thought to herself “if only I’d tried harder I could’ve done this on my own”. That doesn’t seem like a healthy takeaway.
I didn’t take it that way at all. I took it as “I was blinded from the actual solution because my vision was artificially narrow due to my past experiences with this person.” They didn’t ask for help, their partner intervened for them with a completely different and more direct approach.
I have a kid going thru this right now. It’s very disheartening and frustrating to see, because even with coaching and help, they don’t see the help and suggestions as solutions because they simply can’t see it. And as a parent you don’t want to have to intervene, you want them to learn how to dig their way out of it. But it’s tough to get them to dig when they don’t believe in shovels.
I guess I really don’t like this message because I am a disabled person. In the exercise that she describes where an instructor tells people to stand up from a position that they think they can’t stand up from, what if I actually can’t stand up? It might lead me to believe that perhaps I’m simply not trying enough.
You might think this contrived, but when people tell you over and over that you’re not trying hard enough because of things you can’t control, you internalize it.
To me — someone who has to ask for help — it seems like that she didn’t really notice that help was the thing that helped.
What if the cops, the friend, and the consulate all said, "we do not care about a random mentally ill stranger, on a different continent, sending threats. You said he's been doing this for years and has done nothing yet? Sounds like you're safe. We have real crimes to solve. We have real murders to figure out. Call back if he shows up at your house, but he most certainly never will." Or maybe the FBI is like "oh, okay. Thanks. We'll keep an eye out but now this guy's part of an investigation so we can't talk about him to you." and then they do nothing, the friend doesn't reply, and the consulate is like "we're not obligated to reply." Those seem like super likely conclusions to the husband helping, too. So then would that have no longer been the "actual solution?" It seems that the "actual solution" is only determined after the fact once there is a success, and that's used as a proxy for whether or not the actions were really trying. If she had never replied and then the guy stopped texting after a year, would that have also been Actually Trying? Maybe it would've, because one could come up with a post-hoc explanation as to why that was an Actual Try. It feels sloppy to not distinguish what makes something a form of an Actual Try vs a successful try, because Actually Trying should be able to count failures as part of sincere attempts. Otherwise, Actually Trying collapses into being a synonym for success.
1) youre moving state into an arbitrary untrusted easy to modify location.
2) youre allowing users to “deep link” into a page that is deep inside some funnel that may or may not be valid, or even exist at some future point in time, forget skipping the messages/whatever further up.
You probably dont want to do either of those two things.
Really, the interface isn't a meaningful part of it. I also like cmd-L, but claude just does better at writing code.
...also, it's nice that Anthropic is just focusing on making cool stuff (like skills), while the folk from cursor are... I dunno. Whatever it is they're doing with cursor 2.0 :shrug:
There's something in the prompting, tooling, heuristics inside the Claude Code CLI itself that makes it more than just the model it's talking to and that becomes clear if you point your ANTROPHIC_URL at another model. The results are often almost equivalent.
Whereas I tried Kilo Code and CoPilot and JetBrain's agent and others direct against Sonnet 4 and the output was ... not good ... in comparison.
I have my criticisms of Claude but still find it very impressive.
Tokens are fine grained billable attribute that lets you add micro transactions to your service.
Not in all cases, but in many we exist in a complicated world of enshitifcation + inflation.
Inflation means you need to somehow make more money.
You can either: raise prices (unpopular), make your product cheaper (unpopular) or add new features and rise the price on the basis of “new value!”.
You see major organisations doing this: same product, but now with ai! …and it’s more expensive. Or it’s a mandatory bundle. Or it’s “premium”.
Long storm short, a lot of companies see the way that cloud providers do billing (usage based billing, no caps, you get the bill after using it) as the ideal end state.
Token based billing moves towards that world; which isn’t just “profit!” …it’s companies trying to deal with the reality of a complicated market place that will punish them for raising prices.
…and it is bad. I’m just saying that it’s kind of naive to think so many companies are doing this just as a “me tooooo!”. Come on; even if you’re hunting a funding round, the people running these companies are (mostly) not complete idiots.
No one is adding AI features because it’s fun, or they’re bored.
…
…ok, there are some idiots. Most people have a bigger vision for these features than just annoying their users.
> We argue that systematic problem solving is vital and call for rigorous assurance of such capability in AI models. Specifically, we provide an argument that structureless wandering will cause exponential performance deterioration as the problem complexity grows, while it might be an acceptable way of reasoning for easy problems with small solution spaces.
Ie. thinking harder still samples randomly from the solution spaces.
You can allocate more compute to the “thinking step”, but they are arguing that for problems with a very big solution space, adding more compute is never going to find a solution, because you’re just sampling randomly.
…and that it only works for simple problems because if you just randomly pick some crap from a tiny distribution you’re pretty likely to find a solution pretty quickly.
I dunno. The key here is that this is entirely model inference side. I feel like agents can help contain the solution space for complex problems with procedural tool calling.
So… dunno. I feel kind “eh, whatever” about the result.
If someone opens a PR to one of my repos with no context, I ban them.
There’s too much AI spam out there right now.
Publishing ‘@provenance-labs/lodash’ as a test, I suppose. Ok. Leaving it up? Looks like spam.
Badgering the author an a private email? Mmm. Definitely not.
This isn’t a bug, it’s a feature. There’s a contributing guide which clearly says; unless a feature gets community interest, it’s not happening. If you want a feature, talk about it rouse community interest.
Overall: maybe this wasn’t the right way to engage.
Sometimes you just have to walk away from these situations, because the harder you chase, the more it looks like you’re in the wrong.
…it certainly looks, right now, like the lodash author wasn’t out of line with this, to me.
> Overall: maybe this wasn’t the right way to engage
Lex Livingroom. If you are among friends you can surly criticize a sweater, but if you come barging in uninvited and criticize the same sweater, you're in for a bad time.
Anyone seriously using these tools knows that context engineering and detailed specific prompting is the way to be effective with agent coding.
Just take it to the extreme and youll see; what if you auto complete from a single word? A single character?
The system youre using is increasingly generating some random output instead of what you were either a) trying to do, or b) told to do.
Its funny because its like,
“How can we make vibe coding even worse?”
“…I know, lets just generate random code from random prompts”
There have been multiple recent posts about how to direct agents using a combination of planning step, context summary/packing, etc to craft detailed prompts that agents can effectively action on large code bases.
…or yeah, just hit tab and go make a coffee. Yolo.
This could have been a killer feature about using a research step to enhance a user prompt and turn it into a super prompt; but it isnt.
What’s wrong with autocompleting the prompt? There exists entropy even in the English language and especially in the prompts we feed to the llms. If I write something like “fix the ab..” and it autocompletes to AbstractBeanFactory based on the context, isn’t it useful?
You probably don't for... like, trivial cases?
...but, tool use is the most fine grained point, usually, in an agent's step-by-step implementation plan; So when planning, if you don't know what tool definitions exist, an agent might end up solving a problem naively step-by-step using primitive operations, when a single tool already exists that does that, or does part of it.
Like, it's not quite as simple as "Hey, do X"
It's more like: "Hey, make a plan to do X. When you're planning, first fetch a big list of the tools that seem vaguely related to the task and make a step-by-step plan keeping in mind the tools available to you"
...and then, for each step in the plan, you can do a tool search to find the best tool for x, then invoke it.
Without a top level context of the tools, or tool categories, I think you'll end up in some dead-ends with agents trying to use very low level tools to do high level tasks and just spinning.
The higher level your tool definitions are, the worse the problem is.
I've found this is the case even now with MCP, where sometimes you have to explicitly tell an agent to use particular tools, not to try to re-invent stuff or use bash commands.