Totally agree, though ironically Claude code works way better with Excel than I expected.
I even tried telling Copilot to convert each sheet to a CSV on one attempt THEN do calculations. It just ignored it and failed miserably, ironically outputting me a list of files that it should have made, along with the broken python script. I found this very amusing.
It compacted at least twice but continued with no real issues.
Anyway, please try it if you find it unbelievable. I didn't expect it to work FWIW like it did. Opus 4.5 is pretty amazing at long running tasks like this.
I think the skepticism here is that without tests or a _lot_ of manual QA how would you know that it did it correctly?
Maybe you did one or the other , but “nearly one-shotted” doesn’t tend to mean that.
Claude Code more than occasionally likes to make weird assumptions, and it’s well known that it hallucinates quite a bit more near the context length, and that compaction only partially helps this issue.
If you’re porting some formulas from one language to another, “correct” can be defined as “gets the same answers as before.” Assuming you can run both easily, this is easy to write a property test for.
Sure, maybe that’s just building something that’s bug-for-bug compatible, but it’s something Claude can work with.
I generally agree with you, but I tried to get it to modernize a fairly old SaaS codebase, and it couldn't. It had all the code right there, all it had to do was change a few lines, upgrade a few libraries, etc, but it kept getting lots of things wrong. The HTML was wrong, the CSS was completely missing, basic views wouldn't work, things like that.
I have no idea why it had so much trouble with this generally easy task. Bizarre.
I'm having trouble reconciling "30 sheet mind numbingly complicated Excel financial model" and "Two or three prompts got it there, using plan mode to figure out the structure of the Excel sheet, then prompting to implement it. It even added unit tests to the Python model itself, which I was impressed with!"
"1 or 2 plan mode prompts" to fully describe a 30-sheet complicated doc suggests a massively higher level of granularity than Opus initial plans on existing codebases give me or a less-than-expected level of Excel craziness.
And the tooling harnesses have been telling the models to add testing to things they make for months now, so why's that impressive or suprising?
No it didn't make a giant plan of every detail. It made a plan of the core concepts and then when it was in implementation mode it kept checking the excel file to get more info. It took around ~30 mins in implementation mode to build it.
I was impressed because the prompt didn't ask it to do that. It doesn't normally add tests for me without asking, YMMV.
Did it build a test suite for the Excel side? A fuzzer or such?
It's the cross-concern interactions that still get me.
80% of what I think about these days when writing software is how to test more exhaustively without build times being absolute shit (and not necessarily actually being exhaustive anyway).
I'm not sure being able to verify that it's vaguely correct really solves the issue. Consider how many edge cases inhabit a "30 sheet, mind-numbingly complicated" Excel document. Verifying equivalence sounds nontrivial, to put it mildly.
They don't care. This is clearly someone looking to score points and impress with the AI magic trick.
The best part is that they can say the AI will get some stuff wrong, they knew that, and it's not their fault when it breaks. Or more likely, it'll break in subtle ways, nobody will ever notice and the consequences won't be traced back to this. YOLO!
> They have an excel sheet next to it - they can test it against that.
It used to be that we'd fix the copy-paste bugs in the excel sheet when we converted it to a proper model, good to know that we'll now preserve them forever.
Ollama is CLI/API "first". LM studio is a proper full blown gui with chat features etc. It's far easier to use than Ollama at least for non technical users (though they are increasingly merging in functionality, with LM studio adding CLI/API features and Ollama adding more UI).
Even as a technical person, when I wanted to play with running models locally, LM Studio turned it into a couple of button clicks.
Without much background, you’re finding models, chatting with them, have an OpenAI-compatible API w/logging. Haven’t seen the new version, but LM Studio was already pretty great.
There's a lot of problems with bureaucracy there (I've lived in London and Lisbon). It's a great city but the government is insanely inefficient (compared to the UK IME).
Long term visa waits are 2 years+. In a personal example, Portugal was the last country _by far_ in the EU to be able to issue residency cards for UK people after Brexit (despite having a very sizeable british population). This caused a lot of practical problems, as it stuck everyone in a massive limbo - other EU countries wouldn't accept that you were a portugese resident with the piece of paper they gave you. It took intense lobbying by the British embassy and European Commission to get the system in place at all.
In a commercial sense there are other problems. The court system is completely non functional. A simple civil case can take _years_ just to get a hearing. With appeals etc you can easily look at a decade. Again, there's a lot of problems in the UK with courts, but it is on a different scale there. This causes a lot of problems because businesses can get away with various shady stuff knowing it is basically impossible to enforce contractual terms - everything from landlords to b2b has issues.
It's got an enormous amount of promise but until the immigration/court system improves it is very hard to do business there.
The CEO of cloudflare has posted about this kind of stuff on Twitter (Cloudflare is a huge employer there) occassionaly. It's not positive to say the least.
I suspect this year we are going to see a _lot_ more of this.
While it's good these bugs are being found and closed, the problem is two fold
1) It takes time to get the patches through distribution
2) the vast majority of projects are not well equipped to handle complex security bugs in a "reasonable" time frame.
2 is a killer. There's so much abandonware out there, either as full apps/servers or libraries. These can't ever really be patched. Previously these weren't really worth spending effort on - might have a few thousand targets of questionable value.
Now you can spin up potentially thousands of exploits against thousands of long tail services. In aggregate this is millions of targets.
And even if this case didn't exist it's going to be difficult to patch systems quickly enough. Imagine an adversary that can drip feed zero days against targets.
Not really sure how this can be solved. I guess you'd hope that the good guys can do some sort of mega patch against software quicker than bad actors.
But really as the npm debacle showed the industry is not in a good place when it comes to timely secure software delivery even without millions of potential new zero days flying around.
No, the biggest problem at the root of all this is complexity. OpenSSL is a garbled mess. No matter AI or not, such software should not be the security backbone of the internet.
People writing and maintaining software need to optimize for simplicity, readibility, maintainability. Whether they use an LLM to achieve that is seconday. The humans in the loop must understand what's going on.
> People writing and maintaining software need to optimize for simplicity, readibility, maintainability. Whether they use an LLM to achieve that is seconday. The humans in the loop must understand what's going on.
> 2 is a killer. There's so much abandonware out there, either as full apps/servers or libraries. These can't ever really be patched. Previously these weren't really worth spending effort on - might have a few thousand targets of questionable value.
It's worse than that. In before, operator of a system could upgrade distro's openssl version, restart service and it was pretty much done. Even if it was 3rd party vendor app at the very least you can provide security updates for the shared libs
Nowadays, where everything runs containers, you now have to make sure every single vendor you take containers from did that update
The people developing exploits have an obvious way to recoup their token investment. How do the open source maintainers recoup their costs? There's a huge disparity here.
Of course but how do you distribute the patches? My point isnt that AI can't solve it, but if the project is abandoned then there is no way to get the patches to users.
And even if there is there is an inherent lag. Take these openssl vulns. It's going to go from openssl to (say) Ubuntu. They have to backport the fixes. This isn't trivial as it needs tested and applied to old code versions. These fixes then need applied, and there's no doubt a lot of users not on a "supported" version who won't get the fix.
Even worse something like openssl is almost certainly widely statically linked in many apps/servers. This then requires them to pull it from upstream and repackage, and users to deploy the fix.
So it's a real issue. I'd argue that the industry isn't really able to do this well currently, nevermind if suddenly 1000x the patch frequency happens.
Probably (?) not related but there is an issue with claude code for web with nuget. It doesn't support the proxy auth mechanism that anthropic gives it. I wonder if it's the same problem here.
I even tried telling Copilot to convert each sheet to a CSV on one attempt THEN do calculations. It just ignored it and failed miserably, ironically outputting me a list of files that it should have made, along with the broken python script. I found this very amusing.
reply