More

martinald · 2026-02-02T01:59:09 1769997549

Totally agree, though ironically Claude code works way better with Excel than I expected.

I even tried telling Copilot to convert each sheet to a CSV on one attempt THEN do calculations. It just ignored it and failed miserably, ironically outputting me a list of files that it should have made, along with the broken python script. I found this very amusing.

mattmanser · 2026-02-02T07:58:58 1770019138

Think why an LLM might really struggle with csvs. You're asking a chat bot to make a weird sentence with tons of commas in it.

I've read that they're supposed to be great with XML as it's so structured, better than JSON, but haven't actually found that to be the case.

martinald · 2026-02-02T01:55:29 1769997329

It compacted at least twice but continued with no real issues.

Anyway, please try it if you find it unbelievable. I didn't expect it to work FWIW like it did. Opus 4.5 is pretty amazing at long running tasks like this.

moregrist · 2026-02-02T02:11:22 1769998282

I think the skepticism here is that without tests or a _lot_ of manual QA how would you know that it did it correctly?

Maybe you did one or the other , but “nearly one-shotted” doesn’t tend to mean that.

Claude Code more than occasionally likes to make weird assumptions, and it’s well known that it hallucinates quite a bit more near the context length, and that compaction only partially helps this issue.

skybrian · 2026-02-02T06:02:10 1770012130

If you’re porting some formulas from one language to another, “correct” can be defined as “gets the same answers as before.” Assuming you can run both easily, this is easy to write a property test for.

Sure, maybe that’s just building something that’s bug-for-bug compatible, but it’s something Claude can work with.

gregoryl · 2026-02-02T08:10:37 1770019837

For starters, Python uses IEEE 754, and Excel uses IEEE 754 (with caveats). I wonder if that's being emulated.

stavros · 2026-02-02T02:19:26 1769998766

I generally agree with you, but I tried to get it to modernize a fairly old SaaS codebase, and it couldn't. It had all the code right there, all it had to do was change a few lines, upgrade a few libraries, etc, but it kept getting lots of things wrong. The HTML was wrong, the CSS was completely missing, basic views wouldn't work, things like that.

I have no idea why it had so much trouble with this generally easy task. Bizarre.

martinald · 2026-02-02T01:43:56 1769996636

That's exactly what it did (author here).

majormajor · 2026-02-02T01:49:32 1769996972

I'm having trouble reconciling "30 sheet mind numbingly complicated Excel financial model" and "Two or three prompts got it there, using plan mode to figure out the structure of the Excel sheet, then prompting to implement it. It even added unit tests to the Python model itself, which I was impressed with!"

"1 or 2 plan mode prompts" to fully describe a 30-sheet complicated doc suggests a massively higher level of granularity than Opus initial plans on existing codebases give me or a less-than-expected level of Excel craziness.

And the tooling harnesses have been telling the models to add testing to things they make for months now, so why's that impressive or suprising?

martinald · 2026-02-02T01:53:46 1769997226

No it didn't make a giant plan of every detail. It made a plan of the core concepts and then when it was in implementation mode it kept checking the excel file to get more info. It took around ~30 mins in implementation mode to build it.

I was impressed because the prompt didn't ask it to do that. It doesn't normally add tests for me without asking, YMMV.

majormajor · 2026-02-02T01:56:56 1769997416

Ah, I see.

Did it build a test suite for the Excel side? A fuzzer or such?

It's the cross-concern interactions that still get me.

80% of what I think about these days when writing software is how to test more exhaustively without build times being absolute shit (and not necessarily actually being exhaustive anyway).

martinald · 2026-02-02T01:12:25 1769994745

They have an excel sheet next to it - they can test it against that. Plus they can ask questions if something seems off and have it explain the code.

AlotOfReading · 2026-02-02T01:34:35 1769996075

I'm not sure being able to verify that it's vaguely correct really solves the issue. Consider how many edge cases inhabit a "30 sheet, mind-numbingly complicated" Excel document. Verifying equivalence sounds nontrivial, to put it mildly.

Draiken · 2026-02-02T12:02:59 1770033779

They don't care. This is clearly someone looking to score points and impress with the AI magic trick.

The best part is that they can say the AI will get some stuff wrong, they knew that, and it's not their fault when it breaks. Or more likely, it'll break in subtle ways, nobody will ever notice and the consequences won't be traced back to this. YOLO!

Dylan16807 · 2026-02-02T05:25:35 1770009935

Consider how many edge cases it misses. Equivalence probably shouldn't be the top priority here.

Nevermark · 2026-02-02T06:17:54 1770013074

Equivalence here would definitely be the worst test, except for all the alternatives.

lmm · 2026-02-02T01:36:35 1769996195

> They have an excel sheet next to it - they can test it against that.

It used to be that we'd fix the copy-paste bugs in the excel sheet when we converted it to a proper model, good to know that we'll now preserve them forever.

karlgkk · 2026-02-02T01:14:37 1769994877

[flagged]

yomismoaqui · 2026-02-02T01:20:32 1769995232

You would be surprised at the volume of money made by businesses supported by Excel.

martinald · 2026-02-02T01:23:07 1769995387

Yes. I suspect there are thousands of Excel files that "process" >$1bn/yr out there.

irishcoffee · 2026-02-02T04:39:11 1770007151

Allow me to introduce to you: ACH. It is truly fascinating.

martinald · 2026-01-29T09:18:35 1769678315

Ollama is CLI/API "first". LM studio is a proper full blown gui with chat features etc. It's far easier to use than Ollama at least for non technical users (though they are increasingly merging in functionality, with LM studio adding CLI/API features and Ollama adding more UI).

james_marks · 2026-01-29T11:26:22 1769685982

Even as a technical person, when I wanted to play with running models locally, LM Studio turned it into a couple of button clicks.

Without much background, you’re finding models, chatting with them, have an OpenAI-compatible API w/logging. Haven’t seen the new version, but LM Studio was already pretty great.

martinald · 2026-01-29T09:08:10 1769677690

There's a lot of problems with bureaucracy there (I've lived in London and Lisbon). It's a great city but the government is insanely inefficient (compared to the UK IME).

Long term visa waits are 2 years+. In a personal example, Portugal was the last country _by far_ in the EU to be able to issue residency cards for UK people after Brexit (despite having a very sizeable british population). This caused a lot of practical problems, as it stuck everyone in a massive limbo - other EU countries wouldn't accept that you were a portugese resident with the piece of paper they gave you. It took intense lobbying by the British embassy and European Commission to get the system in place at all.

In a commercial sense there are other problems. The court system is completely non functional. A simple civil case can take _years_ just to get a hearing. With appeals etc you can easily look at a decade. Again, there's a lot of problems in the UK with courts, but it is on a different scale there. This causes a lot of problems because businesses can get away with various shady stuff knowing it is basically impossible to enforce contractual terms - everything from landlords to b2b has issues.

It's got an enormous amount of promise but until the immigration/court system improves it is very hard to do business there.

The CEO of cloudflare has posted about this kind of stuff on Twitter (Cloudflare is a huge employer there) occassionaly. It's not positive to say the least.

martinald · 2026-01-28T03:11:24 1769569884

This really is quite scary.

I suspect this year we are going to see a _lot_ more of this.

While it's good these bugs are being found and closed, the problem is two fold

1) It takes time to get the patches through distribution 2) the vast majority of projects are not well equipped to handle complex security bugs in a "reasonable" time frame.

2 is a killer. There's so much abandonware out there, either as full apps/servers or libraries. These can't ever really be patched. Previously these weren't really worth spending effort on - might have a few thousand targets of questionable value.

Now you can spin up potentially thousands of exploits against thousands of long tail services. In aggregate this is millions of targets.

And even if this case didn't exist it's going to be difficult to patch systems quickly enough. Imagine an adversary that can drip feed zero days against targets.

Not really sure how this can be solved. I guess you'd hope that the good guys can do some sort of mega patch against software quicker than bad actors.

But really as the npm debacle showed the industry is not in a good place when it comes to timely secure software delivery even without millions of potential new zero days flying around.

teiferer · 2026-01-28T06:31:47 1769581907

> the problem is two fold

No, the biggest problem at the root of all this is complexity. OpenSSL is a garbled mess. No matter AI or not, such software should not be the security backbone of the internet.

People writing and maintaining software need to optimize for simplicity, readibility, maintainability. Whether they use an LLM to achieve that is seconday. The humans in the loop must understand what's going on.

dust42 · 2026-01-28T08:44:37 1769589877

> People writing and maintaining software need to optimize for simplicity, readibility, maintainability. Whether they use an LLM to achieve that is seconday. The humans in the loop must understand what's going on.

In a perfect world that is.

MBCook · 2026-01-28T03:57:28 1769572648

There’s a reason multiple projects popped up to replace OpenSSL after Heartbleed was discovered.

Let’s see them to do this on projects with a better historical track record.

PunchyHamster · 2026-01-28T10:20:59 1769595659

> 2 is a killer. There's so much abandonware out there, either as full apps/servers or libraries. These can't ever really be patched. Previously these weren't really worth spending effort on - might have a few thousand targets of questionable value.

It's worse than that. In before, operator of a system could upgrade distro's openssl version, restart service and it was pretty much done. Even if it was 3rd party vendor app at the very least you can provide security updates for the shared libs

Nowadays, where everything runs containers, you now have to make sure every single vendor you take containers from did that update

pjmlp · 2026-01-28T11:08:36 1769598516

It would help if regular userspace software wasn't written in languages that were primarly designed to write portable OS kernels.

Even if not all logic errors can be prevented, some of them keep happening by using the wrong tools.

semiquaver · 2026-01-28T15:04:33 1769612673

The silver lining is that if adversarial AI can easily find vulnerabilities, beneficial AI should also be able to find and fix a similar number.

CharlesW · 2026-01-28T03:42:54 1769571774

It's good these bugs are being found and closed. The problems have nothing to do with AI, unless I'm missing something.

xboxnolifes · 2026-01-28T08:03:32 1769587412

If people can use AI to find bugs to close them, people can use AI to find bugs to exploit them. The scale has changed.

semiquaver · 2026-01-28T15:06:10 1769612770

And the project maintainers or their allies can use AI to find bugs and fix them.

AlotOfReading · 2026-01-28T16:28:28 1769617708

The people developing exploits have an obvious way to recoup their token investment. How do the open source maintainers recoup their costs? There's a huge disparity here.

c-hendricks · 2026-01-28T04:25:44 1769574344

Picture the traumatized Mr. Incredible meme with the text "lowering the barrier means more exploits are found"

charcircuit · 2026-01-28T10:47:22 1769597242

>Not really sure how this can be solved.

AI can automatically handle security reports.

martinald · 2026-01-28T12:14:37 1769602477

Of course but how do you distribute the patches? My point isnt that AI can't solve it, but if the project is abandoned then there is no way to get the patches to users.

And even if there is there is an inherent lag. Take these openssl vulns. It's going to go from openssl to (say) Ubuntu. They have to backport the fixes. This isn't trivial as it needs tested and applied to old code versions. These fixes then need applied, and there's no doubt a lot of users not on a "supported" version who won't get the fix.

Even worse something like openssl is almost certainly widely statically linked in many apps/servers. This then requires them to pull it from upstream and repackage, and users to deploy the fix.

So it's a real issue. I'd argue that the industry isn't really able to do this well currently, nevermind if suddenly 1000x the patch frequency happens.

martinald · 2026-01-27T00:50:46 1769475046

Probably (?) not related but there is an issue with claude code for web with nuget. It doesn't support the proxy auth mechanism that anthropic gives it. I wonder if it's the same problem here.

piskov · 2026-01-28T17:48:28 1769622508

Why do you need auth to download public packages from nuget?

martinald · 2026-01-24T19:02:00 1769281320

It tells you to rotate secrets (sometimes) if you put them in the chat. I've never seen it say we need to rotate them if _it_ reads them.

chrisjj · 2026-01-24T19:50:29 1769284229

That means only it doesn't tell us.

martinald · 2026-01-24T15:06:26 1769267186

Another option is invsilight (https://lightera.com/invisilight-home-fiber-kit/).

It's very thin fibre cable that can be glued across skirting boards etc and is very hard to see.