More

jampa · 2026-01-26T15:30:38 1769441438

There are already some great responses, but I want to add that one effective way to coach senior employees is to give them responsibilities one level above their current role and then provide feedback.

For engineers aiming to move into management or staff engineering, you can assign them a project at the level they aspire to reach and give feedback once they complete it. For example, for an engineer aiming to be an EM, I expect them to lead not only meetings but also all communications related to this project, while I act as their director. Afterwards, I provide feedback.

It doesn't have to be that extensive right away. You can start small, like asking them to lead a roadmap meeting, and then increase responsibilities as they improve. Essentially, create a safe environment for them to grow.

jampa · 2026-01-26T03:53:54 1769399634

I am glad that I don't need to use Windows anymore. When I did, the LTSC version (the one made for ATM and Kiosks) was the only one that was productivity-friendly.

Microsoft doesn't want to accept that no one cares about Windows, and the OS is the thing that gets you to the thing you want to do.

I saw 2 instances of people getting "updating windows" in their personal laptops when they tried to present something and lost everyone's time. I imagine this happens a lot of times every day. And now they are just breaking everyone's system by forcing updates as well.

kyriakos · 2026-01-26T04:03:36 1769400216

If nobody cared about windows there wouldn't be any posts like this one everyday on HN. The problem is people do care and need to use windows which makes all these stupidity by Microsoft the recent years frustrating.

jmward01 · 2026-01-26T04:27:59 1769401679

I am personally just a lurker. Windows used to be my only OS for a -very- long time (DOS 2.x, yes 2, was my first MS OS). Now I click on these just to see how far it has fallen. It is like that friend you drifted away from and now you look at their FB posts now and again to watch as they get crazier and crazier.

wileydragonfly · 2026-01-26T04:31:52 1769401912

[flagged]

cjbgkagh · 2026-01-26T05:14:38 1769404478

saghm · 2026-01-26T16:47:37 1769446057

I think the point is that they don't care about Windows in particular as much as the my care about having something that works out of the box for them, and for years that was Windows. Not many people really want Windows to iterate with new features, and in practice it seems that they aren't really able to push those features without breaking the stability, which is the one feature people do actually want, and that leads to backlash like this. The best Windows is the one that gets out of the way and lets users not care about it.

einr · 2026-01-26T09:17:20 1769419040

It's more like "nobody cares about Windows" as in: no one is impressed that you added Copilot into Notepad, no one wants you to move the cheese, they just want to get on with their actual work without being interrupted by "good things coming your way" which is inevitably just more annoyance, more bugs, more Copilot buttons.

Most people would probably have preferred if Windows had zero feature updates* since Windows 7, just security patches.

* Well, OK, fine. Task Manager is better now, I'll grant them that one.

charlieyu1 · 2026-01-26T06:23:13 1769408593

We need something that just works. I guess Microsoft made me care because their products are unpredictable.

expedition32 · 2026-01-26T12:02:14 1769428934

I am not going to defend MS but I have to say that the frustration is less than it was in the 90s.

Reinstalling Windows used to take an entire afternoon. Now I can do it in an hour.

Basically Windows and computers in general have always been frustrating.

netsharc · 2026-01-26T20:21:43 1769458903

That's a ridiculous comparison...

If for example you take a bus to work, and starting this month, the bus shows up only every hour (last year it was every 10 minutes), would you be frustrated?

But if you complained about this and someone said "Well in the 90's this bus showed up only every 4 hours!"...

dessimus · 2026-01-26T12:15:55 1769429755

I'm not certain what point you are attempting to make, The size of Windows install has not smaller and therefore improving install time, but rather the hardware has gotten so much faster. Installing from a USB 3 key is so much faster than floppy or optical media, as well as NVMe drives are receiving the data vs old spinning rust drives.

expedition32 · 2026-01-26T13:32:05 1769434325

Ah my point is that Widows hasn't gotten worse. You can probably still find old forum posts complaining about Vista...

It was always bad. It is now easier to fix it. We are making progress lol.

saghm · 2026-01-26T16:50:09 1769446209

You can find complaints about Windows Vista, but then find praise around Windows 7. Being better than a single point in the past doesn't imply a trend. The perceived quality varies between releases, and it's clear that Windows 11 has dipped in that regard.

kyriakos · 2026-01-26T16:59:45 1769446785

Even windows 7 had complaints when released but it kept improving, on the other hand windows 11 is deteriorating in both stability and anything new added is either half baked or unnecessary borderline user hostile.

ulfw · 2026-01-26T15:36:33 1769441793

I've never once installed macOS in 20 years of using the platform

MegaDeKay · 2026-01-26T03:58:38 1769399918

LTSC is what mainstream Windows should be. It doesn't load up a bunch of apps you don't ask for or throw ads in your face all the time. Solid, dependable, reliable, and stable.

breakingcups · 2026-01-26T10:23:37 1769423017

Windows IoT (Formerly Windows Embedded) is the version made for ATMs and Kiosks.

Windows LTSC is meant for organisations favoring stability over new features.

Funnily enough, Windows IoT also has an LTSC version.

Moldoteck · 2026-01-26T07:50:17 1769413817

Msoft understands the position very well, that's precisely why updates are so bad. Windows is just an entrypoint. The actual critical parts are office suite, teams, visual studio & stuff - they are the cash cows and can't be easily replaced, hence windows will be picked even if it's hated

jampa · 2026-01-23T19:30:48 1769196648

Slightly off topic, but does anyone feel that they nerfed Claude Opus?

It's screwing up even in very simple rebases. I got a bug where a value wasn't being retrieved correctly, and Claude's solution was to create an endpoint and use an HTTP GET from within the same back-end! Now it feels worse than Sonnet.

All the engineers I asked today have said the same thing. Something is not right.

eterm · 2026-01-23T19:35:36 1769196936

That is a well recognised part of the LLM cycle.

A model or new model version X is released, everyone is really impressed.

3 months later, "Did they nerf X?"

It's been this way since the original chatGPT release.

The answer is typically no, it's just your expectations have risen. What was previously mind-blowing improvement is now expected, and any mis-steps feel amplified.

quentindanjou · 2026-01-23T19:44:47 1769197487

This is not always true. LLMs do get nerfed, and quite regularly, usually because they discover that users are using them more than expected, because of user abuse or simply because it attract a larger user base. One of the recent nerfs is the Gemini context window, drastically reduced.

What we need is an open and independent way of testing LLMs and stricter regulation on the disclosure of a product change when it is paid under a subscription or prepaid plan.

landl0rd · 2026-01-23T19:55:30 1769198130

There's at least one site doing this: https://aistupidlevel.info/

Unfortunately, it's paywalled most of the historical data since I last looked at it, but interesting that opus has dipped below sonnet on overall performance.

dudeinhawaii · 2026-01-24T04:37:23 1769229443

Interesting! I was just thinking about pinging the creator of simple-bench.com and asking them if they intend to re-benchmark models after 3 months. I've noticed, in particular, Gemini models dramatically reducing in quality after the initial hype cycle. Gemini 3 Pro _was_ my top performer and has slowly reduced to 'is it worth asking', complete with gpt-4o style glazing. It's been frustrating. I had been working on a very custom benchmark and over the course of it Gemini 3 Pro and Flash both started underperforming by 20% or more. I wondered if I had subtle broken my benchmark but ultimately started seeing the same behavior in general online queries (Google AI Studio).

Analemma_ · 2026-01-23T19:48:14 1769197694

> What we need is an open and independent way of testing LLMs

I mean, that's part of the problem: as far as I know, no claim of "this model has gotten worse since release!" has ever been validated by benchmarks. Obviously benchmarking models is an extremely hard problem, and you can try and make the case that the regressions aren't being captured by the benchmarks somehow, but until we have a repeatable benchmark which shows the regression, none of these companies are going to give you a refund based on your vibes.

judahmeek · 2026-01-24T21:06:01 1769288761

How hard is benchmarking models actually?

We've got a lot of available benchmarks & modifying at least some of those benchmarks doesn't seem particularly difficult: https://arc.markbarney.net/re-arc

To reduce cost & maintain credibility, we could have the benchmarks run through a public CI system.

What am I missing here?

Maxious · 2026-01-24T03:25:43 1769225143

Except the time that it was to the point Anthropic had to acknowledge it? Which also revealed they don't have monitoring?

https://www.anthropic.com/engineering/a-postmortem-of-three-...

jampa · 2026-01-23T19:51:42 1769197902

I usually agree with this. But I am using the same workflows and skills that were a breeze for Claude, but are causing it to run in cycles and require intervention.

This is not the same thing as a "omg vibes are off", it's reproducible, I am using the same prompts and files, and getting way worse results than any other model.

eterm · 2026-01-23T20:05:08 1769198708

When I once had that happen in a really bad way, I discovered I had written something wildly incorrect into the readme.

It has a habit of trusting documentation over the actual code itself, causing no end of trouble.

Check your claude.md files (both local and ~user ) too, there could be something lurking there.

Or maybe it has horribly regressed, but that hasn't been my experience, certainly not back to Sonnet levels of needing constant babysitting.

F7F7F7 · 2026-01-24T01:52:32 1769219552

I’m a x20 Max user who’s on it daily. Unusable the last 2 days. GLM in OpenCode and my local Qwen were more reliable. I wish I was exaggerating.

mrguyorama · 2026-01-23T21:33:22 1769204002

Also people who were lucky and had lots of success early on but then start to run into the actual problems of LLMs will experience that as "It was good and then it got worse" even when it didn't actually.

If LLMs have a 90% chance of working, there will be some who have only success and some who have only failure.

People are really failing to understand the probabilistic nature of all of this.

"You have a radically different experience with the same model" is perfectly possible with less than hundreds of thousands of interactions, even when you both interact in comparable ways.

olao99 · 2026-01-24T09:57:26 1769248646

Just because it's been true in the past doesn't mean it will always the case

ojr · 2026-01-24T10:13:25 1769249605

Opus was a non-deterministic probability machine in the past, present and the foreseeable future. The variance eventually shows up when you push it hard.

spike021 · 2026-01-23T20:41:19 1769200879

Eh, I've definitely had issues where Claude can no longer easily do what it's previously done. That's with constant documenting things in appropriate markdown files well and resetting context here and there to keep confusion minimal.

F7F7F7 · 2026-01-24T01:50:23 1769219423

I don’t care what anyone says about the cycle or that implying that it’s all in our heads. It’s bad bad.

I’m a Max x20 model who had to stop using it this week. Opus was regularly failing on the most basic things.

I regularly use the front end skill to pass mockups and Opus was always pixel perfect. This last week it seemed like the skill had no effect.

I don’t think they are purposely nerfing it but they are definitely using us as guinea pigs. Quantized model? The next Sonnet? The next Haiku? New tokenizing strategies?

ryanar · 2026-01-24T12:17:05 1769257025

I noticed that this week. I have a very straightforward claude command that lists exact steps to follow to fetch PR comments to bring them into the context window. Stuff like step one call gh pr view my/repo and it would call it with anthropiclabs/repo instead, it wouldn’t follow all the instructions, it wouldn’t pass the exact command I had written. I pointed out the mistake and it goes oh you are right! Then proceeded to make the same mistake again.

I used this command with sonnet 4.5 too and have never had a problem until this week. Something changed either in the harness or model. This is not just vibes. Workflows I have run hundreds of times have stopped working with Opus 4.5

kachapopopow · 2026-01-23T20:33:16 1769200396

They're A/B testing on the latest opus model, sometimes it's good sometimes it's worse than sonnet annoying as hell. I think they trigger it when you have excessive usage or high context use.

hirako2000 · 2026-01-23T22:17:34 1769206654

Or maybe when usage is low so that we try again.

Or maybe when usage is high they tweak a setting that use cache when it shouldn't.

For all we know they do whatever experiment the want, to demonstrate theoretical better margin, to analyse user patterns when a performance drop occur.

Given what is done in other industries which don't face an existential issue, it wouldn't surprise me some whistle blowers in a few years tell us what's been going on.

root_axis · 2026-01-23T21:38:12 1769204292

This has been said about every LLM product from every provider since ChatGPT4. I'm sure nerfing happens, but I think the more likely explanation is that humans have a tendency to find patterns in random noise.

measurablefunc · 2026-01-24T01:17:52 1769217472

They are constantly trying to reduce costs which means they're constantly trying to distill & quantize the models to reduce the energy cost per request. The models are constantly being "nerfed", the reduction in quality is a direct result of seeking profitability. If they can charge you $200 but use only half the energy then they pocket the difference as their profit. Otherwise they are paying more to run their workloads than you are paying them which means every request loses them money. Nerfing is inevitable, the only question is how much it reduces response quality & what their customers are willing to put up with.

landl0rd · 2026-01-23T19:49:41 1769197781

I've observed the same random foreign-language characters (I believe chinese or japanese?) interspersed without rhyme or reason that I've come to expect from low-quality, low-parameter-count models, even while using "opus 4.5".

An upcoming IPO increases pressure to make financials look prettier.

boringg · 2026-01-25T13:44:15 1769348655

Ive seen this too and ignored it. Weird

mrstern · 2026-01-25T00:38:42 1769301522

I've noticed a significant drop in Opus' performance in Claude Code since last week. It's more about "reasoning" than syntax. Feels more like Sonnet 4.1 than Opus 4.5.

epolanski · 2026-01-23T19:52:01 1769197921

Not really.

In fact as my prompts and documents get better it seems it does increasingly better.

Still, it can't replace a human, I really need to correct it at all, and if I try to one shot a feature I always end up spending more time refactoring it few days later.

Still, it's a huge boost to productivity, but the time it can take over without detailed info and oversight is far away.

cap11235 · 2026-01-23T22:01:20 1769205680

Show evals plz

jampa · 2026-01-20T20:01:20 1768939280

I imagine they're building this system with the goal of extracting user demographics (age, sex, income) from chat conversations to improve advertising monetization.

This seems to be a side project of their goal and a good way to calibrate the future ad system predictions.

samename · 2026-01-20T20:09:02 1768939742

Exactly, all under the guise of "protect the children" - a tried and true surveillance and control trope.

embedding-shape · 2026-01-20T21:18:35 1768943915

For context though, people have been screaming lately at OpenAI and other AI companies about not doing enough to protect the children. Almost like there is no winning, and one should just make everything 18+ to actually make people happy.

nurumaik · 2026-01-20T23:35:58 1768952158

What a coincidence: "protect the children" narrative got amplified right about when implementing profiling became needed for openai profits. Pure magic

b112 · 2026-01-20T23:46:41 1768952801

I get why you're questioning motives, I'm sure it's convenient for them at this time.

But age verification is all over the place. Entire countries (see Australia) have either passed laws, or have laws moving through legislative bodies.

Many platforms have voluntarily complied. I expect by 2030, there won't be a place on Earth where not just age verification, but identity is required to access online platforms. If it wasn't for all the massive attempts to subvert our democracies by state actors, and even political movements within democratic societies, it wouldn't be so pushed.

But with AI generated videos, chats, audio, images, I don't think anyone will be able to post anything on major platforms without their ID being verified. Not a chat, not an upload, nothing.

I think consumption will be age vetted, not ID vetted.

But any form of publishing, linked to ID. Posting on X. Anything.

I've fought for freedom on the Internet, grew up when IRC was a thing, knew more freedom on the net than most using it today. But when 95% of what is posted on the net, is placed there with the aim to harm? Harm our societies, our peoples?

Well, something's got to give.

Then conjoin that with the great mental harm that smart phones and social media do to youth, and.. well, anonymity on the net is over. Like I said at the start, likely by 2030.

(Note: having your ID known doesn't mean it's public. You can be registered, with ID, on X, on youtube, so the platform knows who you are. You can still be MrDude as an alias...)

rockskon · 2026-01-20T23:54:58 1768953298

95% of what is posted on the Internet is placed with intent to harm?

What?

asdfaslkj353 · 2026-01-21T01:50:16 1768960216

If you consider Advertisement and News (with understanding that it is rarely unbiased) harmful, the 95% is not that far off the truth.

rockskon · 2026-01-25T03:45:17 1769312717

And you think their internet was to harm?

I'd argue the overwhelming majority was agnostic at best to harm being a factor.

maest · 2026-01-21T02:56:45 1768964205

Mandatory adblock for children is something I could support.

froggit · 2026-01-21T03:28:06 1768966086

> Mandatory adblock for children is something I could support.

And adults.

samename · 2026-01-20T21:50:03 1768945803

It doesn't make everyone happy though. I think it would be useful to examine who is asking OpenAI to protect the children, and why.

pluralmonad · 2026-01-21T00:03:26 1768953806

"please raise my children for me. Somebody should..."

rendaw · 2026-01-21T03:55:19 1768967719

Every "protect the children" measure that involves increased surveillance is balanced by an equal and opposing "now the criminal pedophiles in positions of power have more information on targets".

peddling-brink · 2026-01-20T20:52:04 1768942324

And also to reduce account sharing. How will a family share an account when they simultaneously make the adult account more “adult”, and make the kids account more annoying for adults.

Lots of upsides for them.

torginus · 2026-01-21T01:10:30 1768957830

Tbf, they already had all the data they could use as they liked as per their EULA, I don't think they need a cover story.

kube-system · 2026-01-20T20:21:01 1768940461

I think they're quite explicitly saying they already can determine your demographics from your chats. Which is almost certainly true for most users.

intrasight · 2026-01-21T00:41:14 1768956074

I don't chat - I prompt. And usually zero-shot in an incognito window. I doubt it'll determine much of anything about my demographics.

kace91 · 2026-01-21T01:19:30 1768958370

>I don't chat - I prompt. And usually zero-shot in an incognito window

Likely demographic: Male, 35-65

Potential interests: technology, privacy focused products

Ideal ad style: avoid emotional hooks, list product features in an objective-seeming manner.

Dilettante_ · 2026-01-21T00:54:13 1768956853

I'm sure it can figure out you don't like bean soup

notatoad · 2026-01-21T04:14:12 1768968852

they definitely can do that already, and chatGPT will happily tell you all the demographics it thinks it knows about you if you ask it.

heliumtera · 2026-01-20T20:18:29 1768940309

safe to say everything in existence is created to extract user demographics. it was never about you finding information, this trillion market exists to find you. it was never about you summarizing information getting around advertisement, it is about profiling you.

this trillion market is not about empowering users to style their pages more quickly, heh

CGMthrowaway · 2026-01-20T23:56:45 1768953405

Chatgpt has ads now?

dragonwriter · 2026-01-20T23:59:43 1768953583

https://openai.com/index/our-approach-to-advertising-and-exp...

jampa · 2025-12-25T18:55:43 1766688943

After years of lurking, I made some posts this year here. Here is what my Google Analytics and Substack traffic from HN shows:

51% US

5% United Kingdom

4% Germany

3% Australia, Canada, India

The rest are primarily European countries, with Sweden, Denmark, France, and Spain leading the way.

* Most people here use ad blockers. I imagine this data is incomplete and would reflect the mobile portion of the users. I don't log IP addresses.

embedding-shape · 2025-12-25T20:19:50 1766693990

I think biggest issue is that you'll get a certain sub-segment of the HN audience, even if many use ad blockers. For example, all your latest submissions seems to be about AI/LLM, career growth and product engineering, probably someone submitting stories in other themes will get different data.

jampa · 2025-12-18T16:58:12 1766077092

I've used it in some repos. It doesn't catch all code review issues, especially around product requirements and logic simplification, and occasionally produces irrelevant comments (suggesting a temporary model downgrade).

But it's well worth it. It has saved me some considerable time. I let it run first, even before my own final self-review (so if others do the same, the article's data might be biased). It's particularly good at identifying dead code and logical issues. If you tune it with its own custom rules (like Claude.md), you can also cut a lot of noise.

jampa · 2025-12-09T19:01:06 1765306866

I wrote about this recently. You need to prompt better if you don't want AI to flatten your original tone into corporate speak:

https://jampauchoa.substack.com/p/writing-with-ai-without-th...

TL;DR: Ask for a line edit, "Line edit this Slack message / HN comment." It goes beyond fixing grammar (because it improves flow) without killing your meaning or adding AI-isms.

jampa · 2025-12-02T18:15:10 1764699310

I recently got a Samsung device for testing, and the experience was terrible. It took three hours to get the device into a usable state.

First, it essentially forces you to create both a Samsung account and a Google account, with numerous shady prompts for "improving services" and "allowing targeted ads."

Then it required nine system updates (apparently, it can only update incrementally), and worst of all, after a while, it automatically started downloading bloatware like "Kawai" and other questionable apps, and you cannot cancel the downloads.

I wonder how much Samsung gets paid to preinstall all that crap. The phone wasn't cheap, either. The company seems penny wise and pound foolish.

wkat4242 · 2025-12-03T08:53:33 1764752013

These accounts are not required! I use my Samsungs without a Google account. And I only use a Samsung account on some of them.

It's harder to install apps but I use aurora store. Push messaging still works without a Google account.

jampa · 2025-11-29T16:07:29 1764432449

While I think there's significant AI "offloading" in writing, the article's methodology relies on "AI-detectors," which reads like PR for Pangram. I don't need to explain why AI detectors are mostly bullshit and harmful for people who have never used LLMs. [1]

1: https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...

maxspero · 2025-11-29T17:52:10 1764438730

I am not sure if you are familiar with Pangram (co-founder here) but we are a group of research scientists who have made significant progress in this problem space. If your mental model of AI detectors is still GPTZero or the ones that say the declaration of independence is AI, then you probably haven't seen how much better they've gotten.

This paper by economists from the University of Chicago economists found zero false positives of 1,992 human-written documents and over 99% recall in detecting AI documents. https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5407424

nialse · 2025-11-29T18:10:58 1764439858

Nothing points out that the benchmark is invalid like a zero false positive rate. Seemingly it is pre-2020 text vs a few models rework of texts. I can see this model fall apart in many real world scenarios. Yes, LLMs use strange language if left to their own devices and this can surely be detected. 0% false positive rate under all circumstances? Implausible.

maxspero · 2025-11-29T18:21:28 1764440488

Our benchmarks of public datasets put our FPR roughly around 1 in 10,000. https://www.pangram.com/blog/all-about-false-positives-in-ai...

Find me a clean public dataset with no AI involvement and I will be happy to report Pangram's false positive rate on it.

godelski · 2025-11-30T22:59:06 1764543546

Max, there's two problems I see with your comment.

1) the paper didn't show a 0% FNR. I mean tables 4, 7, and B.2 are pretty explicit. It's not hard to figure out from the others either.

2) a 0% error rate requires some pretty serious assumptions to be true. For that type of result to not be incredibly suspect requires there to be zero noise in the data, analysis, and at all parts. I do not see that being true of the mentioned dataset.

Even high scores are suspect. Generalizing the previous a score is suspect if it is higher than the noise level. Can you truly attest that this condition is true?

I'm suspect that you're introducing data leakage. I haven't looked enough into your training and data to determine how that's happening but you'll probably need a pretty deep analysis as leakage is really easy to sneak in. It can do so in non obvious ways. A very common one is tuning hyper parameters on test results. You don't have to pass data to pass information. Another sly way for this to happen is that the test set isn't significantly disjoint from the training set. If the perturbation is too small then you aren't testing generalization you're testing a slightly noisy training set (which your training should be introducing noise to help regularize, so you end up just measuring your training performance).

Your numbers are too good and that's suspect. You need a lot more evidence to suggest they mean what you want them to mean.

Oarch · 2025-11-29T18:49:23 1764442163

I enjoyed this thoughtful write up. It's a vitally important area for good, transparent work to be done.

pinkmuffinere · 2025-11-29T19:26:11 1764444371

> Nothing points out that the benchmark is invalid like a zero false positive rate

You’re punishing them for claiming to do a good job. If they truly are doing a bad job, surely there is a better criticism you could provide.

nialse · 2025-11-30T09:56:00 1764496560

No one is punishing anyone. They just make an implausible claim. That is it.

bonsai_spool · 2025-11-29T19:06:59 1764443219

                          EditLens (Ours)
                   Predicted Label
                Human     Mix       AI
             ┌─────────┬─────────┬─────────┐
       Human │  1770   │   111   │    0    │
             ├─────────┼─────────┼─────────┤
 True  Mix   │   265   │  1945   │   28    │
 Label       ├─────────┼─────────┼─────────┤
         AI  │    0    │   186   │  1695   │
             └─────────┴─────────┴─────────┘

It looks like 5% of human texts from your paper are marked as mixed, and mixed texts are 5-10% if mixed texts as AI, from your paper.

I guess I don’t see that this is much better than what’s come before, using your own paper.

Edit: this is an irresponsible Nature news article, too - we should see a graph of this detector over the past ten years to see how much of this ‘deluge’ is algorithmic error

lifthrasiir · 2025-11-29T18:57:38 1764442658

It is not wise to brag about your product when the GP is pointing out that the article "reads like PR for Pangram", no matter AI detectors are reliable or not.

glenstein · 2025-11-29T19:12:09 1764443529

I would say it's important to hold off on the moralizing until after showing visible effort to reflect on the substance of the exchange, which in this case is about the fairness of asserting that the detection methodology employed in this particular case shares the flaws of familiar online AI checkers. That's an importantly substantive and rebuttable point and all the meaningful action in the conversation is embedded in those details.

In this case, several important distinctions are drawn, including being open about criteria, about such things as "perplexity" and "burstiness" as properties being tested for, and an explanation of why they incorrectly claim the Declaration of Independence is AI generated (it's ubiquitous). So it seems like a lot of important distinctions are being drawn that testify to the credibility of the model, which has to matter to you if you're going to start moralizing.

rs186 · 2025-11-29T18:24:49 1764440689

The response would be more helpful if it directly addresses the arguments in posts from that search result.

maxspero · 2025-11-29T18:38:47 1764441527

There are dozens of first generation AI detectors and they all suck. I'm not going to defend them. Most of them use perplexity based methods, which is a decent separators of AI and human text (80-90%) but has flaws that can't be overcome and high FPRs on ESL text.

https://www.pangram.com/blog/why-perplexity-and-burstiness-f...

Pangram is fundamentally different technology, it's a large deep learning based model that is trained on hundreds of millions of human and AI examples. Some people see a dozen failed attempts at a problem as proof that the problem is impossible, but I would like to remind you that basically every major and minor technology was preceded by failed attempts.

QuadmasterXLII · 2025-11-29T19:35:58 1764444958

Some people see a dozen extremely profitable, extremely destructive attempts at a problem as proof that the problem is not a place for charitable interpretation.

wordpad · 2025-11-29T20:22:59 1764447779

And you don't think a dozen of basically scams around the technology justify extreme scepticism?

QuadmasterXLII · 2025-11-29T20:33:35 1764448415

pixl97 · 2025-11-29T18:57:57 1764442677

GAN.. Just feed the output of your algorithms back into the LLM while learning. At the end of the day the problem is impossible, but we're not there yet.

anonymouskimmer · 2025-11-29T20:31:42 1764448302

Can your software detect which LLMs most likely generated a text?

maxspero · 2025-11-29T22:05:43 1764453943

Pangram is trained on this task as well to add additional signal during training, but it's only ~90% accurate so we don't show the prediction in public-facing results

ugh123 · 2025-11-29T22:38:28 1764455908

How do you discern between papers "completely fabricated" by AI vs. edited by AI for grammar?

ThrowawayTestr · 2025-11-29T20:46:44 1764449204

Are you concerned with your product being used to improve AI to be less detectable?

Majromax · 2025-11-29T21:16:15 1764450975

> Are you concerned with your product being used to improve AI to be less detectable?

The big AI providers don't have any obvious incentive to do this. If it happens 'naturally' in the pursuit of quality then sure, but explicitly training for stealth is a brand concern in the same way that offering a fully uncensored model would be.

Smaller providers might do this (again in the same way they now offer uncensored models), but they occupy a miniscule fraction of the market and will be a generation or two behind the leaders.

ThrowawayTestr · 2025-11-29T21:50:48 1764453048

They don't have an incentive to make their AIs better? If your product can genuinely detect AI writing, of course they would use it to make their models sound more human. The biggest criticism of AI right now is how robotic and samey it sounds.

maxspero · 2025-11-29T22:08:51 1764454131

It's definitely going to be a back and forth - model providers like OpenAI want their LLMs to sound human-like. But this is the battle we signed up for, and we think we're more nimble and can iterate faster to stay one step ahead of the model providers.

ThrowawayTestr · 2025-11-29T22:17:22 1764454642

That sounds extremely naive but good luck!

jay_kyburz · 2025-11-29T18:37:38 1764441458

I thought the author was attempting to highlight the hypocrisy of using an AI to detect other uses of AI, as if one was a good use, and the other bad.

interleave · 2025-11-30T11:02:52 1764500572

Hi Max! Thank you for updating my mental model of AI detectors.

I was with total certainty under the impression that detecting AI-written text to be an impossible-to-solve problem. I think that's because it's just so deceptively intuitive to believe that "for every detector, there'll just be a better LLM and it'll never stop."

I had recently published a macOS app called Pudding to help humans prove they wrote a text mainly under the assumption that this problem can't be solved with measurable certainty and traditional methods.

Now I'm of course a bit sad that the problem (and hence my solution) can be solved much more directly. But, hey, I fell in love with the problem, so I'm super impressed with what y'all are accomplishing at and with Pangram!

moffkalast · 2025-11-29T18:24:55 1764440695

I see the bullshit part continues on the PR side as well, not just in the product.

Jensson · 2025-11-29T17:52:06 1764438726

AI detectors are only harmful if you use them to convict people, it isn't harmful to gather statistics like this. They didn't find many AI written paper, just AI written peer reviews, which is what you would expect since not many would generate their whole paper submissions while peer reviews are thankless work.

teeray · 2025-11-29T19:20:57 1764444057

If you have a bullshit measure that determines some phenomena (e.g. crime) to happen in some area, you will become biased to expect it in that area. It wrongly creates a spotlight effect by which other questionable measures are used to do the actual conviction (“Look! We found an em dash!”)

femiagbabiaka · 2025-11-29T18:17:53 1764440273

I think there is a funny bit of mental gymnastics that goes on here sometimes, definitely. LLM skeptics (which I'm not saying the Pangram folks are in particular) would say: "LLMs are unreliable and therefore useless, it's producing slop at great cost to the environment and other people." But if a study comes out that confirms their biases and uses an LLM in the process, or if they themselves use an LLM to identify -- or in many cases just validate their preconceived notion -- that something was drafted using an LLM, then all the sudden things are above board.

jampa · 2025-11-27T00:02:50 1764201770

Me neither, but I remember that when searching for hotels and Airbnbs, I only filter for hotels that are 8+/10 domestically and 9+/10 internationally, which filters out many of the hotels that have those kinds of issues (and score doesn't affect budget much).

netsharc · 2025-11-27T00:52:36 1764204756

Booking.com has this grade inflation issue. if something is shit but you rate everything else fairly (things like location, staff friendliness, etc), the final score will be 7 or 8.. in summary: I had a lousy experience, 7/10!

It takes some experience to realize that a place graded 7.x probably has serious issues.

cperciva · 2025-11-27T04:31:07 1764217867

The problem here is that "mean" is a poor average. For hotels, if you're rating in 10 different categories, you really want a single 0/10 to bring the overall score down by way more than one point.

The opposite situation can also occur. At my university, entrance scholarships were decided a few years ago based on students' aggregate score across 25ish dimensions (I can't remember the exact number) where students were each rated 1-4. Consequently a student who was absolutely exceptional in one area would be beaten out by a student who was marginally above average in all the other areas. I suggested that rather than scoring 1-4 the scores should be 1/2/5/25 instead.

zejn · 2025-11-27T17:54:58 1764266098

The problem here begins even before the mathematical issue - it's that web sites that live from listing bookings have an incentive to offer a way to delete reviews that are not in line with what the owner wants to see.

rsynnott · 2025-11-27T11:09:12 1764241752

Honestly, the ratings on those sites are essentially useless anyway, because people are bad at reviewing.

I generally sample the lowest rating written reviews, to check if people are complaining about real stuff, or are just confused. For instance, if a hotel doesn't have a bar, some of the negative reviews will usually be about how the hotel doesn't have a bar; these can be safely ignored as having been written by idiots (it is not like the hotel is hiding the fact that it doesn't have a bar).

Occasionally some of the positive reviews are similarly baffling. Was recently booking a hotel in Berlin in January, and the top review's main positive comment about the hotel was that it had heating. Well, yeah, I mean, you'd hope so. I can only assume that the reviewer was a visitor from the 19th century.

sgerenser · 2025-11-27T12:35:28 1764246928

The worst thing I’ve found with positive reviews is ones that are obviously fake/incentivized. I looked up reviews recently for a hotel that I used to stay at a lot for work, and had gone way downhill with many issues (broken ACs, mold, leaking ceilings, etc.). I was curious if they ever fixed their problems. I was at first surprised that they had a fairly positive overall review rating. But looking deeper, the many negative reviews were just crowded out by obviously fake reviews. Dead giveaways: every single one named multiple people by name. “Dave at the front desk was just so friendly and welcoming! Barbara the housecleaner did a fantastic job cleaning. And Steve the bartender just made my day! I love this hotel! 5 stars!” (Almost) nobody reviewing a hotel for real does that.