There are already some great responses, but I want to add that one effective way to coach senior employees is to give them responsibilities one level above their current role and then provide feedback.
For engineers aiming to move into management or staff engineering, you can assign them a project at the level they aspire to reach and give feedback once they complete it. For example, for an engineer aiming to be an EM, I expect them to lead not only meetings but also all communications related to this project, while I act as their director. Afterwards, I provide feedback.
It doesn't have to be that extensive right away. You can start small, like asking them to lead a roadmap meeting, and then increase responsibilities as they improve. Essentially, create a safe environment for them to grow.
I am glad that I don't need to use Windows anymore. When I did, the LTSC version (the one made for ATM and Kiosks) was the only one that was productivity-friendly.
Microsoft doesn't want to accept that no one cares about Windows, and the OS is the thing that gets you to the thing you want to do.
I saw 2 instances of people getting "updating windows" in their personal laptops when they tried to present something and lost everyone's time. I imagine this happens a lot of times every day. And now they are just breaking everyone's system by forcing updates as well.
If nobody cared about windows there wouldn't be any posts like this one everyday on HN. The problem is people do care and need to use windows which makes all these stupidity by Microsoft the recent years frustrating.
I am personally just a lurker. Windows used to be my only OS for a -very- long time (DOS 2.x, yes 2, was my first MS OS). Now I click on these just to see how far it has fallen. It is like that friend you drifted away from and now you look at their FB posts now and again to watch as they get crazier and crazier.
I think the point is that they don't care about Windows in particular as much as the my care about having something that works out of the box for them, and for years that was Windows. Not many people really want Windows to iterate with new features, and in practice it seems that they aren't really able to push those features without breaking the stability, which is the one feature people do actually want, and that leads to backlash like this. The best Windows is the one that gets out of the way and lets users not care about it.
It's more like "nobody cares about Windows" as in: no one is impressed that you added Copilot into Notepad, no one wants you to move the cheese, they just want to get on with their actual work without being interrupted by "good things coming your way" which is inevitably just more annoyance, more bugs, more Copilot buttons.
Most people would probably have preferred if Windows had zero feature updates* since Windows 7, just security patches.
* Well, OK, fine. Task Manager is better now, I'll grant them that one.
If for example you take a bus to work, and starting this month, the bus shows up only every hour (last year it was every 10 minutes), would you be frustrated?
But if you complained about this and someone said "Well in the 90's this bus showed up only every 4 hours!"...
I'm not certain what point you are attempting to make, The size of Windows install has not smaller and therefore improving install time, but rather the hardware has gotten so much faster. Installing from a USB 3 key is so much faster than floppy or optical media, as well as NVMe drives are receiving the data vs old spinning rust drives.
You can find complaints about Windows Vista, but then find praise around Windows 7. Being better than a single point in the past doesn't imply a trend. The perceived quality varies between releases, and it's clear that Windows 11 has dipped in that regard.
Even windows 7 had complaints when released but it kept improving, on the other hand windows 11 is deteriorating in both stability and anything new added is either half baked or unnecessary borderline user hostile.
LTSC is what mainstream Windows should be. It doesn't load up a bunch of apps you don't ask for or throw ads in your face all the time. Solid, dependable, reliable, and stable.
Msoft understands the position very well, that's precisely why updates are so bad. Windows is just an entrypoint. The actual critical parts are office suite, teams, visual studio & stuff - they are the cash cows and can't be easily replaced, hence windows will be picked even if it's hated
Slightly off topic, but does anyone feel that they nerfed Claude Opus?
It's screwing up even in very simple rebases. I got a bug where a value wasn't being retrieved correctly, and Claude's solution was to create an endpoint and use an HTTP GET from within the same back-end! Now it feels worse than Sonnet.
All the engineers I asked today have said the same thing. Something is not right.
A model or new model version X is released, everyone is really impressed.
3 months later, "Did they nerf X?"
It's been this way since the original chatGPT release.
The answer is typically no, it's just your expectations have risen. What was previously mind-blowing improvement is now expected, and any mis-steps feel amplified.
This is not always true. LLMs do get nerfed, and quite regularly, usually because they discover that users are using them more than expected, because of user abuse or simply because it attract a larger user base. One of the recent nerfs is the Gemini context window, drastically reduced.
What we need is an open and independent way of testing LLMs and stricter regulation on the disclosure of a product change when it is paid under a subscription or prepaid plan.
Unfortunately, it's paywalled most of the historical data since I last looked at it, but interesting that opus has dipped below sonnet on overall performance.
Interesting! I was just thinking about pinging the creator of simple-bench.com and asking them if they intend to re-benchmark models after 3 months. I've noticed, in particular, Gemini models dramatically reducing in quality after the initial hype cycle. Gemini 3 Pro _was_ my top performer and has slowly reduced to 'is it worth asking', complete with gpt-4o style glazing. It's been frustrating. I had been working on a very custom benchmark and over the course of it Gemini 3 Pro and Flash both started underperforming by 20% or more. I wondered if I had subtle broken my benchmark but ultimately started seeing the same behavior in general online queries (Google AI Studio).
> What we need is an open and independent way of testing LLMs
I mean, that's part of the problem: as far as I know, no claim of "this model has gotten worse since release!" has ever been validated by benchmarks. Obviously benchmarking models is an extremely hard problem, and you can try and make the case that the regressions aren't being captured by the benchmarks somehow, but until we have a repeatable benchmark which shows the regression, none of these companies are going to give you a refund based on your vibes.
We've got a lot of available benchmarks & modifying at least some of those benchmarks doesn't seem particularly difficult: https://arc.markbarney.net/re-arc
To reduce cost & maintain credibility, we could have the benchmarks run through a public CI system.
I usually agree with this. But I am using the same workflows and skills that were a breeze for Claude, but are causing it to run in cycles and require intervention.
This is not the same thing as a "omg vibes are off", it's reproducible, I am using the same prompts and files, and getting way worse results than any other model.
Also people who were lucky and had lots of success early on but then start to run into the actual problems of LLMs will experience that as "It was good and then it got worse" even when it didn't actually.
If LLMs have a 90% chance of working, there will be some who have only success and some who have only failure.
People are really failing to understand the probabilistic nature of all of this.
"You have a radically different experience with the same model" is perfectly possible with less than hundreds of thousands of interactions, even when you both interact in comparable ways.
Opus was a non-deterministic probability machine in the past, present and the foreseeable future. The variance eventually shows up when you push it hard.
Eh, I've definitely had issues where Claude can no longer easily do what it's previously done. That's with constant documenting things in appropriate markdown files well and resetting context here and there to keep confusion minimal.
I don’t care what anyone says about the cycle or that implying that it’s all in our heads. It’s bad bad.
I’m a Max x20 model who had to stop using it this week. Opus was regularly failing on the most basic things.
I regularly use the front end skill to pass mockups and Opus was always pixel perfect. This last week it seemed like the skill had no effect.
I don’t think they are purposely nerfing it but they are definitely using us as guinea pigs. Quantized model? The next Sonnet? The next Haiku? New tokenizing strategies?
I noticed that this week. I have a very straightforward claude command that lists exact steps to follow to fetch PR comments to bring them into the context window. Stuff like step one call gh pr view my/repo and it would call it with anthropiclabs/repo instead, it wouldn’t follow all the instructions, it wouldn’t pass the exact command I had written. I pointed out the mistake and it goes oh you are right! Then proceeded to make the same mistake again.
I used this command with sonnet 4.5 too and have never had a problem until this week. Something changed either in the harness or model. This is not just vibes. Workflows I have run hundreds of times have stopped working with Opus 4.5
They're A/B testing on the latest opus model, sometimes it's good sometimes it's worse than sonnet annoying as hell. I think they trigger it when you have excessive usage or high context use.
Or maybe when usage is high they tweak a setting that use cache when it shouldn't.
For all we know they do whatever experiment the want, to demonstrate theoretical better margin, to analyse user patterns when a performance drop occur.
Given what is done in other industries which don't face an existential issue, it wouldn't surprise me some whistle blowers in a few years tell us what's been going on.
This has been said about every LLM product from every provider since ChatGPT4.
I'm sure nerfing happens, but I think the more likely explanation is that humans have a tendency to find patterns in random noise.
They are constantly trying to reduce costs which means they're constantly trying to distill & quantize the models to reduce the energy cost per request. The models are constantly being "nerfed", the reduction in quality is a direct result of seeking profitability. If they can charge you $200 but use only half the energy then they pocket the difference as their profit. Otherwise they are paying more to run their workloads than you are paying them which means every request loses them money. Nerfing is inevitable, the only question is how much it reduces response quality & what their customers are willing to put up with.
I've observed the same random foreign-language characters (I believe chinese or japanese?) interspersed without rhyme or reason that I've come to expect from low-quality, low-parameter-count models, even while using "opus 4.5".
An upcoming IPO increases pressure to make financials look prettier.
I've noticed a significant drop in Opus' performance in Claude Code since last week. It's more about "reasoning" than syntax. Feels more like Sonnet 4.1 than Opus 4.5.
In fact as my prompts and documents get better it seems it does increasingly better.
Still, it can't replace a human, I really need to correct it at all, and if I try to one shot a feature I always end up spending more time refactoring it few days later.
Still, it's a huge boost to productivity, but the time it can take over without detailed info and oversight is far away.
I imagine they're building this system with the goal of extracting user demographics (age, sex, income) from chat conversations to improve advertising monetization.
This seems to be a side project of their goal and a good way to calibrate the future ad system predictions.
For context though, people have been screaming lately at OpenAI and other AI companies about not doing enough to protect the children. Almost like there is no winning, and one should just make everything 18+ to actually make people happy.
What a coincidence: "protect the children" narrative got amplified right about when implementing profiling became needed for openai profits. Pure magic
I get why you're questioning motives, I'm sure it's convenient for them at this time.
But age verification is all over the place. Entire countries (see Australia) have either passed laws, or have laws moving through legislative bodies.
Many platforms have voluntarily complied. I expect by 2030, there won't be a place on Earth where not just age verification, but identity is required to access online platforms. If it wasn't for all the massive attempts to subvert our democracies by state actors, and even political movements within democratic societies, it wouldn't be so pushed.
But with AI generated videos, chats, audio, images, I don't think anyone will be able to post anything on major platforms without their ID being verified. Not a chat, not an upload, nothing.
I think consumption will be age vetted, not ID vetted.
But any form of publishing, linked to ID. Posting on X. Anything.
I've fought for freedom on the Internet, grew up when IRC was a thing, knew more freedom on the net than most using it today. But when 95% of what is posted on the net, is placed there with the aim to harm? Harm our societies, our peoples?
Well, something's got to give.
Then conjoin that with the great mental harm that smart phones and social media do to youth, and.. well, anonymity on the net is over. Like I said at the start, likely by 2030.
(Note: having your ID known doesn't mean it's public. You can be registered, with ID, on X, on youtube, so the platform knows who you are. You can still be MrDude as an alias...)
Every "protect the children" measure that involves increased surveillance is balanced by an equal and opposing "now the criminal pedophiles in positions of power have more information on targets".
And also to reduce account sharing. How will a family share an account when they simultaneously make the adult account more “adult”, and make the kids account more annoying for adults.
safe to say everything in existence is created to extract user demographics.
it was never about you finding information, this trillion market exists to find you.
it was never about you summarizing information getting around advertisement, it is about profiling you.
this trillion market is not about empowering users to style their pages more quickly, heh
I think biggest issue is that you'll get a certain sub-segment of the HN audience, even if many use ad blockers. For example, all your latest submissions seems to be about AI/LLM, career growth and product engineering, probably someone submitting stories in other themes will get different data.
I've used it in some repos. It doesn't catch all code review issues, especially around product requirements and logic simplification, and occasionally produces irrelevant comments (suggesting a temporary model downgrade).
But it's well worth it. It has saved me some considerable time. I let it run first, even before my own final self-review (so if others do the same, the article's data might be biased). It's particularly good at identifying dead code and logical issues. If you tune it with its own custom rules (like Claude.md), you can also cut a lot of noise.
TL;DR: Ask for a line edit, "Line edit this Slack message / HN comment." It goes beyond fixing grammar (because it improves flow) without killing your meaning or adding AI-isms.
I recently got a Samsung device for testing, and the experience was terrible. It took three hours to get the device into a usable state.
First, it essentially forces you to create both a Samsung account and a Google account, with numerous shady prompts for "improving services" and "allowing targeted ads."
Then it required nine system updates (apparently, it can only update incrementally), and worst of all, after a while, it automatically started downloading bloatware like "Kawai" and other questionable apps, and you cannot cancel the downloads.
I wonder how much Samsung gets paid to preinstall all that crap. The phone wasn't cheap, either. The company seems penny wise and pound foolish.
While I think there's significant AI "offloading" in writing, the article's methodology relies on "AI-detectors," which reads like PR for Pangram. I don't need to explain why AI detectors are mostly bullshit and harmful for people who have never used LLMs. [1]
I am not sure if you are familiar with Pangram (co-founder here) but we are a group of research scientists who have made significant progress in this problem space. If your mental model of AI detectors is still GPTZero or the ones that say the declaration of independence is AI, then you probably haven't seen how much better they've gotten.
Nothing points out that the benchmark is invalid like a zero false positive rate. Seemingly it is pre-2020 text vs a few models rework of texts. I can see this model fall apart in many real world scenarios. Yes, LLMs use strange language if left to their own devices and this can surely be detected. 0% false positive rate under all circumstances? Implausible.
Max, there's two problems I see with your comment.
1) the paper didn't show a 0% FNR. I mean tables 4, 7, and B.2 are pretty explicit. It's not hard to figure out from the others either.
2) a 0% error rate requires some pretty serious assumptions to be true. For that type of result to not be incredibly suspect requires there to be zero noise in the data, analysis, and at all parts. I do not see that being true of the mentioned dataset.
Even high scores are suspect. Generalizing the previous a score is suspect if it is higher than the noise level. Can you truly attest that this condition is true?
I'm suspect that you're introducing data leakage. I haven't looked enough into your training and data to determine how that's happening but you'll probably need a pretty deep analysis as leakage is really easy to sneak in. It can do so in non obvious ways. A very common one is tuning hyper parameters on test results. You don't have to pass data to pass information. Another sly way for this to happen is that the test set isn't significantly disjoint from the training set. If the perturbation is too small then you aren't testing generalization you're testing a slightly noisy training set (which your training should be introducing noise to help regularize, so you end up just measuring your training performance).
Your numbers are too good and that's suspect. You need a lot more evidence to suggest they mean what you want them to mean.
EditLens (Ours)
Predicted Label
Human Mix AI
┌─────────┬─────────┬─────────┐
Human │ 1770 │ 111 │ 0 │
├─────────┼─────────┼─────────┤
True Mix │ 265 │ 1945 │ 28 │
Label ├─────────┼─────────┼─────────┤
AI │ 0 │ 186 │ 1695 │
└─────────┴─────────┴─────────┘
It looks like 5% of human texts from your paper are marked as mixed, and mixed texts are 5-10% if mixed texts as AI, from your paper.
I guess I don’t see that this is much better than what’s come before, using your own paper.
Edit: this is an irresponsible Nature news article, too - we should see a graph of this detector over the past ten years to see how much of this ‘deluge’ is algorithmic error
It is not wise to brag about your product when the GP is pointing out that the article "reads like PR for Pangram", no matter AI detectors are reliable or not.
I would say it's important to hold off on the moralizing until after showing visible effort to reflect on the substance of the exchange, which in this case is about the fairness of asserting that the detection methodology employed in this particular case shares the flaws of familiar online AI checkers. That's an importantly substantive and rebuttable point and all the meaningful action in the conversation is embedded in those details.
In this case, several important distinctions are drawn, including being open about criteria, about such things as "perplexity" and "burstiness" as properties being tested for, and an explanation of why they incorrectly claim the Declaration of Independence is AI generated (it's ubiquitous). So it seems like a lot of important distinctions are being drawn that testify to the credibility of the model, which has to matter to you if you're going to start moralizing.
There are dozens of first generation AI detectors and they all suck. I'm not going to defend them. Most of them use perplexity based methods, which is a decent separators of AI and human text (80-90%) but has flaws that can't be overcome and high FPRs on ESL text.
Pangram is fundamentally different technology, it's a large deep learning based model that is trained on hundreds of millions of human and AI examples. Some people see a dozen failed attempts at a problem as proof that the problem is impossible, but I would like to remind you that basically every major and minor technology was preceded by failed attempts.
Some people see a dozen extremely profitable, extremely destructive attempts at a problem as proof that the problem is not a place for charitable interpretation.
GAN.. Just feed the output of your algorithms back into the LLM while learning. At the end of the day the problem is impossible, but we're not there yet.
Pangram is trained on this task as well to add additional signal during training, but it's only ~90% accurate so we don't show the prediction in public-facing results
> Are you concerned with your product being used to improve AI to be less detectable?
The big AI providers don't have any obvious incentive to do this. If it happens 'naturally' in the pursuit of quality then sure, but explicitly training for stealth is a brand concern in the same way that offering a fully uncensored model would be.
Smaller providers might do this (again in the same way they now offer uncensored models), but they occupy a miniscule fraction of the market and will be a generation or two behind the leaders.
They don't have an incentive to make their AIs better? If your product can genuinely detect AI writing, of course they would use it to make their models sound more human. The biggest criticism of AI right now is how robotic and samey it sounds.
It's definitely going to be a back and forth - model providers like OpenAI want their LLMs to sound human-like. But this is the battle we signed up for, and we think we're more nimble and can iterate faster to stay one step ahead of the model providers.
Hi Max! Thank you for updating my mental model of AI detectors.
I was with total certainty under the impression that detecting AI-written text to be an impossible-to-solve problem. I think that's because it's just so deceptively intuitive to believe that "for every detector, there'll just be a better LLM and it'll never stop."
I had recently published a macOS app called Pudding to help humans prove they wrote a text mainly under the assumption that this problem can't be solved with measurable certainty and traditional methods.
Now I'm of course a bit sad that the problem (and hence my solution) can be solved much more directly. But, hey, I fell in love with the problem, so I'm super impressed with what y'all are accomplishing at and with Pangram!
AI detectors are only harmful if you use them to convict people, it isn't harmful to gather statistics like this. They didn't find many AI written paper, just AI written peer reviews, which is what you would expect since not many would generate their whole paper submissions while peer reviews are thankless work.
If you have a bullshit measure that determines some phenomena (e.g. crime) to happen in some area, you will become biased to expect it in that area. It wrongly creates a spotlight effect by which other questionable measures are used to do the actual conviction (“Look! We found an em dash!”)
I think there is a funny bit of mental gymnastics that goes on here sometimes, definitely. LLM skeptics (which I'm not saying the Pangram folks are in particular) would say: "LLMs are unreliable and therefore useless, it's producing slop at great cost to the environment and other people." But if a study comes out that confirms their biases and uses an LLM in the process, or if they themselves use an LLM to identify -- or in many cases just validate their preconceived notion -- that something was drafted using an LLM, then all the sudden things are above board.
Me neither, but I remember that when searching for hotels and Airbnbs, I only filter for hotels that are 8+/10 domestically and 9+/10 internationally, which filters out many of the hotels that have those kinds of issues (and score doesn't affect budget much).
Booking.com has this grade inflation issue. if something is shit but you rate everything else fairly (things like location, staff friendliness, etc), the final score will be 7 or 8.. in summary: I had a lousy experience, 7/10!
It takes some experience to realize that a place graded 7.x probably has serious issues.
The problem here is that "mean" is a poor average. For hotels, if you're rating in 10 different categories, you really want a single 0/10 to bring the overall score down by way more than one point.
The opposite situation can also occur. At my university, entrance scholarships were decided a few years ago based on students' aggregate score across 25ish dimensions (I can't remember the exact number) where students were each rated 1-4. Consequently a student who was absolutely exceptional in one area would be beaten out by a student who was marginally above average in all the other areas. I suggested that rather than scoring 1-4 the scores should be 1/2/5/25 instead.
The problem here begins even before the mathematical issue - it's that web sites that live from listing bookings have an incentive to offer a way to delete reviews that are not in line with what the owner wants to see.
Honestly, the ratings on those sites are essentially useless anyway, because people are bad at reviewing.
I generally sample the lowest rating written reviews, to check if people are complaining about real stuff, or are just confused. For instance, if a hotel doesn't have a bar, some of the negative reviews will usually be about how the hotel doesn't have a bar; these can be safely ignored as having been written by idiots (it is not like the hotel is hiding the fact that it doesn't have a bar).
Occasionally some of the positive reviews are similarly baffling. Was recently booking a hotel in Berlin in January, and the top review's main positive comment about the hotel was that it had heating. Well, yeah, I mean, you'd hope so. I can only assume that the reviewer was a visitor from the 19th century.
The worst thing I’ve found with positive reviews is ones that are obviously fake/incentivized. I looked up reviews recently for a hotel that I used to stay at a lot for work, and had gone way downhill with many issues (broken ACs, mold, leaking ceilings, etc.). I was curious if they ever fixed their problems. I was at first surprised that they had a fairly positive overall review rating. But looking deeper, the many negative reviews were just crowded out by obviously fake reviews. Dead giveaways: every single one named multiple people by name. “Dave at the front desk was just so friendly and welcoming! Barbara the housecleaner did a fantastic job cleaning. And Steve the bartender just made my day! I love this hotel! 5 stars!” (Almost) nobody reviewing a hotel for real does that.
For engineers aiming to move into management or staff engineering, you can assign them a project at the level they aspire to reach and give feedback once they complete it. For example, for an engineer aiming to be an EM, I expect them to lead not only meetings but also all communications related to this project, while I act as their director. Afterwards, I provide feedback.
It doesn't have to be that extensive right away. You can start small, like asking them to lead a roadmap meeting, and then increase responsibilities as they improve. Essentially, create a safe environment for them to grow.
reply