Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

As someone with lots of anti-anti-botting knowledge - both are ineffective.

Even if it's a "global rate limit" I'll find out the value (never ran into someone randomizing it) and jump on the web request faster than anyone else RIGHT as it comes up.

With CAPTCHAs I'll bypass with a solving service and/or computer vision if it's easy, or even just get past the noCAPTCHA solutions with primed browser instances from credible networks.

But don't kid yourself - that would not solve botting at all.



> But don't kid yourself - that would not solve botting at all.

What would solve it? Or rather, what is the best defensive measure these days?


At this time - I honestly don't know.

Even the reputation-based stuff is laughable and one can hide a Puppeteer instance with good originating networks, and spoofing a ton of details in the browser. Even if that's a no-go you can also automate plain-old-Chromium/Chrome with extensions and run it in a headless session through something like Xpra. I'm experimenting with Firefox solutions as well.

All-in-all, I've never been stopped - and that's not me stroking my ego... there's TONS of resources out there for this stuff that are just a DuckDuckGo search away.

The biggest thing is if they start aggressively fingerprinting bots, they're going to start blocking user real people. It's all based on a score - and getting a good score is just a matter of a credible proxy, CAPTCHA bypassing services, and making a browser look highly credible.

---

For a "real" answer of some value - as a web developer myself, I'd try to make it as expensive as possible for them. Which specifically would be to implement a non-standard CAPTCHA solution and do rate/conversion-limiting per-network. The reason I didn't say this up-front is because it's not a solid solution - it's just increasing the barrier of difficulty and cost for those that are trying to automate around your solution.


> But don't kid yourself - that would not solve botting at all.

Surely capatcha's and rate limiting raises the bar for people botting.

It couldn't make it any worse right?


It increases the barrier and cost for sure! But the thing is if you get someone who's even remotely sophisticated we can get past this sorta stuff in short order.

For something that's highly desirable like tickets, Nike drops, or apparently campsites there are many people with sufficient ability to bypass this stuff.


If it doesn’t make it any better, then captcha makes it works for real humans


Yep - and as someone who's ran a lot of conversion-based online solutions this is 100% true. Even when you account for automated sign-ups etc. the inclusion of a CAPTCHA will ding your rates.


works -> worse


> As someone with lots of anti-anti-botting knowledge

Any recommendations for books/other information sources? Currently doing some backend work for a company that mainly does scraping and a lot of this seems to be based on the tribal knowledge of the resident old wise one.


I sorta disagree on this being tribal knowledge - a lot of this stuff is out there if you're willing to dig a bit. Tons of it comes down to network reputation and having a legitimate-looking bot. If you're scraping at scale it's an infrastructure problem just as much as it is a fingerprinting one.

Stuff like https://news.ycombinator.com/item?id=20479015 is a goldmine for me and it usually is a fun weekend working around said methods in a lab-like environment.

> resident old wise one

He's just invested the time in picking this skillset up - it's definitely not just something you're an expert on after a few small projects. It takes years of having things break over, and over, and over [...]

There's a small Discord server that I setup for people who do a lot of RPA/web scraping if you're interested in joining a "tribe". My contact info is accessible through my profile if you're interested =)


re: tribal knowledge - didn't mean the field as a whole, just that bus factor is pretty low at the place I'm currently at.


So what does help against bots?


Lottery like yosemite camp 4 has. I really enjoyed my experience with that. You could make some lotteries months or weeks ahead to match different needs.


Answered further up in the thread: https://news.ycombinator.com/item?id=21631112

It's a cat and mouse game for sure, but the answer is usually nothing if the person is sophisticated enough. You can only increase the difficulty/cost (covered in that comment).


I like how the sibling comments all expect an abuser to reveal how to prevent their abuse.

To paraphrase Sinclair: It is difficult to get a man to divulge how to prevent something, when his salary depends on his not preventing it.


I'm an open book! Ask any question that you would like!

Here's the thing - all of this info is out there (largely in other HN threads on this topic) and I'm nothing special in my field-of-knowledge ;)

I'm confident if people fully prevented my "abuse" they'd start to block actual users... simple as that.


What's your take on ML for bot classification? How successful has that family of strategies been in your opinion? One could speculate on what particular features of client behavior a model would hone in on to detect a bot, but it would actually likely be unexpected behavioral oddities not shared by legit clients that a human developer's intuition wouldn't think of.


This is an excellent question!

When it comes to a web "interaction" there's a ton of differences depending on the job at-hand. If it's jumping on a Nike drop you're not going to run into those sort of methods where-as if it's scraping hundreds of thousands of SKUs there's a huge chance you will.

When a human opens a "conversation" with a website usually it's for a specific action which is a small burst of data, ie: looking at a few products on Amazon. So as long as you can design your bot's "conversation" with said website in the same way you can totally bypass ML detection (and everything else for that matter).

Basically if you think of ALL of the unique variables that are happening back and forth in that conversation it's a finite list. If you can check off each and every one of those you can get past anti-bot measures no-problemo, it's just a matter of budget/time. ML can HELP to identify bot traffic, but if you're sending perfect headers/traffic/browser metrics/CC numbers/shipping addresses (this list gets long) you're still going to squeak by with an acceptable risk score without a problem. The other thing is going raw-HTTP request often bypasses almost all of that crap, way more than one would think! I will often get "fingerprinted" with a very valid browser, then continue my work after treating their site as an API; ie: talking directly to the backend (or rendered HTML page) and no one else.

The #1 thing I'd say where bot classification works is network reputation... so I use services that allow me to proxy-out to VERY reputable networks that are actually business/residential ISP connections. This lets me get past the majority of countermeasures because if they start blocking those sort of IP addresses/ranges they're going to block real users at some point. Unfortunately, highly reputable proxy services do cost money - but for someone who does this professionally it's all about budgeting etc.

To implement these systems well takes a really keen implementation between your front-end and back-end and most companies (even the big ones) don't have the development sophistication to pull it off. Those that do are just more expensive because you'll be bouncing around between 40 different "origins" with unique browser profiles.

Sorry for the essay =)


> I will often get "fingerprinted" with a very valid browser, then continue my work after treating their site as an API; ie: talking directly to the backend (or rendered HTML page) and no one else.

This seems like just the kind of behavior that an ML approach could easily identify. Even just feeding a trained model your basic request log data would show you as quite a different kind of user: you're not fetching images, javascript, etc and would have a substantially different traffic profile. Obviously you could get around that by scripting a browser, but that just kicks the can down the road: a scripted browser will still likely behave in some measurable way different than a human-driven browser, and the specifics of those differences are unlikely to be found by intuition but rather by ML which can hone in on things we wouldn't think of. For example, the time spent in certain pages/activities, or the position of the cursor on links being clicked (when you tell a headless browser to click a link, do the mouse coordinates in the click event look normal or are they at the upper left coordinate of the link's position?) etc. The more data you feed such an approach the more details it can use to find anomalies that differentiate humans from bots.


Totally agree but the thing is it's a TON of sophistication to pull that off and block proactively to the point that I'd say the only people who employ methods like that are Amazon (that I've encountered). Typically when you do that you will spoof a crawler and it gets you by just fine.


That’s not a paraphrase of Sinclair. Sinclair’s saying is about people not understanding something that conflicts with their income.

Refusing to give away a secret has absolutely no overlap with Sinclair’s saying.


I guess I should have said "adapt" or "inspired by" or "riffing on" or something to indicate to the literal-minded that I was merely copying Sinclair's phraseology, rather than his meaning.

I thought "paraphrase" was sufficient, but I appear to have misjudged.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: