As someone with lots of anti-anti-botting knowledge - both are ineffective.
Even if it's a "global rate limit" I'll find out the value (never ran into someone randomizing it) and jump on the web request faster than anyone else RIGHT as it comes up.
With CAPTCHAs I'll bypass with a solving service and/or computer vision if it's easy, or even just get past the noCAPTCHA solutions with primed browser instances from credible networks.
But don't kid yourself - that would not solve botting at all.
Even the reputation-based stuff is laughable and one can hide a Puppeteer instance with good originating networks, and spoofing a ton of details in the browser. Even if that's a no-go you can also automate plain-old-Chromium/Chrome with extensions and run it in a headless session through something like Xpra. I'm experimenting with Firefox solutions as well.
All-in-all, I've never been stopped - and that's not me stroking my ego... there's TONS of resources out there for this stuff that are just a DuckDuckGo search away.
The biggest thing is if they start aggressively fingerprinting bots, they're going to start blocking user real people. It's all based on a score - and getting a good score is just a matter of a credible proxy, CAPTCHA bypassing services, and making a browser look highly credible.
---
For a "real" answer of some value - as a web developer myself, I'd try to make it as expensive as possible for them. Which specifically would be to implement a non-standard CAPTCHA solution and do rate/conversion-limiting per-network. The reason I didn't say this up-front is because it's not a solid solution - it's just increasing the barrier of difficulty and cost for those that are trying to automate around your solution.
It increases the barrier and cost for sure! But the thing is if you get someone who's even remotely sophisticated we can get past this sorta stuff in short order.
For something that's highly desirable like tickets, Nike drops, or apparently campsites there are many people with sufficient ability to bypass this stuff.
Yep - and as someone who's ran a lot of conversion-based online solutions this is 100% true. Even when you account for automated sign-ups etc. the inclusion of a CAPTCHA will ding your rates.
> As someone with lots of anti-anti-botting knowledge
Any recommendations for books/other information sources? Currently doing some backend work for a company that mainly does scraping and a lot of this seems to be based on the tribal knowledge of the resident old wise one.
I sorta disagree on this being tribal knowledge - a lot of this stuff is out there if you're willing to dig a bit. Tons of it comes down to network reputation and having a legitimate-looking bot. If you're scraping at scale it's an infrastructure problem just as much as it is a fingerprinting one.
He's just invested the time in picking this skillset up - it's definitely not just something you're an expert on after a few small projects. It takes years of having things break over, and over, and over [...]
There's a small Discord server that I setup for people who do a lot of RPA/web scraping if you're interested in joining a "tribe". My contact info is accessible through my profile if you're interested =)
Lottery like yosemite camp 4 has. I really enjoyed my experience with that. You could make some lotteries months or weeks ahead to match different needs.
It's a cat and mouse game for sure, but the answer is usually nothing if the person is sophisticated enough. You can only increase the difficulty/cost (covered in that comment).
What's your take on ML for bot classification? How successful has that family of strategies been in your opinion? One could speculate on what particular features of client behavior a model would hone in on to detect a bot, but it would actually likely be unexpected behavioral oddities not shared by legit clients that a human developer's intuition wouldn't think of.
When it comes to a web "interaction" there's a ton of differences depending on the job at-hand. If it's jumping on a Nike drop you're not going to run into those sort of methods where-as if it's scraping hundreds of thousands of SKUs there's a huge chance you will.
When a human opens a "conversation" with a website usually it's for a specific action which is a small burst of data, ie: looking at a few products on Amazon. So as long as you can design your bot's "conversation" with said website in the same way you can totally bypass ML detection (and everything else for that matter).
Basically if you think of ALL of the unique variables that are happening back and forth in that conversation it's a finite list. If you can check off each and every one of those you can get past anti-bot measures no-problemo, it's just a matter of budget/time. ML can HELP to identify bot traffic, but if you're sending perfect headers/traffic/browser metrics/CC numbers/shipping addresses (this list gets long) you're still going to squeak by with an acceptable risk score without a problem. The other thing is going raw-HTTP request often bypasses almost all of that crap, way more than one would think! I will often get "fingerprinted" with a very valid browser, then continue my work after treating their site as an API; ie: talking directly to the backend (or rendered HTML page) and no one else.
The #1 thing I'd say where bot classification works is network reputation... so I use services that allow me to proxy-out to VERY reputable networks that are actually business/residential ISP connections. This lets me get past the majority of countermeasures because if they start blocking those sort of IP addresses/ranges they're going to block real users at some point. Unfortunately, highly reputable proxy services do cost money - but for someone who does this professionally it's all about budgeting etc.
To implement these systems well takes a really keen implementation between your front-end and back-end and most companies (even the big ones) don't have the development sophistication to pull it off. Those that do are just more expensive because you'll be bouncing around between 40 different "origins" with unique browser profiles.
> I will often get "fingerprinted" with a very valid browser, then continue my work after treating their site as an API; ie: talking directly to the backend (or rendered HTML page) and no one else.
This seems like just the kind of behavior that an ML approach could easily identify. Even just feeding a trained model your basic request log data would show you as quite a different kind of user: you're not fetching images, javascript, etc and would have a substantially different traffic profile. Obviously you could get around that by scripting a browser, but that just kicks the can down the road: a scripted browser will still likely behave in some measurable way different than a human-driven browser, and the specifics of those differences are unlikely to be found by intuition but rather by ML which can hone in on things we wouldn't think of. For example, the time spent in certain pages/activities, or the position of the cursor on links being clicked (when you tell a headless browser to click a link, do the mouse coordinates in the click event look normal or are they at the upper left coordinate of the link's position?) etc. The more data you feed such an approach the more details it can use to find anomalies that differentiate humans from bots.
Totally agree but the thing is it's a TON of sophistication to pull that off and block proactively to the point that I'd say the only people who employ methods like that are Amazon (that I've encountered). Typically when you do that you will spoof a crawler and it gets you by just fine.
I guess I should have said "adapt" or "inspired by" or "riffing on" or something to indicate to the literal-minded that I was merely copying Sinclair's phraseology, rather than his meaning.
I thought "paraphrase" was sufficient, but I appear to have misjudged.
Even if it's a "global rate limit" I'll find out the value (never ran into someone randomizing it) and jump on the web request faster than anyone else RIGHT as it comes up.
With CAPTCHAs I'll bypass with a solving service and/or computer vision if it's easy, or even just get past the noCAPTCHA solutions with primed browser instances from credible networks.
But don't kid yourself - that would not solve botting at all.