We've had the same issue. They were doing huge bursts of tens of thousands of re...

pera · on June 11, 2020

I just had an idea: if you control your own name server I believe you could use a BIND view to send all their own traffic to themselves based on the source address.

By the way, if someone discovers how to trigger this issue it would be easy to use it as a DOS pseudo-botnet.

mszcz · on June 12, 2020

I had a different idea. Maybe you could craft a zip-bomb response. The bot would fetch the small gzipped content and upon extraction discover it was GBs of data? Not sure that's possible here, when responding to a request, but that would surely turn the admins attention to it.

Nextgrid · on June 12, 2020

Here's an example of things you can do against malicious crawlers: http://www.hackerfactor.com/blog/index.php?/archives/762-Att....

Jazgot · on June 23, 2020

Thanks a lot! I've spent like 2h now reading his blog. Just amazing.

touristtam · on June 12, 2020

fascinating read. :)

mszcz · on June 12, 2020

It is!

narrowtux · on June 12, 2020

If their client asks for gzip compression of the http traffic, you could do it.

gerdesj · on June 11, 2020

Nice idea, I like the thinking. I'll tuck that away for use later. PowerDNS has LUA built in amongst a few other things.

My stack of projects to do is growing at a hell of a rate and I'm not popping them off the stack fast enough.

verdverm · on June 11, 2020

I know the feeling, it's one of the reasons I'm working on https://github.com/hofstadter-io/hof

Check out the code generation parts and modules, they are the most mature. We have HRDs (like CRDs in k8s for anything) and a scripting language between bash and Python coming out soon too.

gerdesj · on June 11, 2020

I've tried to work out what your project does but I'm none the wiser. GEB is prominent on my bookshelf. I'm a syadmin and I got as far as "hollow wold" in Go or was it "Hail Marrow"? Can't remember.

I've checked out your repo for a look over tomorrow when I'm cough sober!

verdverm · on June 11, 2020

Stop by gitter and I'd be happy to explain more

gorgoiler · on June 12, 2020

Something about this idea sits uncomfortably with me. I also just had an idea / thought experiment based on your idea.

We think of net neutrality as being for carriers and ISPs, but you could see it applied to a publicly accessible DNS service too. These DNS service providers are just as much part of the core service of the Internet as anyone else. It’s not a huge leap to require that those who operate a publicly accessible DNS service are bound by the same spirit of the regulations: that the infrastructure must not discriminate based on who is using it.

It’s different to operating a discriminatory firewall. DNS is a cacheable public service with bad consequences if poisonous data ends up in the system. Fiddling with DNS like this doesn’t seem like a good idea. Too much weird and bad stuff could go wrong.

Another analogy would be to the use of encryption on amateur radio. It seems like an innocuously good idea, but the radio waves were held open in public trust for public use. If you let them be used for a different (though arguably a more useful purpose) then the resource ends up being degraded.

Also along these lines of thought [begin irony mode]: FCC fines for DNS wildcard abuse / usage.

inopinatus · on June 12, 2020

Principled neutrality is fine for acceptable use. There’s no moral quandary in closing the door to abusers.

zdragnar · on June 12, 2020

Isn't that the argument that providers make for wanting to meter usage? I.e. video streamers, torrenters and netflix and the like are 'abusing' the network by using a disproportionate amount of their capacity / bandwidth?

I guess my point is that "abuse" in this sense is pretty subjective.

inopinatus · on June 12, 2020

Not really, those things are easy to contrast. Network providers have always been comfortable blackholing DoS routes, and it’s never been controversial. That’s clearly distinct from those wanting to double-dip on transport revenues for routine traffic.

The difference is in whether both endpoints want the traffic, not whether (or on what basis) the enabling infrastructure wants to bear it.

ttwinder · on June 12, 2020

> There’s no moral quandary in closing the door to abusers

Doesn't necessarily apply to this conversation, but the moral mistake that people (and societies) frequently make is underestimating the nuance that should be exercised when identifying others as abusers.

foobiekr · on June 12, 2020

if only there was a way to accidentally amplify that.

raverbashing · on June 12, 2020

You could just probably send an HTTP redirect to their own site, no need to play with DNS for that

mint2 · on June 13, 2020

Probably better to find the highest level FB employees personal site that you can and send the million requests from fb there.

tempestn · on June 12, 2020

Except then you'd still have to deal with all the traffic.

twmahna · on June 11, 2020

>> The bots didn't identify as FB (used "spoofed" UAs)

That's surprising. What were the spoofed user agents that they used?

We've run into this issue also, but all Facebook bot activity had user agents that contained the string "facebookexternalhit".

rob-olmos · on June 12, 2020

I've seen these these two user-agents from FB IPs, maybe others:

Mozilla/5.0 (iPhone; CPU iPhone OS 7_1_2 like Mac OS X) AppleWebKit/537.51.2 (KHTML, like Gecko) Version/7.0 Mobile/11D257 Safari/9537.53

Mozilla/5.0 (iPhone; CPU iPhone OS 13_5 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.1.1 Mobile/15E148 Safari/604.1

Which also execute javascript in some modified sandbox or something, causing errors, and executing error handlers. Interesting attempt to analyze the crawler here: https://github.com/aFarkas/lazysizes/issues/520#issuecomment...

lun4r · on June 12, 2020

Yep, these were the UAs we also saw (amongst others). And also in our case those bots were executing the JS, even hitting our Google Analytics. For some reason GA reported this traffic to come from Peru and Philippines, while an IP lookup showed it belonged to FB registered in the US or Ireland

adrianN · on June 12, 2020

Set up a robots.txt that disallows Facebook crawlers, sue Facebook if the crawling continues for unauthorized access to computer systems, profit.

albatross · on June 12, 2020

I believe this only works if it is coming from a corporation and targeted at an individual -_-

justinph · on June 12, 2020

robots.txt is not a legal document. It is asking nicely, and plenty of crawlers purposefully ignore it.

adrianN · on June 12, 2020

I'm not a lawyer, but I believe I remember people being sued for essentially making a GET request to a URL they weren't supposed to GET.

greggyb · on June 12, 2020

Legal document is a tricky phrase to use. "No trespassing" signs are usually considered sufficient to justify prosecution for trespassing. If the sign is conspicuously placed, it does not usually matter if you actually see the sign or not.

I am not as familiar with law around accessing computer systems, but I imagine that given some of the draconian enforcement we've seen in the past that a robots.txt should be sufficient to support some legal action against someone who disregards it.

dungdang · on June 13, 2020

the no tresspassing sign does not mean i can't yell at you from the street 'tell me your life story.' which you are free to ignore.

greggyb · on June 17, 2020

Sure, but federal law does prevent you from repeatedly placing that request into my mailbox. I don't even need a sign for that.

Making an http request does not fit cleanly as an analogue to yelling from the street, nor does it fit as an analogue to throwing a written request on a brick through a window. It is something different that must be understood on its own terms.

wheresmycraisin · on June 11, 2020

Side question: how did you get in contact with facebook? I've an ad account that was suspended last year and gave up trying to contact them.

justusw · on June 12, 2020

Similar experience, closed ad account because of “suspicious activity”, at least ten support tickets (half closed automatically), four lame apologies (our system says no, sry) and then finally, “there was an error in our system, you’re good to go”

wheresmycraisin · on June 13, 2020

I lost it because there was some sort of UI bug in the ad preview page that led the page to reload a bunch of times really quick, and boom, I'm apparently a criminal. You're lucky though, I never got it back and I had to change my business credit card because they somehow kept charging me after suspension.

missedthecue · on June 11, 2020

Really? I've never had problems contacting them by email. They're one of the easiest tech companies to talk to.

redeux · on June 11, 2020

In my experience they're one of the most useless and difficult companies I've ever tried to interact with.

corobo · on June 12, 2020

The entirety of the universe is contained in the preceding two comments.

peanutz454 · on June 12, 2020

Aah, finally it all makes sense!

lvs · on June 12, 2020

https://twitter.com/james_kpatrick/status/320150923336892416

deadwing0 · on June 12, 2020

What did you contact them for? Just curious as I've almost always heard they're like Google and impossible to get a human response from.

wheresmycraisin · on June 13, 2020

What is their email? I've never found one that they'd respond to.

lun4r · on June 12, 2020

I have an option to email them, live chat, or have our account manager contact us. I guess if you spend enough on ads per month you are entitled to more support...?

danmurphy · on June 12, 2020

try contacting their NOC, they _may_ give you a human that can help.

bishalb · on June 12, 2020

Try their live chat.

wheresmycraisin · on June 13, 2020

Do they actually have a live chat available for smaller advertisers? When I looked last year there wasn't

bishalb · on June 13, 2020

They do, I have used it a couple of times. Go go facebook.com/business/help click on get started button at the bottom for "find answers or contact support". Follow through, you will see a button called "chat with a representative".

paulpauper · on June 11, 2020

twitter does the same thing. it sends a bunch of spoofed US visitors from Korea and Germany and US. The bots are spoofed to make it harder to filter them.

rapind · on June 11, 2020

I wonder what the (legitimate?) reason is for them to spoof. Seems intentionally shady. Maybe there's a legit reason we're missing?

detaro · on June 11, 2020

Possibly trying to avoid people sending them a different version of the page than users would see (of course they could change the page after the initial caching of a preview, but Twitter might refresh/check them later).

Also, you often need an impressive amount of the stuff thats in a normal UA string for random sites to not break/send you the "unsupported browser, please use Netscape 4 or newer!!!" page/..., although you normally can fit an identifier of what you really are at the end. (As an example, here's an iOS Safari user agent: "Mozilla/5.0 (iPhone; CPU iPhone OS 13_3_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.0.5 Mobile/15E148 Safari/604.1" - it's not Mozilla, it's not Gecko, but it has those keywords and patterns because sites expect to see that)

yjftsjthsd-h · on June 11, 2020

Yeah, I once tried to tell my browser to send... I forget; either no UA, or a blank UA string, or nonsense or just "Firefox" or something. I figured, "hey, some sites might break, but it can't be that important!" It broke everything. IIRC, the breaking point was that my own server refused to talk to me. Now, I still think this is insane, but apparently this really is how it is right now.

hombre_fatal · on June 12, 2020

Should make you realize just how much abuse there is on the internet that it's worth it to just filter traffic with no UA.

Usually people get stuck on the fact that we can't have nice things, so X sucks for not letting us have nice things, yet I seem to never see people acknowledge why we can't have nice things.

Then I'd see a lot more "ugh, bad actors suck!" and less "ugh, websites are just trying to make life miserable for me >:("

lkrubner · on June 12, 2020

Is there an official error message for that? Because filtering no UA would trip me up every time I use wget. Most of the time, if I'm casually using wget for something, I don't bother with a UA. If sites started rejecting that, I'd like to get a clear error message, so I would not go crazy trying to figure out what the problem was. If I got a clear message "send a UA" then I would probably started wrapping my wget requests in a short bash script with some extra stuff thrown in to keep everyone happy. But I'd have to know what it is that is needed to keep everyone happy.

auscompgeek · on June 12, 2020

wget has a default User-Agent string.

licebmi__at__ · on June 12, 2020

Well, if filtering for UA really makes things difficult for bad actors, they do suck but more in as a technical opinion than an moral statement.

laughinghan · on June 12, 2020

That's amazing. Any idea what piece of middleware on your own server was doing that?

yjftsjthsd-h · on June 12, 2020

I don't remember, but what really got me was that I wasn't running anything that I expected to do fancy filtering; I think this was just Apache httpd running on CentOS. But there was no web application firewall, no load balancers, pretty sure fail2ban was only set up for sshd. It at least appeared that just the apache stock config was in play.

bzb3 · on June 11, 2020

You removed a vital part of the http protocol and you're surprised when things break?

zucker42 · on June 12, 2020

Seems like it follows the spec to me:

https://tools.ietf.org/html/rfc7231#section-5.5.3

rapind · on June 11, 2020

That makes sense. I couldn't come up with a shady reason why they would do it to be honest, but I was curious.

xg15 · on June 11, 2020

Obligatory link: https://webaim.org/blog/user-agent-string-history/

technion · on June 12, 2020

When you know your target runs Internet Explorer, you serve phishing pages to IE users and some boring content to other users, at the server level. We've had this keep our user tests out of Google Safebrowse and so on. I'm sure similar tricks end up applied to Facebook's UA and "legitimate" marketing sites.

napolux · on June 11, 2020

Thanks man! I'll have a look.

cdown · on June 11, 2020

Hey! Facebook engineer here. If you have it, can you send me the User-Agent for these requests? That would definitely help speed up narrowing down what's happening here. If you can provide me the hostname being requested in the Host header, that would be great too.

I just sent you an e-mail, you can also reply to that instead if you prefer not to share those details here. :-)

traverseda · on June 11, 2020

I'm not sure I'd publicly post my email like that, if I worked at FB. But congratulations on your promotion to "official technical contact for all facebook issues forever".

cdown · on June 11, 2020

My e-mail address is already public from my kernel commits and upstream work. :-)

doommius · on June 11, 2020

Don't think I used my email for anything important doing my time at FB. If it gets out of hand he could just make a request to have a new primary email made and use the above one for "spam"

thirtythree · on June 12, 2020

Curiousity question: does FB use Gmail/Google suite?

dan15 · on June 12, 2020

FB uses Office365 for email. It was on-premise Exchange many many years ago, but moved "to the cloud" a while back.

hirako2000 · on June 12, 2020

Feels odd to read Facebook uses office365/exchange for emails. they haven't built their fsuite yet, I thought they would simply promote Facebook messenger internally. I'm only half joking.

dan15 · on June 13, 2020

Most communication is via Workplace (group posts and chat). Emails aren't very common any more - mainly for communication with external people and alerting.

callalex · on June 12, 2020

My impression is that they pretty much roll their own communication suite.

rock_hard · on June 12, 2020

That’s somewhat correct

But at least for email/calendar backend its exchange

The internal replacement clients for calendar and other things are killer...have yet to find replacements

For the most part though they use Facebook internally for messaging and regular communication (technically now Worplace but before it was just Facebook)

Email is really just for external folks

napolux · on June 11, 2020

I don't want to share my website for personal reasons, but here is some data from cloudflare dashboard (a request made on 11 Jun, 2020 21:30:55 from Ireland, I have 3 requests in the same second from 2 different IPs)

user-agent: facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php) ip 1: 2a03:2880:22ff:3::face:b00c (1 request) ip 2: 2a03:2880:22ff:b::face:b00c (2 requests) ASN: AS32934 FACEBOOK

napolux · on June 11, 2020

Yes requests are still coming. Thanks CloudFlare for saving my ass.

Ueland · on June 12, 2020

Hey,

I can by the way confirm this issue. I work in a large newspaper in Norway and around a year ago we saw the same issue. Thousands of requests per second until we blocked it . And after we blocked it, traffic to our Facebook page also plummeted. I assume Facebook considered our website down and thus wouldn't give users content from our Facebook page either as that would serve them content that would give a bad user experience. The Facebook traffic did not normalize before the attack stopped AND after we told Facebook to reindex all our content.

I'd you want more info, send me a email and il dig out some logs etc. thu at db.no

fsociety · on June 11, 2020

Thanks for looking at this!

6c696e7578 · on June 12, 2020

Thank goodness for mod_rewrite, which makes blocking/redirecting traffic on basic things like headers pretty easy.

https://www.usenix.org.uk/content/rewritemap.html

You could of course block upstream by IP, but if you want to send the traffic away from a CPU heavy dynamic page to something static that 2xx's or 301's to https://developers.facebook.com/docs/sharing/webmasters/craw... then this could be the answer.

lun4r · on June 12, 2020

Here are some more details from my report to FB:

"My webserver is getting hit with bursts of hundreds of requests from Facebook's IP ranges. Google Analytics also reports these hits and shows them as coming from (mostly) Philippines and Peru, however, IP lookup shows that these IPs belong to Facebook (TFBNET3). The number of these hits during a burst typically exceeds my normal traffic by 200%, putting a lot of stress at our infrastructure, putting our business at risk.

This started happening after the Facebook Support team resolved a problem I reported earlier regarding connecting my Facebook Pixel as a data source to my Catalog. It seems Facebook is sending a bot to fetch information from the page, but does so very aggressively and apparently call other trackers on the page (such as Google Analytics)"

69.171.240.19 - - [13/Aug/2018:11:09:52 +0200] "GET /items/ley3xk/ford-dohc-20-sierra-mondeo-scorpio-luk-set.html HTTP/1.1" 200 15181 "https://www.facebook.com/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36 Edge/17.17134"

"e.g. IP addresses 173.252.87.* performed 15,211 hits between Aug 14 12:00 and 12:59, followed by 13,946 hits from 31.13.115.*"

"What is also interesting is that the user agents are very diverse. I would expect a Facebook crawler to identify itself with a unique User-Agent header (as suggested by the documentation page mentioned earlier), but instead I see User-Agent strings that belong to many different browsers. E.g. this file contains 53,240 hits from Facebook's IP addresses with User-Agent "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36 Edge/17.17134"

There are a few Facebook useragents in there, but far less than browser useragents: 7,310 hits: facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php) 2,869 hits: facebookexternalhit/1.1 1,439 hits: facebookcatalog/1.0 120 hits: facebookexternalua

Surprisingly, there is even a useragent string that mentions Bing: 6,280 hits: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534+ (KHTML, like Gecko) BingPreview/1.0b

These IPs don't only fetch the HTML page, but load all the page's resources (images, css, ..) including all third-party trackers (such as Google Analytics). Not only does this put unnecessary stress at our infrastructure, it drives up the usage costs of 3rd party tracking services and renders some of our reports unreliable."

final response from FB: "Thanks for your patience while our team looked into this. They've added measures to reduce the amount of crawler calls made. Further optimizations are being worked on as well, but for now, this issue should be resolved." <- NOT.

stiray · on June 11, 2020

I dont really understand what is the issue. On my welcome page (while all other urls are impossible to guess) i give browser something that requires a few seconds of cpu at 100% to crunch. And tracking some user action in between, visting tarpitted urls etc. In last few years no bot came through. Why bother with robots.txt, just give them something to break their teeths...

(I would give you the url, but I just dont want ti be visited)

cblades · on June 11, 2020

I don't think most users would appreciate having a site spike their CPU for a few seconds when they visit...at least I wouldn't.

hombre_fatal · on June 12, 2020

The far-fetched "solution" along with the "I don't understand everyone doesn't do <thing that is easy to understand why everyone wouldn't do it>" make it hard for me to believe you're actually doing this. Sounds more like a shower thought, maybe fun weekend project, than something in production.

trashburger · on June 11, 2020

Sounds like terrible UX! Won't browsers give you a "Stop script" prompt if you do that?

aetherspawn · on June 11, 2020

I don’t think the solution is flatlining everyone’s CPUs.

xg15 · on June 11, 2020

> but I just dont want ti be visited

Maybe you should just take your website offline?

quickthrower2 · on June 11, 2020

How much have you mined?