Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

We've had the same issue. They were doing huge bursts of tens of thousands of requests in very short time several times a day. The bots didn't identify as FB (used "spoofed" UAs) but were all coming from FB owned netblocks. I've contacted FB about it, but they couldn't figure out why this was happening and didn't solve the problem. I found out that there is an option in the FB Catalog manager that lets FB auto-remove items from the catalog when their destination page is gone. Disabling this option solved the issue.


I just had an idea: if you control your own name server I believe you could use a BIND view to send all their own traffic to themselves based on the source address.

By the way, if someone discovers how to trigger this issue it would be easy to use it as a DOS pseudo-botnet.


I had a different idea. Maybe you could craft a zip-bomb response. The bot would fetch the small gzipped content and upon extraction discover it was GBs of data? Not sure that's possible here, when responding to a request, but that would surely turn the admins attention to it.


Here's an example of things you can do against malicious crawlers: http://www.hackerfactor.com/blog/index.php?/archives/762-Att....


Thanks a lot! I've spent like 2h now reading his blog. Just amazing.


fascinating read. :)


It is!


If their client asks for gzip compression of the http traffic, you could do it.


Nice idea, I like the thinking. I'll tuck that away for use later. PowerDNS has LUA built in amongst a few other things.

My stack of projects to do is growing at a hell of a rate and I'm not popping them off the stack fast enough.


I know the feeling, it's one of the reasons I'm working on https://github.com/hofstadter-io/hof

Check out the code generation parts and modules, they are the most mature. We have HRDs (like CRDs in k8s for anything) and a scripting language between bash and Python coming out soon too.


I've tried to work out what your project does but I'm none the wiser. GEB is prominent on my bookshelf. I'm a syadmin and I got as far as "hollow wold" in Go or was it "Hail Marrow"? Can't remember.

I've checked out your repo for a look over tomorrow when I'm cough sober!


Stop by gitter and I'd be happy to explain more


Something about this idea sits uncomfortably with me. I also just had an idea / thought experiment based on your idea.

We think of net neutrality as being for carriers and ISPs, but you could see it applied to a publicly accessible DNS service too. These DNS service providers are just as much part of the core service of the Internet as anyone else. It’s not a huge leap to require that those who operate a publicly accessible DNS service are bound by the same spirit of the regulations: that the infrastructure must not discriminate based on who is using it.

It’s different to operating a discriminatory firewall. DNS is a cacheable public service with bad consequences if poisonous data ends up in the system. Fiddling with DNS like this doesn’t seem like a good idea. Too much weird and bad stuff could go wrong.

Another analogy would be to the use of encryption on amateur radio. It seems like an innocuously good idea, but the radio waves were held open in public trust for public use. If you let them be used for a different (though arguably a more useful purpose) then the resource ends up being degraded.

Also along these lines of thought [begin irony mode]: FCC fines for DNS wildcard abuse / usage.


Principled neutrality is fine for acceptable use. There’s no moral quandary in closing the door to abusers.


Isn't that the argument that providers make for wanting to meter usage? I.e. video streamers, torrenters and netflix and the like are 'abusing' the network by using a disproportionate amount of their capacity / bandwidth?

I guess my point is that "abuse" in this sense is pretty subjective.


Not really, those things are easy to contrast. Network providers have always been comfortable blackholing DoS routes, and it’s never been controversial. That’s clearly distinct from those wanting to double-dip on transport revenues for routine traffic.

The difference is in whether both endpoints want the traffic, not whether (or on what basis) the enabling infrastructure wants to bear it.


> There’s no moral quandary in closing the door to abusers

Doesn't necessarily apply to this conversation, but the moral mistake that people (and societies) frequently make is underestimating the nuance that should be exercised when identifying others as abusers.


if only there was a way to accidentally amplify that.


You could just probably send an HTTP redirect to their own site, no need to play with DNS for that


Probably better to find the highest level FB employees personal site that you can and send the million requests from fb there.


Except then you'd still have to deal with all the traffic.


>> The bots didn't identify as FB (used "spoofed" UAs)

That's surprising. What were the spoofed user agents that they used?

We've run into this issue also, but all Facebook bot activity had user agents that contained the string "facebookexternalhit".


I've seen these these two user-agents from FB IPs, maybe others:

Mozilla/5.0 (iPhone; CPU iPhone OS 7_1_2 like Mac OS X) AppleWebKit/537.51.2 (KHTML, like Gecko) Version/7.0 Mobile/11D257 Safari/9537.53

Mozilla/5.0 (iPhone; CPU iPhone OS 13_5 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.1.1 Mobile/15E148 Safari/604.1

Which also execute javascript in some modified sandbox or something, causing errors, and executing error handlers. Interesting attempt to analyze the crawler here: https://github.com/aFarkas/lazysizes/issues/520#issuecomment...


Yep, these were the UAs we also saw (amongst others). And also in our case those bots were executing the JS, even hitting our Google Analytics. For some reason GA reported this traffic to come from Peru and Philippines, while an IP lookup showed it belonged to FB registered in the US or Ireland


Set up a robots.txt that disallows Facebook crawlers, sue Facebook if the crawling continues for unauthorized access to computer systems, profit.


I believe this only works if it is coming from a corporation and targeted at an individual -_-


robots.txt is not a legal document. It is asking nicely, and plenty of crawlers purposefully ignore it.


I'm not a lawyer, but I believe I remember people being sued for essentially making a GET request to a URL they weren't supposed to GET.


Legal document is a tricky phrase to use. "No trespassing" signs are usually considered sufficient to justify prosecution for trespassing. If the sign is conspicuously placed, it does not usually matter if you actually see the sign or not.

I am not as familiar with law around accessing computer systems, but I imagine that given some of the draconian enforcement we've seen in the past that a robots.txt should be sufficient to support some legal action against someone who disregards it.


the no tresspassing sign does not mean i can't yell at you from the street 'tell me your life story.' which you are free to ignore.


Sure, but federal law does prevent you from repeatedly placing that request into my mailbox. I don't even need a sign for that.

Making an http request does not fit cleanly as an analogue to yelling from the street, nor does it fit as an analogue to throwing a written request on a brick through a window. It is something different that must be understood on its own terms.


Side question: how did you get in contact with facebook? I've an ad account that was suspended last year and gave up trying to contact them.


Similar experience, closed ad account because of “suspicious activity”, at least ten support tickets (half closed automatically), four lame apologies (our system says no, sry) and then finally, “there was an error in our system, you’re good to go”


I lost it because there was some sort of UI bug in the ad preview page that led the page to reload a bunch of times really quick, and boom, I'm apparently a criminal. You're lucky though, I never got it back and I had to change my business credit card because they somehow kept charging me after suspension.


Really? I've never had problems contacting them by email. They're one of the easiest tech companies to talk to.


In my experience they're one of the most useless and difficult companies I've ever tried to interact with.


The entirety of the universe is contained in the preceding two comments.


Aah, finally it all makes sense!



What did you contact them for? Just curious as I've almost always heard they're like Google and impossible to get a human response from.


What is their email? I've never found one that they'd respond to.


I have an option to email them, live chat, or have our account manager contact us. I guess if you spend enough on ads per month you are entitled to more support...?


try contacting their NOC, they _may_ give you a human that can help.


Try their live chat.


Do they actually have a live chat available for smaller advertisers? When I looked last year there wasn't


They do, I have used it a couple of times. Go go facebook.com/business/help click on get started button at the bottom for "find answers or contact support". Follow through, you will see a button called "chat with a representative".


twitter does the same thing. it sends a bunch of spoofed US visitors from Korea and Germany and US. The bots are spoofed to make it harder to filter them.


I wonder what the (legitimate?) reason is for them to spoof. Seems intentionally shady. Maybe there's a legit reason we're missing?


Possibly trying to avoid people sending them a different version of the page than users would see (of course they could change the page after the initial caching of a preview, but Twitter might refresh/check them later).

Also, you often need an impressive amount of the stuff thats in a normal UA string for random sites to not break/send you the "unsupported browser, please use Netscape 4 or newer!!!" page/..., although you normally can fit an identifier of what you really are at the end. (As an example, here's an iOS Safari user agent: "Mozilla/5.0 (iPhone; CPU iPhone OS 13_3_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.0.5 Mobile/15E148 Safari/604.1" - it's not Mozilla, it's not Gecko, but it has those keywords and patterns because sites expect to see that)


Yeah, I once tried to tell my browser to send... I forget; either no UA, or a blank UA string, or nonsense or just "Firefox" or something. I figured, "hey, some sites might break, but it can't be that important!" It broke everything. IIRC, the breaking point was that my own server refused to talk to me. Now, I still think this is insane, but apparently this really is how it is right now.


Should make you realize just how much abuse there is on the internet that it's worth it to just filter traffic with no UA.

Usually people get stuck on the fact that we can't have nice things, so X sucks for not letting us have nice things, yet I seem to never see people acknowledge why we can't have nice things.

Then I'd see a lot more "ugh, bad actors suck!" and less "ugh, websites are just trying to make life miserable for me >:("


Is there an official error message for that? Because filtering no UA would trip me up every time I use wget. Most of the time, if I'm casually using wget for something, I don't bother with a UA. If sites started rejecting that, I'd like to get a clear error message, so I would not go crazy trying to figure out what the problem was. If I got a clear message "send a UA" then I would probably started wrapping my wget requests in a short bash script with some extra stuff thrown in to keep everyone happy. But I'd have to know what it is that is needed to keep everyone happy.


wget has a default User-Agent string.


Well, if filtering for UA really makes things difficult for bad actors, they do suck but more in as a technical opinion than an moral statement.


That's amazing. Any idea what piece of middleware on your own server was doing that?


I don't remember, but what really got me was that I wasn't running anything that I expected to do fancy filtering; I think this was just Apache httpd running on CentOS. But there was no web application firewall, no load balancers, pretty sure fail2ban was only set up for sshd. It at least appeared that just the apache stock config was in play.


You removed a vital part of the http protocol and you're surprised when things break?


Seems like it follows the spec to me:

https://tools.ietf.org/html/rfc7231#section-5.5.3


That makes sense. I couldn't come up with a shady reason why they would do it to be honest, but I was curious.



When you know your target runs Internet Explorer, you serve phishing pages to IE users and some boring content to other users, at the server level. We've had this keep our user tests out of Google Safebrowse and so on. I'm sure similar tricks end up applied to Facebook's UA and "legitimate" marketing sites.


Thanks man! I'll have a look.


Hey! Facebook engineer here. If you have it, can you send me the User-Agent for these requests? That would definitely help speed up narrowing down what's happening here. If you can provide me the hostname being requested in the Host header, that would be great too.

I just sent you an e-mail, you can also reply to that instead if you prefer not to share those details here. :-)


I'm not sure I'd publicly post my email like that, if I worked at FB. But congratulations on your promotion to "official technical contact for all facebook issues forever".


My e-mail address is already public from my kernel commits and upstream work. :-)


Don't think I used my email for anything important doing my time at FB. If it gets out of hand he could just make a request to have a new primary email made and use the above one for "spam"


Curiousity question: does FB use Gmail/Google suite?


FB uses Office365 for email. It was on-premise Exchange many many years ago, but moved "to the cloud" a while back.


Feels odd to read Facebook uses office365/exchange for emails. they haven't built their fsuite yet, I thought they would simply promote Facebook messenger internally. I'm only half joking.


Most communication is via Workplace (group posts and chat). Emails aren't very common any more - mainly for communication with external people and alerting.


My impression is that they pretty much roll their own communication suite.


That’s somewhat correct

But at least for email/calendar backend its exchange

The internal replacement clients for calendar and other things are killer...have yet to find replacements

For the most part though they use Facebook internally for messaging and regular communication (technically now Worplace but before it was just Facebook)

Email is really just for external folks


I don't want to share my website for personal reasons, but here is some data from cloudflare dashboard (a request made on 11 Jun, 2020 21:30:55 from Ireland, I have 3 requests in the same second from 2 different IPs)

user-agent: facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php) ip 1: 2a03:2880:22ff:3::face:b00c (1 request) ip 2: 2a03:2880:22ff:b::face:b00c (2 requests) ASN: AS32934 FACEBOOK


Yes requests are still coming. Thanks CloudFlare for saving my ass.


Hey,

I can by the way confirm this issue. I work in a large newspaper in Norway and around a year ago we saw the same issue. Thousands of requests per second until we blocked it . And after we blocked it, traffic to our Facebook page also plummeted. I assume Facebook considered our website down and thus wouldn't give users content from our Facebook page either as that would serve them content that would give a bad user experience. The Facebook traffic did not normalize before the attack stopped AND after we told Facebook to reindex all our content.

I'd you want more info, send me a email and il dig out some logs etc. thu at db.no


Thanks for looking at this!


Thank goodness for mod_rewrite, which makes blocking/redirecting traffic on basic things like headers pretty easy.

https://www.usenix.org.uk/content/rewritemap.html

You could of course block upstream by IP, but if you want to send the traffic away from a CPU heavy dynamic page to something static that 2xx's or 301's to https://developers.facebook.com/docs/sharing/webmasters/craw... then this could be the answer.


Here are some more details from my report to FB:

"My webserver is getting hit with bursts of hundreds of requests from Facebook's IP ranges. Google Analytics also reports these hits and shows them as coming from (mostly) Philippines and Peru, however, IP lookup shows that these IPs belong to Facebook (TFBNET3). The number of these hits during a burst typically exceeds my normal traffic by 200%, putting a lot of stress at our infrastructure, putting our business at risk.

This started happening after the Facebook Support team resolved a problem I reported earlier regarding connecting my Facebook Pixel as a data source to my Catalog. It seems Facebook is sending a bot to fetch information from the page, but does so very aggressively and apparently call other trackers on the page (such as Google Analytics)"

69.171.240.19 - - [13/Aug/2018:11:09:52 +0200] "GET /items/ley3xk/ford-dohc-20-sierra-mondeo-scorpio-luk-set.html HTTP/1.1" 200 15181 "https://www.facebook.com/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36 Edge/17.17134"

"e.g. IP addresses 173.252.87.* performed 15,211 hits between Aug 14 12:00 and 12:59, followed by 13,946 hits from 31.13.115.*"

"What is also interesting is that the user agents are very diverse. I would expect a Facebook crawler to identify itself with a unique User-Agent header (as suggested by the documentation page mentioned earlier), but instead I see User-Agent strings that belong to many different browsers. E.g. this file contains 53,240 hits from Facebook's IP addresses with User-Agent "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36 Edge/17.17134"

There are a few Facebook useragents in there, but far less than browser useragents: 7,310 hits: facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php) 2,869 hits: facebookexternalhit/1.1 1,439 hits: facebookcatalog/1.0 120 hits: facebookexternalua

Surprisingly, there is even a useragent string that mentions Bing: 6,280 hits: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534+ (KHTML, like Gecko) BingPreview/1.0b

These IPs don't only fetch the HTML page, but load all the page's resources (images, css, ..) including all third-party trackers (such as Google Analytics). Not only does this put unnecessary stress at our infrastructure, it drives up the usage costs of 3rd party tracking services and renders some of our reports unreliable."

final response from FB: "Thanks for your patience while our team looked into this. They've added measures to reduce the amount of crawler calls made. Further optimizations are being worked on as well, but for now, this issue should be resolved." <- NOT.


I dont really understand what is the issue. On my welcome page (while all other urls are impossible to guess) i give browser something that requires a few seconds of cpu at 100% to crunch. And tracking some user action in between, visting tarpitted urls etc. In last few years no bot came through. Why bother with robots.txt, just give them something to break their teeths...

(I would give you the url, but I just dont want ti be visited)


I don't think most users would appreciate having a site spike their CPU for a few seconds when they visit...at least I wouldn't.


The far-fetched "solution" along with the "I don't understand everyone doesn't do <thing that is easy to understand why everyone wouldn't do it>" make it hard for me to believe you're actually doing this. Sounds more like a shower thought, maybe fun weekend project, than something in production.


Sounds like terrible UX! Won't browsers give you a "Stop script" prompt if you do that?


I don’t think the solution is flatlining everyone’s CPUs.


> but I just dont want ti be visited

Maybe you should just take your website offline?


How much have you mined?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: