> 99% of the server costs of something like Facebook or Twitter, go toward the b...

beeboobaa3 · on May 17, 2024

> From direct personal experience, I strongly believe the backend infrastructure and staff required just to operate the core product functionality of a successful large-scale social network massively exceeds what could be provided by donations.

Because these social networks are designed for analytics. It's in their blood. It permeates everything they do, and causes immense overhead.

Check out mailing lists or usenet.

evanelias · on May 17, 2024

No, nothing about this is related to analytics. I was strictly describing storage, caching, and compute for core product functionality in my previous comment, which is written from direct first-hand experience working on infrastructure for social networks for a decade, including the two social networks I referenced in my previous comment.

Social networks store a lot of OLTP data just to function. Every user, post, comment, follow/friendship relation, like/favorite/interaction, media metadata -- that all gets stored in sharded relational databases and retrieved in order for the product to operate at all. For successful social networks, it adds up to trillions of rows of data (on the smaller end, for something like Tumblr) and that requires a lot of expensive infrastructure to operate. Again, none of this has any relation to analytics.

As for usenet, what? It's basically dead, after becoming an unmanageable cesspool of spam (or worse) more than two decades ago. It was great in the 90s, but the internet population was substantially smaller then.

beeboobaa3 · on May 17, 2024

> As for usenet, what? It's basically dead, after becoming an unmanageable cesspool of spam (or worse) more than two decades ago. It was great in the 90s, but the internet population was substantially smaller then.

Yeah, because no one is interested in promoting it because it doesn't have analytics baked in so you can't make money from doing so. Of course it deteriorated over the years. It's also cheap to run and can handle a massive amount of users.

> store a lot of OLTP data just to function

Right, so they can run analytics. You could reduce your tracking data to aggregates, but then you can't go back and run analytics on your users. You don't need to keep that data forever.

Especially with modern social media where content older than a day is effectively dead and ignored.

> it adds up to trillions of rows of data

This was a lot of data a decade ago. Nowadays a single postgres instance will handle billions of rows without breaking a sweat, and social media content is exceptionally shardable.

evanelias · on May 17, 2024

> Right, so they can run analytics.

Stop gaslighting me, it's not OK! I'm describing first-hand experience of things that were not related to analytics IN ANY WAY, SHAPE, OR FORM.

Try running OLAP queries on a massively sharded MySQL 5.1 deployment, or any aggregation at all on a Memcached cluster. These technologies were designed for OLTP data, and were woefully incapable of useful analytics over massive data sets.

I was Tumblr's fourth full-time software engineering hire. When I joined (nearly 4 years after the company was founded) the only thing remotely related to analytics was a tiny Hadoop cluster, where logs were dumped and largely ignored. Nothing about analytics is "in their blood". All you needed to sign up for Tumblr was an email address. WTF do you even think they are "analyzing"? Your comments are completely fabricated BS.

> You could reduce your tracking data to aggregates

Once again, I'm not describing "tracking data"! I'm talking about things like content that users have posted, comments they have written, content they have favorited, users they are following. These are core data models of a social network. It has nothing to do with tracking or analytics.

> You don't need to keep that data forever.

The OLTP product data I'm describing does need to be kept forever. Users don't like it when content they have written on their blog suddenly disappears.

> Nowadays a single postgres instance will handle billions of rows without breaking a sweat, and social media content is exceptionally shardable.

Yes, but running a massive cluster of hundreds or thousands of sharded database servers is still very expensive.

throwaway22032 · on May 18, 2024

I think that what they're getting at is that in the Usenet days half of what you've mentioned would be local data.

There is no central concept of "content that have favourited" or "users they are following", that's all handled locally in that model.

evanelias · on May 18, 2024

> half of what you've mentioned would be local data

Not by volume. Posts and comments make up the vast majority of the storage requirements, and none of that can be purely client-side.

> There is no central concept of "content that have favourited" or "users they are following"

I'm aware, I used Usenet quite a bit in the 90s, as well as dial-up BBSs.

Usenet is a distributed forum / discussion board, which is related but not equivalent to the core functionality of social media applications being discussed here.

With Usenet's model, there's no concept of a profile aggregating content from a single user. This means you simply cannot replicate the primary experience of Facebook, Twitter, Instagram, Tumblr, Pinterest, DeviantArt, MySpace, Friendster, or any other social media site/app with Usenet's approach. Nor can it reproduce the experience of even modern forums like HN or Reddit.

Usenet also didn't actually scale massively. Every estimate I've seen of the peak Usenet userbase puts it at a tiny fraction of modern social media.

In any case, Usenet essentially failed. We already have empirical evidence about how these ideas play out! Why are we even seriously discussing this?

giantrobot · on May 17, 2024

> Check out mailing lists or usenet.

These aren't really good "social media" examples. Both mailing lists and Usenet have limited retention, with mailing lists there may be almost no retention beyond the amount required to deliver a message.

While low retention might be a desirable feature and something you might actually want in a FOSS social network, it means old content will disappear from the central server. If it's not archived by clients it can easily disappear or end up locked away only in private backups. Google's buyout of Deja News should be a cautionary tale of retention and the locking up of public data behind a private gate.

Usenet history today is largely only available because someone at Google hasn't noticed Google Groups still exists and terminated it yet. If that happens tomorrow there's not any good complete archive of historical Usenet content. There's no guarantee Google won't kill those Usenet archives in the next year let alone the next five years.

derefr · on May 22, 2024

Also speaking from direct personal experience. I have an API SaaS (which, yes, I realize, as a B2B product, has a much smaller audience than a B2C platform, and so probably much lower reads per second... but I want to just talk about writes here, since reads and aggregates-of-reads get cached on so many other levels.)

We have 1.5 trillion distinct rows living on a PG cluster right now. Growing at a higher rate. And that requires... four DB servers (four shards). And no replicas — all our prod query traffic hits these servers directly, as a mixed workload with the writes from our queue-consumer ETL agents.

(The servers are "beefy", sure — but they're not that beefy. They've got 128 cores and 1TB of RAM each. That's not even top-of-the-line these days.)

Despite serving a whole ton of QPS, these DB servers are not CPU-bound or memory-bound or even IO-bound. Our primary horizontal-sharding factor, in fact, is PCIe lanes — we're constrained mainly by the number of NVMe drives we can keep connected performantly per CPU [1] — and thereby, generationally, by "fast storage capacity" per server platform.

One thing that's perhaps "special" about our use-case, though, is that our data is inherently append-only. Which is pretty great for OLTP performance: there's very little write contention, as we just partition the data by insert time (which is also a natural partitioning key[2] for most queries our business layer does) and then only write to the newest partition, with all previous partitions becoming effectively immutable.

Most workloads aren't like this... but they could be, if you model your data carefully, with temporal tables holding versioned records and so forth. You trade off more storage growth, for far less worry about write contention. ("Event streaming" is not the silver-bullet solution it would seem here — as you would still need to create an aggregate to query from; and writing into that aggregate would still be an OLTP write bottleneck. No, the true solution here is data-driven [schema] design — i.e. forcing your API engineers to invite the DBA to feature design meetings, and ensuring the DBA has the perogative and incentives to say "no" to a bad design. Or you could just hire database systems engineers to code your business layer, like we did.)

My point here, is that if you're very careful in designing for not just "scalability" but economies of scale — and if your org is engineering-driven rather than product-driven (as a non-profit social network would likely be!), and so has "mechanical sympathy" — then you can achieve things with just the budget of an average B2B SaaS, that would rival a (smaller) social network.

But with the budget of a benefit corp that nevertheless charges even $1 for each install of its social-network mobile app? Who knows what you could achieve!

(See also: WhatsApp, pre-acquisition. They were just charging $1 for each install. With just 50 employees, that got them a lot of operating budget.)

---

[1] And actually, that's part of why the DBs are not IO-bound. Put enough NVMe disks in RAID0 together, and you get something nearly functionally indistinguishable from memory. (Yes, RAID0 — because our DBs aren't the store-of-record; our data lake is.)

And even where NVMe reads are not indistinguishable from memory reads, they're still complementary. Postgres, in relying on mmap(2)ing heap files and thereby on the the OS page-cache as a buffer pool, makes per-query serial page faults expensive, and causes multiple threads wanting the same cold data to bottleneck by repeatedly becoming coalesced awaiters on the same sequence of pages faulting in. But when you've got highly-concurrent queries [= many different OS threads] that are all page-faulting for different cold data, then you get deep IO queuing, and everything works out optimally. So the expensive cases don't really come up — the page-cache serves the head of the request distribution, while the deep IO queues of the RAID0-NVMe-pool are a perfect match for the long tail.

This makes it almost irrelevant, at runtime, whether data is hot or cold. As long as you have sufficient memory to host the very hottest data (e.g. relatively-low-cardinality fact tables like users/blogs that get joined to everything, and esp. their indices), the rest can be read straight from disk with nearly no penalty.

---

[2] Funny enough, given that we're doing heavy joins for some queries, the one thing we do sometimes run out of at runtime, is timeslices of the Postgres postmaster to coordinate lightweight locking of the global locks table.

We not only have time-partitioned tables, but also something akin to tenant-partitioned schemas. This results in a lot of relations. (It takes hours for us to run vacuumdb --analyze-only.)

And some OLAP-ier queries (that we nevertheless have to run, synchronously, in response to client requests!) need to touch many of the partitions and many of the schemas. That's O(MN) locks they need to take — which take time to write into the global locks table, even though those are only memory writes.

Every once in a while, we have to "consolidate" our table partitions — not by rolling them up with aggregations to reduce row-count, but rather by just copying all the data into fewer, larger partitions — so that we can take fewer access-shared locks per query, so that transactions can spend less time waiting on a handle into the global locks table at startup just to write these O(MN) read locks into it. (In other words, we consolidate data to reduce O(MN) read locks per transaction, to O(N log M) read locks per transaction.)

But that's a PG problem, not a resource problem per se. We've been considering writing a patch for PG to allow relations to be marked as "immutable", where an "immutable" relation doesn't have locks tracked for it (or its indices) at all — but also can't be written to, re-indexed, or even dropped. We'd then just apply this setting to all of our historical partitions. (If you're curious: given its semantics, this property could be enabled for a relation at runtime; but would need an instance restart for disabling it to take effect, as the instance would have no idea what read-locks "would be being held if the table would have been tracking locks", and so couldn't safely do any writes to the table without a barrier that drops all existing MVCC read-states — i.e. an instance restart.) As a bonus, such relations could also have "perfect statistics" calculated for them by ANALYZE; and those stats could then be marked as never needing to be re-calculated — allowing both VACUUM and ANALYZE to be skipped for that relation forevermore.

evanelias · on May 23, 2024

I don't think this access pattern is generally applicable to social networks, and ditto for your operational considerations.

"Four shards" and "no replicas" is a complete non-starter for a popular social network. That low shard count means a hardware failure on a shard master results in downtime for 25% of your users. Putting ~400 billion rows on a single shard also means backups, restores, schema changes, db cloning, etc all can take a tremendous amount of time. No read replicas means you can't geographically distribute your reads to be closer to users.

I'm not sure why you're assuming that social networks aren't 'very careful in designing for not just "scalability" but economies of scale'. I can assure you, from first-hand experience as both a former member of Facebook's database team and the founding member of Tumblr's database team, that you have a lot of incorrect assumptions here about how social networks design and plan their databases.

I can't comment on any of the PG-specific stuff as I don't work with PG.

Append-only design is not usable today for social networks / user-generated content. That approach is literally illegal in the EU. In years past, afaik Twitter operated this way but I was always skeptical of the cost/benefit math for their approach.

The "200 expensive beefy database servers" I was describing were relative to 2012. By modern standards, they were each a tiny fraction of the size of the hardware you just described. That server count also included replicas, both read-replicas and HA/standby ones (the latter because we did not use online cloning at the time).

I haven't said anything about event sourcing as it generally is not used by social networks, except maybe LinkedIn.