When I was working there I implemented my patent during a hack week (given a set...

steviesands · on Jan 8, 2023

A few thoughts. The first is, are we asking the wrong questions? Should it be, "If I spend 10m on hardware for predicting ads (storage/compute) that generates 25m in revenue, should I buy the hardware?". Sure, we can "minify" twitter, and it's a wonderful thought experiment, but it seems devoid of the context of revenue generation.

The second is, it's interesting to understand social media industry wide infra cost per user. If you look at FB, Snap, etc. they are within all within an order of magnitude in cost per DAU (DAU / Cost of revenue) of each other. This can be verified via 10-ks which show Twitter at $1.4B vs. SNAP 1.7B Cost of Revenue. The major difference between the platforms is revenue per user, with FB being the notable exception.

Also would you summarize the patent/architecture? The link is a bit opaque/hard to read.

Note: Cost of Revenue does also include TAC and revenue sharing (IIRC) and not just Infra costs but in theory they would also be at similar levels.

eg. SNAPs 10-k https://d18rn0p25nwr6d.cloudfront.net/CIK-0001564408/da8288a...

spullara · on Jan 8, 2023

The basic idea of the system was to scan a reverse chronologically ordered list of "user id, tweet id", filtering out any tweet whose user wasn't in the follow set (or sets in the case of scan sharing) until you retrieved enough tweets for the timeline request. There are a bunch of variants in the patent, but that is the basic idea. At the time, I estimated that Twitter was spending 80% of its CPU time in the DC doing thrift/json/html serialization/deserialization and mused about merging all the separate services into a single process. Lot's of opportunity for optimization.

steviesands · on Jan 8, 2023

Interesting, 80% seems a bit on the higher end nowadays though? For example, Google quantified this as the "datacenter tax" and through their cluster wide profiling tooling saw that it was 22-27% of all CPU cycles (still a huge amount). They go a different route and suggest hardware accelerators for common operations. Datacenter tax was defined as:

"The components that we included in the tax classification are: protocol buffer management, remote procedure calls (RPCs), hashing, compression, memory allocation and data movement."

https://static.googleusercontent.com/media/research.google.c...

spullara · on Jan 9, 2023

This was back when there was 0 encryption, 0 compression, and using thrift and there is little actual business logic.

pg314 · on Jan 8, 2023

Could you give an insight into the reasons that such a system never replaced the existing implementation?

spullara · on Jan 9, 2023

It is extremely difficult to change out data formats.