"S3 costs include battle tested, multi-DC replication."
Sometimes we pay a bit too much for this multi-replication, battle tested stuff. It's not like the probability of loosing data is THAT huge. For the 4x extra cost you could easily take a backup every 24h.
"It means that we can host 100 different indices on S3, and use the same pool of search server to deal with the CPU-bound stuff"
You can do that with NFS.
It's amazing how much we are willing to pay for a bunch of computers in the cloud. Leasing a new car costs around $350/month. You could have three new cars at your disposal for the same price as this search implementation.
> For the 4x extra cost you could easily take a backup every 24h.
It's also worth considering the cost to simply regenerate the data for something like this that isn't the source of truth. You'll lose any content that you indexed that has disappeared from the web, but that seems like a feature more than a bug.
> You can do that with NFS.
You're going to be bound by your NIC speed. You can bond them together, but the upper bounds on NFS performance are going to be significantly lower than on S3. Whether that's going to be an issue for them or not, I don't know, but a big part of the reason for separating compute and storage is so that one of them can scale massively without the other.
Sometimes we pay a bit too much for this multi-replication, battle tested stuff. It's not like the probability of loosing data is THAT huge. For the 4x extra cost you could easily take a backup every 24h.
"It means that we can host 100 different indices on S3, and use the same pool of search server to deal with the CPU-bound stuff"
You can do that with NFS.
It's amazing how much we are willing to pay for a bunch of computers in the cloud. Leasing a new car costs around $350/month. You could have three new cars at your disposal for the same price as this search implementation.