Don't forget S3 includes replication.
Also EBS throughput (even with SSD) is not good at all.
Also our memory footprint is tiny. This is necessary to make it run on two servers.
Finally, cpu-wise, our search engine is almost 2x faster than
lucene.
If you don't believe us, try to replicate our demo on an elastic search :D.
Chatnoir.eu is the only other common crawl cluster we know of. It consists of 120 nodes.
> If we get 1 req/s, even for a dataset of that size, this is not as cost efficient.
How many req/s do you have in mind for your system to be a viable option?
> Also EBS throughput (even with SSD) is not good at all.
It is not worse than S3 still, right?
> Chatnoir.eu is the only other common crawl cluster we know of. It consists of 120 nodes.
I have no deep ES experience. Are you saying, that to host 6TB of indexed data (before replication) you'd need 120 nodes ES cluster? If so, then reducing it to just 2 nodes is the real sales pitch, not S3 usage :)
What about d3en instances? Clustered, and together with minio you might reach similar performance. Only issue is the inter-region traffic, it would need to be inside the same AZ
For that kind of use case, I'd probably start using minio.
> Seems comparable to AWS ElasticSearch service costs: > - 3 nodes m5.2xlarge.elasticsearch = $1,200 > - 20TB EBS storage = $1,638
Don't forget S3 includes replication. Also EBS throughput (even with SSD) is not good at all. Also our memory footprint is tiny. This is necessary to make it run on two servers.
Finally, cpu-wise, our search engine is almost 2x faster than lucene.
If you don't believe us, try to replicate our demo on an elastic search :D.
Chatnoir.eu is the only other common crawl cluster we know of. It consists of 120 nodes.