I tend to agree :). If we get 1 req/s, even for a dataset of that size, this is ...

nopurpose · on May 7, 2021

> If we get 1 req/s, even for a dataset of that size, this is not as cost efficient.

How many req/s do you have in mind for your system to be a viable option?

> Also EBS throughput (even with SSD) is not good at all.

It is not worse than S3 still, right?

> Chatnoir.eu is the only other common crawl cluster we know of. It consists of 120 nodes.

I have no deep ES experience. Are you saying, that to host 6TB of indexed data (before replication) you'd need 120 nodes ES cluster? If so, then reducing it to just 2 nodes is the real sales pitch, not S3 usage :)

klohto · on May 7, 2021

What about d3en instances? Clustered, and together with minio you might reach similar performance. Only issue is the inter-region traffic, it would need to be inside the same AZ

EDIT: Realizing that d3 has just slow HDD

pcnix · on May 7, 2021

Have you checked out the new EBS gp3 disks? Throughout vs cost is much better on those than gp2, and also cheaper than Provisioned IOPS