Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Removing your files from S3 can cost thousands of dollars (valeski.org)
46 points by petewarden on May 10, 2010 | hide | past | favorite | 16 comments


This guy must have an incredibly large number of files if 2 XL instances running deletes for a month only cuts his S3 usage by 10%. I routinely see 700 DELETEs per second from a small EC2 instance; if performance scales linearly with CPU speed, he should be able to do 700 * 8 * 2 = 11200 DELETEs per second, or 29 billion DELETEs per month; if that's 10% of his objects, he must have 290 billion objects stored.

Except that, oops, S3 only passed 100 billion objects a couple of months ago, and at its current rate of growth is probably still under 150 billion, never mind 200 or 290 billion.

My guess is that the "FAIL" here is whatever process he's using for deleting files -- not in S3 itself.


A few points I can confirm from talking with Jud about this over the last few weeks:

- Billions is the correct magnitude. It's the result of Gnip's aggregation of multiple Twitter-firehose-scale feeds over a year. They are massive users of S3 storage.

- I'm unsure why he's using an XL instance in this case, but I know he's been experimenting heavily with different configs to improve performance

- How do you get sustained 700 deletes a second? I'm not being facetious, I see much slower performance using commercial and open-source interfaces to S3 like s3cmd or Bucket Explorer. I'd love to find some faster approaches.


Use non-blocking IO, like PyCurl.Multi interface (or any other Curl.Multi bindings). You'd have to generate DELETE cmmand URLs to call from some S3 library, the rest is up to Curl (to call those URL).

It's not a trivial task, but reasonably exprienced Python (or maybe even C) programmer would do it easily. (I'd say in a couple of days) And I can confirm that even on EC2 small instances 700 HTTP req's a sec is achievable (not sure about S3 API usage limiting).


nbio isn't a magic wand; context switching is virtually free in a language without a GIL

http://paultyma.blogspot.com/2008/03/writing-java-multithrea...


the solution I used leveraged non-blocking I/O in a roundabout way. 150 separate procs (quick and dirty) per instance. no threads. I let the OS manage the I/O in that regard. My thinking was that if I had "a lot" of procs trying I/O... I'd get the same effect. again... quick/dirty.


And if you're going to incur several thousands of dollars to change S3 credentials on a "vast network of machines", you've been doing node/cluster management wrong.


Agree. This is just full of fail. Also, anyone could see this is likely to be an IO-bound operation, not a CPU-bound one. The smallest instance going should be just fine, and you should be able to parallelize the hell out of deleting this stuff.


I can confirm billions of files/keys. I went the quick/dirty route to parallelize the DELETEs and just let the OS manage the async I/O for me at the proc level. Each instance spun up 150 Ruby procs that issued the DELETEs (tight, serial, while-loop style), and ran them in parallel.

I spent quite a bit of time working w/ the S3 team at Amazon to try and come up with a better model, but we all agreed on my approach (mind you, it could likely be done more efficiently) as option 'a', and option 'b' was to delete the account and just let Amazon reap the data.


He claims to need to delete "billions of objects": http://one.valeski.org/2010/03/amazon-s3-file-deletion-fail....

Would it be naive of me to assume that no single person/company is making up a few percentage points of all S3 objects?


I checked to see how many files we have on our main CDN (not S3 hosted, but just to get an idea), about 500 million files, there are 6 of those boxes so that's 3 billion files.

We keep three copies of each, but since S3 does that for you I think we shouldn't multiply by 3.

3 billion files is a lot of files, but it isn't that extreme, he could very well be telling the truth.


I know of at least one S3 customer that uses a small (but significant) number of S3's total published object count. I would assume that there are more customers like this one.


What are you using to parallelize your DELETEs? 700 / sec seems pretty nice


Non-blocking I/O and 32 simultaneous connections. It could probably go faster -- I haven't made any serious attempt to optimize this since it isn't a performance-limiting operation for me.


I have also had problems deleting large amounts of data from S3.

The problem that underlies the problem the author was having is that S3 is quite slow and you can't do batch deletes. On top of that, they limit the number of API requests you can make per second. In some case it has been faster to download an object and re-upload it again than do a copy.


That seems to make the case for not needing such a large machine to do the delete since you're being rate limited. Was your delete CPU intensive? What setup did you use?


$1500 would buy a really nice server or two on a colo rack.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: