Removing your files from S3 can cost thousands of dollars

cperciva · on May 10, 2010

This guy must have an incredibly large number of files if 2 XL instances running deletes for a month only cuts his S3 usage by 10%. I routinely see 700 DELETEs per second from a small EC2 instance; if performance scales linearly with CPU speed, he should be able to do 700 * 8 * 2 = 11200 DELETEs per second, or 29 billion DELETEs per month; if that's 10% of his objects, he must have 290 billion objects stored.

Except that, oops, S3 only passed 100 billion objects a couple of months ago, and at its current rate of growth is probably still under 150 billion, never mind 200 or 290 billion.

My guess is that the "FAIL" here is whatever process he's using for deleting files -- not in S3 itself.

petewarden · on May 10, 2010

A few points I can confirm from talking with Jud about this over the last few weeks:

- Billions is the correct magnitude. It's the result of Gnip's aggregation of multiple Twitter-firehose-scale feeds over a year. They are massive users of S3 storage.

- I'm unsure why he's using an XL instance in this case, but I know he's been experimenting heavily with different configs to improve performance

- How do you get sustained 700 deletes a second? I'm not being facetious, I see much slower performance using commercial and open-source interfaces to S3 like s3cmd or Bucket Explorer. I'd love to find some faster approaches.

rarestblog · on May 10, 2010

Use non-blocking IO, like PyCurl.Multi interface (or any other Curl.Multi bindings). You'd have to generate DELETE cmmand URLs to call from some S3 library, the rest is up to Curl (to call those URL).

It's not a trivial task, but reasonably exprienced Python (or maybe even C) programmer would do it easily. (I'd say in a couple of days) And I can confirm that even on EC2 small instances 700 HTTP req's a sec is achievable (not sure about S3 API usage limiting).

jbellis · on May 10, 2010

nbio isn't a magic wand; context switching is virtually free in a language without a GIL

http://paultyma.blogspot.com/2008/03/writing-java-multithrea...

jvaleski · on May 10, 2010

the solution I used leveraged non-blocking I/O in a roundabout way. 150 separate procs (quick and dirty) per instance. no threads. I let the OS manage the I/O in that regard. My thinking was that if I had "a lot" of procs trying I/O... I'd get the same effect. again... quick/dirty.

thwarted · on May 10, 2010

And if you're going to incur several thousands of dollars to change S3 credentials on a "vast network of machines", you've been doing node/cluster management wrong.

obeattie · on May 10, 2010

Agree. This is just full of fail. Also, anyone could see this is likely to be an IO-bound operation, not a CPU-bound one. The smallest instance going should be just fine, and you should be able to parallelize the hell out of deleting this stuff.

jvaleski · on May 10, 2010

I can confirm billions of files/keys. I went the quick/dirty route to parallelize the DELETEs and just let the OS manage the async I/O for me at the proc level. Each instance spun up 150 Ruby procs that issued the DELETEs (tight, serial, while-loop style), and ran them in parallel.

I spent quite a bit of time working w/ the S3 team at Amazon to try and come up with a better model, but we all agreed on my approach (mind you, it could likely be done more efficiently) as option 'a', and option 'b' was to delete the account and just let Amazon reap the data.

brown9-2 · on May 10, 2010

He claims to need to delete "billions of objects": http://one.valeski.org/2010/03/amazon-s3-file-deletion-fail....

Would it be naive of me to assume that no single person/company is making up a few percentage points of all S3 objects?

jacquesm · on May 10, 2010

I checked to see how many files we have on our main CDN (not S3 hosted, but just to get an idea), about 500 million files, there are 6 of those boxes so that's 3 billion files.

We keep three copies of each, but since S3 does that for you I think we shouldn't multiply by 3.

3 billion files is a lot of files, but it isn't that extreme, he could very well be telling the truth.

notmyname · on May 10, 2010

I know of at least one S3 customer that uses a small (but significant) number of S3's total published object count. I would assume that there are more customers like this one.

felixge · on May 10, 2010

What are you using to parallelize your DELETEs? 700 / sec seems pretty nice

cperciva · on May 10, 2010

Non-blocking I/O and 32 simultaneous connections. It could probably go faster -- I haven't made any serious attempt to optimize this since it isn't a performance-limiting operation for me.

danielrhodes · on May 10, 2010

I have also had problems deleting large amounts of data from S3.

The problem that underlies the problem the author was having is that S3 is quite slow and you can't do batch deletes. On top of that, they limit the number of API requests you can make per second. In some case it has been faster to download an object and re-upload it again than do a copy.

MichaelApproved · on May 10, 2010

That seems to make the case for not needing such a large machine to do the delete since you're being rate limited. Was your delete CPU intensive? What setup did you use?

ck2 · on May 10, 2010

$1500 would buy a really nice server or two on a colo rack.