As far as I can tell, getting good 'de-duplication' technology is more expensive right now than just buying disks.
Also, how much would de-duplication help? I mean, how much stuff is stored on s3 that is uncompressed? (or the same compressed file?)
I mean, sure, you could sell compressed/de-duplicated s3 storage, but do you really think you could even get 50%? I mean, I've done the math and I could turn a profit renting uncompressed drive space at $0.05/gigabyte/month, even at my scale. Granted, not a huge profit, but something.
This is the thing, I think; there is much profit to be had buying disks and renting them out right now, if you can charge amazon's prices. Amazon does have a pretty massive economy of scale advantage, but they are not passing those savings down to the end users.
Costs in the outsourced infrastructure market, as far as I can tell, have always been dominated by marketing. The difference with 'cloud infrastructure' seems to be that large corporations are trying to change that, but they aren't passing down much by way of savings to the end users.
> As far as I can tell, getting good 'de-duplication' technology is more expensive right now than just buying disks.
Working at the abstraction layer of disks is misleading because there are so many other pieces that go into making a service like S3.
Instead I'm proposing an arbitrage that works one layer up -- on top of S3.
> Also, how much would de-duplication help? I mean, how much stuff is stored on s3 that is uncompressed? (or the same compressed file?)
It's a good question. The answer is unclear. Certainly if you're willing to get exotic with redundancy elimination (as opposed to dedup, which is a subset) and large-data-set compression algorithms there just are going to be economies of scale. What's not clear is (a) how large those economies become, or (b) how exotic your algorithms have to be to capture those economies.
> ..I could turn a profit renting uncompressed drive space at $0.05/gigabyte/month..
I wouldn't want to compete with Amazon S3. Data safety and uptime would keep me up at night like a mofo. I just wouldn't want to run that company, though I can imagine folks like MS doing it.
> Costs in the outsourced infrastructure market, as far as I can tell, have always been dominated by marketing.
This is a really interesting question. The landscape is obviously changing with the current batch of cloud infrastructure services.
I do think there are pieces of cloud storage that will not be commoditized, most notably the peace of mind around data safety and uptime. I wouldn't want to use a small cloud storage provider. Thus: (a) they will be able to charge a premium for a long time, and (b) there will be a smallish number of providers, meaning marketing will be less important in the future than having a good product.
(I'm not sure how much the API's and other development tools will be commoditizable. MSFT historically does a great job of embrace & extend when it comes to those parts, though their recent cloud services are too wedded to Windows and .NET.)
(EDIT: added responses to more of your original post.)
>Instead I'm proposing an arbitrage that works one layer up -- on top of S3.
Is transfer from s3 to ec2 free? in which case, that might work. Otherwise you'd get eaten alive by bandwidth charges between s3 and your compression/uncompression box.
>I wouldn't want to compete with Amazon S3. Data safety and uptime would keep me up at night like a mofo. I just wouldn't want to run that company, though I can imagine folks like MS doing it.
If you are storing data compressed in your special way on s3, you have the same problems. using non-ecc ram (or ecc ram configured to not crash when bad) and have bad ram on one of your compression boxes? data is corrupted. gone. compression boxes go down? you now have your dreaded uptime issues. Of course, if there is some bug in your compression/deduplication/whatever software, again, you have data loss.
I have personally lost much money and reputation trying to make my disk system 'better' by adding complexity and flexibility. After that pain, I now just try to keep the data as simple as possible (mirrors, and stripes of mirrors) I haven't had data loss since I abandoned my SAN. Complexity == admin error == data loss. In a reasonably designed hardware system, data loss is pretty rare without admin error.
>I do think there are pieces of cloud storage that will not be commoditized, most notably the peace of mind around data safety and uptime. I wouldn't want to use a small cloud storage provider. Thus: (a) they will be able to charge a premium for a long time, and (b) there will be a smallish number of providers, meaning marketing will be less important in the future than having a good product.
If s3 costs keep going down 15% a year (and hard drive costs keep going down 50% a year) s3 will very shortly become a 'premium' provider. While yes, there is room for premium providers, it will be a whole lot cheaper for anyone with significant storage needs to run their own stuff (or to go with a smaller player)
I mean, paying premium prices makes sense sometimes, but not all the time. If you are speaking of premium markets, really, I'm the wrong guy to ask; I'm way over on the other end of the spectrum.
1) most data in the cloud is going to be proprietary. How much is GE's internal factory data going to overlap with Starbucks' financials?
2) You would need global access at the byte level to truly dedupe systems. So the 50 biggest companies on S3 are giving some random company read access to all their data? And allowing them to compare it to all the data of their competitors?
3) And to do what? to save a few percent max in storage costs on S3? That's not going to dominate your cost structure by a longshot.
Outside of a web crawl I doubt there is that much redundancy.
The only one who can dedupe at scale and with trust would be Amazon themselves, and they would only do it if it weren't a huge headache to keep track of.
deduping definitely not going to work on proprietary encrypted data
but for web services like tumblr, flickr, youtube or hulu, deduping must be a killer app - there's a huge amount of overlapping imaging data between them and it makes little sense not to analyze it. I'm sure google is deduping youtube stuff on a massive scale.
So it depends on the data and application of course.
Then again, purely deduping based on byte comparison is not going to cut it - you need to look into logical deduping (not storing same image in different resolutions, formats etc)
Amazon's pricing model is even more busted when you look at the cost of puts for small objects. The simplest answer is for Amazon to revisit their pricing based on their actual costs. You can use SimpleDB to index larger objects stored in S3 and save money. I've written up a description of how I did this, but it proved too much hassle and we've moved on to HBase.
I'm skeptical of any business model too close to Amazon's core services. I'm thinking of things like Elastic MapReduce, which aren't perfect, aren't optimal compared to what you could do yourself, but aren't bad enough that I would ever choose a niche provider instead of EMR or running my own cluster.
Or it could be that AWS doesn't see S3 as a product per-se, but rather as an infrastructure piece that is a building block for other, more full-featured products. My understanding of S3 (and Dynamo) is that it/they started as tech used for running Amazon's internal systems. Someone realized they could get more revenue by offering their internal publicly and stuck a price tag on it. Services like de-duping, compression, etc. are more in the realm of Jungle Disk and tarsnap: third-party resellers that become a front end to the infrastructure provided by S3.
What S3 and other storage services sell is not space as much as reliability. No question: S3 is more expensive than a few file servers in your back room. Cloud storage services sell reliability and availability.
I read all the comments posted here thus far, and the one thing I don't see is a concern that de-duping, if not done AT the filesystem, would be prohibitively slow.
Am I misguided in thinking that it would be? I mean, I suppose if you implemented it on disk at the server level, retrieving the duped blocks there, I think you're adding latency. I suppose I might be looking at it from the wrong perspective, as there just has to be people using S3 for backups only, for which cost probably does make more sense than speed, but I honestly don't know if that's more the exception or the rule.
That depends on the block size. If you're using 40MB blocks, for example, then seek costs are not going to dominate, and you can afford to stitch together a couple of deduped blocks. Where you're going to find duplicated files that are large enough that blocks so large make sense is another question.
I don't agree with the article: odd business premise since we use S3 because it is replicated across availability zones, is very robust, and probably safe enough to base a business on. Why "de-duplicate" when redundant storage is what makes S3 secure. Also, what is the risk of trusting your business to a smaller company?
The idea is to deduplicate / eliminate redundancy one layer up. So if you are storing an image in S3 let's say it's replicated 10 times to ensure data safety. If I store the same image it doesn't need to be stored another 10 times.
Getting customers to trust a smaller company's software is definitely a top risk. I wouldn't start this company now, but it might be worth doing in 2015. Watch this space!
My biggest concern about the article is that it doesn't prove its premise. In fact, the S3 service is only arbitrageable if all the other things you need to do to build your proposed service are free, or less expensive (including opportunity cost) than the profits you generate by doing them. This is far from certain, because even with the infrastructure all built it's not clear to me that you could turn a profit when you factor in your running costs like CPU and RAM used to actually make the system function, much less the support staff.
De-duplication and replication are complements. The more you can de-duplicate, the better and cheaper you can make your replication. If you know your data is highly redundant (for example, you are backing up all the files on thousands of desktops created with a standard disk image), de-duping will allow you to dramatically improve the replication (perhaps replicated 4x between two data centers nearly in real time, instead of copied to a single RAID-10 array in one data center on a nightly basis).
If you don't de-dupe your data before you replicate it, then you end up storing redundant data in a way where the redundancy is totally unhelpful (in fact, hurtful) to you. So, you can look at de-duping as the process of converting hurtful redundancy into helpful redundancy.
Data de-duplication is clearly the win, only paying larger dividends as your data under management grows. Why does the OP think that amazon, google, microsoft, apple, et al. are not already doing this at scale... which in turn allows them to provide the services that they do.
It's just that their business model -- charging per GB, instead of per GB of new data that nobody else is storing yet -- leaves them susceptible to arbitrage. I.e. customers can steal the economies of scale if they collaborate.
You know how Starbucks charges $5 for a small and $6 for a large? It's the same thing. If two people want smalls, but collaborate, they can just split a large and save $2 each. It's the same idea, except that arbitrage becomes more realistic when it's electronic.
(I've ignored compression in the whole discussion for simplicity. If AWS wants a business model that isn't arbitragable based on compression they would have to charge based on how compressable your data is. Also, next generation compression algorithms -- another blog post on that later perhaps -- could achieve compression rates that have the same economies of cross-customer scale that data deduplication does, so really it would have to be price based on how compressable your data is __given all of the other data S3 is storing__.)
I'd like to think users with large data sets are already deduping and/or compressing their data. Of course, the users still win by eating into the margins of the company performing the arbitrage.
Deduping and compression at a higher level may be easier though less efficient. I had built a prototype mail system using a twisted POP server, qpsmtpd, and S3 as a mail store that deduped and compressed email bodies.
Not in my experience. My product uses >5PB data and we don't bother to do significant deduping. I've only heard of a few products at Google that do, and that's only when you can anticipate that most objects in the system are going to be duplicated many times, which is decidedly not the case in heterogeneous storage like S3's.
The problem is it wouldn't be difficult for Amazon to start charging based on how much storage space you're consuming, which would pull the rug from under their feet.
I'm wondering what the logistics of this would be. Say 1 customer is storing 100GB in a cloud storage service that does internal de-duping. A second customer uploads exactly the same 100GB of data. What is each charged? The whole 100GB*rate? Does each customer pay half? Do prices change for remaining customers when one "copy" of the data is removed? Without charging each customer for their total apparent storage use, I don't see how any customer can have any predictability as to their monthly bill.
This is exactly the point. Billing under this some dedupe scheme would be a nightmare. The only scale that is to be had is for the provider. Either Amazon in this example, or some other service built on top of Amazon (or self hosted) that kept the vagaries of price fluctuation away from the customer.
Notice that the way Amazon does price arbitrage with their compute nodes allows then to pull the power on them at any time when the price point moves above what you had payed for it. Anybody feel good about that happening to their data?
Also, how much would de-duplication help? I mean, how much stuff is stored on s3 that is uncompressed? (or the same compressed file?)
I mean, sure, you could sell compressed/de-duplicated s3 storage, but do you really think you could even get 50%? I mean, I've done the math and I could turn a profit renting uncompressed drive space at $0.05/gigabyte/month, even at my scale. Granted, not a huge profit, but something.
This is the thing, I think; there is much profit to be had buying disks and renting them out right now, if you can charge amazon's prices. Amazon does have a pretty massive economy of scale advantage, but they are not passing those savings down to the end users.
Costs in the outsourced infrastructure market, as far as I can tell, have always been dominated by marketing. The difference with 'cloud infrastructure' seems to be that large corporations are trying to change that, but they aren't passing down much by way of savings to the end users.