Amazing: Cloudflare -- Free for most services OVH Cloud -- Free and unlimited Sc...

dijit · on Feb 11, 2024

One thing we should all have a subtle understanding of: if you are a company selling infrastructure tooling as a service (Vercel, for example). You should probably try to avoid hyperscalers.

The adage is that you should outsource the things that are not your core competence. However if your product is infrastructure for developers then outsourcing that is going to be painful.

You are forced to pass down costs from your upstream supplier, and headroom is necessary (somewhat) because your provider can dictate terms essentially on a whim.

I feel like this is such an obvious statement, but we do seem to be in the gold-rush of avoiding responsibility so I'm not certain.

lumost · on Feb 11, 2024

Practically, this means that if you are an infrastructure tolling company - then you must reinvent all pieces of the cloud supply chain before building your product.

That’s a very expensive proposition. While the paper cost of a few racks in a colo is low - the people time to get where you want to be is high. If you mess up in this evolution - there is a risk that others outcompete you on something that isn’t your core product.

vidarh · on Feb 11, 2024

The people time of setup in a colo or managed provider is trivial if you hire someone who actually know what they're doing, and usually lower than a cloud setup. It's stupidly easy to mess up a cloud setup in ways that are hard with a rack.

I find this is mostly an issue with a lot of shops simply having nobody with actual ops experience.

lumost · on Feb 11, 2024

It’s not so much the setup time for the hardware/networking. This takes some time, but can reasonably be done in a few months even for complex setups.

The problem comes when you then wish to have things like CI/CD, distributed databases etc.

vidarh · on Feb 11, 2024

The vast majority of companies never reach a scale where wiring up the infra needs to take more than days.

And CI/CD and distributed databases are equally not an issue of you actually have someone experienced do it.

Done this many times.

candiodari · on Feb 11, 2024

1) It's not like cloud databases are problem free. You can have very non-trivial issues, and they especially have nonstandard (and very nontrivial) transaction semantics. I'm not saying this is necessarily a problem, but unless you're on top of it. You need someone with real database administration experience.

You especially CANNOT "START TRANSACTION; SELECT * FROM products; SELECT * FROM five_other_tables; UPDATE sales ...; COMMIT TRANSACTION" (so all product prices and ... are included in the transaction "and nothing goes wrong"). It does not scale, it does not scale on postgres, it does not scale on Aurora, it does not scale on Spanner, it does not scale anywhere. It will scale on MySQL and MariaDB because it will just ignore your SELECT statements, which will cause a disaster at some point.

2) The big problem with cloud databases is the same as with normal databases: let bad developers just do what they want, and you'll have 40 queries per pageload, each returning 20-30 Mb of data. You NEED to optimize as soon as you hit even medium scale. And, sorry, "just slap a cache in front of it" is very likely to make the problem worse.

Cloud Hyperscalers salivate seeing this, have simply said: "don't pay an annoying database/system admin to tell you to FIX IT, simply pay us 10x what you pay that guy".

There's 2 kinds of people. There's people who look at the $10 pageload as both a technical embarassement and a potential business disaster. And there's business people who build hyperscalers, dine the CTO and hold a competition for a $100 pageload, while ordering their next Ferrari.

3) There's nothing automatic about automatic sharding. And if you "just shard UUID", your query performance will be limited. It will not scale. It just won't, you need to THINK about WHAT you shard, and HOW. No exceptions.

4) you'd be surprised how far ansible + k8s + postgres operator goes (for both CI/CD and distributed databases)

lumost · on Feb 11, 2024

Once in my career, I did a high 7 figure migration from cloud to colo.

Our startups lead investor, was invested in a few firms that aimed to make a cloud experience on-prem. So we enjoyed strong board support, and had a strong team. On paper, the migration saved 10-20x its TCO over 3 years. We weren’t flying by night, and used high quality colo facilities near the cloud.

Everything went smooth enough at first, the hardware was late by 12 weeks - and there were some minor hiccups in the build. Time to turn-off of the original infra was maybe 2 months longer than planned. All in all 3 months of net delay where we were running both stacks and the financial clock was ticking. We even had industry press and tech talks on the success of the whole effort.

Now here is where problems actually started, 1 year post-build - folks wanted SSDs. While I fought to get SSDs in the initial build, I was overruled by a consultant for the main db vendor we had at the time. It’s tough to ask for 500k+ in SSDs with 1/4 the storage volume of spinners if everyone else thinks the drives are a waste.

2 years in, the main cloud vendor dropped prices by north of 50% for the hardware we had bought. To make matters worse, the service team wasn’t using the hardware we had to full effect - usage was around 25%.

By year four contracts weren’t renewed and the workload was migrated to cloud again. In total, we had probably saved 10-25% vs. aggressive negotiations with the cloud vendor. We could probably have beaten the Colo price if we had aggressively tuned the workload on the cloud. However there were political reasons not to do this. If you add in opportunity cost and the use of VC funds… we were probably negative ROI.

Given this backdrop, I never worked on colos again (this all went down 10+ years ago).

vidarh · on Feb 11, 2024

So you had a badly planned transition, did it in a single big switch (why?), overprovisioned when already having cloud experience instead makes you perfectly placed for under provisioning, and it sounds like you paid up front for no good reason?

Sounds like lots of valuable lessons there, but assuming from this that cloud was better for you seems flawed.

1. Don't do migrations at once. In a large deployment you likely have multiple locations anyway, so do one at a time, and/or rack by rack.

2. Verify usage levels, and take advantage of your ability to scale into cloud instances to ensure you only provision tiny headroom on your non-cloud environment. Doing so makes the cost gap to a cloud environment much bigger.

3. The advantage is being able to tune servers to your workload. That you had to fight for SSDs is already a big red flag - if you don't have a culture of testing and picking based on what actually work, yeah, you may actually be best off in the cloud because you'll overpay by a massive factor no matter what you choose and in a cloud you can at least rapidly rectify the repeated major mistakes that will be made.

4. Don't pay upfront unless you're at a scale where it brings you discounts at a return that makes it worth it. Rent or lease to own. For smaller environments, rent managed servers. The cloud provider dropping prices 2 years in shouldn't matter because by year 3 you should have paid off the hardware at an average cost well below anyway, and start cycling out hardware as your load requires and as/when hardware fails and/or it's not cost effective to have it take up space in your racks any more. Depending on your setup, that might be soon, or you might find yourself still running 6+ year old hardware you paid off 3 years ago for many tasks.

If you can't beat cloud on TCO, you're doing something very wrong. I use cloud for lots of things that aren't cost sensitive, but there are extremely few scenarios where it is the cost effective option.

lumost · on Feb 11, 2024

Hardware beyond 3 years starts to be inefficient due to the space and power cost. paid off servers aren't free :) Otherwise, we did mortgage the servers in the backend - buying upfront gave us around a 50% discount on top of the steep discounts you get from buying direct from an integrator vs. Dell et. al.

On the rent-to-own topic, the people time cost of the migration was budgeted at around 10-20% of total TCO if I recall correctly, but we were in the realm of "we will have to hire people to maintain this project at 24/7 uptime". As the people time was relatively fixed to manage a sane on-call rotation - shrinking the infra footprint would have simply increased the people time portion of the TCO. If the footprint had shrunk by 75% - the maintenance cost would have become problematic. Due to the higher intensity work of build outs vs. maintenance, the people time factor would have risen had we started in more experimental amounts. As it happened, tech salaries also rose ~2x over the 3-5 years that Colo existed.

On the workload front, as happens with many large organizations - there are teams who have the attitude of "don't touch my stuff". At the time that the buy was initiated, there were several estimates indicating that we would need 4x the hardware we were buying in 5 years time.

Whenever someone says that they are beating cloud TCO, I'd suggest you do the following math.

- Sum up all costs related to the on-prem hardware, include smart-hands, power, Racks/network gear, power adaptors.

- Calculate a depreciation schedule for the existing hardware targeted to 3 years.

- Add a 10% cost of capital to depreciate future savings

- Amortize all "non-server" hardware costs onto your principal bottleneck (be it CPU/Network/Memory/Storage)

- Project how your savings compare against cloud costs under different assumptions of future cloud discounts. Both negotiated and public.

- Project how your savings compare under different utilization assumptions.

- Add a bus factor into the investment, what happens if your team leaves/hardware gets smashed/Workload becomes more efficient/Other crises occurs

Next, and more controversially

- Add in the people cost of everyone involved in the maintenance of your on-prem infrastructure.

- Project what happens under varying assumptions for how much of a raise they will ask for. What happens if tech compensation rises by another 2x over the next 3 years?

- Explore what happens if on cloud you needed X% fewer engineers. Depending on what you are doing X could be 0%, or even a negative percentage - but for many shops some fraction of existing work can be automated or avoided.

It was this math that made me turn away from non-cloud offerings. I still use cloud alternatives for some personal projects - but I don't bill myself in that setting.

vidarh · on Feb 11, 2024

> Hardware beyond 3 years starts to be inefficient due to the space and power cost. paid off servers aren't free

Hence my caveat, but racks are typically rented at a fixed price per rack, and most places never even reach a scale where they need a full rack per location, so my experience is that for most people it pays to keep servers quite a bit longer because most people have spare space in racks that are already being paid for.

Once you're at a scale where you typically will have whole racks aging out at once, it shifts the calculation somewhat, but then to it really varies greatly depending on e.g. your balance between storage and compute. It's very rare it pays to throw out hardware on the 3 year mark, except in markets where the power and real-estate cost is unusually high, but then moving your hosting wholesale often pays off - e.g. I've in the past moved entirely workloads from London to cheaper locations.

> Otherwise, we did mortgage the servers in the backend - buying upfront gave us around a 50% discount on top of the steep discounts you get from buying direct from an integrator vs. Dell et. al.

Either you loan-financed the hardware or the earlier mention of the "use of VC funds" was irrelevant, then. And nobody gives you a 50% discount for paying up front. You might have paid 50% less than the full cost of the purchase price + interest rate, sure. That's not a discount, that's not having to pay interest. In the end, whether you buy on credit, rent, or lease to own, you either way get a cost-curve per server per month, and that is what is relevant to plug into your models.

> On the rent-to-own topic, the people time cost of the migration was budgeted at around 10-20% of total TCO if I recall correctly

Migration is a one off, so this only makes sense given a time frame to write it off over. 10%-20% written off over 1-2 years wouldn't be completely crazy, though high. If your time horizon is so short that this matters to you, then you have organizational problems.

Put another way: While I did consulting in this space, I'd often offer to do the transition for clients for a percentage of their savings for the first few months, because I knew exactly how much these transitions would take us, and the clients would take a look at the proposals and accept my hourly rate instead when they realised how much they'd save, how quickly.

Reduction in egress fees alone often paid for my fees in couple of months (one system I migrated, which was admittedly atypical, saw hosting costs drop 90% thanks to egress fees alone), and we usually saw devops costs drop at the same time. Most of my clients outsourced 100% of their devops so the costs were easy to quantify.

> If the footprint had shrunk by 75% - the maintenance cost would have become problematic.

This is backward thinking. If the footprint drops by 75%, the cost for that drops. If your maintenance costs don't drop as fast, it doesn't matter - your total cost is still lower, but if your maintenance cost isn't elastic, you have an organizational problem.

And while whether or not the proportion spent on each factor then changes might be a political consideration, but if you then surrender savings because the budget "would have become problematic" then it's no wonder you ended up with a failed transition - if there are incentives to avoid savings to maintain numbers that were broken from the outset. This sounds more and more dysfunctional to me.

> Whenever someone says that they are beating cloud TCO, I'd suggest you do the following math.

Done all of these many times, and never once had cloud come out remotely competitive.

To your "controversial" point, what I usually see when people think their cloud setup is comparative is that they carry out no accounting of how much time they actually spend maintaining cloud-specific things. When they get to the point of handing it over to someone specializing in it (as I did for years), it's often a surprise to them just how much time they offload from their teams.

A few of the other things that seem to shine through here is 1) an assumption of capital outlays. Not needed capital outlays for coloed environments in any setup I've done in the last 20 years - did it for a handful before that; the cost of financing directly with the provider or via a leasing company is priced in when I compare costs with cloud because otherwise it wouldn't comparable. If you then want to pay upfront, that's a choice, not a necessity unless your credit is absolutely worthless.

2) Comparing only colo vs. cloud instead of adding in hybrid or managed hosting. If you build a pure self-honest environment a lot of the price advantage gets eaten up because you need to assume a far higher amount of spare capacity, which will drive up your cost even if you get it right, but people often end up far too conservative here and assume peaks far higher than what they ever need.

The moment you have a hybrid setup that can scale into cloud instances as needed, which is typically little extra effort to set up (you're going to be using an orchestrator anyway), you can easily double (or triple, if people were being conservative) the typical load on your colo servers, and cut the hardware and rack cost accordingly, and usually when people do this they still end up hardly ever actually spinning up cloud instances, because most peoples traffic varies far less than they'd like to think.

Even more so given there are now plenty of providers that offer you a seamless transition from colo, via managed server, to vps's to cloud instances, with often surprisingly little difference in time to spin up extra capacity. Your setup just has a method to register a new resource anyway, irrespective of what is underneath - I've deployed systems that way for nearly 20 years, now, with hybrid setups spanning the gamut from colo, via managed server, VPSs and AWS instances in a single setup.

The net effect tends to be to have to defend retaining the capability to scale into cloud because of how rarely it ends up being used.

3) an assumption that you need to hire people vs. e.g. outsourcing. Most companies never reach a scale where they need even a single full-time person doing hands-on ops - you're better off leaning on colo support, and retainers for monitoring and out-of-hours support for fractional scaling until you reach a scale where staffing several dozen full-time staff becomes viable. I've never had a problem scaling this up/down on an hour-by-hour basis with commitments for base-level needs on a month-by-month basis. For years I used to provide fractional support for companies to facilitate this type of thing.

4) an assumption that you can't automate the same things in a colo environment as in a cloud environment. For a well-managed colo environment, past racking hardware, if you can't boot the system via IPMI etc. straight into an image that automatically enrolls the server in your orchestrator, you're doing something wrong. If your cost of physically managing your servers are more than a rounding error, you''re doing it wrong.

Yet, the flexibility of public cloud environments is like the gold ticket for people doing devops consulting - when I was consulting in this space, the one constant was that the clients in cloud environments ended up paying me 2x-3x as much for assistance for similar size and complexity workloads. And I still usually cut their costs significantly compared to what they used to pay. E.g. the time that goes to maintaining network setups in a typical cloud setup that's solved by plugging things physically into isolated switches is staggering. Yes, usually people could do it cheaper than they are in public cloud setups, but the risk of getting it wrong also tends to be far higher.

candiodari · on Feb 12, 2024

I don't get why you'd do this. This sounds like a false choice. If you want a company to deliver racks with hardware to you, plenty of companies do exactly this. You don't need cloud for that.

The choice is not between "do all hardware and building" and "cloud". There's a whole spectrum in between. In fact, it's quite hard to do the building. You can go colo (building + power + electrical done by vendor, optionally network), dedicated (servers done by vendor), hybrid hosting, hyperscaler (you get VMs), ...

As soon as you move above colo, you don't have the hardware issues mentioned here. Or, well, you do have issues, but you "just" file a ticket.

vidarh · on Feb 11, 2024

I wrote an orchestrator from scratch in the pre-K8S days, and even factoring in the opportunity cost of that colo and eventually Hetzner was far cheaper than cloud, to the point where - and we costed this out yearly - the company wouldn't have survived a year if we had to pay AWS level prices for infrastructure.

Cloud is great for many things, but not cutting cost.

supriyo-biswas · on Feb 11, 2024

In the case of fly.io it’s a little different because they have a anycast network with traffic being accepted at the closest region and being proxied to your running VM. This is not unlike a CDN, and if you look at the North America/Europe pricing of most CDNs, fly.io is quite similar.

wolfendin · on Feb 11, 2024

Anycast does not really make the pricing that different.

Especially since one does not have to backhaul the traffic to one specific PoP.

Cloudflare manages to run their Anycast CDN as a loss leader just fine.

supriyo-biswas · on Feb 11, 2024

I believe Cloudflare is in the long-haul of commoditizing their competitors and making their profit through other auxiliary services.

Also, on the plan routing is atrocious, with all requests from Africa and Asia being directed into Europe, which helps keep costs down, but it cannot then be compared with fly.io.

quickthrower2 · on Feb 11, 2024

With Netlify it costs more than hiring a courier to ship an SSD that you paid for and turn up in person and copy it and deliver it next day.

tomcam · on Feb 11, 2024

I’m guessing you can negotiate for sharply reduced rates. No one pays list if they can help it.

blackoil · on Feb 11, 2024

It looks like Netlify $550 also includes hosting and CDN so not apple to apple.

avereveard · on Feb 11, 2024

Aws transfer costs seems off by a factor of ten sticker price is $0.09-0.14 per gb depending on region how did you get 100gb for 98$?

angoragoats · on Feb 11, 2024

If I'm not mistaken, this is copied from the OP, which is pointing out that you get 100GB/mo free (that's the "100GB/mo" part). The $92.16 cost is for 1TB of egress. It's clearer in the original article.

cchance · on Feb 11, 2024

100gb is the free 1tb is for the 98$ as the pricing based on 1tb chunks in the article

ehPReth · on Feb 11, 2024

there's no way it actually costs them that much right?

kyledrake · on Feb 11, 2024

Not even close. I think a 1gbps ip transit hookup at a data center goes for $200-400 a month right now. And you can push 192TB a month with that. That's not the bulk cost, they buy larger amounts in volume and probably get a huge discount on it. In fact given their size, they might not pay for it at all (peering agreements), so it's just operating costs.

nixgeek · on Feb 11, 2024

A good chunk of network capital is also amortized into these costs, it isn't just the 20 cents per megabit for IP settlement, it would also be all the routers those ports are connected to, the racks those sit in, the electricity those consume, and the networks beneath that from the hypervisor upwards.

Add on the usual and egregious $150-400/mo for a SMF/OS1 crossconnect in many datacenters.

Now, I'll grant you, "not even close" and handsome margins likely still applies but "I'm going to hook a server directly into a 1Gbps IP transit port for $200" vs. all the complexity of a hyperscaler network is not apples to apples.

amluto · on Feb 12, 2024

> A good chunk of network capital is also amortized into these costs, it isn't just the 20 cents per megabit for IP settlement, it would also be all the routers those ports are connected to, the racks those sit in, the electricity those consume, and the networks beneath that from the hypervisor upwards.

Some of that is capex, and some is opex. But it's worth noting that the hyperscalers are doing something far, for more complex than almost any single-tenant physical installation would do.

In a hyperscaler cloud, I can run an instance anywhere in the AZ, and I can connect it to my VPC, and I get what appears to be, functions like, and performs as if I'm plugged in to conventional network infrastructure. This is amazing, and the clouds don't even charge for this service as such. There are papers written about how this works, presentations are given, costs are bragged about, and entire industries of software-defined networking are devoted to enabling use cases like this.

But an on-prem or normal datacenter installation doesn't need any of this. Your favorite networking equipment vendor will happily sell you a single switch with tens of Tbps of switching bankwidth and individual links that are pushing 1Tbps. It takes maybe 2RU. If you are serving 1 billion DAU, you can serve up a respectable amount of content on that one switch. You could stick all the content in one rack (good luck at that scale), or you can scale across a few racks, and you don't need any of the amazing things that hyperscalers use.

And, realistically, very very few companies will ever need to scale past that in a single datacenter -- there aren't a whole lot of places were even 100% market penetration can find 1 billion people all close enough to a single datacenter that it makes sense.

So the apples-to-oranges comparison cuts both ways. You really can get a lot of mileage out of just plugging some off-the-shelf hardware into a transit port or two.

nixgeek · on Feb 13, 2024

Notably, anywhere in the AZ is simply a launch constraint, and the hyperscaler can and does place you in the most advantageous way for them which matches against the 'customer promise' (constraint). They in practice do model both the volume and cost of East-West traffic in their network topology and will wherever possible not place Instance A and Instance B all that far apart in the AZ, particularly if that AZ is one of the big ones with 20+ datacenters worth of capacity.

All the other points you made are also excellent.

kyledrake · on Feb 11, 2024

The important takeaway message I'm trying to deliver is that AWS is comically overpriced regardless of what they are trying to run. Direct connections aside, reasonable cloud competitors with similar networks are charging $0-1000 for transit and the same amount would cost $8500 on AWS. Cloudflare in particular has demonstrated that you can run these systems so efficiently that they can charge nothing for egress and still make a healthy profit selling related cloud services.

phonon · on Feb 11, 2024

AWS Lightsail is actually pretty good, comparable to Linode (except it maxes out at 7 TB/month).

https://aws.amazon.com/lightsail/pricing/

hughesjj · on Feb 11, 2024

It's also in the SLA to not use lightsail as a proxy

phonon · on Feb 12, 2024

I'm comparing Lightsail standalone.