The ARM server apocalypse

hapless · on June 16, 2013

"If, by 2015, 64-Bit ARM makes it within striking distance of x86 server processor performance [...]"

There isn't a big enough rolleyes in the universe. ARM's sink-or-swim pitch points will be will be power consumption and density. Competition on performance isn't even a choice.

x86 server buys, today, are often about how much RAM I can get in a rack unit. If Aarch64 servers can get enough RAM tied to a chip, and make memory access fast enough, there's a pitch to be made based on ultra low-power, high-density virt hosts.

If they can't, they remain novelties for "hyperscale" hosting of static content and toy sites.

akiselev · on June 16, 2013

Yeah the chance of ARM improving their architecture to match 10nm SoCs (2015 is the release date for Intel's 10nm line and they said that the will push their SoCs to the latest tech), let alone 2015 x86 is laughable.

Intel's mad focus on power consumption might catch up to within like 10-50% of ARM at 25-100% better performance.

t05ter · on June 16, 2013

The article is pretty shallow in the sense that it just casually glosses over a few key characteristics that make the server/datacenter space an interesting battlefield for the current round of architecture wars.

It will be an interesting next couple of years. Intel didn't sit on their hands with this one, like they did with the mobile market. They saw the potential for ARM SoCs, using a large number of smaller/lighter cores to take market share in things like simple web servers or hosting static data. Their response was their Atom based "Centerton" which is cannibalizing their own Xeon line.

As for more CPU performance intensive tasks, ARM has yet to prove itself in the performance/watt arena, especially in the 64-bit realm of server features, like error correcting code or RAS. Most sys admins building infrastructure like to play things safe, since they have to live with their decisions for years.

It seems like ARMs biggest advantage will be price, and the SoC business model of custom tailoring silicon for a customer's needs.

glurgh · on June 16, 2013

Someone's been just about to beat Intel at the high-performance CPU game, any day now, for a couple of decades. The AIM (Apple/IBM/Motorola) alliance was very fond of hockeystick graphs showing their planned crushing of other CPUs (shown as a flatter line, usually coyly labeled 'CISC'), for instance, back in the early 90s.

It's not an impossible thing but given the track record of such claims and predictions, the only comment they should elicit is 'Shut up and show me the silicon'

venomsnake · on June 16, 2013

If you throw enough "This time it is different" on the web a person may guess one.

When i shop for a new desktop I feel I am essentially paying intel tax. Since AMD begun to play catch up the intel desktop product line moved from providing value for the customer to extracting consumer surplus more efficiently. I would be surprised if this is not the case in the server market.

The lack of VT-d on K processors and the totally locked other CPUs are a good example. Nobody likes being milked which can create some desire to diversify from intel even if the savings are not enormous.

frozenport · on June 17, 2013

Are yoy saying Dell won't buy Intel because Intel is evil? Recently China started to wind down its anti-Intel stance in pqrt due to a new supercomputer deal.

venomsnake · on June 17, 2013

Dell don't buy intel. They resell intel so it is all the same for them.

They will gladly resell whatever someone decides to buy. But every time there is a new chip on the market the big cloud operators and big end clients for intel chips take calculators and begin to evaluate very carefully.

Also intel are not evil. They just milk their customers.

fnordfnordfnord · on June 16, 2013

Numerous examples exist of firms beating Intel at creating high-performance CPU's. Nobody can best them at business though. Otellini spent just enough on R&D to keep Intel caught up with its competitors. It will be interesting to see what they do without him.

glurgh · on June 16, 2013

Yep, that's what I meant, perhaps a little unclearly, by 'game' rather than just 'high performance CPU'.

moconnor · on June 16, 2013

Intel's Xeon Phi cards pack 240 x86 CPUs per card at very low power per core. You can run Linux on them. Soon we'll be serving web content and VMs on them.

The best bang-per-watt supercomputer in the world right now is a ton of Xeon hosts with 2 of these cards per machine.

We've got a stack of then in the office; they're pretty interesting.

lgeek · on June 17, 2013

> Intel's Xeon Phi cards pack 240 x86 CPUs per card

> We've got a stack of then in the office; they're pretty interesting.

That's funny, seeing as the only Phi coprocessor you can buy right now only has 60 cores [0].

I mean, what Intel has done is quite impressive, but let's not exaggerate.

[0] http://ark.intel.com/products/71992/Intel-Xeon-Phi-Coprocess...

kaib · on June 16, 2013

One important observation is that the interconnect out of the Phi cards is note very good. While the interconnect inside Phi is pretty good the bandwidth to communicate with other computing facilities is through the PCIe bus. This places the Phi cards close to GPU's very good at graphics and a few specialized problems like protein folding but not supercomputers that can tackle hard problems like sparse matrix solving.

marshray · on June 16, 2013

Have you tried benchmarking serving dynamic web content on them?

berkut · on June 16, 2013

This is fine for VMs and maybe low-load servers, but the problem for high-performance servers is ARM sucks memory-wise as they don't have the patents on the cache hierarchies that Intel has (and AMD can use), and thus have very limited cache systems.

AMD's future entry into the ARM market might change this, as they can use these patents.

VLM · on June 16, 2013

"... with that comes all the awe & terror of living with Windows."

I had to LOL. Not even any point stating the obvious. That's good writing, can coin quite a phrase, unless thats a quote from somewhere.

On the bigger picture, I donno. I've seen the infinite circular wheel of IT rotate around quite a few times and this overall idea sounds like a rehashed Transmeta marketing message. That didn't turn out so well last time, but maybe the situation is different this time. Probably not, but maybe.

sp332 · on June 16, 2013

Transmeta's CPUs worked fine, but the market wasn't looking for underpowered mobile processors at the time. They sold off their IP to Intel who rolled it into the Pentium M which was the basis for the Core series.

akanaber · on June 16, 2013

The Transmeta Crusoe (VLIW with software translation) was nothing like the Pentium M which was a derivative of the P6. The Pentium M did add some cleverness in fusing some of its hardware translated micro-ops but that's not terribly related to what Transmeta were doing.

You may be misremembering the patent lawsuit Transmeta filed against Intel, long after the Pentium M had been released and Crusoe had failed commercially, basically as an alternative monetisation strategy. Intel did end up paying them to go away, but their IP was not "rolled into" Pentium M any more than Eolas's was into Internet Explorer.

sp332 · on June 16, 2013

Oh I didn't mean core architecture or instruction set or anything, just some of the power management.

btilly · on June 16, 2013

What people don't realize is that Transmeta wasn't actually trying to develop underpowered mobile processors. That is just what they wound up trying to sell.

The goal was a high performance architecture. The #1 barrier that they saw was heating. They therefore developed a very power efficient chip. They succeeded in that, but because they both took longer and were less efficient than hoped, they were underpowered compared to the market. However there was a niche that they could sell to.

From there their hope was that, being a fundamentally simpler chip, they could iterate faster. That would mean that eventually they would catch up on performance. Unfortunately for them, Intel's greater resources and experience meant that they iterated faster than Transmeta, despite having a much more complex architecture.

Transmeta was always going to either change the world or sink without a trace. They sank without a trace. But it was an interesting approach and was worth a try.

VLM · on June 16, 2013

The transmeta saga is a long and varied one, and of the many phases of it, this is the one I was referring to. Not becoming a patent troll or shipping cell phone procs or whatever once they realized it wasn't going to work on servers and started scrambling to anything to keep head above water.

The transmeta cpu on servers pitch is almost word for word the ARM cpu on servers pitch, just maybe a dozen or so years later. This PR pitch might even work someday, maybe even this time. Just saying I've heard it all before and last time I heard it, it turned out this way...

From an engineering standpoint it seems almost impossible to optimize for both performance for this new market AND heat/power for your traditional market unless your R+D budget is absolutely insane compared to competitors only optimizing for one or the other. Note that R+D budget dollar value has almost no relationship with quantity shipped units being discussed in its place, probably because the real numbers are unfavorable.

stephengillie · on June 16, 2013

The wikipedia page for Transmeta says they sued Intel for patent infringement in 2006, then licensed their tech to AMD & Nvidia, before being bought by another failed company, and having their patent portfolio wind up with our favorite patent troll, Intellectual Ventures.

https://en.wikipedia.org/wiki/Transmeta#Timeline

sp332 · on June 16, 2013

Hm, what happened to this deal in 2008? http://www.electronicsweekly.com/news/components/microproces...

The first agreement grants to Intel a non-exclusive license to use and exploit certain Transmeta technologies commercially. A second agreement is an amendment to a settlement and license agreement from December 2007, which granted to Intel a perpetual non-exclusive license to all Transmeta patents and patent applications.

As a result it will receive from Intel a one-time, non-refundable payment of $91.5m in the third quarter of 2008.

luu · on June 16, 2013

The business case for serving web content isn't as strong as you might think from a back of the envelope calculation about cost/power vs. performance. It's true that you'll find that you can get more throughput per cost with ARM/Atom (mostly due to the lower power, but also because the machines are cheaper), but, when you actually do it you'll find that latency is significantly higher, if you compare ARM/Atom boxes with low load vs. fast x86 machines with high utilization [1]. An argument people often make is that ARM is going to catch x86. Sure, possibly, but why wouldn't you expect x86 to catch ARM? [2]

A lesson that people seem to have to keep re-learning, over and over again, is that latency matters a lot on the web. A few ms increase in latency has a measurable effect on your income, as people just close the webpage and click elsewhere. A significant increase in latency is disastrous.

[1] http://users.ece.utexas.edu/~vjreddi/UT/Publications/Entries.... This paper describes a way you can use low-power boxes to get better results for the same cost, but it doesn't involve simply swapping your core i5s with Atoms or ARMs, which is what a lot of people seem to want to do.

[2] One of the major lessons computer architects learned perhaps 10-15 years ago is that the instruction set just doesn't matter that much, compared to the microarchitecture, the manufacturing process, and the quality of the circuit design. Intel has a decisive advantage in manufacturing that's been growing for approximately two decades, ARM doesn't even try to compete in circuit design with full-custom design or fancy circuit techniques, so that leaves the microarchitecture. Although I'd disagree, you might make a case that ARM simply has better architects, and that ARM would produce a better design than Intel if they targeted the exact same space. But, I doubt you'd try to make the case that you'd expect ARM's design to be so superior that it will obviously overcome Intel's other advantages.

Another case you might make is that Intel simply won't target the same space, to avoid cannibalizing their own market, but, in addition to obviously moving towards that space with Atom, they have a history of ruthlessness that makes that seem unlikely.

Intel used to be a dominant player in the DRAM industry, but they killed off their DRAM business when they were a leader in the field, because they recognized that it would become a commodity industry. After becoming a market leader in SRAMs, one of the competitors invented flash; they realized the significance and focused on flash and microprocessors, while, again, killing off what was (then) a major cash cow. It's very hard to imagine Intel just sitting and slowly losing their dominance of the microprocessor industry, ending up with a position like IBM or Sun. They've never done that in the past, so why would you expect them to start now?

akiselev · on June 16, 2013

Just to second this point, most people just have no conception of what semiconductor manufacturing is like and this article seems especially misinformed about the manufacturing side. For example: "The ARM collective simply show up to Samsung or TWSC, or TI or Global Foundries, ask for a zillion processors to run off the manufacturing lines and wonder if they can have them by Tuesday?" which is just total bullshit. The reason Apple had to buy so many processors from Samsung for so long is because setting up a new processor is expensive and difficult, whether it's an Intel or Global Foundries and it sure as hell doesn't take less than six months, let alone a week before you can get a reliable shipment of chips for a consumer product. (and before you can talk about volume of ARM vs Intel, compare the dozen+ companies that can manufacture 24-60-something nm to the ONE company that can do 22 and lower: Intel)

There are many competitors in the semiconductor manufacturing space because COMBINED they don't have the manufacturing or scientific resources of Intel (TI and Qualcomm have massive R&D arms but they're more concerned with wireless and other electronics). Intel's first mass market 14 nm facility is supposed to go live this year and Samsung just BARELY got their 14nm demo out this year. Intel's plant was $5 billion and started in 2011 which means that Intel has years to improve their power, bus speeds, and reliability before any manufacturer will even be able to sell their first processor. As the technology nears sub-10nm, the gaps between performance and power consumption between the architectures will be more and more obvious.

All of the other semifab manufacturers are extremely reliant on third party suppliers for their factories whereas Intel helped develop a huge portion of their technology, many times outright owning part of the equipment manufacturers (for example, http://www.extremetech.com/computing/132604-intel-invests-in...). 14nm was the point where they hit a lot of physical phenomena that prevent e-beam lithography and some of the other methods from working and Intel's pretty much the only one that can really push this technology forward.

Edit: Also, http://www.tomshardware.com/news/intel-cpu-processor-5nm,175... - it's over. Notice how Intel said they're going to push their SoC's to their current processes? Hopefully this means that in 2015 we'll have 10nm x86 SoCs (at which point ARM will be, best case scenario, nearing end-of-life 22nm)

brigade · on June 16, 2013

Shrinking feature size alone has had diminishing returns on efficiency for the last couple generations, and I haven't heard any indications that sub 22 nm would reverse that. Obviously it still increases your transistor budget, which is admittedly important for server CPUs since caches are generally the single largest use of transistors.

But efficiency-wise it's FinFET, not feature size, that has given Intel the biggest advantage as of late. And indeed the ARM foundries might not have volume shipping FinFET SoCs until 2015.

I have no idea why you're comparing TI/Qualcomm to Intel when discussing foundry R&D - it's TSMC, GloFo, and Samsung that are relevant there.

Also, Tom's hardware has its years off - Intel isn't shipping 14nm CPUs until next year, with 20nm ARM SoCs expected around the same timeframe. Then 10nm isn't expected until 2016 at the earliest, again with 14-16nm ARM SoCs in about the same timeframe.

So yes, Intel is about a generation ahead of its competition, and even more factoring in FinFET. But that's not the end of the world - IBM, AMD, and Oracle still make server CPUs despite this. Not as successfully as Intel obviously, but enough that there could be room for one or two makers of ARM microservers to enter.

akiselev · on June 16, 2013

I forgot about FinFET and just extrapolated from the i-series jump but the only important differences I've seen in processors in the last 3-4 years has been power efficiency, cache, and bus speed. Clock speeds (especially with TurboBoost thrown in) became more and more erratic as a metric for my use cases so at this point it's all about Intel's microarchitecture and process (in my perspective). They've also been one of the most advanced firms materials wise so as we start to get smaller and smaller, I think Intel will begin to use new materials that will increase the gains on FinFET and process. IIRC in 2012 they were two "generations" ahead on high dielectric materials and their silicon straining process for mass production.

I just picked two companies off the cuff that are relatively well known and that I think could make an impact in the server market with ARM. I think comparing ARM mobile SoCs to Intel's x86 or AMDs AMD64 is disingenuous and since all TSMC and Global Foundries can do is play catchup (hence me touting Intel's process advantage), the brunt is left on the microarchitecture designers and the integrators to really make a server that can beat Intel's Xeon. I'm sure TSMC/Global Foundries are deeply involved in the design aspect but I think other companies will make or break ARM in the server market.

As for the dates, that's a shame to hear :(. However, I just looked it up and TSMC finished their 20nm design this year so I really doubt 10nm chips will be in a server-ready state in 2016. I think that the next few years are on ARM's turf but if Intel can hit the cost sweet spot that ARM is at (or even just in the ballpark) with Intel's process and performance, it might be the end of ARM as a non-mobile/embedded contender.

ricw · on June 16, 2013

This is marginal compared to the two major factors that matter for the majority of web services: acquisition cost and power consumption. On both fronts intels current business case cannot ever compete with arm, no matter how much more advanced their manufacturing may be. Sure, intel is likely to keep on driving the high end single machine market, but the vast majority of web services will go with the market. And that market is arms to take, unless intel drives down its price dramatically. A bloodshed is about to happen. Better start shorting intel..

akiselev · on June 16, 2013

Please read this, it's old but still applies: http://www.geek.com/chips/why-amd-mhz-dont-equal-intel-mhz-5...

Since Intel is the driver for cutting edge technology they, by a wide margin, have the most potential for drastic improvements. ARM is mostly restricted to the mercy of Qualcomm, Snapdragon, and a few other companies who rely on third party manufacturers of semiconductor fab equipment.

Also, I laughed out loud at "bloodshed." Intel has about half the assets of Apple with plenty of cash in the bank and controls something like NINETY-SIX PERCENT of the server market. Even if they lost half the market in the next ten years, the only bloodshed would be the cost of engineering and IT time wasted moving architectures. Worst case scenario they become a contract manufacturer for the best ARM chips.

ChuckMcM · on June 16, 2013

You are missing a point I think. Intel burns huge amounts of cash while driving the business. That was made really clear when AMD started taking money away from their server business with Opteron in the first days of AMD64 chips. The impact is that a small dent in their sales results in a disproportionate impact on their cash accounting. That and basically having to can an entire Pentium design cycle was a big hit for them and AMD got what, perhaps 7% of the share at the peak? Then there are the desktops which are also under attack by ARM (wasn't the case with Opteron really, and AMD still struggles there) so Intel is facing a two front assault, tablets use ARM chips and are eating laptop PC sales, ARM-64 is threatening to eat into server sales with the ability to innovate on the frontside bus (something not possible in the Intel space).

The texture of the ARM threat is very different than the AMD threat has ever been, closer to what Intel faced back in the early days with Motorola and PowerPC except ARM is a license business not a semiconductor business so there is home 'home team' to protect price wise. All of the pieces are in place for ARM to seriously marginalize Intel, other folks can see that now and that is why Windows for ARM even exists.

akiselev · on June 16, 2013

I don't know anything about the impact Intel's status as a publicly traded companies could have on the distribution of this market (which maybe why I missed the point), but semiconductor factories at Intel's level are insanely valuable. Intel also seems to learn very well from their expensive mistakes so with the equity in the factories and their existing market share, there is a long long way before Intel is truly "Oh shit we're going to go bankrupt really soon" threatened.

ARM processors don't have an FSB (at least not in the sense that I understand it). Even if ARM-64 introduced one, what innovation can they do there? The whole point of QuickPath Interconnect and HyperTransport was that the FSB to Northbridge connection was a bottleneck. If you mean ARM can innovate in their general bus architecture, like implementing NUMA and a few other optimizations which would really help, then I agree. But we have yet to see how these optimizations will help in the ARM architecture and implementing many of them is nontrivial.

I totally agree that this is a different threat and that's exactly why I think Intel will win this. Their response has been late but very promising and, from my cursory knowledge of their history, taken very seriously compared to previous threats.

As for "other folks can see that now," the opinions I've heard in my circles have all been largely pro-ARM except for the electrical engineers. None of them think that any of the ARM microarchitectures can compare to Intel's for servers (opinions gathered pre-SnapDragon 800) and that they're nowhere near ready to impact Intel's market share in servers beyond a rounding error.

fnordfnordfnord · on June 16, 2013

Intel isn't "the driver for cutting edge technology" they eek out incremental improvements to their products in as miserly a fashion as they can get away with. When a firm challenges them and takes the lead in performance, they scramble and catch up while doing their best to ruin their new competitor with nasty business tricks.

akiselev · on June 16, 2013

This is so absolutely wrong that I can't even....

When a firm challenges them, they change philosophies and strategies (and they win). Remember the GHz race that Intel eventually won? They won not because they had better manufacturing technology than AMD (which they did), but because they totally changed direction away from increasing the clock to improving the communications buses and optimizing instructions/cycle. Yes they're often anticompetitive, but this isn't something new in the semiconductor industry (Intel was the only one that had enough market share to go after).

As for incremental improvements, you're kidding right? Every jump in semiconductor manufacturing is a huge leap, both in performance and technological complexity and they've been first to mass market with (i think) every new process in the last decade.

fnordfnordfnord · on June 16, 2013

>they totally changed direction away from increasing the clock to improving the communications buses and optimizing instructions/cycle.

I remember Intel following AMD's lead in that. Remember HyperTransport which predated Quickpath by a few years while Intel persisted in wringing every last dime out of selling the Northbridge/Southbridge? Intel hung on to that scheme long past its useful life. The only reason they got away with it was their huge scale as a company.

akiselev · on June 16, 2013

"Intel hung on to that scheme long past its useful life. The only reason they got away with it was their huge scale as a company."

I agree with the former, disagree with the latter. HyperTransport gave AMD a significant leg up on interprocessor communication but Intel's front bus and cycles/instruction in the most important parts of the architecture were still superior to AMDs. When HyperTransport was introduced the only thing that saved Intel from losing even more market share (in my largely uninformed opinion) was AMD's inability to market their own innovations as well as Intel and the fact that most of the other silicon on Intel's processors was implemented better (for real world use cases, not benchmarks).

Intel and AMD both make great tech and yes Intel is much better at at ruthless business side, but the comparisons that you see in the media about their technology don't even begin to scratch the surface of reality. These systems are really complicated and they differ enough that just about every marketing term describing CPUs is useless when comparing architectures. For example, there was a time when I had a 900 mhz Athlon and a newer Intel CPU (I believe it was a Celeron or Pentium) running at 1.8ghz. There were use cases where the Intel could barely beat the Athlon by 20% and other times it would have a 10x improvement because its processor cache was 3x faster.

fnordfnordfnord · on June 16, 2013

>AMD's inability to market their own innovations as well

Look, we can quibble about the details all day long, but AMD never had a chance, even if you concede AMD's momentary advantage over Intel in having both HT and better CPU's for a short time. Intel is a behemoth with all of the market share, and OEM computer manufacturing can't flip overnight. Intel allowed AMD to live so they could tell the FTC, EU trade commission that the CPU business was a competitive market. That tactic served them well.

> as Intel and the fact that most of the other silicon on Intel's processors was implemented better

Come on. You make it sound as though Intel is infallible at hardware design. Besides failing to drive innovation much of the time, they have had some huge failures, Itanic? anything Intel ever produced to do with video? the 8051? (the 8051 is a financially successful product, but IMO a terrible microcontroller). They're great at clinging on to their legacy designs, and adopting others' good ideas after they're proven.

>(for real world use cases, not benchmarks).

We both know that benchmarks are just a well organized way to tell a lie, same as specifications.

akiselev · on June 16, 2013

Agreed on AMD.

I don't make them sound infallible, you make them sound incompetent. The point is that because of Intel's experience, sheer size, and resources, they have a massive leg up over the very fragmented ARM group. They fuck up all the time, but they almost always catch up on features or optimizations (except for the power efficiency aspect, which I believe is highly dependent on the architecture) and they do so very quickly. When they do innovate (which is very often), others can't catch up because by the time they do, Intel is already ahead again (not entirely true, but close enough).

"They're great at clinging on to their legacy designs, and adopting others' good ideas after they're proven."

I guess everyone just really loves the underdog but come on, read a book. AMD's history revolves around copying almost every aspect of Intel's business and then they get hailed for coming up with TWO original things in TWO decades (AMD64 and HyperTransport). I'm sorry to oversimplify this, but this is the only response I could think of against a silly statement like that.

vidarh · on June 16, 2013

Latency is important, yes, but for static content and other "simple" things, network latency and slow clients means you can often serve the content "fast enough" with very slow CPU's. It's not unusual to serve hundreds of connections or more per core with an x86 server. In that case, lowering the number of connections to run it of a slower core can be viable with quite slow CPUs, and it becomes a question of cost per client served.

zokier · on June 16, 2013

Interestingly enough AMCC, the company making the ARM chips in question, is allegedly rejecting the "wimpy cores" idea. So these processors might have significantly better performance than what's been associated with ARM before. Clock-speeds are said to be 2.4-3.0 GHz. The real-world performance remains to be seen.

akiselev · on June 16, 2013

Clock doesn't mean anything if the communication peripherals (technically busses not peripherals in processors) can't feed data from disk, RAM, and ethernet fast enough. The problem is that ARM has never been a high data throughput architecture, since few devices that need to process a large stream of information also need to be mobile. In the early days, Intel won because it crushed everyone with its bridge speeds, cycles per instruction, etc. and now ARM is crushing it on low power consumption. Problem is that ARM is nowhere near Intel's performance in real world scenarios, despite a lower Watts per MIPS.

I don't know of any benchmarks quantifying the difference but when I compared a stripped down Android on a Transformer Prime (Tegra 1.6 ghz quadcore, don't remember if it was DDR2 or 3) with Angstrom on a roughly equivalent Intel Atom COM Express module (All code running was C/C++ not Java). In almost all of the tests I ran (all specific to my own use cases for web apps, CV, FEA, and market analysis for EVE Online) the Atom processor blew away the Tegra because the Tegra cores spent a lot more time idle waiting for peripherals.

Edit: Forgot to mention that this is true (afaik) with operating system overhead and mismanagement on ARM chips. If you use a micro-kernel and optimize your code for all of the peripherals of the processor (i.e. using DMA directly instead of letting the kernel fuck it up) you can get ridiculous speed increases over Intel stuff. Ofcourse, this is true only as long as Intel processors are as inaccessible as they are now. If they start selling surface mount soldered chips then you could do away with the OS overhead on Intel too.

lgeek · on June 17, 2013

I've recently benchmarked RAM bandwidth and latency on computers based on i.MX53 (Cortex-A8), Exynos4 (Cortex-A9) and Exynos5 (Cortex-A15). RAM throughput increases by something like 20 times between the A8 and the A15 platforms. So while it's true that ARM systems used to suffer from low throughput, this is something they've been working on for some time, and the results look quite good from where I'm staying. I found this microbenchmark[0] and (looking at memcpy) the results for new ARM systems look alright to me.

Of course, this is an oversimplification, but I'm happy to go into more details if there's interest.

Since the first ARMv8 cores are built for server workloads, I expect they're brining significant improvements compared to A15.

Disclaimer: Not associated with ARM, but I have an interest in ARM being used more for things other than mobile devices.

[0] https://github.com/c2h2/arm_c_benchmark

akiselev · on June 17, 2013

Kudos for actually trying out a benchmark but it must be seriously flawed. There's no way you'd get a 20x boost in bandwidth between the A8 and A15 (both of which are ARMv7 and DDR3 is not 20x faster than DDR2) in nontrivial cases if you're using technology within the last 5 years. I'm guessing you ran a very trivial benchmark that operated mostly out of the L1/L2 cache (hence you weren't really testing the RAM) on the Exynos5 and mostly out of ram on the i.MX53. Depending on what operating system you ran on the different boards, there could also be major differences in the kernel's implementation of the DMA peripheral, which would also heavily skew results.

For example, in the Github link, some code used native memcpys while others used kernel call memcpys. The differences in the specific Ubuntu 13.04 and Android implementations could vary the results quite a bit, even if they have the same exact overhead.

lgeek · on June 17, 2013

> There's no way you'd get a 20x boost in bandwidth between the A8 and A15 (both of which are ARMv7 and DDR3 is not 20x faster than DDR2)

There are a few factors at work here:

* A15 was designed to have more memory bandwidth in the first place, since this was a known issue. I think the bottleneck was (is?) the speed of AMBA bus[0] connecting the peripherals (including RAM controller) and the ARM core.

* A15 has a vastly more advanced pipeline and multiple-issue capabilities than A8. This should allow it to use its functional units more efficiently.

* Finally, my A15 system is running at 1.7 GHz compared to 1 GHz for the A8 system.

> I'm guessing you ran a very trivial benchmark that operated mostly out of the L1/L2 cache

The buffers were pre-initialized (to remove the kernel's lazy physical memory allocation as a factor), and far larger than the L2. The data caches were explicitly flushed at the beginning. But indeed it was a very simple benchmark since it was a microbenchmark meant to measure memory bandwidth.

> Depending on what operating system you ran on the different boards, there could also be major differences in the kernel's implementation of the DMA peripheral.

I wasn't using DMA.

> For example, in the Github link, some code used native memcpys while others used kernel call memcpys.

To clarify: the project I've linked isn't mine, it's a public project that supported my point. But as far as I'm aware there's no memcpy systemcall. memcpy is implemented completely in libc. For example, glibc has multiple optimized implementations written in assembly[1].

But you're right, the glibc version is different between these computers, so I need to repeat with a statically compiled benchmark.

Anyway, I've looked at this while trying to improve memcpy speed for A15, which is why I don't have solid comparative results. But I'm doing a write-up and now I'll probably also include this bit.

[0] https://en.wikipedia.org/wiki/Advanced_Microcontroller_Bus_A... [1] http://sourceware.org/git/?p=glibc.git;a=tree;f=ports/sysdep...

akiselev · on June 17, 2013

Huh wow, I guess I'm behind the times. Didn't know they made that big of a jump within ARMv7. This makes me curious to look at the datasheets.

From what I can tell, the i.MX53[1] has a 64 bit AXI @ 200mhz. The Exynos5[2] on the other hand, has 64bit AXIs and also a ton of optimizations. The LCD spec says it operates off of a 200mhz AXI so I wouldn't be surprised if the Exynos5 uses dual 200mhz AXIs for memory. It's tough to tell how much of that 20 times translates to the clock speeds and the rest to optimizations. I agree the A15 is just a sign of things to come since ARMv8, although it has legacy stuff, ARM still gets to improve a lot more on the architecture specifically for servers.

For comparison, these guys [3] say 12.8 GB/S which kinda sounds ridiculous. If it's true, they really are in range of Intel. The Sandy Bridge Xeon E3-1220 boasts a theoretical 21 GB/s @ a whopping 80 watts (although the high end E7-8870 [4] is a fucking monster. At this point I can't even begin to think how they compare with all of Intel's memory stuff). The number are clearly within the ballpark, we'll just have to see a real world case. I'm curious about how well GCC and the kernels utilize all the unique processor features and optimizations. If it's mostly down to market forces, Intel's 96% market share might make it a long and difficult journey to an ARM Linux kernel optimized as well for a server environment as x86/x64 ports.

I'm too lazy to check what that assembly code is but if it uses NEON PLD optimizations (NEON is definitely in Exynos5) that may give a speed bump in memcpy (even if you preloaded the buffers) because those optimizations would use L1/L2 cache intelligently. It's hard to tell without looking at the code and diving into i.MX53's feature set more whether that played a factor. * I was thinking of malloc and free, they are system calls because of paging. Memcpy is a straight pointer to pointer copy, except maybe for whatever that assembly code does.

Time to just find a fluffy article on this topic [5]

[1] http://www.freescale.com/files/32bit/doc/data_sheet/IMX53IEC...

[2] http://www.samsung.com/global/business/semiconductor/file/pr...

[3] http://www.maximumpc.com/article/news/samsung_details_exynos...

[4] http://ark.intel.com/products/53580/Intel-Xeon-Processor-E7-...

[5] http://www.theregister.co.uk/2011/10/20/details_on_big_littl...

rbanffy · on June 17, 2013

If one is designing a whole server architecture from the ground up, it's possible to directly connect smart components directly, maybe even using different buses, to each other and have the main processor issue only high-level commands to the components themselves, mainframe style.

One of the things that plague x86 servers is that they are overgrown PCs. A machine designed to be a web or Samba server could be very different from a machine designed to run Windows, even if, from the application's point of view, it's just a regular server, with all the exotic stuff nicely hidden under the OS, within its device drivers.

It'll be fun to see what gets invented in this space.

akiselev · on June 17, 2013

I haven't looked at ARMv8 in depth but I doubt it's a "from the ground up" architecture, let alone one specifically for servers. I'm pretty sure that ARMv8 is a microarchitecture anyway, which means it's probably stuck with a ton of the same (or incrementally improved) IP cores except for the critical peripherals (memory manager, cache, etc). The flexibility that ARM has because of the speed of their microarchitecture iteration process is fantastic but I don't think it's enough to compete with Intel's x86/64 architecture (and Xeon microarchitecture).

I agree that Intels are overgrown for simple stuff like that but there's no other option. You either make the general commodity cheap or increase the cost across the board for specialized designs. The question is though, until ARM chips are as hefty as Intel's, is this legacy overhead from personal computing enough to wipe out Intel's advantage over ARM longterm? I.e., if ARM can't get yields as good as Intel's (which means that ARM's silicon chips have to be smaller in physical size), this lack of overhead might be outweighed by the overhead of communicating between more processors or running more operating systems per [whatever] of computing power.

zokier · on June 17, 2013

I think you are conflating the ISA and the microarchitecture. ARMv8 is an ISA, which is implemented by ARM Cortex-A5x series microarchitecture and several others.

Most importantly the X-gene (which these servers are made of) is afaik not based on ARM IP cores, and are indeed built "from the ground up" for datacenter purposes.

akiselev · on June 17, 2013

Will have to read up on this X-gene stuff.

I said microarchitecture because ARMv8 is incremental over ARMv7 and not a from the ground up design, but you're right its an architecture. I think the terms are used interchangeably. Implementing 64 bit is a big leap but since its also backwards compatible, it carries improved/specialized ARMv7 components and instruction set (and X-gene and other custom cores per manufacture/microarchiture).

rbanffy · on June 17, 2013

I wasn't talking about micro-architecture. What I was imagining were more intelligent peripherals that could better offload a relatively underpowered CPU with specialized hardware communicating between its parts.

zokier · on June 17, 2013

I just poked around the AMCC website. Apparently they are claiming 80GB/s memory bandwidth and 10Gbe NIC on-chip. They also say something about "Coherent terabit fabric" and "ultra-low latency", so it sounds like they really are tackling the comms seriously. And looking at their product portfolio which mostly consists of communications processors, it looks like they have what it takes to make bits move fast.

gngeal · on June 16, 2013

The problem is that ARM has never been a high data throughput architecture, since few devices that need to process a large stream of information also need to be mobile.

What, Acorn Archimedes was supposed to be mobile? And if it was low data throughput architecture, does that mean that the three-times slower 80386 chips had an ultra-low data throughput architecture?

akiselev · on June 16, 2013

Can you link to sources for such data? From what I remember the 80386s and Acorns had roughly the same MIPS rating, but I can't find datasheets comparing bus speeds.

Either way, yes I'm wrong, they started out as a non-mobile processor company.

sown · on June 16, 2013

> The business case for serving web content isn't as strong as you might think from a back of the envelope calculation about cost/power vs. performance.

How do you feel about other applications of ARM in the datacenter? I seem to hear a lot about it, now.

anizan · on June 16, 2013

ARM SoCs have big problems when it comes to memory throughput which is a big part of a server's performance. i ran http://code.google.com/p/byte-unixbench/ on a Exynos 4412 SoC and some of the memory dependent results was horrendous.

timthorn · on June 16, 2013

Try a server optimised SoC rather than a mobile one; the optimisations are somewhat different.

Swannie · on June 16, 2013

Am I being naive here? Isn't the whole point of virtualisation, that I can put 50 "average" VMs onto a top-end server, I can put 200 "small" VMs onto it, or 1000s of lightweight LXC containers.

What limitation are ARM micro-servers going to overcome? CPU bound tasks? No. Memory bound tasks? Well, with memory extension techniques, x86 can have huge quantities of memory, no not so much. So IO bound tasks? Maybe... Is there something else I missed?

Workload per watt? Power management in x86 processors, chipsets, and general server design is getting better all the time. Of course, there are things where FPGA's or ASICs will win every time... so what is it that I get with an ARM that I can't get with a FPGA + x86? Or is it more that ARM SoC + FPGA/ASIC is where things get attractive?

vardump · on June 17, 2013

Memcached/redis sounds like something ARM CPUs might be good at.

Maybe execution performance predictability and better isolation of running a single instance on bare metal ARMv8 vs. a hypervisor running 50 instances on x86? In my experience, performance varies wildly on virtualized systems. Maybe x86 VM worst case can be worse than running on bare metal on ARM?

Anything where you need only a few instances and have relatively low performance requirements, like SOHO servers? I'd love to have something generic that consumes just a few watts but could do diverse tasks from routing, VPN, file serving, etc. at 500+ Mbps.

I guess it remains to be seen what kind of niche ARM servers will carve. I'm excited to try them out, to see how far they can be pushed.

csense · on June 17, 2013

Modern websites have a lot of moving parts, and they all have to move to ARM before you can think about changing your server architecture.

Do proprietary databases run on ARM? What about the language your web application is written in? Last I heard (~6 months ago) official ARM Java is available for the bleeding edge, I'm not sure if it's made it into an official major release yet.

Then there's the incompatible softfloat / hardfloat divide, just to make life fun for those who write JIT compilers (and regular compilers).

PaulHoule · on June 16, 2013

Ordinary Windows 8 on x86 has the same waves of instability as the author remarks about ARM builds.

I've noticed weird and wonderful changes in Windows 8 such as a situation that would have been an intermittent BSOD before now causes a instant and deliberate reboot and an attempt to, as much as possible, pretend it didn't happen.

I say it's about time.

voltagex_ · on June 17, 2013

What? There's still going to be memory.dmp somewhere in c:\windows that'll tell you what happened.

PaulHoule · on June 17, 2013

sure, but it's just the right thing, the Unix thing to do.

Qantourisc · on June 16, 2013

They say it only cost 30M to make a ARM, but how much would it cost to make one that runs at the same practical speed then a x86 ? That's IMO the real question ...

Qantourisc · on June 16, 2013

Ow I'd also like an ARM that can emulate x86 at native speed please. Otherwise a lot of stuff will stop running :)

pbharrin · on June 17, 2013

I think he meant TSMC when he said TWSC.