Interestingly enough AMCC, the company making the ARM chips in question, is alle...

akiselev · on June 16, 2013

Clock doesn't mean anything if the communication peripherals (technically busses not peripherals in processors) can't feed data from disk, RAM, and ethernet fast enough. The problem is that ARM has never been a high data throughput architecture, since few devices that need to process a large stream of information also need to be mobile. In the early days, Intel won because it crushed everyone with its bridge speeds, cycles per instruction, etc. and now ARM is crushing it on low power consumption. Problem is that ARM is nowhere near Intel's performance in real world scenarios, despite a lower Watts per MIPS.

I don't know of any benchmarks quantifying the difference but when I compared a stripped down Android on a Transformer Prime (Tegra 1.6 ghz quadcore, don't remember if it was DDR2 or 3) with Angstrom on a roughly equivalent Intel Atom COM Express module (All code running was C/C++ not Java). In almost all of the tests I ran (all specific to my own use cases for web apps, CV, FEA, and market analysis for EVE Online) the Atom processor blew away the Tegra because the Tegra cores spent a lot more time idle waiting for peripherals.

Edit: Forgot to mention that this is true (afaik) with operating system overhead and mismanagement on ARM chips. If you use a micro-kernel and optimize your code for all of the peripherals of the processor (i.e. using DMA directly instead of letting the kernel fuck it up) you can get ridiculous speed increases over Intel stuff. Ofcourse, this is true only as long as Intel processors are as inaccessible as they are now. If they start selling surface mount soldered chips then you could do away with the OS overhead on Intel too.

lgeek · on June 17, 2013

I've recently benchmarked RAM bandwidth and latency on computers based on i.MX53 (Cortex-A8), Exynos4 (Cortex-A9) and Exynos5 (Cortex-A15). RAM throughput increases by something like 20 times between the A8 and the A15 platforms. So while it's true that ARM systems used to suffer from low throughput, this is something they've been working on for some time, and the results look quite good from where I'm staying. I found this microbenchmark[0] and (looking at memcpy) the results for new ARM systems look alright to me.

Of course, this is an oversimplification, but I'm happy to go into more details if there's interest.

Since the first ARMv8 cores are built for server workloads, I expect they're brining significant improvements compared to A15.

Disclaimer: Not associated with ARM, but I have an interest in ARM being used more for things other than mobile devices.

[0] https://github.com/c2h2/arm_c_benchmark

akiselev · on June 17, 2013

Kudos for actually trying out a benchmark but it must be seriously flawed. There's no way you'd get a 20x boost in bandwidth between the A8 and A15 (both of which are ARMv7 and DDR3 is not 20x faster than DDR2) in nontrivial cases if you're using technology within the last 5 years. I'm guessing you ran a very trivial benchmark that operated mostly out of the L1/L2 cache (hence you weren't really testing the RAM) on the Exynos5 and mostly out of ram on the i.MX53. Depending on what operating system you ran on the different boards, there could also be major differences in the kernel's implementation of the DMA peripheral, which would also heavily skew results.

For example, in the Github link, some code used native memcpys while others used kernel call memcpys. The differences in the specific Ubuntu 13.04 and Android implementations could vary the results quite a bit, even if they have the same exact overhead.

lgeek · on June 17, 2013

> There's no way you'd get a 20x boost in bandwidth between the A8 and A15 (both of which are ARMv7 and DDR3 is not 20x faster than DDR2)

There are a few factors at work here:

* A15 was designed to have more memory bandwidth in the first place, since this was a known issue. I think the bottleneck was (is?) the speed of AMBA bus[0] connecting the peripherals (including RAM controller) and the ARM core.

* A15 has a vastly more advanced pipeline and multiple-issue capabilities than A8. This should allow it to use its functional units more efficiently.

* Finally, my A15 system is running at 1.7 GHz compared to 1 GHz for the A8 system.

> I'm guessing you ran a very trivial benchmark that operated mostly out of the L1/L2 cache

The buffers were pre-initialized (to remove the kernel's lazy physical memory allocation as a factor), and far larger than the L2. The data caches were explicitly flushed at the beginning. But indeed it was a very simple benchmark since it was a microbenchmark meant to measure memory bandwidth.

> Depending on what operating system you ran on the different boards, there could also be major differences in the kernel's implementation of the DMA peripheral.

I wasn't using DMA.

> For example, in the Github link, some code used native memcpys while others used kernel call memcpys.

To clarify: the project I've linked isn't mine, it's a public project that supported my point. But as far as I'm aware there's no memcpy systemcall. memcpy is implemented completely in libc. For example, glibc has multiple optimized implementations written in assembly[1].

But you're right, the glibc version is different between these computers, so I need to repeat with a statically compiled benchmark.

Anyway, I've looked at this while trying to improve memcpy speed for A15, which is why I don't have solid comparative results. But I'm doing a write-up and now I'll probably also include this bit.

[0] https://en.wikipedia.org/wiki/Advanced_Microcontroller_Bus_A... [1] http://sourceware.org/git/?p=glibc.git;a=tree;f=ports/sysdep...

akiselev · on June 17, 2013

Huh wow, I guess I'm behind the times. Didn't know they made that big of a jump within ARMv7. This makes me curious to look at the datasheets.

From what I can tell, the i.MX53[1] has a 64 bit AXI @ 200mhz. The Exynos5[2] on the other hand, has 64bit AXIs and also a ton of optimizations. The LCD spec says it operates off of a 200mhz AXI so I wouldn't be surprised if the Exynos5 uses dual 200mhz AXIs for memory. It's tough to tell how much of that 20 times translates to the clock speeds and the rest to optimizations. I agree the A15 is just a sign of things to come since ARMv8, although it has legacy stuff, ARM still gets to improve a lot more on the architecture specifically for servers.

For comparison, these guys [3] say 12.8 GB/S which kinda sounds ridiculous. If it's true, they really are in range of Intel. The Sandy Bridge Xeon E3-1220 boasts a theoretical 21 GB/s @ a whopping 80 watts (although the high end E7-8870 [4] is a fucking monster. At this point I can't even begin to think how they compare with all of Intel's memory stuff). The number are clearly within the ballpark, we'll just have to see a real world case. I'm curious about how well GCC and the kernels utilize all the unique processor features and optimizations. If it's mostly down to market forces, Intel's 96% market share might make it a long and difficult journey to an ARM Linux kernel optimized as well for a server environment as x86/x64 ports.

I'm too lazy to check what that assembly code is but if it uses NEON PLD optimizations (NEON is definitely in Exynos5) that may give a speed bump in memcpy (even if you preloaded the buffers) because those optimizations would use L1/L2 cache intelligently. It's hard to tell without looking at the code and diving into i.MX53's feature set more whether that played a factor. * I was thinking of malloc and free, they are system calls because of paging. Memcpy is a straight pointer to pointer copy, except maybe for whatever that assembly code does.

Time to just find a fluffy article on this topic [5]

[1] http://www.freescale.com/files/32bit/doc/data_sheet/IMX53IEC...

[2] http://www.samsung.com/global/business/semiconductor/file/pr...

[3] http://www.maximumpc.com/article/news/samsung_details_exynos...

[4] http://ark.intel.com/products/53580/Intel-Xeon-Processor-E7-...

[5] http://www.theregister.co.uk/2011/10/20/details_on_big_littl...

rbanffy · on June 17, 2013

If one is designing a whole server architecture from the ground up, it's possible to directly connect smart components directly, maybe even using different buses, to each other and have the main processor issue only high-level commands to the components themselves, mainframe style.

One of the things that plague x86 servers is that they are overgrown PCs. A machine designed to be a web or Samba server could be very different from a machine designed to run Windows, even if, from the application's point of view, it's just a regular server, with all the exotic stuff nicely hidden under the OS, within its device drivers.

It'll be fun to see what gets invented in this space.

akiselev · on June 17, 2013

I haven't looked at ARMv8 in depth but I doubt it's a "from the ground up" architecture, let alone one specifically for servers. I'm pretty sure that ARMv8 is a microarchitecture anyway, which means it's probably stuck with a ton of the same (or incrementally improved) IP cores except for the critical peripherals (memory manager, cache, etc). The flexibility that ARM has because of the speed of their microarchitecture iteration process is fantastic but I don't think it's enough to compete with Intel's x86/64 architecture (and Xeon microarchitecture).

I agree that Intels are overgrown for simple stuff like that but there's no other option. You either make the general commodity cheap or increase the cost across the board for specialized designs. The question is though, until ARM chips are as hefty as Intel's, is this legacy overhead from personal computing enough to wipe out Intel's advantage over ARM longterm? I.e., if ARM can't get yields as good as Intel's (which means that ARM's silicon chips have to be smaller in physical size), this lack of overhead might be outweighed by the overhead of communicating between more processors or running more operating systems per [whatever] of computing power.

zokier · on June 17, 2013

I think you are conflating the ISA and the microarchitecture. ARMv8 is an ISA, which is implemented by ARM Cortex-A5x series microarchitecture and several others.

Most importantly the X-gene (which these servers are made of) is afaik not based on ARM IP cores, and are indeed built "from the ground up" for datacenter purposes.

akiselev · on June 17, 2013

Will have to read up on this X-gene stuff.

I said microarchitecture because ARMv8 is incremental over ARMv7 and not a from the ground up design, but you're right its an architecture. I think the terms are used interchangeably. Implementing 64 bit is a big leap but since its also backwards compatible, it carries improved/specialized ARMv7 components and instruction set (and X-gene and other custom cores per manufacture/microarchiture).

rbanffy · on June 17, 2013

I wasn't talking about micro-architecture. What I was imagining were more intelligent peripherals that could better offload a relatively underpowered CPU with specialized hardware communicating between its parts.

zokier · on June 17, 2013

I just poked around the AMCC website. Apparently they are claiming 80GB/s memory bandwidth and 10Gbe NIC on-chip. They also say something about "Coherent terabit fabric" and "ultra-low latency", so it sounds like they really are tackling the comms seriously. And looking at their product portfolio which mostly consists of communications processors, it looks like they have what it takes to make bits move fast.

gngeal · on June 16, 2013

The problem is that ARM has never been a high data throughput architecture, since few devices that need to process a large stream of information also need to be mobile.

What, Acorn Archimedes was supposed to be mobile? And if it was low data throughput architecture, does that mean that the three-times slower 80386 chips had an ultra-low data throughput architecture?

akiselev · on June 16, 2013

Can you link to sources for such data? From what I remember the 80386s and Acorns had roughly the same MIPS rating, but I can't find datasheets comparing bus speeds.

Either way, yes I'm wrong, they started out as a non-mobile processor company.