FMA acronym is not fast multiply add, it’s fused multiply add. Fused means the instruction computes the entire a * b + c expression using twice as many mantissa bits, only then rounds the number to the precision of the arguments.
It might be the Prism emulator failed to translate FMA instructions into a pair of two FMLA instructions (equally fused ARM64 equivalent), instead it did some emulation of that fused behaviour, which in turn what degraded the performance of the AVX2 emulation.
vintagedave 5 hours ago [-]
Author here - thanks - my bad. Fixed 'fast' -> 'fused' :)
I don't have insight into how Prism works, but I have wondered if the right debugger would see the ARM code and let us debug exactly what was going on for sure.
Const-me 5 hours ago [-]
You’re welcome. Sadly, I don’t know how to observe ARM assembly produced by Prism.
And one more thing.
If you test on an AMD processor, you will probably see much less profit from FMA. Not because it’s slower, but because SSE4 version will runs much faster.
On Intel processors like your Tiger Lake, all 3 operations addition, multiplication and FMA compete for the same execution units. On AMD processors however, multiplication and FMA do as well but addition is independent, e.g. on Zen4 multiplication and FMA run on execution units FP0 or FP1 while addition runs on execution units FP2 or FP3. This way replacing multiply/add combo with FMA on AMD doesn’t substantially improve throughput in FLOPs. The only win is L1i cache and instruction decoder.
malkia 4 hours ago [-]
You can ... to a degree - Google for "XtaCache"
kbolino 9 hours ago [-]
I suspected this was because the vector units were not wide enough, and it seems that is the case. AVX2 is 256-bit, ARM NEON is only 128-bit.
The big question then is, why are ARM desktop (and server?) cores so far behind on wider SIMD support? It's not like Intel/AMD came up with these extensions for x86 yesterday; AVX2 is over 15 years old.
Aurornis 7 hours ago [-]
> The big question then is, why are ARM desktop (and server?) cores so far behind on wider SIMD support?
Very wide SIMD instructions require a lot of die space and a lot of power.
Most ARM desktop/mobile parts are designed to be low power and low cost. Spending valuable die space on large logic blocks for instructions that are rarely used isn't a good tradeoff for consumer apps.
Most ARM server parts are designed to have very high core counts, which requires small individual die sizes. Adding very wide SIMD support would grow die space of individual cores a lot and reduce the number that could go into a single package.
Supporting 256-bit or 512-bit instructions would be hard to do without interfering with the other design goals for those parts.
Even Intel has started dropping support for the wider AVX instructions in their smaller efficiency cores as a tradeoff to fit more of them into the same chip. For many workloads this is actually a good tradeoff. As this article mentions, many common use cases of high throughput SIMD code are just moving to GPUs anyway.
aseipp 5 hours ago [-]
Knights Landing is a major outlier; the cores there were extremely small and had very few resources dedicated to them (e.g. 2-wide decode) relative to the vector units, so of course that will dominate. You aren't going to see 40% of the die dedicated to vector register files on anything looking like a modern, wide core. The entire vector unit (with SRAM) will be in the ballpark of like, cumulative L1/L2; a 512-bit register is only a single 64 byte cache line, after all.
dlcarrier 5 hours ago [-]
Also, the Knights Landing/Mill implementation is completely different from modern AVX-512. It's Ice Lake and Zen 4 that introduced modern AVX-512.
Aurornis 5 hours ago [-]
True! But even if only 20% of the die area goes to AVX-512 in larger cores, that makes a big difference for high core count CPUs.
That would be like having a 50-core CPU instead of a 64-core CPU in the same space. For these cloud native CPU designs everything that takes significant die area translates to reduced core count.
wtallis 2 hours ago [-]
You're still grossly overestimating the area required for AVX-512. For example, on AMD Zen4, the entire FPU has been estimated as 25% of the core+L2 area, and that's including AVX-512. If you look at the extra area required for AVX-512 vs 256-bit AVX2, as a fraction of total die area including L3 cache and interconnect between cores, it's definitely not going to be a double digit percentage.
wtallis 6 hours ago [-]
> The AVX-512 implementation in Intel's Knight's Landing took up 40% of the die area
That chip family was pretty much designed to provide just enough CPU power to keep the vector engines fed. So that 40% is an upper bound, what you get when you try to build a GPU out of somewhat-specialized CPU cores (which was literally the goal of the first generation of that lineage).
For a general purpose chip, there's no reason to spend that large a fraction of the area on the vector units. Something like the typical ARM server chips with lots of weak cores definitely doesn't need each core to have a vector unit capable of doing 512-bit operations in a single cycle, and probably would be better off sharing vector units between multiple cores. For chips with large, high-performance CPU cores (eg. x86), a 512-bit vector unit will still noticeably increase the size of a CPU core, but won't actually dwarf the rest of the core the way it did for Xeon Phi.
kbolino 6 hours ago [-]
The rarity of use is a chicken-egg problem, though. The hardware makers consider it a waste because the software doesn't use it, and the software makers won't use it because it's not widely supported enough. Apple and Qualcomm not supporting it at all on any of their hardware tiers just exacerbates it. I think this is a good explanation for why mobile devices lack it, and even why say a MacBook Air or Mac Mini lacks it, but not why a MacBook Pro or Mac Studio lacks it.
It does seem like server hardware is adopting SVE at least, even if it's not always paired with wider registers. There are lots of non-math-focused instructions in there that benefit many kinds of software that isn't transferable to a GPU.
6 hours ago [-]
formerly_proven 6 hours ago [-]
KNL is an almost 15 years old uarch expressly designed to compete with dedicated SIMD processors (GPGPU), dedicating the die to vector is the point of that chip.
happyPersonR 7 hours ago [-]
Yeah this seems likely, but with all the LLM stuff it might be an outdated assumption.
Buy new chips next year! Haha :)
hajile 5 hours ago [-]
Wider SIMD is a solution in search of a problem in most cases.
If your code can go wide and has few branches (uses SIMD basically every cycle), either a GPU or matrix co-processor will handily beat the performance of several CPU cores all running together.
If your code can go wide, but is branchy (uses bursts of SIMD between branches), wider becomes even less worth it. If it takes 4 cycles to put through a 256-bit SIMD instruction and you have some branches between the next one, using a 128-bit SIMD with 2 instructions will either have them execute in parallel at the same 4 cycles or even in the worst case, they will pipeline to 5 cycles (that's just a single instruction bubble in the FPU pipeline).
You can increase this differential by going to a 512-bit pipeline, but if it's just occasional 512-bit, you can still match with 4 SIMD units (The latest couple of ARM cores have 6 SIMD units) and while pipelining out from 4 to 7 cycles means you need at least 3-cycle bubbles to break even, this still doesn't seem too unusual.
The one area where this seems to be potentially untrue is simulations working with loads of f64 numbers which can consistently achieve high density with code just branchy enough to make GPUs be inefficient. Most of these workloads are running on supercomputers though and the ARM competitor here is the Fujitsu A64FX which does have 512-bit SVE.
It's also worth noting that even modern x86 chips (by both AMD and Intel) seem to throttle under heavy 512-bit multi-core workloads. Reducing the clockspeed in turn reduces the integer performance which may make applications slower in some cases
All of this is why ARM/Qualcomm/Apple's chips with 128-bit SIMD and a couple AMX/SME units are very competitive in most workloads even though they seem significantly worse on paper.
dlcarrier 5 hours ago [-]
Video encoding and image compression is a huge use case, and not at all uncommon, so much so that a lot of hardware has dedicated hardware for it. Of course, offloading the SIMD instructions to dedicated hardware accelerators does reduce usage of SIMD instructions, but any time a specific CODEC or algorithm isn't accelerated, then the SIMD instructions are absolutely necessary.
Emulators also use them a lot, often in unintended ways, because they are very flexible. This is partially because the emulator itself can use the flexibility to optimize emulation, but also because hand optimizing with SIMD instruction can significantly improve performance of any application, which is necessary for the low-performance processors common in videogame consoles.
jsheard 9 hours ago [-]
SVE was supposed to be the next step for ARM SIMD, but they went all-in on runtime variable width vectors and that paradigm is still really struggling to get any traction on the software side. RISC-V did the same thing with RVV, for better or worse.
camel-cdr 8 hours ago [-]
> SVE was supposed to be the next step for ARM SIMD, but they went all-in on runtime variable width vectors and that paradigm is still really struggling to get any traction on the software side.
You can treat both SVE and RVV as a regular fixed-width SIMD ISA.
"runtime variable width vectors" doesn't capture well how SVE and RVV work. An RVV and SVE implementation has 32 SIMD registers of a single fixed power-of-two size >=128. They also have good predication support (like AVX-512), which allows them to masked of elements after certain point.
If you want to emulate avx2 with SVE or RVV, you might require that the hardware has a native vector length >=256, and then you always mask off the bits beyond 256, so the same code works on any native vector length >=256.
jsheard 7 hours ago [-]
> You can treat both SVE and RVV as a regular fixed-width SIMD ISA.
Kind of, but the part which looks particularly annoying is that you can't put variable-width vectors on the stack or pass them around as values in most languages, because they aren't equipped to handle types with unknown size at compile time.
ARM seems to be proposing a C language extension which does require compilers to support variably sized types, but it's not clear to me how the implementation of that is going, and equivalent support in other languages like Rust seems basically non-existent for now.
camel-cdr 7 hours ago [-]
> Kind of, but the part which looks particularly annoying is that you can't put variable-width vectors on the stack or pass them around as values in most languages, because they aren't equipped to handle types with unknown size at compile time
Yes, you can't, which is annoying, but you can if you compile for a specific vector length.
This is mostly a library structure problem. E.g. simdjson has a generic backend that assumes a fixed vector length. I've written fixed width RVV support for it.
A vector length agnostic backend is also possible, but requires writing a full new backend. I'm planning to write it in the future (I alreasy have a few json::minify implementations), but it will be more work. If the generic backend used a SIMD abstraction, like highway, that support scalable vectors this wouldn't be a problem.
Toolchain support should also be improved, e.g. you could make all vregs take 512-bit on the stack, but have the codegen only utilize the lowee 128 bit, if you have 128-but vregs, 256-bit if you have 256-bit vregs and 512-bit if you have >=512-bit vregs.
jsheard 7 hours ago [-]
> Toolchain support should also be improved, e.g. you could make all vregs take 512-bit on the stack, but have the codegen only utilize the lowee 128 bit, if you have 128-but vregs, 256-bit if you have 256-bit vregs and 512-bit if you have >=512-bit vregs.
SVE theoretically supports hardware up to 2048-bit, so conservatively reserving the worst-case size at compile time would be pretty wasteful. That's 16x overhead in the base case of 128-bit hardware.
arka2147483647 44 minutes ago [-]
Surely you could have compiler types for 128, 256, 512, etc, and then choose the correct codepath with simple if statement at runtime?
pertymcpert 4 hours ago [-]
You can definitely SVE vectors on the stack, there are special instructions to load and store with variable offsets. What you can't do is to put them into structs which need to have concretely sized types (i.e. subsequent element offset need to have a known byte offset).
Tuldok 8 hours ago [-]
The only time I've encountered ARM SVE being used in the wild is in the FEX x86 emulator (https://fex-emu.com/FEX-2407/).
kbolino 9 hours ago [-]
Yeah, the extensions exist, and as pointed out by a sibling comment to yours, have been implemented in supercomputer cores made by Fujitsu. However, as far as I know, neither Apple nor Qualcomm have made any desktop cores with SVE support. So the biggest reason there's no desktop software for it is because there's no hardware support.
jsheard 8 hours ago [-]
ARMs Neoverse IP does support SVE, so it's at least already relevant in cloud applications. Apparently AWS Graviton3 had 256bit SVE, but Graviton4 regressed back to 128bit SVE for some reason?
The problem with SVE is that ARM vendors need to make NEON as fast as possible to stay competitive, so there is little incentive to implement SVE with wider vectors.
Graviton3 has 256-bit SVE vector registers but only four 128-bit SIMD execution units, because NEON needs to be fast.
Intel previously was in such a dominant market position that they could require all performance-critical software to be rewritten thrice.
my123 7 hours ago [-]
The Oryon 3rd gen in the Snapdragon X2 has SVE2 (as does NVIDIA N1x, currently pre-launched of sorts on the DGX Spark)
justincormack 8 hours ago [-]
I think the CIX P1 has support, but I havent got one yet to verify, this is a cheap SOC core.
otherjason 8 hours ago [-]
The only CPU I've encountered that supports SVE is the Cortex-X925/A725 that is used in the NVIDIA DGX Spark platform. The vector width is still only 128 bits, but you do get access to the other enhancements the SVE instructions give, like predication (one of the most useful features from Intel's AVX512).
0x000xca0xfe 8 hours ago [-]
RISC-V chip designers at least seem to be more bullish on vectors. There is seriously cool stuff coming like the SpacemiT K3 with 1024-bit vectors :)
camel-cdr 8 hours ago [-]
The 1024-bit RVV cores in the K3 are mostly that size to feed a matmul engine. While the vector registers are 1024-bit, the two exexution units are only 256-bit wide.
The main cores in the K3 have 256-bit vectors with two 128-bit wide exexution units, and two seperate 128-bit wide vector load/store units.
But yes, RVV already has more diverse vector width hardware than SVE.
0x000xca0xfe 6 hours ago [-]
It's a low clocked (2.1GHz) dual-issue in-order core so obviously nowhere near the real-world performance of e.g. Zen5 which can retire multiple 256-bit or even 512-bit vector instructions per cycle at 5+ GHz.
But I find the RVV ISA just really fascinating. Grouping 8 1024-bit registers together gives us 8192-bit or 1-kilobyte registers! That's a tremendous amount of work that can be done using a single instruction.
Feels like the Lanz bulldog of CPUs. Not sure how practical it will be after all, but it's certainly interesting.
8 hours ago [-]
brigade 6 hours ago [-]
ARM favored wider ILP and mostly symmetric ALUs, while x86 favored wider and asymmetric ALUs
Most high-end ARM cores were 4x128b FMA, and Cortex-X925 goes to 6x128b FMA. Contrast that to Intel that was 2x256b FMA for the longest, then 2x512b FMA, with another 1-2 pipelines that can't do FMA.
But ultimately, 4x128b ≈ 2x256b, and 2x256b < 6x128b < 2x512b in throughput. Permute is a different factor though, if your algorithm cares about it.
phonon 9 hours ago [-]
Well, you can always use a Fujitsu A64FX...let me check eBay.. :-)
Cold_Miserable 5 hours ago [-]
AVX2 isn't really 256-bit.
Its 2x128-bit.
leeter 9 hours ago [-]
[removed]
kbolino 8 hours ago [-]
Part of the reason, I think, is that Qualcomm and Apple cut their teeth on mobile devices, and yeah wider SIMD is not at all a concern there. It's also possible they haven't even licensed SVE from Arm Holdings and don't really want to spend the money on it.
In Apple's case, they have both the GPU and the NPU to fall back on, and a more closed/controlled ecosystem that breaks backwards compatibility every few years anyway. But Qualcomm is not so lucky; Windows is far more open and far more backwards compatible. I think the bet is that there are enough users who don't need/care about that, but I would question why they would even want Windows in the first place, when macOS, ChromeOS, or even GNU/Linux are available.
jovial_cavalier 8 hours ago [-]
A ton of vector math applications these days are high dimensional vector spaces. A good example of that for arm would I guess be something like fingerprint or face id.
Also, it doesn't just speed up vector math. Compilers these days with knowledge of these extensions can auto-vectorize your code, so it has the potential to speed up every for-loop you write.
josefx 8 hours ago [-]
> A good example of that for arm would I guess be something like fingerprint or face id.
So operations that are not performance critical and are needed once or twice every hour? Are you sure you don't want to include a dedicated cluster of RTX 6090 Ti GPUs to speed them up?
jovial_cavalier 6 hours ago [-]
I'd argue that those are actually very performance critical because if it takes 5 seconds to unlock your phone, you're going to get a new phone.
The point is taken, though, that seemingly the performance is fine as it is for these applications. My point was only that you don't need to be running state of the art LLMs to be using vector math with more than 4 dimensions.
pertymcpert 4 hours ago [-]
Those are extremely performance critical operations. A lot of people use their phone many times an hour.
bhouston 8 hours ago [-]
Hasn't there been issues with AVX2 causing such a heavy load on the CPU that frequency scaling would kick in a lot of cases slowing down the whole CPU?
My experience is that trying to get benefits from the vector extensions is incredibly hard and the use cases are very narrow. Having them in a standard BLAS implementation, sure, but outside of that I think they are not worth the effort.
jsheard 8 hours ago [-]
Throttling was mainly an issue with AVX512, which is twice the width of AVX2, and only really on the early Skylake (2015) implementation. From your own source Ice Lake (2019) barely flinches and Rocket Lake (2021) doesn't proactively downclock at all. AMDs implementation came later but was solid right out of the gate.
kbolino 8 hours ago [-]
This is a bit short-sighted. Yes, it is kinda tricky to get right, and a number of programming languages are quite behind on good SIMD support (though many are catching up).
SIMD is not limited to mathy linear algebra things anymore. Did you know that lookup tables can be accelerated with AVX2? A lot of branchy code can be vectorized nowadays using scatter/gather/shuffle/blend/etc. instructions. The benefits vary, but can be significant. I think a view of SIMD as just being a faster/wider ALU is out of date.
kccqzy 8 hours ago [-]
That’s only on very old CPUs. Getting benefits from vector extensions is incredibly easy if you do any kind of data crunching. A lot of integer operations not covered by BLAS can benefit including modern hash tables.
vintagedave 6 hours ago [-]
Re hard to get benefits: a lot depends on the compiler. In Elements (the toolchain this article was tested with) we made a bunch of modifications to LLVM passes to prioritise vectorisation in situations where it could, but did not.
I've heard anecdotally that the old pre-LLVM Intel C++ Compiler also focused heavily on vectorisation and had some specific tradeoffs to achieve it. I think they use LLVM now too and for all I know they've made similar modifications that we did. But we see a decent number of code patterns that can and now are optimised.
adgjlsfhk1 5 hours ago [-]
the modern approach is much more fine grained throttling so by the time it throttles you already are coming out ahead.
TheJoeMan 8 hours ago [-]
I tried searching "SSE2-4.x" and this is the top result in DDG and Google, so I was initially confused what instruction set the article is referring to. However, this appears to be shorthand for SSE2 through SSE4? Perhaps a rephrasing of the article title could be helpful.
vintagedave 7 hours ago [-]
Author here - yes, it's shorthand for the set of SSE2, SSE3, SSSE3 (not a typo), and SSE4 including SSE 4.1 and SSE 4.2. My bad for confusion!
That set matches the x86-64-v2 x64 microarchitecture level. Most of the articles uses 'v2' or 'v3' or 'x86-64-v2', but I thought that more people would be familiar with the names of the instruction sets than that x64 was versioned. The versions only appeared quite recently (2020) and are rather retroactive.
cogman10 6 hours ago [-]
I read it as SSE2->4.x.
Generally speaking, when working with SSE instructions you'll end up using a mix of instructions from 2->4 as they are all effectively just additional operations on the SSE2 registers.
Aissen 9 hours ago [-]
Spoiler is in the conclusion:
> Yes, it is absolutely key to build your app as ARM, not to rely on Windows ARM emulation.
okanat 8 hours ago [-]
Is this actually surprising? Once you use stuff like vectorization you want to get as much performance out of a system. If you're not natively compiling for a system, you won't get any performance.
Using AVX2 and using an emulator have contradictory goals. Of course there can be a better emulator or actually matching hardware design (since both Apple and Microsoft actually exploit the similar register structure between ARM64 and x86_64). However, this means you have increased complexity and reduced reliability / predictability.
vintagedave 7 hours ago [-]
Author here - have to say, thanks for reading all the way to the end, you don't always see people do that ;)
I put a spoiler at the top too, to avoid trying to make people read the whole thing. The real bit is that chart, which I think is quite an amazing result.
You're right re building. We're a compiler vendor, so we have a natural interest in what people should be targeting. But even for us the results here were not what we expected ahead of time.
Aissen 7 hours ago [-]
Having written an emulator, the conclusion was a bit less surprising. It's also probably not definitive, as it might depend on the specific hardware (and future emulator optimizations); you even say in your blog that the hardware you use is not the hardware Microsoft targeted.
qingcharles 5 hours ago [-]
Is Chrome for Windows compiled in ARM too or does it use the Windows under emulation?
The reason I ask is that I believe Windows Chrome is (like many Windows binaries) compiled with lots of the advanced CPU features disabled (e.g. AVX512) because they're not available on older PCs. Is that true?
mtklein 7 hours ago [-]
If I remember correctly, the AVX2 feature set is a fairly direct upscale of SSE4.1 to 256 bit. Very few instructions even allowed interaction between the top and bottom 128 bits, I assume to make implementation on existing 128 bit vector units easier. And the most notable new things that AVX2 added beyond that widening, fp16 conversion and FMA support, are also present in NEON, so I wouldn't expect that to be the issue either.
So I'd bet the issue is either newness of the codebase, as the article suggests, or perhaps that it is harder to schedule the work in 256 bit chunks than 128. It's got to be easier when you've got more than enough NEON q registers to handle the xmms, harder when you've got only exactly enough to pair up for handling ymms?
spacecadet_ 6 hours ago [-]
> Very few instructions even allowed interaction between the top and bottom 128 bits
That would be plain AVX, AVX2 has shuffles across the 128-bit boundary. To me that seems like the main hurdle for emulation with 128-bit vectors, in my experience compilers are very eager to emit shuffle instructions if allowed, and emulating a 256-bit shuffle with 128-bit operations would require 2 shuffles and a blend for each half of the emulated register.
EDIT: I just noticed that the benchmark in the article is pure math which probably wouldn't hit this particular issue, so this doesn't explain the performance difference...
ack_complete 6 hours ago [-]
There are also mode switching and calling convention issues.
The way that the vector registers were extended to 256-bit causes problems when legacy 128-bit and 256-bit ops are mixed. Doing so puts the CPU into a mode where all legacy 128-bit ops are forced to blend the high half, which can reduce throughput of existing SSE2-based library routines to as low as 1/4 throughput. For this reason, AVX code has to aggressively use the VZEROUPPER instruction to ensure that the CPU is not left in AVX 256-bit vector mode before possibly returning to any library or external code that uses SSE2. VZEROUPPER sets a flag to zero the high half of all 256-bit registers, so it's cheap on modern x86 CPUs but can be expensive to emulate without hardware support.
The other problem is that only the low 128 bits of vector registers are preserved across function calls due to the Windows x64 calling convention and the VZEROUPPER issue. This means that practically any call to external code forces the compiler to spill all AVX vectors to memory. Ideally 256-bit vector usage is concentrated in leaf routines so this isn't an issue, but where used in non-leaf routines, it can result in a lot of memory traffic.
rbanffy 3 hours ago [-]
When doing feature detection for execution path selection, it’s sometimes useful to run some quick benchmarks to see which path is objectively best.
Now we have two-ish implementations or x86, but back in the 1980s and 1990s we had quite a few, some with wildly different performance characteristics.
And, if we talk about ARM and RISC-V, we’ll have an order of magnitude more.
LeoNatan25 5 hours ago [-]
Any equivalent look at Apple's Rosetta 2? Perhaps if author has time and availability of hardware, they can have a similar look. Rosetta 2 is going away next year, and it's a shame, even if from a purely technical reason. Apple will never open source it.
vintagedave 5 hours ago [-]
My daily driver is a M2 Mac, and we added the same set of optimisations to ARM on Mac as we did to ARM Windows (at the same time as Intel Windows we measured emulation of in this blog.) More info: https://blogs.remobjects.com/2026/01/26/fast-math-in-six-lan...
We did not try to especially optimise Intel Mac, but it's very tempting to do so in order to look at it as you ask.
I wish Rosetta was open sourced too. Same with Prism. I think any and all translation tech could only benefit everyone.
LeoNatan25 5 hours ago [-]
Well, Prism is likely to be with us for a decade, if not more, since Microsoft actually cares about backwards compatibility, whereas Apple much less so, and I guess, to them, we're at a "good enough" state. So Microsoft releasing it is less likely, but Apple could, especially after it is done. But I suspect some asshole sees a "competitive advantage" somewhere, and won't sign off a source release. What a gut punch for the team that worked on it.
Looking forward to a future look at Rosetta 2. Thanks!
crest 8 hours ago [-]
I wouldn't be surprised for SSE4 to be the fastest cause it's easiest to map to NEON as both use 128 bit registers and offer a fairly simlar feature set.
vintagedave 7 hours ago [-]
Author here - agreed, we have been speculating that too.
Mickell 7 hours ago [-]
[flagged]
targettracker 8 hours ago [-]
[flagged]
iberator 9 hours ago [-]
AVX2 should be banned anyway.
Only expensive CPUs have it, ruining mininum games requirements and making hardware obsolete.
Other Settings > AVX2 > 95.11% supported (+0.30% this month)
Tuldok 9 hours ago [-]
I, too, hate progress. By the way, the AMD Athlon 3000G system I helped build for a friend has AVX2. Even the old HP T630 thin client (https://www.parkytowers.me.uk/thin/hp/t630/) I bought for $15 as a home network router has AVX2.
thrtythreeforty 9 hours ago [-]
Au contraire: AVX2 is the vector ISA for x86 that doesn't suck. And it's basically ubiquitous at this point.
cogman10 6 hours ago [-]
Any x86 CPU manufactured in the last 10 year has AVX2.
Here's a laptop for $350 which has a CPU with AVX2 support.
Almost every x86 cpu made in the last decade should have avx2.
Maybe you're thinking of avx512 or avx10?
jorvi 8 hours ago [-]
Yeah, sounds like they're confusing AVX2 for AVX512. AVX2 has been common for a decade at least and greatly accelerates performance.
AVX512 is so kludgy that it usually leads to a detriment in performance due to the extreme power requirements triggering thermal throttling.
kimixa 8 hours ago [-]
AMD's implementation very much doesn't have that issue - it throttles slightly, maybe, but it's still a net benefit. The problem with Intel's implementation is that the throttling was immediate - and took noticeable time to then settle and actually start processing again - from any avx512 instruction, so the "occasional" avx512 instruction (in autovectorized code, or something like the occasional optimized memcpy or similar) was a net negative in performance. This meant that it only benefitted large chunks of avx512-heavy code, so this switching penalty was overcome.
But there's plenty in avx512 the really helps real algorithms outside the 512-wide registers - I think it would be perceived very differently if it was initially the new instructions on the same 256-wide registers - ie avx10 - in the first place, then extended to 512 as the transistor/power budgets allowed. AVX512 was just tying too many things together too early than "incremental extensions".
AVX512 leading to thermal throttling is a common myth that from what I can tell traces its origins to a blog post about clock throttling on a particular set of low-TDP SKUs from the first generation of Xeon CPUs that supported it (Skylake-X), released over a decade ago: https://blog.cloudflare.com/on-the-dangers-of-intels-frequen...
In practice, this has not been an issue for a long time, if ever; clock frequency scaling for AVX modes has been continually improved in subsequent Intel CPU generations (and even more so in AMD Zen 4/5 once AVX512 support was added).
adrian_b 7 hours ago [-]
That was true only for the 14-nm Intel Skylake derivatives, which had very bad management of the clock frequency and supply voltage, so they scaled down the clock prophylactically, for fear that they would not be able to prevent overheating fast enough.
All AMD Zen 4 and Zen 5 and all of the Intel CPUs since Ice Lake that support AVX-512, benefit greatly from using it in any application.
Moreover the AMD Zen CPUs have demonstrated clearly that for vector operations the instruction-set architecture really matters a lot. Unlike the Intel CPUs, the AMD CPUs use exactly the same execution units regardless whether they execute AVX2 or AVX-512 instructions. Despite this, their speed increases a lot when executing programs compiled for AVX-512 (in part for eliminating bottlenecks in instruction fetching and decoding, and in part because the AVX-512 instruction set is better designed, not only wider).
corysama 6 hours ago [-]
In gamedev it takes 7-10 years before you can require a new tech without getting a major backlash. AMD came out with AVX2 support in 2015. And, the (vocal minority) petitions to get AVX2 requirements removed from major games and VR systems are only now starting to quiet down.
So, in order to make use of users new fancy hardware without abandoning other users old and busted hardware, you have to support multiple back-ends. Same as it ever was.
Actually, a lot easier than it ever was today. Doom 3 famously required Carmack to reimplement the rendering 6 times to get the same results out of 6 different styles of GPUs that were popular at the time.
ARB Basic Fallback (R100) Multi-pass Minimal effects, no specular.
I think that's slightly old information as well, AVX512 works well on Zen5.
SecretDreams 8 hours ago [-]
Agree. It's only recently with modern architectures in the server space that avx512 has shown some benefit. But avx2 is legit and has been for a long time.
winstonwinston 8 hours ago [-]
Not really, Intel Celeron/Pentium/Atom (apollo lake) that was made in the last decade does not have AVX. These CPUs were very popular for low-cost, low-tdp quad-core machines such as Intel NUC mini PC.
Edit. Furthermore, i think that none of these (pre-2020) low budget CPUs support AVX2, until Tiger lake released in 2020.
nwellnhof 8 hours ago [-]
I think the last Intel CPUs that didn't support AVX were 10th gen (Comet Lake) Pentiums and Celerons, released in 2019.
Edit: That's wrong. Jasper Lake from 2021 also came without AVX support.
my123 7 hours ago [-]
It took until Alder Lake-N for the atom-grade stuff to have AVX2 across the board.
Rendered at 23:19:12 GMT+0000 (Coordinated Universal Time) with Vercel.
FMA acronym is not fast multiply add, it’s fused multiply add. Fused means the instruction computes the entire a * b + c expression using twice as many mantissa bits, only then rounds the number to the precision of the arguments.
It might be the Prism emulator failed to translate FMA instructions into a pair of two FMLA instructions (equally fused ARM64 equivalent), instead it did some emulation of that fused behaviour, which in turn what degraded the performance of the AVX2 emulation.
I don't have insight into how Prism works, but I have wondered if the right debugger would see the ARM code and let us debug exactly what was going on for sure.
And one more thing.
If you test on an AMD processor, you will probably see much less profit from FMA. Not because it’s slower, but because SSE4 version will runs much faster.
On Intel processors like your Tiger Lake, all 3 operations addition, multiplication and FMA compete for the same execution units. On AMD processors however, multiplication and FMA do as well but addition is independent, e.g. on Zen4 multiplication and FMA run on execution units FP0 or FP1 while addition runs on execution units FP2 or FP3. This way replacing multiply/add combo with FMA on AMD doesn’t substantially improve throughput in FLOPs. The only win is L1i cache and instruction decoder.
The big question then is, why are ARM desktop (and server?) cores so far behind on wider SIMD support? It's not like Intel/AMD came up with these extensions for x86 yesterday; AVX2 is over 15 years old.
Very wide SIMD instructions require a lot of die space and a lot of power.
The AVX-512 implementation in Intel's Knight's Landing took up 40% of the die area (Source https://chipsandcheese.com/p/knights-landing-atom-with-avx-5... which is an excellent site for architectural analysis)
Most ARM desktop/mobile parts are designed to be low power and low cost. Spending valuable die space on large logic blocks for instructions that are rarely used isn't a good tradeoff for consumer apps.
Most ARM server parts are designed to have very high core counts, which requires small individual die sizes. Adding very wide SIMD support would grow die space of individual cores a lot and reduce the number that could go into a single package.
Supporting 256-bit or 512-bit instructions would be hard to do without interfering with the other design goals for those parts.
Even Intel has started dropping support for the wider AVX instructions in their smaller efficiency cores as a tradeoff to fit more of them into the same chip. For many workloads this is actually a good tradeoff. As this article mentions, many common use cases of high throughput SIMD code are just moving to GPUs anyway.
That would be like having a 50-core CPU instead of a 64-core CPU in the same space. For these cloud native CPU designs everything that takes significant die area translates to reduced core count.
That chip family was pretty much designed to provide just enough CPU power to keep the vector engines fed. So that 40% is an upper bound, what you get when you try to build a GPU out of somewhat-specialized CPU cores (which was literally the goal of the first generation of that lineage).
For a general purpose chip, there's no reason to spend that large a fraction of the area on the vector units. Something like the typical ARM server chips with lots of weak cores definitely doesn't need each core to have a vector unit capable of doing 512-bit operations in a single cycle, and probably would be better off sharing vector units between multiple cores. For chips with large, high-performance CPU cores (eg. x86), a 512-bit vector unit will still noticeably increase the size of a CPU core, but won't actually dwarf the rest of the core the way it did for Xeon Phi.
It does seem like server hardware is adopting SVE at least, even if it's not always paired with wider registers. There are lots of non-math-focused instructions in there that benefit many kinds of software that isn't transferable to a GPU.
Buy new chips next year! Haha :)
If your code can go wide and has few branches (uses SIMD basically every cycle), either a GPU or matrix co-processor will handily beat the performance of several CPU cores all running together.
If your code can go wide, but is branchy (uses bursts of SIMD between branches), wider becomes even less worth it. If it takes 4 cycles to put through a 256-bit SIMD instruction and you have some branches between the next one, using a 128-bit SIMD with 2 instructions will either have them execute in parallel at the same 4 cycles or even in the worst case, they will pipeline to 5 cycles (that's just a single instruction bubble in the FPU pipeline).
You can increase this differential by going to a 512-bit pipeline, but if it's just occasional 512-bit, you can still match with 4 SIMD units (The latest couple of ARM cores have 6 SIMD units) and while pipelining out from 4 to 7 cycles means you need at least 3-cycle bubbles to break even, this still doesn't seem too unusual.
The one area where this seems to be potentially untrue is simulations working with loads of f64 numbers which can consistently achieve high density with code just branchy enough to make GPUs be inefficient. Most of these workloads are running on supercomputers though and the ARM competitor here is the Fujitsu A64FX which does have 512-bit SVE.
It's also worth noting that even modern x86 chips (by both AMD and Intel) seem to throttle under heavy 512-bit multi-core workloads. Reducing the clockspeed in turn reduces the integer performance which may make applications slower in some cases
All of this is why ARM/Qualcomm/Apple's chips with 128-bit SIMD and a couple AMX/SME units are very competitive in most workloads even though they seem significantly worse on paper.
Emulators also use them a lot, often in unintended ways, because they are very flexible. This is partially because the emulator itself can use the flexibility to optimize emulation, but also because hand optimizing with SIMD instruction can significantly improve performance of any application, which is necessary for the low-performance processors common in videogame consoles.
You can treat both SVE and RVV as a regular fixed-width SIMD ISA.
"runtime variable width vectors" doesn't capture well how SVE and RVV work. An RVV and SVE implementation has 32 SIMD registers of a single fixed power-of-two size >=128. They also have good predication support (like AVX-512), which allows them to masked of elements after certain point.
If you want to emulate avx2 with SVE or RVV, you might require that the hardware has a native vector length >=256, and then you always mask off the bits beyond 256, so the same code works on any native vector length >=256.
Kind of, but the part which looks particularly annoying is that you can't put variable-width vectors on the stack or pass them around as values in most languages, because they aren't equipped to handle types with unknown size at compile time.
ARM seems to be proposing a C language extension which does require compilers to support variably sized types, but it's not clear to me how the implementation of that is going, and equivalent support in other languages like Rust seems basically non-existent for now.
Yes, you can't, which is annoying, but you can if you compile for a specific vector length.
This is mostly a library structure problem. E.g. simdjson has a generic backend that assumes a fixed vector length. I've written fixed width RVV support for it. A vector length agnostic backend is also possible, but requires writing a full new backend. I'm planning to write it in the future (I alreasy have a few json::minify implementations), but it will be more work. If the generic backend used a SIMD abstraction, like highway, that support scalable vectors this wouldn't be a problem.
Toolchain support should also be improved, e.g. you could make all vregs take 512-bit on the stack, but have the codegen only utilize the lowee 128 bit, if you have 128-but vregs, 256-bit if you have 256-bit vregs and 512-bit if you have >=512-bit vregs.
SVE theoretically supports hardware up to 2048-bit, so conservatively reserving the worst-case size at compile time would be pretty wasteful. That's 16x overhead in the base case of 128-bit hardware.
https://ashvardanian.com/posts/aws-graviton-checksums-on-neo...
Graviton3 has 256-bit SVE vector registers but only four 128-bit SIMD execution units, because NEON needs to be fast.
Intel previously was in such a dominant market position that they could require all performance-critical software to be rewritten thrice.
The main cores in the K3 have 256-bit vectors with two 128-bit wide exexution units, and two seperate 128-bit wide vector load/store units.
See also: https://forum.spacemit.com/uploads/short-url/60aJ8cYNmrFWqHn...
But yes, RVV already has more diverse vector width hardware than SVE.
But I find the RVV ISA just really fascinating. Grouping 8 1024-bit registers together gives us 8192-bit or 1-kilobyte registers! That's a tremendous amount of work that can be done using a single instruction.
Feels like the Lanz bulldog of CPUs. Not sure how practical it will be after all, but it's certainly interesting.
Most high-end ARM cores were 4x128b FMA, and Cortex-X925 goes to 6x128b FMA. Contrast that to Intel that was 2x256b FMA for the longest, then 2x512b FMA, with another 1-2 pipelines that can't do FMA.
But ultimately, 4x128b ≈ 2x256b, and 2x256b < 6x128b < 2x512b in throughput. Permute is a different factor though, if your algorithm cares about it.
In Apple's case, they have both the GPU and the NPU to fall back on, and a more closed/controlled ecosystem that breaks backwards compatibility every few years anyway. But Qualcomm is not so lucky; Windows is far more open and far more backwards compatible. I think the bet is that there are enough users who don't need/care about that, but I would question why they would even want Windows in the first place, when macOS, ChromeOS, or even GNU/Linux are available.
Also, it doesn't just speed up vector math. Compilers these days with knowledge of these extensions can auto-vectorize your code, so it has the potential to speed up every for-loop you write.
So operations that are not performance critical and are needed once or twice every hour? Are you sure you don't want to include a dedicated cluster of RTX 6090 Ti GPUs to speed them up?
The point is taken, though, that seemingly the performance is fine as it is for these applications. My point was only that you don't need to be running state of the art LLMs to be using vector math with more than 4 dimensions.
https://en.wikipedia.org/wiki/Advanced_Vector_Extensions#Dow...
My experience is that trying to get benefits from the vector extensions is incredibly hard and the use cases are very narrow. Having them in a standard BLAS implementation, sure, but outside of that I think they are not worth the effort.
SIMD is not limited to mathy linear algebra things anymore. Did you know that lookup tables can be accelerated with AVX2? A lot of branchy code can be vectorized nowadays using scatter/gather/shuffle/blend/etc. instructions. The benefits vary, but can be significant. I think a view of SIMD as just being a faster/wider ALU is out of date.
I've heard anecdotally that the old pre-LLVM Intel C++ Compiler also focused heavily on vectorisation and had some specific tradeoffs to achieve it. I think they use LLVM now too and for all I know they've made similar modifications that we did. But we see a decent number of code patterns that can and now are optimised.
That set matches the x86-64-v2 x64 microarchitecture level. Most of the articles uses 'v2' or 'v3' or 'x86-64-v2', but I thought that more people would be familiar with the names of the instruction sets than that x64 was versioned. The versions only appeared quite recently (2020) and are rather retroactive.
Generally speaking, when working with SSE instructions you'll end up using a mix of instructions from 2->4 as they are all effectively just additional operations on the SSE2 registers.
> Yes, it is absolutely key to build your app as ARM, not to rely on Windows ARM emulation.
Using AVX2 and using an emulator have contradictory goals. Of course there can be a better emulator or actually matching hardware design (since both Apple and Microsoft actually exploit the similar register structure between ARM64 and x86_64). However, this means you have increased complexity and reduced reliability / predictability.
I put a spoiler at the top too, to avoid trying to make people read the whole thing. The real bit is that chart, which I think is quite an amazing result.
You're right re building. We're a compiler vendor, so we have a natural interest in what people should be targeting. But even for us the results here were not what we expected ahead of time.
The reason I ask is that I believe Windows Chrome is (like many Windows binaries) compiled with lots of the advanced CPU features disabled (e.g. AVX512) because they're not available on older PCs. Is that true?
So I'd bet the issue is either newness of the codebase, as the article suggests, or perhaps that it is harder to schedule the work in 256 bit chunks than 128. It's got to be easier when you've got more than enough NEON q registers to handle the xmms, harder when you've got only exactly enough to pair up for handling ymms?
That would be plain AVX, AVX2 has shuffles across the 128-bit boundary. To me that seems like the main hurdle for emulation with 128-bit vectors, in my experience compilers are very eager to emit shuffle instructions if allowed, and emulating a 256-bit shuffle with 128-bit operations would require 2 shuffles and a blend for each half of the emulated register.
EDIT: I just noticed that the benchmark in the article is pure math which probably wouldn't hit this particular issue, so this doesn't explain the performance difference...
The way that the vector registers were extended to 256-bit causes problems when legacy 128-bit and 256-bit ops are mixed. Doing so puts the CPU into a mode where all legacy 128-bit ops are forced to blend the high half, which can reduce throughput of existing SSE2-based library routines to as low as 1/4 throughput. For this reason, AVX code has to aggressively use the VZEROUPPER instruction to ensure that the CPU is not left in AVX 256-bit vector mode before possibly returning to any library or external code that uses SSE2. VZEROUPPER sets a flag to zero the high half of all 256-bit registers, so it's cheap on modern x86 CPUs but can be expensive to emulate without hardware support.
The other problem is that only the low 128 bits of vector registers are preserved across function calls due to the Windows x64 calling convention and the VZEROUPPER issue. This means that practically any call to external code forces the compiler to spill all AVX vectors to memory. Ideally 256-bit vector usage is concentrated in leaf routines so this isn't an issue, but where used in non-leaf routines, it can result in a lot of memory traffic.
Now we have two-ish implementations or x86, but back in the 1980s and 1990s we had quite a few, some with wildly different performance characteristics.
And, if we talk about ARM and RISC-V, we’ll have an order of magnitude more.
We did not try to especially optimise Intel Mac, but it's very tempting to do so in order to look at it as you ask.
I wish Rosetta was open sourced too. Same with Prism. I think any and all translation tech could only benefit everyone.
Looking forward to a future look at Rosetta 2. Thanks!
Most of the world lives of 300$ per month
Other Settings > AVX2 > 95.11% supported (+0.30% this month)
Here's a laptop for $350 which has a CPU with AVX2 support.
https://ebay.us/m/yoznZ1
Maybe you're thinking of avx512 or avx10?
AVX512 is so kludgy that it usually leads to a detriment in performance due to the extreme power requirements triggering thermal throttling.
But there's plenty in avx512 the really helps real algorithms outside the 512-wide registers - I think it would be perceived very differently if it was initially the new instructions on the same 256-wide registers - ie avx10 - in the first place, then extended to 512 as the transistor/power budgets allowed. AVX512 was just tying too many things together too early than "incremental extensions".
AVX512 leading to thermal throttling is a common myth that from what I can tell traces its origins to a blog post about clock throttling on a particular set of low-TDP SKUs from the first generation of Xeon CPUs that supported it (Skylake-X), released over a decade ago: https://blog.cloudflare.com/on-the-dangers-of-intels-frequen...
The results were debated shortly after that by well-known SIMD authors that were unable to duplicate the results: https://lemire.me/blog/2018/08/25/avx-512-throttling-heavy-i...
In practice, this has not been an issue for a long time, if ever; clock frequency scaling for AVX modes has been continually improved in subsequent Intel CPU generations (and even more so in AMD Zen 4/5 once AVX512 support was added).
All AMD Zen 4 and Zen 5 and all of the Intel CPUs since Ice Lake that support AVX-512, benefit greatly from using it in any application.
Moreover the AMD Zen CPUs have demonstrated clearly that for vector operations the instruction-set architecture really matters a lot. Unlike the Intel CPUs, the AMD CPUs use exactly the same execution units regardless whether they execute AVX2 or AVX-512 instructions. Despite this, their speed increases a lot when executing programs compiled for AVX-512 (in part for eliminating bottlenecks in instruction fetching and decoding, and in part because the AVX-512 instruction set is better designed, not only wider).
So, in order to make use of users new fancy hardware without abandoning other users old and busted hardware, you have to support multiple back-ends. Same as it ever was.
Actually, a lot easier than it ever was today. Doom 3 famously required Carmack to reimplement the rendering 6 times to get the same results out of 6 different styles of GPUs that were popular at the time.
ARB Basic Fallback (R100) Multi-pass Minimal effects, no specular.
NV10 GeForce 2 / 4 MX, 5 Passes, Used Register Combiners.
NV20 GeForce 3 / 4 Ti, 2–3 Passes, Vertex programs + Combiners.
R200 Radeon 8500–9200, 1 Pass, Used ATI_fragment_shader.
NV30 GeForce FX Series, 1 Pass, Precision optimizations (FP16).
ARB2 Radeon 9500+ / GF 6+, 1 Pass, Standard high-end GLSL-like assembly.
https://community.khronos.org/t/doom-3/37313
Edit. Furthermore, i think that none of these (pre-2020) low budget CPUs support AVX2, until Tiger lake released in 2020.
Edit: That's wrong. Jasper Lake from 2021 also came without AVX support.