Next.js App Router + React Server Components Demo

NHacker Next

new
past
show
ask
show
jobs
submit

▲Floor and Ceil versus Denormals on CPU and GPU (asawicki.info)

36 points by ibobev 4 days ago | 14 comments

petermcneeley 35 minutes ago [-]

WebGPU (WGSL) handles this by having a specified accuracy for each operation.

https://www.w3.org/TR/WGSL/#concrete-float-accuracy

This is all fully tested in the CTS.

https://gpuweb.github.io/cts/standalone/?q=webgpu:shader,*

kevmo314 4 hours ago [-]

> This is not the first time we can see Nvidia taking shortcuts to achieve maximum performance of their GPUs

Why is implementing it correctly not performant? For context I have no idea how rounding is typically implemented anyways.

adrian_b 39 minutes ago [-]

It is not correct because it does not implement the FP arithmetic standard and this can lead to much greater numerical errors than expected.

NVIDIA is not responsible alone, because the Microsoft DirectX specification includes the non-standard behavior.

Nevertheless, as shown in TFA, both the AMD and Intel GPUs allow the user to choose between correct behavior and incorrect behavior that might be faster, while NVIDIA ignores what the user requests and implements only the non-standard behavior.

The developers of graphics or ML/AI applications do not care about errors, but there are also people who want to use GPUs for normal computations, where the accuracy of the results matters, so they want to be able to choose between correct behavior and incorrect but faster behavior.

Actually "faster" is a misnomer, because denormals can be handled correctly without diminishing the speed, but that costs additional die area. Thus what NVIDIA gains by not implementing the right behavior is a reduced production cost.

Dwedit 55 minutes ago [-]

Denormals happen to be the way that Zero can even be represented at all?

crote 5 hours ago [-]

Another thing to keep in mind is that CPU processing of denormals tends to be extremely slow - I vaguely recall running into something like a 10x slowdown a decade ago.

For a lot of applications the difference between a denormal and zero is small enough to be irrelevant, so if you expect near-zero values to be common, enabling a denormals-to-zero compiler flag might give you a pretty nice performance boost for free.

adrian_b 31 minutes ago [-]

Denormal processing is slow only on certain CPUs, where the designers have been lazy, so when denormals are encountered that is handled by a microprogrammed sequence.

During the last half of century there have been plenty of CPUs where denormals have been handled in hardware, so that any slow down caused by them is negligible.

Except for generating graphic images seen by humans or in ML/AI applications, neither flushing results to zero nor treating denormal inputs as zero are acceptable, because they can lead to huge errors.

Whoever fears that denormals can slow down an application, must enable the underflow exception. In that case denormals are never generated, but the underflow exceptions must be handled, because when denormals are not desired but underflows happen, that means that there are bugs in the program, which must be fixed.

Denormals have been created so that people can mask the underflow exception and avoid to handle it, without dire consequences.

However this habit of no longer handling the floating-point exceptions, like before the IEEE 754 standard, has created younger developers who are no longer aware of how FP arithmetic must be handled to avoid errors, so now there are too many who believe that the use of "-ffast-math" is permitted in general-purpose programs, not only in special applications where result accuracy does not matter.

For correct results, you must use either denormals or underflow exception handling. There is no third choice. The third choice, like in GPUs, is only for when correctness is irrelevant.

mananaysiempre 1 hours ago [-]

> CPU processing of denormals tends to be extremely slow - I vaguely recall running into something like a 10x slowdown a decade ago

Intel CPU processing, where slowdowns can be as bad as couple hundred cycles. AMD CPUs penalize them much more mildly, usually single-digit cycles. (No idea about ARM.)

adgjlsfhk1 2 hours ago [-]

cpus that aren't Intel are plenty fast on denormals. Intel is the only one where denormals are 100x slower. (and Intel has fixed that on their new cpus, but only on their e cores)

andrepd 2 hours ago [-]

More like 100x, but not sure how true that is nowadays.

yosefk 3 hours ago [-]

Flush denormals to zero. Even their inventor had trouble writing correct code in their presence - see the Appendix to that "what every programmer should know..." paper

mananaysiempre 1 hours ago [-]

On the other hand, they (unexpectedly to the inventor, who intended them to be a debugging tool) underpin a few foundational results in correctly rounded computation, such as https://en.wikipedia.org/wiki/Sterbenz_lemma.

loicd 3 hours ago [-]

> Even their inventor had trouble writing correct code in their presence

I didn't know that. Could you provide a more specific reference?

andrepd 2 hours ago [-]

It's one of several issues with the design of IEEE floats, unfortunately. I wish we could start thinking more seriously about a new design, to complement if not replace IEEE in the long term. Posits are an example https://github.com/andrepd/posit-rust

freeopinion 14 minutes ago [-]

Thank you for this contribution.

Your repo has a link to the standard[0], which might interest some people. It makes me unreasonably happy to know that this was funded out of Singapore.

[0] https://posithub.org/docs/posit_standard-2.pdf

Rendered at 17:11:31 GMT+0000 (Coordinated Universal Time) with Vercel.