Next.js App Router + React Server Components Demo

new
past
show
ask
show
jobs
submit

▲Computed goto for efficient dispatch tables (2012) (eli.thegreenplace.net)

20 points by firephox 3 days ago | 6 comments

jdw64 53 minutes ago [-]

I understood up to the comparison operations in this article, but I'm having trouble understanding branch prediction even after reading it. It seems like they used branch prediction optimization... Sometimes when I read articles like this, I start to question whether I can really call myself a programmer

adrian_b 9 minutes ago [-]

In a CPU, there are 2 kinds of branch predictors, predictors for conditional branches, which must predict only whether the branch is taken or not taken, and predictors for indirect branch instructions, which must predict the address of the next instruction that will be executed after the indirect branch instruction.

The indirect branch predictor stores information about indirect branches in a structure that is similar to a cache memory.

When the CPU loads some data from the main memory, the loaded data together with the address from where it was loaded is stored in the cache memory.

When another load instruction is executed later, the load address is used to query the cache memory and if the query is successful the data value is returned by the query and it is used instead of accessing again the main memory.

Similarly, (simplifying a lot) when an indirect branch instruction is executed the address where the execution continues after the indirect branch is stored together with the address of the branch instruction in an associative memory similar to a cache memory.

When an indirect branch instruction is executed later, the address of the branch instruction is used to query the associative memory and if the query is successful it will return the address of the next instruction to be executed and execution will continue there speculatively, until it is confirmed by reading from the main memory that this is the correct jump address. If the jump address is wrong than all the work done is discarded and execution continues at the right address.

This description is simplified, because modern branch descriptors may store for a given branch instruction address not only the last jump address, but a sequence of past jump addresses, together with a pattern about how they have been used (e.g. alternating between jumping twice to the first address and jumping once to the second address). Thus successive queries for the branch instruction address will retrieve different jump addresses, based on their activation pattern.

The point of TFA is that if the dispatch loop contains a single indexed branch instruction, then the associative memory of the indirect branch predictor contains a single record, keyed by the address of that instruction. Depending on the CPU, the record will contain one or more past addresses, but it can predict correctly only a short periodic pattern of jump addresses, i.e. of opcodes used in dispatching.

This could work to predict correctly a short loop in the interpreted program, but in typically most of the predictions will be wrong and each branch misprediction takes much more time than the execution of any instruction. In smaller and cheaper CPUs, which might store a single jump address per record, the correct predictions would be extremely rare, as they would happen only if the interpreted program contains repeated instructions.

In the optimized program, the indexed branch instruction is replicated into each "case" so now the associative memory will store separate records, one for each "case", containing the address or addresses of the next cases where execution has continued in the past.

Because after each interpreted instruction the probabilities of the next instructions are not equal, but some instructions are more likely to follow, that greatly increases the chances that the indirect branch predictor will make correct predictions from time to time.

kibwen 8 minutes ago [-]

> Therefore, the standard forces the compiler to generate "safe" code for the switch. Safety, as usual, has cost, so the switch version ends up doing a bit more per loop iteration.

Safety only has a cost in this case because the switch is fundamentally just operating on an integer. With an actual enumerated type (rather than C's primitive "enums as numeric aliases"), which even a basic type system could trivially enforce, there would be no need for this check, because the domain of the value would be guaranteed at compile-time.

nly 2 hours ago [-]

All well and good provided your opcodes are sequential/dense

adrian_b 1 hours ago [-]

The same is true for a "switch" statement.

If the "case" values used in a "switch" are sparse, the "switch" will not be compiled into an indexed jump instruction, but into a sequence of conditional jump instructions, which test all the possible cases.

In this situation, the alternative to "switch" for implementing dispatching is no longer a computed "goto", but a multiple "if"/"else" sequence.

A smarter compiler could detect when a "switch" forms the body of a loop and it would replicate the indexed jump instruction at the end of each case, instead of jumping to the beginning of the loop to execute there an indexed jump, or even worse, first jumping to the end of the loop to terminate the "switch", then jumping to the beginning of the loop to repeat the loop body.

With such a compiler, computed "goto" would not be necessary as an alternative to "switch".

The range check inside the dispatch loop would not be necessary if the opcodes had an enumeration type (in a programming language where enumeration types are clearly distinct from integers) and the "switch" handled all the possible cases. In that situation, the range checks would be moved elsewhere in the program, wherever opcodes are generated.

froh 2 hours ago [-]

(2012)

1 hours ago [-]

Rendered at 15:33:24 GMT+0000 (Coordinated Universal Time) with Vercel.