Originally posted by: billo
This posting concerns outputs which are assigned registers that may overlap with inputs. Consider this inline asm that uses addc and adde in order to accomplish 128 bit arithmetic...
asm ("addc %0, %2, %3 \n"
"adde %1, %4, %5 \n"
: "=r"(xl),"=r"(xu)
: "r"(yl),"r"(zl),"r"(yu),"r"(zu));
In this example all the inputs and outputs are unsigned longs. Unfortunately, although it looks pretty straightforward, there is a potentially fatal flaw in this asm: registers allocated for the outputs are allowed to overlap with registers allocated for the inputs. Even though we programmers know that these two instructions execute separately, the compiler will, by default, treat them as if they are an indivisible whole, and thus may choose to reuse as outputs any registers used as inputs. Imagine if the register allocated for %0 (the xl variable) was the same one used for %4 or %5. That would mean that by the time the adde was being executed, one of its inputs would have been trashed by the addc instruction.
When I compiled the above example, the assembly generated was as follows...
addc 4, 0, 3
adde 5, 4, 5
Oops! R4 is overwritten before it is used by adde. The solution to this problem is the “&” output modifier. It says "the operand may be modified before the instruction is finished using the input operands.” We could use it in this example as follows...
asm ("addc %0, %2, %3 \n"
"adde %1, %4, %5 \n"
: "=&r"(xl),"=&r"(xu)
: "r"(yl),"r"(zl),"r"(yu),"r"(zu));
Now register allocation will ensure that the registers allocated for %0 and %1 are distinct, not only from each other, but from all the inputs as well. When I compiled this example I got the following assembly...
addc 4, 0, 3
adde 5, 6, 7
This will execute correctly. However, the use of two “&” modifiers may be overkill. Would it hurt if, for instance, the register chosen for %1 (the xl variable) was the same one used for some input? The answer is no, it wouldn’t hurt. It wouldn’t matter if it overlapped with %2 or %3, as those inputs have already been used by the time adde executes. Nor would it matter if it overlapped with %4 or %5, as the semantics of the adde instruction ensures that the inputs are read before the output is written. Therefore it would be correct to write the example as....
asm ("addc %0, %2, %3 \n"
"adde %1, %4, %5 \n"
: "=&r"(xl),"=r"(xu)
: "r"(yl),"r"(zl),"r"(yu),"r"(zu));
When I compiled this, the assembly generated looked like this...
addc 4, 0, 3
adde 5, 5, 6
In this case, the compiler has overlapped output %1 with input %4, but it could have overlapped it with any input, and the program would still be correct. Why would we choose to use only one “&” instead of two? Because the example with one “&” used fewer registers in the compiled output for the inline asm, which could make a critical performance difference in some programs. For instance, it could mean less register spilling, or the use of fewer non-volatile registers, which would result in fewer loads to restore those registers.
Therefore, the rule with “&” is to use a few of them as possible, but no fewer.
Till next time…. Bill