The latest (and first) test result is clear: a basic branchless iteration is 35% faster than the reference code.
... 46 2693509383 334015557 0.1240076 2693511054 448954961 0.1666802 0.3441131 47 2693530123 332309597 0.1233733 2693583097 449581581 0.1669084 0.3528731 48 2693527773 333995349 0.1239992 2693637400 449048669 0.1667072 0.3444211 49 2693514118 334861793 0.1243215 2693502259 449658083 0.1669418 0.3428230 # 50: 0.1233530 0.1664931 0.3497823
OYIIIISSSSSS....
Thanks to the benchmarking framework developed before, I could clearly and accurately get this encouraging result right away, which is only one first step on the way to objctive peep-hole optimisation. Already, using this new code saves 1/4 of the computation time for a mass-scale scan...
The reference code is below:
uint64_t orbit_1(uint64_t modulo, // !!! must be negative !
volatile unsigned int *stop_flag) { // uint64_t max) {
uint64_t len=0, X = 1, Y = 0, t;
reloop:
Y += X;
t = Y + modulo; // Y - -modulo : reduction
if (!(t >> 63)) {
Y = t;
X++;
}
if (((Y == 1) && (X==0))
|| (*stop_flag != 0))
goto return_1;
X += Y;
t = X + modulo; // X - -modulo : reduction
if (!(t >> 63)) {
X = t;
Y++;
}
len+=2;
if ((X != 1) || (Y != 0))
goto reloop;
return len;
return_1:
len++;
return len;
}
It's close to the very original source code but with already one "optimisation" : the modulo parameter is negative, so the inner code can use an addition opcode, which can be merged in a LEA when combined with other additions, and it follows the coding trick of the log 122. Funky modulo.
Already, we're setting the stage for later : x86 has no MIN/MAX opcode but CMOV so it is a natural fit just after the ADD, in a version written in asm.
What prevents the compiler from optimising the conditional jump into a conditional move is the conditional increment. Again, in asm, it's very simple : since the carry flag is not affected by the CMOV, one can do an ADC instruction just after it. 3 opcode and you're done ! Well, not really since the carry flag must be complemented...
Also we can see that the result of the conditional incrementation leaves useful flags for the succeeding test of X==0 for example. But first let's see how to make and test an intermediary version. Here is one which is 35% faster:
uint64_t orbit_2(uint64_t modulo, // !!! must be negative !
volatile unsigned int *stop_flag) { // uint64_t max) {
uint64_t len=0, X = 1, Y = 0, t, s;
reloop:
Y += X;
t = Y + modulo; // Y - -modulo : reduction
s = (int64_t)t >> 63;
if (!s)
Y = t;
X += 1+s;
if (((Y == 1) && (X==0))
|| (*stop_flag != 0))
goto return_1;
X += Y;
t = X + modulo;
s = (int64_t)t >> 63;
if (!s)
X = t;
Y += 1+s;
len+=2;
if ((X != 1) || (Y != 0))
goto reloop;
return len;
return_1:
len++;
return len;
}
I only changed a few lines and the critical instructions are already looking very different:
addq %rcx, %rdx leaq (%rdx,%rdi), %r8 movq %r8, %r9 sarq $63, %r9 cmove %r8, %rdx leaq 1(%rcx,%r9), %rcx
The unpredictable jump has disappeared and the movq and sarq can be shaved by direct asm code.
.
.
.
.
.
Discussions
Become a Hackaday.io Member
Create an account to leave a comment. Already have an account? Log In.