I get tired of fighting GCC's weird code generator that won't get a clue from my source code. I'm (still) using gcc6.4.1 (2017, I know) but I don't know how to make it understand basic things... This is critical at my level because any little performance win saves many days of computation !
I get variable performance tiers (sometimes 50% bumps) and I can't see how to attribute (or force) them. I made many tests and I simplified the source code to this point, trying to remove as many jumps as possible out of the inner loop:
while (N < limit) {
chunk=0;
do {
do {
chunk++;
// Compute forward :
t = X + Y + C;
Y = X;
C = t >> WIDTH;
X = t & m;
} while (Y != 0);
crossings++;
if ((X == 1) && (C == 0)) {
N+=chunk;
// Something something
return NULL;
}
} while(chunk < chunksize);
N+=chunk;
printf("Forward Step %"PRIu64" : X=%d, Y=%d, C=%d \n",
N, X, Y, C);
}
I can see unexplained jumps in performance and I looked at the generated asm, where I found some good and some less good surprises...
gcc -g -S -Wall -lpthread -D_REENTRANT -DWIDTH=16 -O3 pt_orbit_05.c -o pt_orbit.S
And I get this:
.LVL6: .loc 1 86 0 cmpq limit(%rip), %r15 => while (N < limit) { movl 12(%rsp), %r9d jnb .L3 => go to end of main loop .LVL7: .L16: .loc 1 68 0 discriminator 1 xorl %esi, %esi => esi = chunk = 0 jmp .L8 => wait ? Why jump here ?? .LVL8: .p2align 4,,10 .p2align 3 .L5: .loc 1 112 0 cmpq %r14, %rsi => Why compare here and not at the bottom of the loop as I wrote ? ja .L20 .L4: => Oh and this unwanted prefix has eaten from the alignment... .loc 1 96 0 discriminator 1 movl %ebp, %ebx => wait, why a circular assignation ? movl %r12d, %ebp => X into ebp .LVL9: .L8: => and the loop starts only now ? .loc 1 93 0 discriminator 1 leal (%rbx,%r9), %eax => t = Y + C .loc 1 91 0 discriminator 1 addq $1, %rsi => chunk++ .LVL10: .loc 1 93 0 discriminator 1 addl %ebp, %eax => t += X .LVL11: .loc 1 95 0 discriminator 1 movl %eax, %r9d => wait, another assignation ? t goes to r9 now ? .LVL12: .loc 1 96 0 discriminator 1 movzwl %ax, %r12d => X = t & mask .loc 1 95 0 discriminator 1 shrl $16, %r9d => C = t >> 16 .LVL13: .loc 1 101 0 discriminator 1 testl %ebp, %ebp => loop again if ebp is not zero. jne .L4 .loc 1 103 0 addl $1, %r13d .LVL14: .loc 1 104 0 cmpl $1, %r12d => X == 1 ? jne .L5 testl %r9d, %r9d => C == 0 ? jne .L5 .LVL15: .loc 1 106 0 movl $.LC1, %edi => prepare printf... .loc 1 105 0 addq %r15, %rsi .LVL16: .loc 1 106 0 movl %r13d, %edx xorl %eax, %eax .LVL17: call printf
The movzwl thing is a good way to avoid moving the 16 bits of mask around, and it performs a move/copy to another register. But then WHY not exploit this and rename a few things around that ? EAX is not used after the movzwl so it should have saved one mov ! It should look like this :
movzwl %ax, %r12d => X = t & mask shrl $16, %eax => C = t >> 16
So the inner loop should look a bit like this:
; rsi => chunk ; r12 => X ; eax => t, C ; ebp => Y .L4: addl %ebp, %eax => t = Y + C addq $1, %rsi => chunk++ addl %r12d, %eax => t += X movl %r12d, %ebp => Y = X movzwl %ax, %r12d => X = t & mask shrl $16, %eax => C = t >> 16 testl %ebp, %ebp => loop again if ebp is not zero. jne .L4
As promised, that's 5 opcodes for the computation, 1 counter update and 2 conditional jump instructions. That one should fly, right ? But should I have to switch to LLVM to get this ?
Meanwhile the orbit search for w=26 is still running at maybe half the potential top speed...*
-o-O-0-O-o-
Grumble !!!!
Same source code, different optimisation level:
[yg@E4-11-5B-38-2A-D8 FiboSum]$ gcc -g -Wall -lpthread -D_REENTRANT -DWIDTH=16 -O3 pt_orbit_05.c -o pt_orbit && /usr/bin/time ./pt_orbit Measuring the first orbit of the Pisano-Carry state space for WIDTH = 16 bits, or about 2147516416 iterations. Starting to scan forward from Step 0 : X=1, Y=0, C=0 Forward scan reached init state after 2147516415 steps and 33035 crossings 4.04user 0.00system 0:04.15elapsed 97%CPU (0avgtext+0avgdata 1420maxresident)k 0inputs+0outputs (0major+69minor)pagefaults 0swaps [yg@E4-11-5B-38-2A-D8 FiboSum]$ gcc -g -Wall -lpthread -D_REENTRANT -DWIDTH=16 -Os pt_orbit_05.c -o pt_orbit && /usr/bin/time ./pt_orbit Measuring the first orbit of the Pisano-Carry state space for WIDTH = 16 bits, or about 2147516416 iterations. Starting to scan forward from Step 0 : X=1, Y=0, C=0 Forward scan reached init state after 2147516415 steps and 33035 crossings 2.40user 0.00system 0:02.42elapsed 99%CPU (0avgtext+0avgdata 1528maxresident)k 0inputs+0outputs (0major+69minor)pagefaults 0swaps
To achieve this I also reused a trick I just found and used in the "desired asm" code above:
C += X + Y; Y = X; X = C & MASK; C = C >> WIDTH;
Removing t helps a bit... The result is the fastest version so far, the last best time was around 2.7s. The asm dump follows:
.L2: .loc 1 84 0 cmpq limit(%rip), %rbp jnb .L6 xorl %esi, %esi => chunk = 0 .LVL8: .L3: .loc 1 93 0 discriminator 1 movl %r14d, %eax => whaaat ? again ?? movl %r12d, %r14d => Y = C .LVL9: .loc 1 89 0 discriminator 1 incq %rsi => chunk++ .LVL10: addl %eax, %ebx => C += Y .LVL11: .loc 1 91 0 discriminator 1 addl %r12d, %ebx => C += X .LVL12: .loc 1 93 0 discriminator 1 movzwl %bx, %r12d => X = C & mask .LVL13: .loc 1 94 0 discriminator 1 shrl $16, %ebx => C = C >> 16 .LVL14: .loc 1 104 0 discriminator 1 testl %r14d, %r14d => Y == 0 ? jne .L3 .loc 1 106 0 incl %r15d => crossings++ .LVL15: .loc 1 107 0 cmpl $1, %r12d jne .L4 testl %ebx, %ebx jne .L4
This version of the code gets about 700-750M steps per second on my old i7.
Discussions
Become a Hackaday.io Member
Create an account to leave a comment. Already have an account? Log In.