I did it !
The fastest checksum (128 bits per loop) that is also solid (unlike Fletcher/Adler etc.)
Here is the C source code:
#include <stdint.h>
#define PEAC64_INIT1 (0x7777777777777777ULL)
#define PEAC64_INIT2 (0xDDDDDDDDDDDDDDDDULL)
void w64_checksum_asm(
uint64_t *buffer,
uint64_t *r,
uint64_t *s,
uint64_t WordPairs) {
uint64_t
X=WordPairs,
Y=PEAC64_INIT2,
I=PEAC64_INIT1;
asm volatile("\n\t"
" addq %[I], %[X] \n\t" // X += I;
" clc \n\t" // C = 0;
" \n\t"
" test %[WordPairs],%[WordPairs] \n\t" // WordPairs=0?
" jz empty_buffer \n\t"
" \n\t"
" .p2align 4 \n\t"
"innerloop: \n\t"
" adcq %[X] , %[Y] \n\t" // Y += X
" adcq (%[buffer]), %[X] \n\t" // X += *buffer
" \n\t"
" adcq %[Y], %[X] \n\t" // X += Y
" adcq 8(%[buffer]), %[Y] \n\t" // Y += buffer[1]
" \n\t"
" lea 16(%[buffer]), %[buffer] \n\t" // buffer+=2
" decq %[WordPairs] \n\t" // WordPairs--
" jnz innerloop \n\t" // while (WordPairs);
" \n\t"
"empty_buffer: \n\t"
" adcq %[X], %[Y] \n\t" // Y += X
" adcq %[I], %[X] \n\t" // X += PEAC64_INIT1
: // outputs
[X] "+r" (X), [Y] "+r" (Y),
[WordPairs] "+r" (WordPairs),
[buffer] "+r" (buffer)
: // inputs
[I] "r" (I)
: // clobber
"cc" );
*s = X;
*r = Y;
}
And gcc -O3 -S compiles it down to:
w64_checksum_asm: movabsq $-2459565876494606883, %rax movq %rcx, %r9 movabsq $8608480567731124087, %r8 addq %r8, %rcx clc test %r9,%r9 jz empty_buffer .p2align 4 innerloop: adcq %rcx , %rax adcq (%rdi), %rcx adcq %rax, %rcx adcq 8(%rdi), %rax lea 16(%rdi), %rdi decq %r9 jnz innerloop empty_buffer: adcq %rcx, %rax adcq %r8, %rcx movq %rcx, (%rdx) movq %rax, (%rsi) ret
That's it !
With this, you get a 128-bit signature of a memory block, which can be filled with read(2) for example. It is hard to do significantly better now, because the memory becomes the bottleneck.
One last trick is the use of ADCX and ADOX opcodes, to implement a separate carry structure and break the 4-instruction carry chain into two halves. But it won't work because decq affects the overflow flag.
Now I have to check the primary orbit of w64...
Discussions
Become a Hackaday.io Member
Create an account to leave a comment. Already have an account? Log In.
Hi! does it have theoretically the same behaviour than in your Log.168 or Log.64 (with the same solution ? :-) ==> calculate the last byte 2times is better ?
Are you sure? yes | no