I did it !
The fastest checksum (128 bits per loop) that is also solid (unlike Fletcher/Adler etc.)
Here is the C source code:
#include <stdint.h>
#define PEAC64_INIT1 (0x7777777777777777ULL)
#define PEAC64_INIT2 (0xDDDDDDDDDDDDDDDDULL)
void w64_checksum_asm(
uint64_t *buffer,
uint64_t *r,
uint64_t *s,
uint64_t WordPairs) {
uint64_t
X=WordPairs,
Y=PEAC64_INIT2,
I=PEAC64_INIT1;
asm volatile("\n\t"
" addq %[I], %[X] \n\t" // X += I;
" clc \n\t" // C = 0;
" \n\t"
" test %[WordPairs],%[WordPairs] \n\t" // WordPairs=0?
" jz empty_buffer \n\t"
" \n\t"
" .p2align 4 \n\t"
"innerloop: \n\t"
" adcq %[X] , %[Y] \n\t" // Y += X
" adcq (%[buffer]), %[X] \n\t" // X += *buffer
" \n\t"
" adcq %[Y], %[X] \n\t" // X += Y
" adcq 8(%[buffer]), %[Y] \n\t" // Y += buffer[1]
" \n\t"
" lea 16(%[buffer]), %[buffer] \n\t" // buffer+=2
" decq %[WordPairs] \n\t" // WordPairs--
" jnz innerloop \n\t" // while (WordPairs);
" \n\t"
"empty_buffer: \n\t"
" adcq %[X], %[Y] \n\t" // Y += X
" adcq %[I], %[X] \n\t" // X += PEAC64_INIT1
: // outputs
[X] "+r" (X), [Y] "+r" (Y),
[WordPairs] "+r" (WordPairs),
[buffer] "+r" (buffer)
: // inputs
[I] "r" (I)
: // clobber
"cc" );
*s = X;
*r = Y;
}
And gcc -O3 -S compiles it down to:
w64_checksum_asm:
movabsq $-2459565876494606883, %rax
movq %rcx, %r9
movabsq $8608480567731124087, %r8
addq %r8, %rcx
clc
test %r9,%r9
jz empty_buffer
.p2align 4
innerloop:
adcq %rcx , %rax
adcq (%rdi), %rcx
adcq %rax, %rcx
adcq 8(%rdi), %rax
lea 16(%rdi), %rdi
decq %r9
jnz innerloop
empty_buffer:
adcq %rcx, %rax
adcq %r8, %rcx
movq %rcx, (%rdx)
movq %rax, (%rsi)
ret
That's it !
With this, you get a 128-bit signature of a memory block, which can be filled with read(2) for example. It is hard to do significantly better now, because the memory becomes the bottleneck.
One last trick is the use of ADCX and ADOX opcodes, to implement a separate carry structure and break the 4-instruction carry chain into two halves. But it won't work because decq affects the overflow flag.
Now I have to check the primary orbit of w64...
Yann Guidon / YGDES
Discussions
Become a Hackaday.io Member
Create an account to leave a comment. Already have an account? Log In.
Hi! does it have theoretically the same behaviour than in your Log.168 or Log.64 (with the same solution ? :-) ==> calculate the last byte 2times is better ?
Are you sure? yes | no
This checksum is meant to compute the footprint of a file, just like CRC32 or Adler/Fletcher is used today, but faster and better. It's not a substitute to SHAxxx, by far.
Adding a few extra rounds at the end is "better" as it mixes the bit even more but it shouldn't change the result since no extra external entropy is added.
It might affect the Hamming distance somewhat on average, but that's not the goal since it's not used as a hash table index for example. I just want to know if two files are different (not identical).
Are you sure? yes | no