I did it !
The fastest checksum (128 bits per loop) that is also solid (unlike Fletcher/Adler etc.)
Here is the C source code:
#include <stdint.h>
#define PEAC64_INIT1 (0x7777777777777777ULL)
#define PEAC64_INIT2 (0xDDDDDDDDDDDDDDDDULL)
void w64_checksum_asm(
uint64_t *buffer,
uint64_t *r,
uint64_t *s,
uint64_t WordPairs) {
uint64_t
X=WordPairs,
Y=PEAC64_INIT2,
I=PEAC64_INIT1;
asm volatile("\n\t"
" addq %[I], %[X] \n\t" // X += I;
" clc \n\t" // C = 0;
" \n\t"
" test %[WordPairs],%[WordPairs] \n\t" // WordPairs=0?
" jz empty_buffer \n\t"
" \n\t"
" .p2align 4 \n\t"
"innerloop: \n\t"
" adcq %[X] , %[Y] \n\t" // Y += X
" adcq (%[buffer]), %[X] \n\t" // X += *buffer
" \n\t"
" adcq %[Y], %[X] \n\t" // X += Y
" adcq 8(%[buffer]), %[Y] \n\t" // Y += buffer[1]
" \n\t"
" lea 16(%[buffer]), %[buffer] \n\t" // buffer+=2
" decq %[WordPairs] \n\t" // WordPairs--
" jnz innerloop \n\t" // while (WordPairs);
" \n\t"
"empty_buffer: \n\t"
" adcq %[X], %[Y] \n\t" // Y += X
" adcq %[I], %[X] \n\t" // X += PEAC64_INIT1
: // outputs
[X] "+r" (X), [Y] "+r" (Y),
[WordPairs] "+r" (WordPairs),
[buffer] "+r" (buffer)
: // inputs
[I] "r" (I)
: // clobber
"cc" );
*s = X;
*r = Y;
}
And gcc -O3 -S compiles it down to:
w64_checksum_asm: movabsq $-2459565876494606883, %rax movq %rcx, %r9 movabsq $8608480567731124087, %r8 addq %r8, %rcx clc test %r9,%r9 jz empty_buffer .p2align 4 innerloop: adcq %rcx , %rax adcq (%rdi), %rcx adcq %rax, %rcx adcq 8(%rdi), %rax lea 16(%rdi), %rdi decq %r9 jnz innerloop empty_buffer: adcq %rcx, %rax adcq %r8, %rcx movq %rcx, (%rdx) movq %rax, (%rsi) ret
That's it !
With this, you get a 128-bit signature of a memory block, which can be filled with read(2) for example. It is hard to do significantly better now, because the memory becomes the bottleneck.
One last trick is the use of ADCX and ADOX opcodes, to implement a separate carry structure and break the 4-instruction carry chain into two halves. But it won't work because decq affects the overflow flag.
Now I have to check the primary orbit of w64...
Discussions
Become a Hackaday.io Member
Create an account to leave a comment. Already have an account? Log In.
Hi! does it have theoretically the same behaviour than in your Log.168 or Log.64 (with the same solution ? :-) ==> calculate the last byte 2times is better ?
Are you sure? yes | no
This checksum is meant to compute the footprint of a file, just like CRC32 or Adler/Fletcher is used today, but faster and better. It's not a substitute to SHAxxx, by far.
Adding a few extra rounds at the end is "better" as it mixes the bit even more but it shouldn't change the result since no extra external entropy is added.
It might affect the Hamming distance somewhat on average, but that's not the goal since it's not used as a hash table index for example. I just want to know if two files are different (not identical).
Are you sure? yes | no