Close

PEAC w64 checksum : x86-64 edition

A project log for PEAC Pisano with End-Around Carry algorithm

Add X to Y and Y to X, says the song. And carry on.

yann-guidon-ygdesYann Guidon / YGDES 10/18/2024 at 01:252 Comments

I did it !

The fastest checksum (128 bits per loop) that is also solid (unlike Fletcher/Adler etc.)

Here is the C source code:

#include <stdint.h>

#define PEAC64_INIT1 (0x7777777777777777ULL)
#define PEAC64_INIT2 (0xDDDDDDDDDDDDDDDDULL)

void w64_checksum_asm(
    uint64_t *buffer,
    uint64_t *r,
    uint64_t *s,
    uint64_t WordPairs) {

  uint64_t
    X=WordPairs,
    Y=PEAC64_INIT2,
    I=PEAC64_INIT1;

  asm volatile("\n\t"
"  addq %[I], %[X]                \n\t" //  X += I;
"  clc                            \n\t" //  C = 0;
"                                 \n\t"
"  test %[WordPairs],%[WordPairs] \n\t" // WordPairs=0?
"  jz empty_buffer                \n\t"
"                                 \n\t"
"  .p2align 4                     \n\t"
"innerloop:                       \n\t"
"    adcq %[X] , %[Y]             \n\t" // Y += X
"    adcq (%[buffer]), %[X]       \n\t" // X += *buffer
"                                 \n\t"
"    adcq %[Y], %[X]              \n\t" // X += Y
"    adcq 8(%[buffer]), %[Y]      \n\t" // Y += buffer[1]
"                                 \n\t"
"    lea 16(%[buffer]), %[buffer] \n\t" // buffer+=2
"    decq %[WordPairs]            \n\t" // WordPairs--
"    jnz innerloop                \n\t" // while (WordPairs);
"                                 \n\t"
"empty_buffer:                    \n\t"
"  adcq %[X], %[Y]                \n\t" // Y += X
"  adcq %[I], %[X]                \n\t" // X += PEAC64_INIT1
: // outputs
  [X] "+r" (X), [Y] "+r" (Y),
  [WordPairs] "+r" (WordPairs),
  [buffer]  "+r" (buffer)
: // inputs
  [I] "r" (I)
: // clobber
  "cc" );

  *s = X;
  *r = Y;
} 

And gcc -O3 -S compiles it down to:

w64_checksum_asm:
  movabsq $-2459565876494606883, %rax
  movq    %rcx, %r9
  movabsq $8608480567731124087, %r8
  addq %r8, %rcx
  clc
  test %r9,%r9
  jz empty_buffer
  .p2align 4
innerloop:
    adcq %rcx , %rax
    adcq (%rdi), %rcx
    adcq %rax, %rcx
    adcq 8(%rdi), %rax
    lea 16(%rdi), %rdi
    decq %r9
    jnz innerloop
empty_buffer:
  adcq %rcx, %rax
  adcq %r8, %rcx
  movq    %rcx, (%rdx)
  movq    %rax, (%rsi)
  ret

That's it !

With this, you get a 128-bit signature of a memory block, which can be filled with read(2) for example. It is hard to do significantly better now, because the memory becomes the bottleneck.

One last trick is the use of ADCX and ADOX opcodes, to implement a separate carry structure and break the 4-instruction carry chain into two halves. But it won't work because decq affects the overflow flag.

Now I have to check the primary orbit of w64...

Discussions

Thibaut wrote 10/28/2024 at 09:46 point

Hi! does it have theoretically the same behaviour than in your Log.168 or Log.64 (with the same solution ? :-) ==> calculate the last byte 2times is better ?

  Are you sure? yes | no

Yann Guidon / YGDES wrote 02/04/2025 at 23:46 point

This checksum is meant to compute the footprint of a file, just like CRC32 or Adler/Fletcher is used today, but faster and better. It's not a substitute to SHAxxx, by far.
Adding a few extra rounds at the end is "better" as it mixes the bit even more but it shouldn't change the result since no extra external entropy is added.

It might affect the Hamming distance somewhat on average, but that's not the goal since it's not used as a hash table index for example. I just want to know if two files are different (not identical).

  Are you sure? yes | no