Close

PEAC w64 checksum : x86-64 edition

A project log for PEAC Pisano with End-Around Carry algorithm

Add X to Y and Y to X, says the song. And carry on.

yann-guidon-ygdesYann Guidon / YGDES 10/18/2024 at 01:251 Comment

I did it !

The fastest checksum (128 bits per loop) that is also solid (unlike Fletcher/Adler etc.)

Here is the C source code:

#include <stdint.h>

#define PEAC64_INIT1 (0x7777777777777777ULL)
#define PEAC64_INIT2 (0xDDDDDDDDDDDDDDDDULL)

void w64_checksum_asm(
    uint64_t *buffer,
    uint64_t *r,
    uint64_t *s,
    uint64_t WordPairs) {

  uint64_t
    X=WordPairs,
    Y=PEAC64_INIT2,
    I=PEAC64_INIT1;

  asm volatile("\n\t"
"  addq %[I], %[X]                \n\t" //  X += I;
"  clc                            \n\t" //  C = 0;
"                                 \n\t"
"  test %[WordPairs],%[WordPairs] \n\t" // WordPairs=0?
"  jz empty_buffer                \n\t"
"                                 \n\t"
"  .p2align 4                     \n\t"
"innerloop:                       \n\t"
"    adcq %[X] , %[Y]             \n\t" // Y += X
"    adcq (%[buffer]), %[X]       \n\t" // X += *buffer
"                                 \n\t"
"    adcq %[Y], %[X]              \n\t" // X += Y
"    adcq 8(%[buffer]), %[Y]      \n\t" // Y += buffer[1]
"                                 \n\t"
"    lea 16(%[buffer]), %[buffer] \n\t" // buffer+=2
"    decq %[WordPairs]            \n\t" // WordPairs--
"    jnz innerloop                \n\t" // while (WordPairs);
"                                 \n\t"
"empty_buffer:                    \n\t"
"  adcq %[X], %[Y]                \n\t" // Y += X
"  adcq %[I], %[X]                \n\t" // X += PEAC64_INIT1
: // outputs
  [X] "+r" (X), [Y] "+r" (Y),
  [WordPairs] "+r" (WordPairs),
  [buffer]  "+r" (buffer)
: // inputs
  [I] "r" (I)
: // clobber
  "cc" );

  *s = X;
  *r = Y;
} 

And gcc -O3 -S compiles it down to:

w64_checksum_asm:
  movabsq $-2459565876494606883, %rax
  movq    %rcx, %r9
  movabsq $8608480567731124087, %r8
  addq %r8, %rcx
  clc
  test %r9,%r9
  jz empty_buffer
  .p2align 4
innerloop:
    adcq %rcx , %rax
    adcq (%rdi), %rcx
    adcq %rax, %rcx
    adcq 8(%rdi), %rax
    lea 16(%rdi), %rdi
    decq %r9
    jnz innerloop
empty_buffer:
  adcq %rcx, %rax
  adcq %r8, %rcx
  movq    %rcx, (%rdx)
  movq    %rax, (%rsi)
  ret

That's it !

With this, you get a 128-bit signature of a memory block, which can be filled with read(2) for example. It is hard to do significantly better now, because the memory becomes the bottleneck.

One last trick is the use of ADCX and ADOX opcodes, to implement a separate carry structure and break the 4-instruction carry chain into two halves. But it won't work because decq affects the overflow flag.

Now I have to check the primary orbit of w64...

Discussions

Thibaut wrote 10/28/2024 at 09:46 point

Hi! does it have theoretically the same behaviour than in your Log.168 or Log.64 (with the same solution ? :-) ==> calculate the last byte 2times is better ?

  Are you sure? yes | no