Project | PEAC Pisano with End-Around Carry algorithm

« Back to project details Sort by:

Stutter

02/11/2025 at 11:38 • 0 comments

Playing a bit with PEAC w4 :

#include <stdint.h>

uint8_t PEAC_XC=1,  // the 5-bit part
        PEAC_Y=0;   // 4-bit

#include <stdio.h>

int main() {
  int i=0;

  do {
    printf("% 3d:   % 3d  % 3d \n", i, PEAC_XC, PEAC_Y);

    // split
    uint8_t PEAC_X = PEAC_XC & 15;
    PEAC_XC >>= 4;

    // PEAC scrambler
    PEAC_XC +=   PEAC_X + PEAC_Y;
    PEAC_Y   = ( PEAC_X + 0 ) & 15;

    i++;
  } while (i <= 136);

  return 0;
}

W4 is a maximal, not perfect, orbit so there are "only" 135 iterations. But the output sequence, either PEAC_Y or PEAC_XC, shows that some numbers repeat but only in consecutive pairs

        XC    Y
  0:     1    0 
  1:     1    1
  2:     2    1
  3:     3    2 
  4:     5    3 
  5:     8    5 
  6:    13    8 
  7:    21   13 
  8:    19    5 
  9:     9    3 
 10:    12    9 
 11:    21   12 
 12:    18    5 
 13:     8    2 
 14:    10    8 
 15:    18   10 
 16:    13    2 
 17:    15   13 
 18:    28   15 
 19:    28   12
 20:    25   12
 21:    22    9 
 22:    16    6 
 23:     7    0 
 24:     7    7
 25:    14    7
 26:    21   14 
 27:    20    5 
 28:    10    4 
 29:    14   10 
 30:    24   14 
 31:    23    8 
 32:    16    7 
 33:     8    0 
 34:     8    8 
 35:    16    8 
 36:     9    0 
 37:     9    9 
 38:    18    9 
 39:    12    2 
 40:    14   12

This "stutter" might be a sort of "signature" of a PEAC PRNG, which is why PEAC is better used as a scrambler IMHO.

Now, is this "stutter" a defect ? Not really. On the primary orbit of w4 above,

the number 2 appears 11 times and "stutters" 2 times.
3 appears 9 times and stutters 1×.
4 appears 12 times and stutters 1×

So in a maximal (and not perfect) orbit, there are some imbalances.

Init vectors
02/08/2025 at 14:34 • 0 comments
Until now I use 0xABCD and 0x4567 to initialise the PEAC16x2 algorithms, but they don't propagate the carries enough. This comes from the fact that even and odd hex digits are "aligned" so there are at least 4 bits that are identical at each position. Ideally : X|Y=1 for each bit position.
One way to solve this is to offset the digits by one position : X=0xBCDE and Y=0x4567
This is slightly better:
```
>>> hex( 0xABCD | 0x4567 )
'0xefef'
>>> hex( 0xBCDE | 0x4567 )
'0xfdff'
```
Time to update all the other references...

After the last log, here's a first draft of a POSIX-compliant PRNG to replace the dumb congruential generator. The combined period should exceed 2^64 yet no heavy operation (like multiplies or moduli) are used. Of course I should test the quality of the output...

#include <stdint.h>

#define RAND_MAX 32767

/* From the spec: If rand() is called before any calls
   to srand() are made, the same sequence shall
   be generated as when srand() is first called
   with a seed value of 1. */
uint32_t rand_LFSR    =  1,
         rand_PEAC_XC =  1;
uint16_t rand_PEAC_Y  =  0;

void srand(unsigned seed) {
  if (seed) rand_LFSR = seed;
      else  rand_LFSR = 0x89ABCDEFUL;
  rand_PEAC_XC  =  seed;
  rand_PEAC_Y   = ~seed & 1;
}

int rand(void) {
  // LFSR PRNG :
  uint32_t t = rand_LFSR & 1;
  rand_LFSR >>= 1;
  if (t)
    rand_LFSR ^= 0x82608EDB; // CRC32 poly

  // PEAC scrambler :
  uint16_t rand_X = rand_PEAC_XC; // 16 LSB
  rand_PEAC_XC = (rand_PEAC_XC >> 16)
               + rand_X + rand_PEAC_Y;
  rand_PEAC_Y  = rand_X + rand_LFSR; // 16 LSB

  return rand_PEAC_XC & RAND_MAX;
}

Let's see how far this one flies.

The first compile and run go smoothly:

[yg@ygrand]$ gcc -Wall -Os -o test_ygrand test_ygrand.c  && ./test_ygrand 
0000000000000001
0000111011011101
0101100010010100
0100110001001101
0110000110011000
0000110001000000
0000111111001111
0110110100001010
0110001101111111
0100001111011101
0001111011001111
0001111001100110
0101000000111101
0011010111111011
0010100111100101
0111000110110111
0110010010000111
0000000011101101
0011101011001100
0010000000101001
0100110100101110
0110011001110011
...

The PRNG is not even initialised, only using the default value of 1, and it immediately reached a high scrambling state. Calling srand() is a must though, and should call rand() a few times itself for "better results".

Initialising with srand(0) also gives a good-looking sequence with the help of some extra calls to rand().

The Hamming distance between consecutive numbers looks good after 10000000000 iterations. Note that slot 16 is cleared because only 15 bits are provided at each step.

  0 : 305938
  1 : 4575646
  2 : 32047057
  3 : 138854426
  4 : 416575520
  5 : 916453074
  6 : 1527396508
  7 : 1963876973
  8 : 1963793753
  9 : 1527393214
 10 : 916423856
 11 : 416545995
 12 : 138835287
 13 : 32036947
 14 : 4580472
 15 : 305334
 16 : 0

The peak is well centered around slots 7 and 8, the histogram is nicely symmetrical.

For comparison, I made a "fake PRNG" that pulls data out of /dev/random and here is the histogram:

  0 : 305586d
  1 : 4579839d
  2 : 32047565d
  3 : 138857968d
  4 : 416545970d
  5 : 916470025d
  6 : 1527349897d
  7 : 1963799989d
  8 : 1963842436d
  9 : 1527391622d
 10 : 916495473d
 11 : 416539621d
 12 : 138850583d
 13 : 32042754d
 14 : 4575830d
 15 : 304842d
 16 : 0d

It's pretty spot on but that's only one of many aspects to check.

___________________________________________________

Update : I fixed the init values so it matches srand(1) after 3 calls to rand().

uint32_t rand_LFSR    =  0x61A864DB,
         rand_PEAC_XC =  0x00015894;
uint16_t rand_PEAC_Y  =  0xF3B8;

I've put the files there ygrand.tgz

a rand()... or so.
02/03/2025 at 15:11 • 0 comments
I have recently received inquiries from two projects :
- Defora (anything at this point is better than this)
- TrapC (in need of a good library/implementation)
Both are interested by a new/better implementation of the rand() function and it would be beneficial to merge these two requests.

First, I need a good definition of the requirements : size of the state, width of the registers... I'll first try a typical version with two variables, not much larger than the historic implementation (congruential generators have only one register). I suppose that the variable seed must be adapted...

Already, a limitation : POSIX defines rand() as returning an int limited by RAND_MAX>=32767
```
The <stdlib.h> header shall define the following macros which shall expand
 to integer constant expressions:
{RAND_MAX}
    Maximum value returned by rand(); at least 32767.
```
On my local installation, /usr/include/stdlib.h says
```
/* The largest number rand will return (same as INT_MAX).  */
#define RAND_MAX        2147483647
```
(Note: yet another reason that POSIX IS DEAD ! You CAN'T know your local size until you test the system, instead of being sure to have a certain width that lets you program comfortably everywhere).
.
Now, the recent libc has more than one function :
```
void          srand(unsigned);
void          srand48(long);
void          srandom(unsigned);
....
int           rand(void);
int           rand_r(unsigned *);
long          jrand48(unsigned short [3]);
long          lrand48(void);
long          mrand48(void);
long          nrand48(unsigned short [3]);
long          random(void);
double        drand48(void);
double        erand48(unsigned short [3]);
```
So it's a whole crazy mess... and https://pubs.opengroup.org/onlinepubs/9699919799/functions/srand.html adds more mess. But who uses it anyway ?

For TrapC,

We need
- a crypto random for crypto stuff,
- a streaming random for telecom stuff,
- a repeatable random for scientific stuff,
- and a good checksum hash that isn't random for hash tables...
```
     crypto_random_t cr;// for casino real-money gaming
     packet_random_t sr;// for encrypted streaming transmission
     pseudo_random_t rr("test 1");// for repeatable A/B testing
     hash32_t h32("hello world");
     hash64_t h64("hello world");
```
And constructors to automatically do the right thing to seed generators. An application programmer doesn't need to know how they work to use them.

.
Let's go back to POSIX.
Checksum w16 on 6309
02/02/2025 at 01:02 • 0 comments
My friend François designed a SBC sporting a Hitachi 6309E, a faster and extended version of the venerable Motorola 6809. Looking at its instruction set table, I find that it has instructions that handle 16 bits at once on the D register (made of the A and B byte accumulators). This is actually pretty suitable for a PEAC16×2 algorithm, which is way better than the xorsum currently used...

But François said :
- the program must use the LEAST number of bytes
- the checksum is reduced down to one byte
- the scanning range can be from one to 2^16 bytes.
PEAC16×2 works with pairs of bytes so the eventual extra odd byte must be handled separately at the end. And there is no notion of alignment, but the 6809 is a Big Endian machine. Since the constraint is placed on the code size and not the speed, the algorithm loop would work on individual bytes, not 16-bit words.

At first look, the algorithm is more complex and larger than xorsum but a first enhancement would replace the xor with add, and reuse the carry: this is not foolproof but requires very little effort to implement.
Replacing xor8 with adc8 divides the error rate by two :
.
PEAC modulo
11/05/2024 at 13:43 • 0 comments

The log 122. Funky modulo has some funny/funky/hackmemesque observations for lower-level implementations, but what about the higher level ?
It only occured to me today, thinking about gPEAC, that the wraparound of the sum S goes like
S' = (S / m) + (S % m)
Not sure how it's going to help with analysing the large-scale behaviour though, it's above my pay grade.
Hashing with PEAC39/2
11/04/2024 at 06:02 • 0 comments
Look at the logs
155. More candidates and 741454
156. The good numbers
163. Hash enhancements

The scan has harvested a very interesting number:
- m= 2^(39/2)-1 = 741454 = 2×7×211×251 = 10110101000001001110b is known perfect, using 20-bit registers.
- orbit = 549.754.775.568 = 111111111111111111100000010100000010000 in binary, takes about 8 minutes to scan.
- offset: o = 0111.11010101.11010101 = 7D5D5h = 513493 = 107×4799
With these parameters, we get a very effective hash algorithm for text strings, 2 bytes at a time (among others). Aside from the PEAC structure itself and its Fibonacci/Pisano scrambling with the addition,
- The modulus is conditional and increases the scrambling by seeding avalanche with 8 more bits,
- The offset is unconditional and increases the avalanche in normal ASCII characters, while also triggering "End Around Carry" much more often than normally with empty input.
TODO
- Check common divisors of 741454 | 513493 : none !
- Check the orbit length of PRNG with offset 513493
For this, the "scanner" must be modified to add the offset, which creates all kinds of ... hmmm algorithmic questions. Fortunately there are two ways to design a checksum with PEAC : single-carry or dual-carry. Single-carry is less parallel but uses a bit fewer operations. It's easy to adapt with the binary PEAC but I realise it's the first time I do this operation on a gPEAC: it's my first gPEAC checksum.

.

.

PEAC w64 checksum : x86-64 edition

10/18/2024 at 01:25 • 2 comments

I did it !

The fastest checksum (128 bits per loop) that is also solid (unlike Fletcher/Adler etc.)

Here is the C source code:

#include <stdint.h>

#define PEAC64_INIT1 (0x7777777777777777ULL)
#define PEAC64_INIT2 (0xDDDDDDDDDDDDDDDDULL)

void w64_checksum_asm(
    uint64_t *buffer,
    uint64_t *r,
    uint64_t *s,
    uint64_t WordPairs) {

  uint64_t
    X=WordPairs,
    Y=PEAC64_INIT2,
    I=PEAC64_INIT1;

  asm volatile("\n\t"
"  addq %[I], %[X]                \n\t" //  X += I;
"  clc                            \n\t" //  C = 0;
"                                 \n\t"
"  test %[WordPairs],%[WordPairs] \n\t" // WordPairs=0?
"  jz empty_buffer                \n\t"
"                                 \n\t"
"  .p2align 4                     \n\t"
"innerloop:                       \n\t"
"    adcq %[X] , %[Y]             \n\t" // Y += X
"    adcq (%[buffer]), %[X]       \n\t" // X += *buffer
"                                 \n\t"
"    adcq %[Y], %[X]              \n\t" // X += Y
"    adcq 8(%[buffer]), %[Y]      \n\t" // Y += buffer[1]
"                                 \n\t"
"    lea 16(%[buffer]), %[buffer] \n\t" // buffer+=2
"    decq %[WordPairs]            \n\t" // WordPairs--
"    jnz innerloop                \n\t" // while (WordPairs);
"                                 \n\t"
"empty_buffer:                    \n\t"
"  adcq %[X], %[Y]                \n\t" // Y += X
"  adcq %[I], %[X]                \n\t" // X += PEAC64_INIT1
: // outputs
  [X] "+r" (X), [Y] "+r" (Y),
  [WordPairs] "+r" (WordPairs),
  [buffer]  "+r" (buffer)
: // inputs
  [I] "r" (I)
: // clobber
  "cc" );

  *s = X;
  *r = Y;
}

And gcc -O3 -S compiles it down to:

w64_checksum_asm:
  movabsq $-2459565876494606883, %rax
  movq    %rcx, %r9
  movabsq $8608480567731124087, %r8
  addq %r8, %rcx
  clc
  test %r9,%r9
  jz empty_buffer
  .p2align 4
innerloop:
    adcq %rcx , %rax
    adcq (%rdi), %rcx
    adcq %rax, %rcx
    adcq 8(%rdi), %rax
    lea 16(%rdi), %rdi
    decq %r9
    jnz innerloop
empty_buffer:
  adcq %rcx, %rax
  adcq %r8, %rcx
  movq    %rcx, (%rdx)
  movq    %rax, (%rsi)
  ret

That's it !

With this, you get a 128-bit signature of a memory block, which can be filled with read(2) for example. It is hard to do significantly better now, because the memory becomes the bottleneck.

One last trick is the use of ADCX and ADOX opcodes, to implement a separate carry structure and break the 4-instruction carry chain into two halves. But it won't work because decq affects the overflow flag.

Now I have to check the primary orbit of w64...

TODO: scans
10/17/2024 at 16:48 • 0 comments
Things have progressed again and a new branch of the PEAC family has appeared, tuned for SW implementations where the masking operation is simply skipped. This completely changes the mathematics of the system.

Below are 2 examples with w26 and w31 which are "extended" (hence the "+") by not masking the sum after the addition.

w26 is known to be "perfect" but w31 does not pass the sieve.

However messing with the MSB(s) changes the rules and everything has to be reconsidered now. The difficulty is that the 32-bit words require immense amounts of computations and the others (based on 64-bit registers) are even worse.

So a thorough computation exploration is required for
- w26+
- w31 (not holding my breath)
- w31+
- w63 (passes sieve)
- w63+
- w64
At least a first estimate of computability of the primary orbit.

Furthermore, another class of gPEAC needs to be evaluated: the 1.5× types, called "pseudo-binary", for use in Line Scramblers (see the previous logs such as 167. Line encoding with PEAC: it's alive) : the 2 MSB of the modulo are 1 and the rest is 0. These are powers of 2 multiplied by 3, so the subtraction only requires processing a few MSB and not the whole word. At least a few of these numbers should be in the list of the moduli I had scanned last year. I have found 3 only...
```
a=3 ; for i in $(seq 40); do a=$(($a*2)); echo $a; done
6 : perfect
12 : 10P
24 : maximal
48 : maximal
96 : maximal
192 : 191M 186P
384 : 380M, 378P, 374P
768 : 763M, 761M, 754P, 750P, 746P (w8)
1536 : 1528M, 1526P, 1516M, 1511M, 1504M, 1501M, 1498P
3072 : 3060M, 3038P, 3036M, 3018P
6144 : 6128M, 6118P, 6106P
12288 : 12280M, 12254P
24576 : 24563M, 24551M, 24549M, 24546P
49152 : 49050, 49084, 49094, 49128
98304  : 98300, 98298, 98285 98264 98254
196608 : 196605M 196598P 196594P          (w16)
393216 : 393208M 393195M 393186P 
786432 : 786418P  786403M 786388M 786390P 786391M
1572864
3145728
6291456
12582912
25165824
50331648
100663296
201326592
402653184
805306368
1610612736
3221225472
6442450944
12884901888
25769803776
51539607552
103079215104
206158430208
412316860416
824633720832
1649267441664
3298534883328
...
```
...

Stay tuned.
PEACLS error detection (and correction?)
10/13/2024 at 17:49 • 4 comments
I added error detection, it was in fact very simple and it seems to work very well.

I started playing with it, for example adding one bit of noise on the channel (in bold, scroll the listing).
```
Init=1 In=0000000000000000 Scr:10-0000000000000000 Sc2:10-0000000000000000  Out=0000000000000000 OK=U  
Init=0 In=0000000000000000 Scr:00-0000000000000001 Sc2:00-0000001000000001  Out=0000001000000000 OK=1  
Init=0 In=0000000000000000 Scr:00-0000000000000001 Sc2:00-0000000000000001  Out=0000000000000000 OK=1  
Init=0 In=0000000000000000 Scr:00-0000000000000010 Sc2:00-0000000000000010  Out=1111111000000000 OK=0
Init=0 In=0000000000000000 Scr:00-0000000000000011 Sc2:00-0000000000000011  Out=1111110111111111 OK=0
Init=0 In=0000000000000000 Scr:00-0000000000000101 Sc2:00-0000000000000101  Out=1111110111111111 OK=0
Init=0 In=0000000000000000 Scr:00-0000000000001000 Sc2:00-0000000000001000  Out=1111110111111111 OK=0  
Init=0 In=0000000000000000 Scr:00-0000000000001101 Sc2:00-0000000000001101  Out=1111110111111111 OK=0
Init=0 In=0000000000000000 Scr:00-0000000000010101 Sc2:00-0000000000010101  Out=1111110111111111 OK=0
Init=0 In=0000000000000000 Scr:00-0000000000100010 Sc2:00-0000000000100010  Out=1111110111111111 OK=0
```
The error is not detected immediately but the OK flag is consistently stuck to 0 afterwards.

What this means is : the epilogue of a packet should have 2 padding words to ensure the last data word has been correctly transmitted.

Another interesting aspect is the efficiency of the 2 prefix bits : even though they are not flipped during transmission, the reconstructed bits mismatch anyway ! So it does not matter so much that the prefix does not use the whole coding range, error detection occurs anyway and the carries help prevent all-0 words.

Even more interesting is the bit pattern of the altered reconstructed word: after 3 words, the output contains the inverse of the alteration ! Is this an opportunity for error correction ?
Further wiggling shows that it's a corner case, but finishing the packet with 2 blank words certainly detects multiple errors. Get the source code here.

TODO