-
PEAC modulo
11/05/2024 at 13:43 • 0 commentsThe log 122. Funky modulo has some funny/funky/hackmemesque observations for lower-level implementations, but what about the higher level ?
It only occured to me today, thinking about gPEAC, that the wraparound of the sum S goes like
S' = (S / m) + (S % m) Not sure how it's going to help with analysing the large-scale behaviour though, it's above my pay grade.
-
Hashing with PEAC39/2
11/04/2024 at 06:02 • 0 commentsLook at the logs
155. More candidates and 741454
156. The good numbers
163. Hash enhancementsThe scan has harvested a very interesting number:
- m= 2^(39/2)-1 = 741454 = 2×7×211×251 = 10110101000001001110b is known perfect, using 20-bit registers.
- orbit = 549.754.775.568 = 111111111111111111100000010100000010000 in binary, takes about 8 minutes to scan.
- offset: o = 0111.11010101.11010101 = 7D5D5h = 513493 = 107×4799
With these parameters, we get a very effective hash algorithm for text strings, 2 bytes at a time (among others). Aside from the PEAC structure itself and its Fibonacci/Pisano scrambling with the addition,
- The modulus is conditional and increases the scrambling by seeding avalanche with 8 more bits,
- The offset is unconditional and increases the avalanche in normal ASCII characters, while also triggering "End Around Carry" much more often than normally with empty input.
TODO
- Check common divisors of 741454 | 513493 : none !
- Check the orbit length of PRNG with offset 513493
For this, the "scanner" must be modified to add the offset, which creates all kinds of ... hmmm algorithmic questions. Fortunately there are two ways to design a checksum with PEAC : single-carry or dual-carry. Single-carry is less parallel but uses a bit fewer operations. It's easy to adapt with the binary PEAC but I realise it's the first time I do this operation on a gPEAC: it's my first gPEAC checksum.
.
.
-
PEAC w64 checksum : x86-64 edition
10/18/2024 at 01:25 • 1 commentI did it !
The fastest checksum (128 bits per loop) that is also solid (unlike Fletcher/Adler etc.)
Here is the C source code:
#include <stdint.h> #define PEAC64_INIT1 (0x7777777777777777ULL) #define PEAC64_INIT2 (0xDDDDDDDDDDDDDDDDULL) void w64_checksum_asm( uint64_t *buffer, uint64_t *r, uint64_t *s, uint64_t WordPairs) { uint64_t X=WordPairs, Y=PEAC64_INIT2, I=PEAC64_INIT1; asm volatile("\n\t" " addq %[I], %[X] \n\t" // X += I; " clc \n\t" // C = 0; " \n\t" " test %[WordPairs],%[WordPairs] \n\t" // WordPairs=0? " jz empty_buffer \n\t" " \n\t" " .p2align 4 \n\t" "innerloop: \n\t" " adcq %[X] , %[Y] \n\t" // Y += X " adcq (%[buffer]), %[X] \n\t" // X += *buffer " \n\t" " adcq %[Y], %[X] \n\t" // X += Y " adcq 8(%[buffer]), %[Y] \n\t" // Y += buffer[1] " \n\t" " lea 16(%[buffer]), %[buffer] \n\t" // buffer+=2 " decq %[WordPairs] \n\t" // WordPairs-- " jnz innerloop \n\t" // while (WordPairs); " \n\t" "empty_buffer: \n\t" " adcq %[X], %[Y] \n\t" // Y += X " adcq %[I], %[X] \n\t" // X += PEAC64_INIT1 : // outputs [X] "+r" (X), [Y] "+r" (Y), [WordPairs] "+r" (WordPairs), [buffer] "+r" (buffer) : // inputs [I] "r" (I) : // clobber "cc" ); *s = X; *r = Y; }
And gcc -O3 -S compiles it down to:
w64_checksum_asm: movabsq $-2459565876494606883, %rax movq %rcx, %r9 movabsq $8608480567731124087, %r8 addq %r8, %rcx clc test %r9,%r9 jz empty_buffer .p2align 4 innerloop: adcq %rcx , %rax adcq (%rdi), %rcx adcq %rax, %rcx adcq 8(%rdi), %rax lea 16(%rdi), %rdi decq %r9 jnz innerloop empty_buffer: adcq %rcx, %rax adcq %r8, %rcx movq %rcx, (%rdx) movq %rax, (%rsi) ret
That's it !
With this, you get a 128-bit signature of a memory block, which can be filled with read(2) for example. It is hard to do significantly better now, because the memory becomes the bottleneck.
One last trick is the use of ADCX and ADOX opcodes, to implement a separate carry structure and break the 4-instruction carry chain into two halves. But it won't work because decq affects the overflow flag.
Now I have to check the primary orbit of w64...
-
TODO: scans
10/17/2024 at 16:48 • 0 commentsThings have progressed again and a new branch of the PEAC family has appeared, tuned for SW implementations where the masking operation is simply skipped. This completely changes the mathematics of the system.
Below are 2 examples with w26 and w31 which are "extended" (hence the "+") by not masking the sum after the addition.
w26 is known to be "perfect" but w31 does not pass the sieve.
However messing with the MSB(s) changes the rules and everything has to be reconsidered now. The difficulty is that the 32-bit words require immense amounts of computations and the others (based on 64-bit registers) are even worse.
So a thorough computation exploration is required for
- w26+
- w31 (not holding my breath)
- w31+
- w63 (passes sieve)
- w63+
- w64
At least a first estimate of computability of the primary orbit.
Furthermore, another class of gPEAC needs to be evaluated: the 1.5× types, called "pseudo-binary", for use in Line Scramblers (see the previous logs such as 167. Line encoding with PEAC: it's alive) : the 2 MSB of the modulo are 1 and the rest is 0. These are powers of 2 multiplied by 3, so the subtraction only requires processing a few MSB and not the whole word. At least a few of these numbers should be in the list of the moduli I had scanned last year. I have found 3 only...
a=3 ; for i in $(seq 40); do a=$(($a*2)); echo $a; done 6 : perfect 12 : 10P 24 : maximal 48 : maximal 96 : maximal 192 : 191M 186P 384 : 380M, 378P, 374P 768 : 763M, 761M, 754P, 750P, 746P (w8) 1536 : 1528M, 1526P, 1516M, 1511M, 1504M, 1501M, 1498P 3072 : 3060M, 3038P, 3036M, 3018P 6144 : 6128M, 6118P, 6106P 12288 : 12280M, 12254P 24576 : 24563M, 24551M, 24549M, 24546P 49152 : 49050, 49084, 49094, 49128 98304 : 98300, 98298, 98285 98264 98254 196608 : 196605M 196598P 196594P (w16) 393216 : 393208M 393195M 393186P 786432 : 786418P 786403M 786388M 786390P 786391M 1572864 3145728 6291456 12582912 25165824 50331648 100663296 201326592 402653184 805306368 1610612736 3221225472 6442450944 12884901888 25769803776 51539607552 103079215104 206158430208 412316860416 824633720832 1649267441664 3298534883328 ...
...
Stay tuned.
-
PEACLS error detection (and correction?)
10/13/2024 at 17:49 • 2 commentsI added error detection, it was in fact very simple and it seems to work very well.
I started playing with it, for example adding one bit of noise on the channel (in bold, scroll the listing).
Init=1 In=0000000000000000 Scr:10-0000000000000000 Sc2:10-0000000000000000 Out=0000000000000000 OK=U Init=0 In=0000000000000000 Scr:00-0000000000000001 Sc2:00-0000001000000001 Out=0000001000000000 OK=1 Init=0 In=0000000000000000 Scr:00-0000000000000001 Sc2:00-0000000000000001 Out=0000000000000000 OK=1 Init=0 In=0000000000000000 Scr:00-0000000000000010 Sc2:00-0000000000000010 Out=1111111000000000 OK=0 Init=0 In=0000000000000000 Scr:00-0000000000000011 Sc2:00-0000000000000011 Out=1111110111111111 OK=0 Init=0 In=0000000000000000 Scr:00-0000000000000101 Sc2:00-0000000000000101 Out=1111110111111111 OK=0 Init=0 In=0000000000000000 Scr:00-0000000000001000 Sc2:00-0000000000001000 Out=1111110111111111 OK=0 Init=0 In=0000000000000000 Scr:00-0000000000001101 Sc2:00-0000000000001101 Out=1111110111111111 OK=0 Init=0 In=0000000000000000 Scr:00-0000000000010101 Sc2:00-0000000000010101 Out=1111110111111111 OK=0 Init=0 In=0000000000000000 Scr:00-0000000000100010 Sc2:00-0000000000100010 Out=1111110111111111 OK=0
The error is not detected immediately but the OK flag is consistently stuck to 0 afterwards.
What this means is : the epilogue of a packet should have 2 padding words to ensure the last data word has been correctly transmitted.
Another interesting aspect is the efficiency of the 2 prefix bits : even though they are not flipped during transmission, the reconstructed bits mismatch anyway ! So it does not matter so much that the prefix does not use the whole coding range, error detection occurs anyway and the carries help prevent all-0 words.
Even more interesting is the bit pattern of the altered reconstructed word: after 3 words, the output contains the inverse of the alteration ! Is this an opportunity for error correction ?
Further wiggling shows that it's a corner case, but finishing the packet with 2 blank words certainly detects multiple errors. Get the source code here.
-
Line encoding with PEAC: it's alive.
10/13/2024 at 03:21 • 0 commentsI just had a first successful test run in VHDL using w16. It's not finished but the result is almost as good as I expected.
[yg@Rizen5 PEACLS]$ ./runme.sh In=0000000000000000 X:010000000000000000 Out=0000000000000000 In=0000000000000000 X:011101110101101101 Out=0000000000000000 In=0000100000000000 X:010111010111011011 Out=0000100000000000 In=0000000000000000 X:000100101101001001 Out=0000000000000000 In=0000000000000000 X:011100000100100101 Out=0000000000000000 In=0000000000000000 X:000000110001101110 Out=0000000000000000 In=0000000000000000 X:001100110110010100 Out=0000000000000000 In=0000000000000000 X:011101101000000010 Out=0000000000000000 In=0000000000000000 X:011010011110010110 Out=0000000000000000 In=0000000000000000 X:011000000110011001 Out=0000000000000000 In=0000000000000000 X:000010100100110000 Out=0000000000000000 In=0000000000000000 X:001010101011001010 Out=0000000000000000 In=0000000000000000 X:011101001111111010 Out=0000000000000000 In=0000000000000000 X:010111111011000100 Out=0000000000000000 In=0000000000010000 X:000101001011001111 Out=0000000000010000 In=0000000000000000 X:011101000110000100 Out=0000000000000000 In=0000000000000000 X:000010010001010011 Out=0000000000000000 In=0000000000000000 X:011111010111011000 Out=0000000000000000 In=0000000000000000 X:010001101000101011 Out=0000000000000000 In=0000000000000000 X:000001000000000100 Out=0000000000000000 In=0000000000000000 X:000010101000110000 Out=0000000000000000 In=0000000000000000 X:000011101000110100 Out=0000000000000000 In=0000000000000000 X:000110010001100100 Out=0000000000000000 In=0000000000000000 X:011001111010011000 Out=0000000000000000 In=0000000000000000 X:000000001011111100 Out=0000000000000000 In=0000000000000000 X:001010000110010101 Out=0000000000000000 In=0000000000000000 X:011010010010010001 Out=0000000000000000 In=0000000000000000 X:000100011000100110 Out=0000000000000000 In=0000000000000000 X:011110101010111000 Out=0000000000000000 In=0000000000000000 X:010011000011011110 Out=0000000000000000 In=0000000000000000 X:000001101110010111 Out=0000000000000000 In=0000000000000000 X:000100110001110110 Out=0000000000000000
I send 2 individual bits and the data on the line (X) is a nice mess. The output is exactly as expected however!
The "schematics" are almost straight out of 79. Taxonomy:
Encoder:
Decoder:
There is a slight delay due to pipelining/buffering so the output log above is "compressed".
I can play with the INIT_X and INIT_Y values, for now I use
INIT_X <= "0110110111011011"; INIT_Y <= "1101110101101101";
One thing I notice though is that the D bit is very dependent on the input message. When there is no input entropy, the prefix/HA output is either 00 or 01, barely ever 10. C seems to have a good behaviour but I need a way to toggle D more. Or even remove D...
.
.
.
But otherwise, so far it's a success, almost like expected.
.
_____________________________________________
.
I updated the "circuit" because the D should subtracted from the message, instead of added.
So this solves a "little problem" with the output that gets off-by-one in certain situations.
Get PEACLS_20241013.tgz !
The carries will give a lot of problems, for sure. The 2-bit prefix depends a LOT on the message's MSB, for example. But the XOR part is quite good.
-
Line encoding with PEAC : OK
10/09/2024 at 23:23 • 0 commentsThe idea started more than a year ago: 161. PEAC for line encoding
So PEAC has many advantages and I wanted to turn it into a "line coder" for differential twisted pair transmission, for example. It didn't work back then but I finally cracked it !
As you can see, this is the same scrambler as in 79. Taxonomy and I added a pair of gates (AND+XOR) to form a half-adder. The result is concatenated in front of the first 2 bits of a scrambled word, and that's it.
With only this circuit, you get:
- enhanced parity,
- spectrum spreading / scrambling
- clock resynchronization,
- DC suppression
- strict run-length limitation (up to w+1 bits)
Packet-start signaling is added by ORing a 11 pattern to the prefix, and replacing the payload with a "start" symbol. This forced prefix can't map to any valid transmitted data, and can be used for signaling other conditions as well.
This system works for any PEAC width, be it 4, 10, 16, 26 bits, depending on the BER, the medium's bandwidth, the timing constraints and the maximum length of a sequence of identical bits (000000... or 111111....).
I worked all the theory and it only begs to be implemented, maybe in a future #Tiny Tapeout project ?
.
.
.
dropping the mic.
-
w32 wasn't such a dud
10/01/2024 at 05:24 • 0 commentsSomewhere I write that the primary orbit of w32's length is 13.603.546.275 iterations.
Well, I don't know what happened. I'm testing again with another program right now and this number is far exceeded:
Step 3423.088.934.912 : X=889508404, Y=3377359528, C=1
I'm also testing w32+ simultaneously: it uses 64-bit registers and the carry is simply the MSB shifted back to the LSB... it's pretty wild!
I adapted the scanning program to handle 64-bit data. When working with w63, -DNOMASK saves 25% of the computation time and the result looks as good.
20241002:
Computing further (about 27h) gives these results :
- w32 masked:
Step 127242701111296 : X=486058137, Y=1452857267, C=1 Step 127246996078592 : X=57168942, Y=2026275220, C=1 Step 127251291045888 : X=2130391839, Y=3854448558, C=1 Step 127255586013184 : X=3974460549, Y=1367440861, C=0 Step 127259880980480 : X=4171287861, Y=349814740, C=0 Step 127264175947776 : X=1548631083, Y=2991865890, C=1 Step 127268470915072 : X=635132602, Y=3437845154, C=1 Step 127272765882368 : X=2468530850, Y=365303063, C=0 ^C 99801.70user 7.42system 27:43:43elapsed 99%CPU
- w32 nomask:
Step 127783866990592 : X=1857706233519333428, Y=6363298141668173779, C=432530938 Step 127788161957888 : X=10278906262506799708, Y=11191247221860114708, C=2393244361 Step 127792456925184 : X=17161298025364128903, Y=3297960277882192750, C=3995676065 Step 127796751892480 : X=10933138637106185040, Y=14836593303031674861, C=2545569706 Step 127801046859776 : X=1746085887977577584, Y=4191515253591217683, C=406542301 Step 127805341827072 : X=9757511789714895600, Y=5605426172831249234, C=2271847750 Step 127809636794368 : X=18277385174601529875, Y=5102702571660683175, C=4255535354 Step 127813931761664 : X=11054218837186404983, Y=17005938597805104479, C=2573760886 Step 127818226728960 : X=8927644014085209424, Y=8319196262501969693, C=2078629102 Step 127822521696256 : X=7085261420112952104, Y=6817447390030532634, C=1649665976 Step 127826816663552 : X=6780497954159397175, Y=14411798433420106112, C=1578707702 ^C 100167.06user 5.81system 27:49:45elapsed 99%CPU
- w63 masked (classic):
Step 98187247353856 : X=7789805987906940537, Y=8997492253897967197, C=1 Step 98191542321152 : X=6219391496852744986, Y=2932353114636017441, C=0 Step 98195837288448 : X=3513660373162068975, Y=5583561552522910818, C=1 Step 98200132255744 : X=5464065101111839226, Y=6559200458595146488, C=1 Step 98204427223040 : X=1704903053962015866, Y=689819550266971419, C=0 Step 98208722190336 : X=5773997793067942377, Y=1739895575921382638, C=0 Step 98213017157632 : X=8265381472875106818, Y=112466157831344603, C=0 Step 98217312124928 : X=8866559370984867717, Y=1274529812375955528, C=0 Step 98221607092224 : X=2616865902061394463, Y=2296577278457820939, C=0 Step 98225902059520 : X=4885176899669488212, Y=5742267305191855321, C=1 Step 98230197026816 : X=5923363790209385457, Y=6818607223108001127, C=1 ^C 101740.10user 5.92system 28:15:59elapsed 99%CPU
- w63 nomask:
Step 129712307306496 : X=13898979293651408146, Y=10471030960484516614, C=1 Step 129716602273792 : X=15651397243760147896, Y=6773005075823389117, C=1 Step 129720897241088 : X=17954653194078123381, Y=16120148392752158639, C=1 Step 129725192208384 : X=198529025183803037, Y=13573698087745720944, C=0 Step 129729487175680 : X=11965111940566923107, Y=14889642454769736172, C=1 Step 129733782142976 : X=6123098105022721077, Y=15600159587929872239, C=0 Step 129738077110272 : X=9623131821115452626, Y=7793938344049970617, C=1 Step 129742372077568 : X=5460162401730428220, Y=481501917958797404, C=0 Step 129746667044864 : X=364843809283491869, Y=16020168061567238065, C=0 Step 129750962012160 : X=18379832403448125132, Y=9352390193144058009, C=1 Step 129755256979456 : X=3070579473553777869, Y=11817703077574477677, C=0 Step 129759551946752 : X=4276236740962303982, Y=2211040804269366145, C=0 Step 129763846914048 : X=15164859771112835864, Y=11754352737906713717, C=1 ^C 101738.11user 5.51system 28:15:56elapsed 99%CPU
Apparently, due to some coding/compiler mystery, the masked w63 is slower than the unmasked (a 3:4 ratio). The w32s are about as fast as nomasked w63.
I have found where the w32 number came from : it's mentioned in the header comments of gPEAC_scan.c:
/usr/bin/time ./gPEAC_scan 4294967296 4294967296 f 4294967296(2147483647)=13603546275 M 1 candidates among 1 numbers (100.00000%) 1 hits among 1 candidates (100.00000%) 17.23user 0.00system 0:17.26elapsed 99%CPU
but there was a capacity overflow issue in the orbit length calculation, fixed since, but it was too late.
I shall investigate.
-
The dawn of PEAC63 ?
09/28/2024 at 01:44 • 0 commentsWe already know that 2147483648 % 7559=1984 so w31 is out. However PEAC63 still passes the sieve so it could be a good candidate. The question is : how good ?
PEAC63 is equivalent to 126 bits of signature, the MSB work as carry bits so we get 128 bits just after the additions. There is quite little mixing so hashing is out of question but it's still very good for rough checksumming of files on a 64-bit platform, using 32-bit data at every cycle. Furthermore, it's not very different at all from the existing code. Getting bit 63 maybe done by sign extension or simply a shift, or even testing the sign bit of an unsigned number.
Another funky thing to test is whether clearing the MSB/sign bit after it's copied back to the LSB, would affect the orbit's size. NOT clearing the MSB could save one operation and at this stage, it's pretty critical. Though with modern CPUs with OOO, it could not matter so much. So it must be tested.
But exhaustive scans are out of question. I should run a sequential scan on the primary orbit, for the sake of it, but 2^126 states is not realistic, not even parallelizeable because even an arc is larger than the 2^52 states of PEAC26. That's why the sieve was so critical...
The code would go like this:
#define WIDTH (63) #define MASK (0x7FFFFFFFFFFFFFFFULL) uint64_t C, X, Y; uint32_t data; C += X + Y; Y = X + (uint64_t)data; X = C & MASK; <= can we remove this ? C = C >> WIDTH;
I'm not sure of the numerical properties but that code looks damn fast, with few instructions and able to gobble 32 bits at once per cycle. Fingers crossed.
-
w26's second life
09/25/2024 at 23:37 • 0 comments(Damnit, the last log was exactly one year ago :-O )
.
(Warning ! wrong assumption ! w32's primary orbit is way larger than 13G ! See w32 wasn't such a dud)
.
Ideally, w32 would be perfect or maximal. It's neither. So it's not possible to design a mathematically sound CRC/checksum with 32-bit wide registers. In fact, w32's primary orbit is much smaller than expected...
So what's perfect ? w26.
It's suitable to checksum 16-bit wide words only, so it's slower than w32, which is used in some database code. But compare w32's primary orbit containing only 13.603.546.275 iterations, vs w26's 4.503.599.694.479.358 ! So w32 could be "good enough" for some cases but it doesn't benefit from the whole mathematical potential of the PEAC system.
I have not found any suitable candidate that is larger than 32. The sieve says the following widths should be tested: 51 52 54 55 59 60 and 63 (this last one is alluring right ?)
However this is now a 64-bit number and it's less suitable for embedded computing (often limited to 32 bits). It would work on a PC though.
The other solution is to take w16 and parallelize it, SIMD-style. Embedded CPUs seldom have this feature.
But I have found a trick.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ So we're stuck with w26 and 32-bit registers. That's still a 64-bit checksum in the end. Now let's do the math.
32-26=6 bits are unused in the register and do not meaningfully contribute to the result. They are affected by the data but the carry output is lost, never recovered nor re-injected so it's like a dead-end.
26-16=10 bits are not affected by the data that is injected for checksumming. Normally the input data is added in the LSB for convenience.
I guess you'll agree that 10 > 6 : This means that the data that "escapes" the register can be fully re-injected in the lower bits, above or along the input data. In fact, the PEAC checksums a part of itself, while still preserving the w26 properties. There are now 32+32=64 bits worth of entropy, not just 52: that's 4K more codes that are now unlocked.
Of course, at this point, it's important to check the actual orbit's topology of this new generator. Is it still "perfect" ? This extended system still accepts 16 bits per cycle only so the advantage over PEAC16 is only when needing a larger signature. But using the extended w26 provides twice as many bits and works for 32-bit platforms.
...Now, it's time to code.