Close
0%
0%

Another Table-Based Stream Scrambler

Non-reversible, non-cryptographic scrambler for PRNG, 16 bits at a time.

Similar projects worth following
0 followers
Reinventing RC4 but faster and even worse.

I'm working on PRNGs these days.

I have made a replacement for the POSIX rand() and srand() functions (see log a POSIXy rand() ) using a mixed structure: a basic 32-bit LFSR PRNG has its LSB mixed with a PEAC 16x2 scrambler, outputting 16 bits (15 due to POSIX) with little effort and 65 bits of internal state. I want to make it even smoother and the best way is to add another scrambler with a very different structure. Naturally I remembered the RC4 stream cypher and a table-based system looks interesting, with its very large state and apparent simplicity. Though I'm not doing a cypher and I have forgotten the details. And memory accesses are expensive, particularly 16bit ones.

Anyway, a 256-byte table is nice but my algo works with 16 bits at once so a sort of "parallel RC4" would be required. I don't want to have to deal with key initialisation so let's just borrow the SBoxes of AES : one is forward, the other is reverse, let's put them in parallel as they are reciprocal permutations. They must be pretty good, I assume.

So with this system, I parallelize two table lookups, which is "good" for modern microprocessors, and the state is extended to 512 bytes, a bit like RC4A. But wait : if one has to swap values, then BOTH tables should be accessed by the same two indices. So the pair of tables is turned into a double-width, dual-port LUT (but I can do that with a FPGA, no biggie). This leads to a new structure that provides 2×16=32 bits per dual access: now the table works as an expansion table!

The extra 16 bits can be reused :

  • as extra parameters for the "swap" which is now a 32-bit rotation / circular shift
  • as offsets for the address inputs

With this extra feedback, the system becomes almost self-sufficient and can output a reasonable stream when fed with a constant (or short cyclical) value. Using a rotation instead of a swap is another "nice touch" that upsets the usual balance of a permutation, yet preserves the total number of set bits.

Even though the LUT starts with a pair of AES SBoxes which are permutations, the goal is only to fill the table with a well-behaved array with as many 1s as 0s. The rotation keeps that balance but the system does not rely on the table being a perfect permutation. In some degenerate cases, the entry should have at least a few bits set, no 0xFFFFFFFF or 0x00000000, that's all.

This is not cryptographically secure though, and the primary design assumption was that the input provided a half-decent entropy source (think LFSR or congruential PRNG). But with a careful design (the method to offset the address bits), it can become autonomous.

Looking back at RC4, this new algo can compete on throughput (only one read and one write per byte) though it certainly has glaring loopholes (and the code has many word casts). Yet it's still "good enough" for a semi-decent PRNG and the implementation in FPGA is a breeze.

I'm sure enhancements will appear, as I have identified a point or two to adjust...

scrambler_startup.tgz

show the LUT state as the scrambler tries to bootstrap from lowest possible entropy.

x-compressed-tar - 14.03 kB - 02/11/2025 at 15:55

Download

ATBSS_02.tgz

2nd version. A bit better, issue found.

x-compressed-tar - 2.19 kB - 02/11/2025 at 03:09

Download

ATBSS.tgz

First draft. Compiles and runs.

x-compressed-tar - 2.31 kB - 02/10/2025 at 21:39

Download

  • Startup

    Yann Guidon / YGDESan hour ago 0 comments

    So far I have chosen the AES SBoxes to initialise the bi-LUT.

    Remember : with the rotation trick the only thing that matters is to initialise the LUT with an equal amount of 0s and 1s. The AES SBox seemed like a quick way to get you there cheaply, you can immediately pump data through it.

    Note : this is also a great way to make a biased generator (if the bias is not too strong), just initialise the required number of 0s and 1s in the LUT for the required ratio.

    Now what's the "expensive way", the "worst case" ? Initialise the LUT with one half 0xFFFF and the other 0x0000. And no entropy at all in the inputs : keep them at 0. And FG=0 too. This forces the system to bootstraps into PRNG mode the hardest way... It was not easy and several shortcomings have been found.

    • First, one of the inputs now gets XORed by a constant, though it could be a bit better than that.
    • Second, the rotation is now extracted from a independent 5-bit LFSR (thus giving values from 1 to 31, excluding the 0 speeds mixing up).

    I was not happy to add more internal bits, and even less happy to add a counter : this is the sign that "something wrong" happens and needs to be broken up. Well, in this case, it is... But then what is the minimum number of bits ?

    First test estimate 10K iterations but it's hard to be sure, when can the LUT be considered "OK" ?

    Have a look at scrambler_startup.tgz

    (note: I use Braille Unicode to represent 8 bits in one glyph, displays OK with chrome)

    The test starts at

    LUT:
    ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
    ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
    ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
    ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
    ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
    ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
    ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
    ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
    ⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿
    ⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿
    ⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿
    ⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿
    ⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿
    ⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿
    ⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿
    ⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿
    

    Can it be worse ?

    The trick is to start it up. Hence the 0xAA as a XOR in one input, though it would be better fed from a LFSR8 (which also feeds the rotation). But...

    Read more »

  • First log, second stage.

    Yann Guidon / YGDES16 hours ago 0 comments

    I just started the project and I'm already questioning a few early design choices.

    The first stage mixes D with F & G to provide the two addresses, and R is a direct result of the table lookup. There is not much to be done there, and the input entropy is meant to be already "minimal".

    The second stage does the dual update (regardless of A and B being identical). That's what I feel needs an enhancement.

    • First, the source of the twiddling is F and G, which is a dubious choice because it's also used for mixing in the first stage. I would have preferred another source but it's the best I have so far, and using E and H would leak information. I don't want a counter because it could leave a linear trace at the output, in some way, and it's one weakness of RC4.
    • Then the compaction function is just a XOR of 2×5 bits, it's not bad but not the best. A different reduction function would be required. I have considered PopCount but it has a pretty strong bias toward 16, it would be more predictable.
    • And the barrel shifter (that performs the rotation) is one easy simple operation but it takes quite some area (5×32 MUX2). It's not the fastest but I don't know yet a better method to shuffle bits. The number of set bits MUST remain equal to prevent bias at the output, so multiplies or division (integer or Galois) are not possible either.

    I have thought of extending the width of the LUT but 32 bits would only make the problem worse and the source of S=E^H would not be solved.

    For now, it works "well enough" to complement the basic rand(), since the LUT adds a pretty strong non-linearity to the output so it masks short-term correlations and greatly extends the period.

View all 2 project logs

Enjoy this project?

Share

Discussions

Similar Projects

Does this project spark your interest?

Become a member to follow this project and never miss any updates