Close
0%
0%

Another Table-Based Stream Scrambler

Non-reversible, non-cryptographic scrambler for PRNG, 16 bits at a time.

Similar projects worth following
0 followers
Reinventing RC4 but faster and even worse.

I'm working on PRNGs these days.

I have made a replacement for the POSIX rand() and srand() functions (see log a POSIXy rand() ) using a mixed structure: a basic 32-bit LFSR PRNG has its LSB mixed with a PEAC 16x2 scrambler, outputting 16 bits (15 due to POSIX) with little effort and 65 bits of internal state. I want to make it even smoother and the best way is to add another scrambler with a very different structure. Naturally I remembered the RC4 stream cypher and a table-based system looks interesting, with its very large state and apparent simplicity. Though I'm not doing a cypher and I have forgotten the details. And memory accesses are expensive, particularly 16bit ones.

Anyway, a 256-byte table is nice but my algo works with 16 bits at once so a sort of "parallel RC4" would be required. I don't want to have to deal with key initialisation so let's just borrow the SBoxes of AES : one is forward, the other is reverse, let's put them in parallel as they are reciprocal permutations. They must be pretty good, I assume.

So with this system, I parallelize two table lookups, which is "good" for modern microprocessors, and the state is extended to 512 bytes, a bit like RC4A. But wait : if one has to swap values, then BOTH tables should be accessed by the same two indices. So the pair of tables is turned into a double-width, dual-port LUT (but I can do that with a FPGA, no biggie). This leads to a new structure that provides 2×16=32 bits per dual access: now the table works as an expansion table!

The extra 16 bits can be reused :

  • as extra parameters for the "swap" which is now a 32-bit rotation / circular shift
  • as offsets for the address inputs

With this extra feedback, the system becomes almost self-sufficient and can output a reasonable stream when fed with a constant (or short cyclical) value. Using a rotation instead of a swap is another "nice touch" that upsets the usual balance of a permutation, yet preserves the total number of set bits.

Even though the LUT starts with a pair of AES SBoxes which are permutations, the goal is only to fill the table with a well-behaved array with as many 1s as 0s. The rotation keeps that balance but the system does not rely on the table being a perfect permutation. In some degenerate cases, the entry should have at least a few bits set, no 0xFFFFFFFF or 0x00000000, that's all.

This is not cryptographically secure though, and the primary design assumption was that the input provided a half-decent entropy source (think LFSR or congruential PRNG). But with a careful design (the method to offset the address bits), it can become autonomous.

Looking back at RC4, this new algo can compete on throughput (only one read and one write per byte) though it certainly has glaring loopholes (and the code has many word casts). Yet it's still "good enough" for a semi-decent PRNG and the implementation in FPGA is a breeze.

I'm sure enhancements will appear, as I have identified a point or two to adjust...

ATBSS_02.tgz

2nd version. A bit better, issue found.

x-compressed-tar - 2.19 kB - 02/11/2025 at 03:09

Download

ATBSS.tgz

First draft. Compiles and runs.

x-compressed-tar - 2.31 kB - 02/10/2025 at 21:39

Download

  • First log, second stage.

    Yann Guidon / YGDES7 hours ago 0 comments

    I just started the project and I'm already questioning a few early design choices.

    The first stage mixes D with F & G to provide the two addresses, and R is a direct result of the table lookup. There is not much to be done there, and the input entropy is meant to be already "minimal".

    The second stage does the dual update (regardless of A and B being identical). That's what I feel needs an enhancement.

    • First, the source of the twiddling is F and G, which is a dubious choice because it's also used for mixing in the first stage. I would have preferred another source but it's the best I have so far, and using E and H would leak information. I don't want a counter because it could leave a linear trace at the output, in some way, and it's one weakness of RC4.
    • Then the compaction function is just a XOR of 2×5 bits, it's not bad but not the best. A different reduction function would be required. I have considered PopCount but it has a pretty strong bias toward 16, it would be more predictable.
    • And the barrel shifter (that performs the rotation) is one easy simple operation but it takes quite some area (5×32 MUX2). It's not the fastest but I don't know yet a better method to shuffle bits. The number of set bits MUST remain equal to prevent bias at the output, so multiplies or division (integer or Galois) are not possible either.

    I have thought of extending the width of the LUT but 32 bits would only make the problem worse and the source of S=E^H would not be solved.

    For now, it works "well enough" to complement the basic rand(), since the LUT adds a pretty strong non-linearity to the output so it masks short-term correlations and greatly extends the period.

View project log

Enjoy this project?

Share

Discussions

Similar Projects

Does this project spark your interest?

Become a member to follow this project and never miss any updates