-
Speeding up by 20%, profitability almost reached
04/13/2021 at 15:36 • 2 commentsComputing SHA256 is a challenge for 6502 CPU inside C64. The main problem is that this 8-bit CPU has only three 8-bit registers while 32-bit arithmetic is required. You need at least 4 instructions (one for every byte) for each 32-bit operation. In practice it will be closer to twelve: 4 times (load, change, store).
Even though C64 has only 64KB of RAM, the primary way of speeding up programs is using more memory: unrolling loops and heavy use of lookup tables. For example for drawing pixels on the screen you could sacrifice some RAM to precalculate lookup tables for the memory address of every line of the screen.
If you look on SHA256 pseudocode in Wikipedia there are only few basic operations needed: NOT, AND, XOR, addition and circular right bit shift rotation with varying shift length.
Using lookup tables won't get us anywhere for anything except bit rotation. There is no benefit of using lookup tables for AND or XOR - it would take more time than the built in instructions.
The right shift case is different because it depends on the shift length. The naïve and memory-conserving approach is to use a loop. This is what CC65 runtime does in the arithmetic shift to the right built-in function.
But we need a circular shift - bits that fall out on the right end have to appear on the left side. So the actual operation is:
uint32_t right_rot_gen(uint32_t value, uint8_t count) { return value >> count | value << (32 - count); }
There are four 32-bit calculations here: two shifts, subtraction and OR.
Can we do better? Yes.
---------- more ----------Let's dive into right_rot function from sha2.c
The first thing we need is to get into 8-bit mood and start thinking in terms of bytes.
A 32-bit unsigned integer is just 4 consecutive bytes. 6502 is little-endian so the first byte of this array contains the lowest 8 bits.
uint32_t right_rotbuf; uint8_t *r = (uint8_t*)&right_rotbuf; // could be uint8_t r[4] here
Now it should be clear that shifting these 32-bits by 8, 16 or 24 bits means just moving those bytes around, without extra calculations. Note that only one of those conditions can trigger. This runs almost in constant time - one subtraction and five/six byte moves.
if (c>=24) { c = c - 24; tmp = r[0]; r[0] = r[3]; r[3] = r[2]; r[2] = r[1]; r[1] = tmp; } if (c>=16) { c = c - 16; tmp = r[0]; r[0] = r[2]; r[2] = tmp; tmp = r[1]; r[1] = r[3]; r[3] = tmp; } if (c>=8) { c = c - 8; tmp = r[0]; r[0] = r[1]; r[1] = r[2]; r[2] = r[3]; r[3] = tmp; }
What's left to do is to handle the case of shifting an 8-bit value by 1..7 bits to the right. What happens there? A naive approach still means a loop (up to 7) that would apply bit shift four times - for every byte of the result.
But observe that the shift is done up to 7 bits only. So only the current byte and the following one can be affected.
This is where a lookup table comes into picture. We need 2 bytes of result: new low bits of the current byte and new top bits of following byte for 7 possible shift counts, for every 256 possible values of the current byte.
These values are precalculated in generate_rot_tables() function using the standard C library for simplicity:
for (c=0;c<8;c++) { rot_lobytes[c] = malloc(256); rot_hibytes[c] = malloc(256); for (i=0;i<256;i++) { j = right_rot_gen(i << 8, c); rot_lobytes[c][i] = (j & 0xff00) >> 8; rot_hibytes[c][i] = (j & 0x00ff); } }
Going back to right_rot function we can use this data directly just by indexing into rot_lobytes and rot_hibytes. For every byte of the result we take the result of operation on current value (rot_hi) OR result of operation on the byte to the left (rot_lo).
if (c>0) { rot_lo = rot_lobytes[c]; rot_hi = rot_hibytes[c]; tmp = rot_lo[r[0]]; r[0] = rot_hi[r[0]] | rot_lo[r[1]]; r[1] = rot_hi[r[1]] | rot_lo[r[2]]; r[2] = rot_hi[r[2]] | rot_lo[r[3]]; r[3] = rot_hi[r[3]] | tmp; }
If c>0, then this part also runs in constant time, independent of the actual value of c.
Taken together this change improved the whole round calculation (two hashes) by 20%, by going down from 4.6s to 3.6s.
This whole thing can now be simply rewritten directly in assembly, removing C overhead.