The other LRU

If I want to be sure that my "invented here" LRU4 is both valid and better than other systems, I must compare them, which means I must also care for the existing systems.

The first obvious choice is the IBM-invented 3-bit tree-LRU (as named and described in the wikipedia page). Which is actually a recursive version of the obvious 1-bit LRU for 2-way associative cache. You can't make simpler than that: it's an inverter!

But for 4 ways, it gets a bit more complex, as quoted in the project description. Fortunately a sort of lookup table is provided:

state | replace      ref to | next state
------+--------      -------+-----------
 00x  |  line_0      line_0 |    11_
 01x  |  line_1      line_1 |    10_
 1x0  |  line_2      line_2 |    0_1
 1x1  |  line_3      line_3 |    0_0
   ('x' means don't care)
   ('_' means unchanged)

This can be condensed in a few gates:

No wonder this is a favourite method, but is it actually good ? It seems this is out of fashion in the last 20 years.

This is to be compared to the "true LRU", which uses 5 bits in the optimised case.

An MRU mode could be implemented by not updating the bits.

------------------------------------------------------

Another, more modern method (Itanium, SPARC...) is the 4-bit "bit-LRU" with a pretty weird scheme. Quoting Wikipedia:

Bit-PLRU

Bit-PLRU stores one status bit for each cache line. These bits are called MRU-bits. Every access to a line sets its MRU-bit to 1, indicating that the line was recently used.
At this point, OK I get it, pretty simple and it makes sense: "mark the accessed way as it is used". But then, it is different because after a while, everything will be marked. The method here looks trivial but unheard of:

Whenever the last remaining 0 bit of a set's status bits is set to 1, all other bits are reset to 0.
WHAT ? you just clear the whole history ???

So far this is the corresponding circuit:

At cache misses, the leftmost line whose MRU-bit is 0 is replaced.

There is a priority encoder now? why not a random value to prevent biases ?

This increases the logic depth, unless a ROM is used. The priority encoder is added below:

This circuit provides both the binary encoded number of the new set to replace, as well as the 4-bit set. The logic depth is 5 here (like the one I designed).

And I'm not sure how to design a MRU mode with it.

Maybe the trick for this system is to send 5 signals to the target line : 4 "set" signals, and one "reset all" signal (overdriven by the one set signal). Set-reset flip-flops take fewer transistors (one half) than DFF gates.

The bias might be how, over time, some lines might get moved toward the "left" and create a sort of natural LRU where the "rightmost" are evicted more often but I have only vague ideas about it and it should be tested as well as documented.

---------------------------------------------------

Update:

I looked at the referenced paper from 2010, where it's called "NRU" (Not Recently Used) and the description is "slightly different" !

So the position of the first candidate rotates, preventing a bias. Is the counter general, or stored with the lines ? The latter would take 2 more bits per line. And a counter is more "concerning" than a PRNG.

...............................................................................................................

And since we're in 2024, power consumption matters, and in this case signal activity (level changes) have a significant importance.
So the rate of changes must be included in the comparisons.

Tree-LRU changes the polarity of 2 signals at every cycle ! (and only uses 8 of the possible 24 permutations).

Bit-LRU might swap 4 bits at once from time to time, only setting one most of the time.

...............................................................................................................

Another detail I seem to have missed is the Hit/Miss behaviour. That means another log about this subject...

Bit-PLRU

MRU mode

MRU in L1

Discussions

Become a Hackaday.io Member