The last post was the first ever in this project, which was created about 3 years ago as a placeholder, I just wanted to keep track of interesting algos and have fun with them. I didn't expect to stumble upon what I just described... So I start to take it more seriously, not a one-white-night curiosity.
As always, the goal is to get a fairly good replacement behaviour with the least resources and latency. I found something with a good ratio: at 4 bits of history, it's 1 more bit than the IBM tree-PLRU used in the 486 and others, but this is not a significant overhead, because you'll need 2 more MESI bits per line. So 4 ways require (2*4)+4=12 bits per set of ways, instead of 11, no big deal here. And it's not far from the ideal 5.
The decoding can be done with logic gates only, no ROM required. And it can be simply switched from LRU to MRU on the fly (by forcing a hit on a field).
The algo only compares one half of the ways : that's just 2×2 bits to compare. And it compares only the ones to be soon evicted, leaving the uncertainty to the MRU half. So there are fewer bad surprises about lines that get evicted too early : there are at least 2 turns before it happens.
I also like that it is scalable and easily configured. Higher numbers of ways can be handled with little effort, since only one half of the fields are compared.
It is also energy efficient since repeated hits on the MRU half do not change the state: thus it is not needed to rewrite it back. Extra energy is needed when actual replacement occurs.
The system is based on the ideal LRU queue. For N entries, there are N fields (A, B, C, D, E... from the LRU to the MRU) that act as a FIFO until it is full. Let's have an 8-deep queue with slots A to H containing a permutation of numbers 0 to 7, initialised in increasing order:
input sequence: 0 1 2 3 4 5 6 7 LRU MRU slot: A B C D E F G H val: 0 1 2 3 4 5 6 7
When a new value is added from the sequence, it is inserted at the MRU level, and the other values are shifted down the queue, until the inserted value is found. For example let's add 2:
LRU MRU slot: A B C D E F G H val: 0 1 3 4 5 6 7 2 <==insertion
The value 2 is found at the slot C so all the values at the slots on the right move down one position, overwriting the 2, which is re-inserted at the MRU slot H.
Here is the problem:
- The comparisons and shifting get heavy very fast as the queue depth increases.
- We only care about the LRU field because this is the one we need to replace an old entry.
Here is the solution:
Only keep the M least recently used slots.
For example, for N=8, we can set M to 3. This requires 3×3 bits&comparisons for the queue. This means that during replacement, a given entry must not appear M times to be evicted.
The other beauty is that neither N or M need to be a power of 2. A 5-deep would require 6 bits of history since the
- A needs 5 values
- B needs 4 values
- C needs 3 values
B is encoded in 2 bits, A and C are encoded together in 4 bits because 5×3=15.
Here is the new problem:
Since the MRU slots are ignored, during shifting there must be one new value inserted. The value could be random but must not be one of the inputs. This is a less trivial question but it is still limited since only log2(N) bits must be "guessed". The wide range of possible values make it a good candidate for boolean simplifications (where creativity can be unleashed). The guess only depends on the existing M×log2(N) inputs, not on the desired/matching set.
- For N=4 M=2 it requires a 16×2 lookup and 2×2 comparisons.
- For N=8 M=3, there is a 512×3 lookup and 3×3 comparisons.
- For N=8 M=4, there is a 4096×3 lookup and 4×3 comparisons.
Of course, some heuristics can further decrease the LUT size, using common boolean techniques.
Discussions
Become a Hackaday.io Member
Create an account to leave a comment. Already have an account? Log In.