Close

Enhanced Split Displacement

A project log for F-CPU

The Freedom CPU project has a log here too now :-)

yann-guidon-ygdesYann Guidon / YGDES 05/31/2024 at 07:340 Comments

(obsolete)

Thanks to the comments on the FB group, there are updates to the mechanism !

Duane said :

Yann's proposal is to replace a branch instruction's plain N-bit PC-relative address field with two fields: an M-bit signed offset added to the program counter's middle bits, and
an L-bit page offset that replaces the lowest L bits of the PC. Various values of M will work. His example was for M=2.

The page-offset address mode of the 6502 microprocessor is the M=0 case. A short branch could go anywhere within the current page being executed, but only in that page. Going anywhere else involved forming the target address in another way, such as indirect addressing through a pointer cell in page zero. PDP-8 had the same M=0 address mode.
If the branch op and the target op lie in different pages, the short branch cannot be used, even if their actual distance in words is small. How did users and tools cope with this? I think, by restricting such routines to lie entirely within a single page. If a routine crossed a page boundary, the program was rebuilt with a filler before that routine so that it began on the next boundary. The wasted space could maybe be partially filled by shuffling in smaller routines from elsewhere.

If L=0, you get conventional PC-relative addressing. This is easy for writers and tools to use. You can tell whether a given displacement will fit into the instruction at assembly time, just by counting how many instruction words lie between the branch and target.

Yann's example is with M=2 and L>0. The short branch will work, if the interval between branch and target spans at most 2 page boundaries. If the programmer and tools know nothing about how code chunks get positioned at run time, they must use the short branch format conservatively. Only for shorter distances that will work regardless of their actual offsets within page.
However, the tools could arrange that longer routines are moved up to the next page boundary, like in the 6502 and PDP8 cases. With that help, the tools could allow somewhat longer reaches between branch and target. This does not help for routines larger than 3 pages.

By picking M bigger than 2, the cases benefitting from page fillers get larger. But this means shortening L, and losing some of the hoped-for benefits for carry propagation delays or cache indexing.

So this is a great summary of the last log.

First new trick:

Now let me introduce K : the field with the index inside the cache line. Usually it is 2 or 3 bits for 32-bit instructions, for 128-bit and 256-bit wide lines respectively. This shortens the L field but does not change the range because they are simply concatenated.

But if M is +1 or -1, this can be qualified as a "inter-page" jump and there is a chance of cache miss or such. And compilers tend to align the targets of such branches (or calls) to the beginning of the cache line to reduce the penalty of having to fetch a second line immediately. Usually, branches get aligned to the first half of the line, so a 0 can be added between L and K. This doubles the range of the M field, so inter-page jumps get boosted to +2 and -2 pages.

The cost is a whole line of MUX...

Second trick:

The other trick (which is complementary) is to use M=2 to indicate:

00 : current page
01 : +1 page
10 : -1 page
11 : custom/indirect page

This adds another register to the architecture, which might be unused... Duane was referring to the 6502 and PDP-8 "zero page" scheme but the beginning of the positive range of instruction addresses might be used already.

This creates a symmetry between the forward and backward directions and allows easy full-range jumps.

So far I also keep L+K=12, with K=3 then L=9, or 512 lines of L1 Icache to directly access. This should be large enough while still be fast.

Result:

Jump/branch/call offset can then reach : 4096 arbitrary instructions in the current page, or 8192 locations both in the next or previous page, as well as the extended (indirect) page.

MM=00 : current page  => addr=  ...uuuuuLLLLLLLLLKKK

MM=01 : next page     => addr=   PC+1|2LLLLLLLLLK0KK

MM=10 : pref page     => addr=   PC-1|2LLLLLLLLLK0KK

MM=11 : indirect page => addr=     indRLLLLLLLLLK0KK

It's now more complex, for sure. We're far from the Canonical MIPS structure but it's somewhat efficient without being too cumbersome.

The L field is shifted left when M!=0, that's an OR2 gate, and fanout is only 8 which is still somewhat tolerable. It's still less effort than an adder!

Now, I guess the indirect page register is going to be messy to design properly...

Update:

The MUX to shift the address is a nuisance, but there is one simple trick to avoid it: shuffle the bits a little to reduce the decoding effort! Here is the new version:

MM=00 : current page  => addr=  ...uuuuuLLLLLLLLLKKK

MM=01 : next page     => addr=   PC+1|2KLLLLLLLLL0KK

MM=10 : pref page     => addr=   PC-1|2KLLLLLLLLL0KK

MM=11 : indirect page => addr=     indRKLLLLLLLLL0KK

So the MSB of K gets moved above the MSB of L and replaced by 0.

The new issue now is that there is a new type of granularity for the pages : 4K instructions for the current, and doubled pages that are not aligned, so the PC's MSB incrementer is less straight-forward.

Is it worth all the efforts ?

Discussions