The last log introduced a modified, explicit Branch Target Buffer. It has evolved again since then and is taking a new life of its own...
As already explained, it totally changes the ISA, the compiler strategy, the coding practices, almost everything, but it is able to scale up in operating frequency, or with CPU/Memory mismatch, provided the external memory works with pipelined transactions. The new ISA exposes some of this and lets the compiler handle parts of the overlapping of prefetches.
So the jump system works with two phases :
- The program emits a prefetch operation: this associates a jump address to a fixed ID. The visible ID is 3 bits, there can be up to 8 simultaneously ongoing prefetch operations. For each ID, there is one 8-instruction cache line that is stored close to the instruction decoder, for immediate access. It usually takes several cycles to fetch the desired line and store in in the BTB.
- The actual "jump" instruction selects one of these 8 lines (with the 3-bit ID in the instruction) and checks the (possibly complex) condition. If the condition is verified, the selected line is chosen as source of instructions for the next cycle, and the PC starts to prefetch the next line(s) in its double buffer.
You can see this system as exposing a register set dedicated to instructions addresses. Each register is a target, so let's call them T0 to T7. You can only perform an overwrite of these registers with a prefetch instruction, or a "loopentry" instruction for example : there is no fancy way to derail the CPU's operation.
Each line has one of 4 states:
- absent/invalid
- fetching/checking
- failure/error/fault
- valid
This system mirrors the already studied mechanism for accessing data memory, but the BTB is much more constrained : Each data fetch address is explicitly computed and stored into a dedicated "Address register" (in Y32: A0 to A7), which triggers the TLB and cache checks, such that "a number of cycles later" the corresponding data is available in a coupled/associated "data" register (in Y32: D0 to D7). The #YASEP already uses this split scheme for 2 decades now.
And now, a similar decoupling scheme is also provided in the Y32.
So we get 8 visible registers : they must be saved and restored through all kinds of situations such as function calls, IPC or traps... yet there is no way to read them back to the general register file. The only way to get them out and back in is with the control stack, which can save 2 entries in one cycle.
So the pair T0-T1 can be saved and restored in one cycle, for example : it seems easier to pair them as neighbours. A whole set of targets can be saved in 4 instructions, 4 cycles. It still consumes a half-line of cache...
All the pairs of targets can be freely allocated by the compiler, with some hints and heuristics: the short-lived and immediate targets would be in the high numbers (T4-T7) while targets for later would be in the lower addresses (T0-T3). This makes it easier to save the context across the functions, as the buffer would be defragmented and there are fewer chances to save a register that doesn't need backup.
...
Another thing I didn't mention enough is that the control stack (its neighbour) also has hidden entries in the eBTB. Let's say 8 more lines (or even pairs for the double-buffering) which are allocated on a rolling basis. The stack manages the allocation and prefetch itself. Usually, these hidden lines are a copy of the current instruction buffer and its double/prefetch buffer, so there is only need of prefetch when the stack is emptying.
...
But the v2.0 goes beyond that: it manages the call/return and others, without the need for explicit saving/restoring by discrete instructions. These spill/unspill instructions are still possible but they take precious space (hence time).
What call and IPC now do is accept a 2-bit immediate (on top of the 3-bit target, the condition flags, the base address...) to say how many pairs of targets to push/pop automatically on the control stack.
But each pop or push takes a whole cycle, right ? Call already takes some time to push the PC and the context/base address.
The solution (which might sound familiar to Duane for example) is with a register window. In our case of 8 targets, 16 actual lines are actually implemented and addressed in a circular buffer. This means that a "base index" is added to the instruction's 3-bit Target ID to get the actual number of the physical entry.
What call/return do is simply add their 2-bit amount to the base index (shifted by 1) to make new entries available, while the old entries (now out of reach) can get saved in the background on the control stack, during the following cycles.
The granularity of 2 entries per cycle simplifies a few things, such as the computation of the effective index of the target/line, since the LSB is not changed. The instruction provides 3 bits but only 2 must be added to the current index, which is 3 bits as well but the LSB is implicit. That's a matter of a few XOR at most, so not a critical performance hog, unlike with a more classical register set in SPARC for example.
- Addressing a line from an instruction's target ID field is relatively simple: of the 3 bits, the LSB is preserved (pass-though) and the remaining 2 MSB (0-extended) are added to the 3-bit Target Based Index. Carry out is not calculated, it's a simple wrap-around.
- call instruction : 0-extended 2-bit field is added to the 3-bit TBI.
- ret only pops the destination, then also re-executes the call instruction to see how many levels must be restored, as well as which base register to restore.
Initially I was thinking that the 3-bit Target Based Index should be saved in the PC's MSB but it is not necessary if the stack is correctly emptied/spilled on the control stack. This TBI could be any value as long as the data is appropriate on the lines. It's all relative so who cares now ?
The control stack has a more straight-forward, direct addressing scheme, with 3 bits in the PC's MSB, next to the type field.
A typical function call would look like :
; main function: pf.t7 subfunction ; (a 24-bit immediate and 3 bits for TID) ; .. fill with some useful work if available ; while target #7 is being prefetched call t7, condition, a6, 2 ; t7 : target line 7 ; condition : the call can be "always" or check some flags ; a6 : the context/base pointer for the current function, to be saved. ; 2 : save 2 pairs of target lines, for reuse later. ; ... the rest of the function subfunction: ; ... do something ; all 8 of the target lines are available ; and can be overwritten/spilled/unspilled if needed. ret ; (eventual condition)
That's ... quite short, right ? Of course I didn't show the passing of parameters, which are moved/prepared in the sequence just after the prefetch instruction.
The function may choose to spill 0, 2, 4 or 6 targets across the call. During traps and/or IPC, it should be possible to also spill all the 8 current targets.
The depth of 16 entries is just enough to make the system work OK. 32 entries would use a lot more room for a smaller speedup.
The other very cool thing about this "target window" system : you can repeatedly call a function and the intermediate targets will remain in cache, in the window, speeding things up even more. There is a small stacked context, not just on the control stack but also the eBTB. This increases cache hit rates, frees some memory bandwidth in repetitive cases without programming efforts...
Discussions
Become a Hackaday.io Member
Create an account to leave a comment. Already have an account? Log In.
> the temptation for AXP-like chaining becomes strong and Intel might still have some patents on this...
By AXP, you mean the Alpha internal designs. HP inherited those patents from DEC. All those patents have expired. Maybe you mean techniques used by DEC's strongarm, which Intel inherited. Those patents have expired too.
While looking up that, I stumbled across Intel's APX announcements from this spring. Lots of new IP there.
I didn't realize until now that Intel is pushing a major add-on to the x86 integer cpu model. It looks like adding regs and conditional ops to compete with ARM64 and RISC-V. But mainly, to create an x86 superset that they will not license to AMD, thus splitting the market they hope. Most software devs won't build twice just for a 10% gain. But this could become important in the cloud server market. Someday, when enough high-end chips are supporting APX.
Are you sure? yes | no
> By AXP, you mean the Alpha internal designs.
Yes.
I'll think these things through... There are other priorities for now, as Y32 is very experimental and a lot of things must be considered, way more than #YGREC8 .
APX : we'll see if/how Intel manages this, this time.
AVX10 : Looking forward to it because I want to play with some features of AVX512 :-)
Are you sure? yes | no
Does the BTB hold two consecutive cache lines for every target, or just one? If just one, then the sequential fetcher needs to deliver the second line quickly when a branch to a target is taken. If the BTB held the address of the second line, that would squeeze out one add cycle when fetching the second line and the ones following.
How many cycles ahead of the branch should the prefetch instr be? How many instructions is that? Most basic blocks are short. How many blocks ahead should be prefetch instr be placed? When I've looked at using existing machines' data prefetch ops, the placement depended very much on the latencies of a particular implementation.
Are you sure? yes | no
> Does the BTB hold two consecutive cache lines for every target, or just one? If just one, then the sequential fetcher needs to deliver the second line quickly when a branch to a target is taken. If the BTB held the address of the second line, that would squeeze out one add cycle when fetching the second line and the ones following.
The target width depends on the physical implementation.
Ideally the target's cache width should not be too large to save on transistors/room.
I believe 256 bits would be "reasonable" for Y32, and 512 bits would be required for the more ambitious FC1, but then the latency to fetch the whole line gets humongous.
A L1 block with 3 cycles of latency would be "typical", the prefetch would store 1 line in the eBTB and the next one in L1... if the memory bus is not overloaded.
Storing the incremented address would be possible, though then the temptation for AXP-like chaining becomes strong and Intel might still have some patents on this...
.
> How many cycles ahead of the branch should the prefetch instr be? How many instructions is that?
As much as possible, so the prefetch can get more chances of leftover/idle memory cycles.
Hence having at least 8 addressable buffers is much better than 4, so the prefetches can be pipelined and queued.
.
> the placement depended very much on the latencies of a particular implementation.
yes but this is also limited by the number of in-flight requests, bound by the size of the target ID (here : 3 bits).
Note that here, it's not data prefetch, but instruction prefetch. There are similarities but also a few key differences, such as : if you don't have instructions, your data is useless ;-)
Are you sure? yes | no
If you support both kinds of branches: prefetch/select pair vs single branch instr, then the compiler could and should choose the single form for all very-unlikely jumps. That focuses the prefetch mechanism and IDs and code cache fetch cycles on those code paths that are likely needed most. Dunno about termination paths from inner loops.
In various ways, you are relying on compile-time prediction of likely branch directions and making dynamic predictions less needed or less useful. In that case, it is worthwhile to allocate some opcode bits to the static prediction: mostly taken, mostly not taken, and perhaps uncertain. It would be great if a running program's code can be dynamically tuned according to its observed branch misses. Using the program code as its branch prediction cache.
Any call to another module must be treated as destroying all 8 ID bindings. Calls to a leaf routine in the current compilation unit could maybe arrange to only use ID regs not used and destroyed by the callee, in the same way compilers can theoretically assign data registers cleverly across internal calls. I don't know of any compilers that actually do that.
Some calls are for announcing fatal program errors, as in array bounds checks or nil pointer checks. Those are best encoded as trap instructions, not calls, and the compiler should understand that flow never comes back. So it needn't rebuild the destroyed ID regs.
Pushing ID reg values somewhere doesn't make sense I think. Each such save/restore action means that the corresponding cache lines get copied somewhere and back, at twice the net cost of just doing a new prefetch op instead.
What cache latencies and branch latencies are you designing for?
If block A branches unconditionally to block B and prefetching is working well, are there any pipeline bubbles in the non-branch instructions of A then B? How about if it is an conditional branch that resolves based on a final ALU op of block A?
A compiler could arrange code so that the most-frequent paths are mostly fall-through. That maximizes the use of cache lines sequentially fetched without the BTB ID mechanism. The cases that can't be fall-through are loop repeats. The less-frequent leg of if/then/else can be moved elsewhere. This technique is called Trace Scheduling.
Are you sure? yes | no
Hi Duane,
that's a lot of comments and I'm not sure I can answer every single aspect.
From a higher perspective, for 60 years now we have roughly 2 options :
* simple, slow, in-order core, cheap but hurt by memory latencies.
* complex, power-hungry, you-know-the-deal out-of-order cores that speculatively execute whatever they can just in case.
And then there is the example of Itanium, which had its ambitions which created issues because it couldn't accurately "predict" memory latencies, but that was a problem for data access, not instruction fetch, AFAIK.
So there is a required compromise to be found somewhere. OOO is the king because it helps support ISAs that were designed more than 30 years ago (ARM got a refresh, an exception). OOO compensates for code compactness and spends large amounts of transistors on things that compilers should be able to handle easily by now.
Explicit prefetching of branch targets is a different way to get more performance while keeping the circuits relatively simple, mostly predictable, fast and cheap. It trades program size for core size, which is a decent thing to do today, in line with what RISC did in the 80s (trade code size for speed, bypassing the microcode middleman).
.
> If you support both kinds of branches: prefetch/select pair vs single branch instr, then the compiler could and should choose the single form for all very-unlikely jumps. That focuses the prefetch mechanism and IDs and code cache fetch cycles on those code paths that are likely needed most.
That makes sense (I had to re-read it though), but there is a little issue there :
as it stands, the instruction set can only do a direct, unconditional jump. There is not enough room for meaningful conditions. Jump is a sibling/variation of the "prefetch" instruction which already has little headroom.
The use of direct jump for "unlikely branches" makes sense but by itself, direct jump can't "decide". This would require a predicated skip instruction in front of it. This would be equivalent of prefetching the unlikely target right before the branch is (eventually) taken.
.
> Dunno about termination paths from inner loops.
What do you mean by this term ?.
> it is worthwhile to allocate some opcode bits to the static prediction: mostly taken, mostly not taken, and perhaps uncertain.
I like static prediction but :
- branch hints are nice when there is a need of prediction,
- prediction is required to reduce the average wasted time when the pipeline has bubbles.
- With a shallow pipeline, the number of bubbles is reduced, less time is wasted.
- When the target instruction is already immediately available (thanks to prefetching) and the condition is evaluated fast enough, almost no time is wasted (it depends on the aggressiveness of the design, at most one cycle is lost).
When the design is aimed at avoiding bubbles (at least caused by control flow), static hints are not relevant. I seem to remember that the POWER/PPC architecture has static hint bits but I don't know if they are even used.
But this shines a light on the evaluation of the condition, which adds its own latency if too complex.
.
> It would be great if a running program's code can be dynamically tuned according to its observed branch misses. Using the program code as its branch prediction cache.
By this, do you mean "using the branch hints in the opcodes as a branch history table" ?
I am familiar with this idea, though it creates its own set of issues, that I will not address here.
I don't mind reserving 4 or 5 bits in the branch instructions "for later use". Because unlike the prefetch instructions, there is still some room, and I have no idea in which direction a design might evolve and if people will depend on its 5697th generation one day.
.
to be continued.
Are you sure? yes | no
> Any call to another module must be treated as destroying all 8 ID bindings.
Yes
the initial idea was to discard the buffer, indeed. What a waste but there was no other way.
Now there is another way, with the windowed registers : the past entries are not flushed, but out of reach, the line contents remains in the L1/L2 and the addresses are spilled on the control stack.
I haven't fully designed this, yet, because like function calls, the compiler/programmer should be able to select how many entries are saved.
.
> I don't know of any compilers that actually do that.
Inter-function optimisation is not easy. Inlining and flattening is a first approximation...
Now, when you see the insane amount of optimisation the x86-64 platform received, it's not impossible.
.
> Those are best encoded as trap instructions, not calls,
But how do you implement these traps ?
it is actually equivalent to an IPC to a HW-controlled routine, instead of an instruction.
.
> and the compiler should understand that flow never comes back. So it needn't rebuild the destroyed ID regs.
That's a bit more complex than that, if you want to make exceptions restartable :-) for example on TLB miss. Hence the reuse of the IPC mechanism for traps, so the program can gracefully recover.
.
> Pushing ID reg values somewhere doesn't make sense I think. Each such save/restore action means that the corresponding cache lines get copied somewhere and back, at twice the net cost of just doing a new prefetch op instead.
The "costs" may not be that high, and might be justified. It's the first line of caching, after all.
There are 3 things :
* the target line ID (0 to 7) (in the instruction word, and it gets extended to 16 entries with the base index)
* the target instruction's address (24 bits)
* the line contents (256 bits)
When the entries are "pushed" by a call or IPC, pairs of 24-bit addresses are "spilled" on the control stack (sequentially, in the background).
When the target "register window/circular buffer" is full, the line/contents is dropped (should be still in L1/L2) and replaced/overwritten by a new value, but it is recoverable since the address is still saved on the stack. The contents will be re-prefetched in advance before it's needed.
The idea is to reduce the occurrences of re-prefetches because the memory interface is already strained, particularly if cache lines are 256 bits (that's 4×64-bit words to transfer in atomic bursts, and 4 cycles is a looong time, enough for 12 instructions)
.
> What cache latencies and branch latencies are you designing for?
branches : 0 or 1 bubble cycle.
memory latency : this is given by the cache width divided by the ILP.
Each cache line contains about 3 cycles worth of instructions, with an optimistic average ILP of 2.6.
L1 should have an access time of about 3 cycles (if possible pipelined) but it could go up to 5, since PC buffer has a double-buffer (2 lines) and there is one cycle to increment the address.
.
> If block A branches unconditionally to block B and prefetching is working well, are there any pipeline bubbles in the non-branch instructions of A then B?
I'm not sure why there would be bubbles.
Computation bubbles would be caused by computation-specific conditions/constraints and/or memory availability.
.
> How about if it is an conditional branch that resolves based on a final ALU op of block A?
That is currently the critical path indeed. This is a physically incompressible circuit.
I expect, right now, that it will not be possible to branch on the result of the current cycle's computation, like the Pentium did (it could "pair" a conditional branch instruction with an ALU and jump in the same cycle, but it's too long).
Are you sure? yes | no
I am skeptical that compilers will adapt to this prefetch-then-branch scheme.
It involves coloring each basic block with a distinct color (BTB ID) and then back-propagating that color assignment back far enough across all other preceding blocks to schedule a required prefetch in time.
Without having clashes of the same color being needed for two different successor blocks.
You might spend a day imagining what graph algorithms would give the result you want, while looking for cases that create a coloring clash. Assume it doesn't work, and then look for why.
If I need to prefetch ID=1 to branch to codeblock A, can block A immediately overwrite ID=1 with another prefetch while executing that block and its fall-thrus?
Does this prefetch scheme still work, if the number of IDs is reduced to just one, ie zero bits?
What compiler algorithm would give a correctly-running program, if the number of IDs is two, ie using one bit? I distrust any hand-waving that says that having 8 IDs somehow makes all these coloring issues disappear.
Also: it doesn't matter if the instructions all fit into 32 bits or whatever, if the instructions can't cover all needed programs, or if the compiler algorithms needed are failure-prone heuristics. Itanium failed from expecting /demanding too much new cleverness from its compilers.
Are you sure? yes | no
Skepticism is normal. Itanium has been a serious ... problem. Yet, as you know, GCC and LLVM are huuuuge complex machineries and I don't know many people who dare touching it. I won't.
Hence the side-project to create a "toy language" to showcase the prefetch algorithms.
It's either that or spending another lifetime trying to shoot a moving target "for the sake of compatibility".
.
> Without having clashes of the same color being needed for two different successor blocks.
The coloring is not required to be perfect. It's better if it is but consider this : the point is to do the best possible, with the available constraints and resources, and there will always be code that will not comply with our heuristics :-)
.
> If I need to prefetch ID=1 to branch to codeblock A, can block A immediately overwrite ID=1 with another prefetch while executing that block and its fall-thrus?
yes, once the block is branched to, you can reuse the target ID, even immediately, since the contents of the target buffers are copied to the running PC buffers. Unless you want to reuse the address for later.> Does this prefetch scheme still work, if the number of IDs is reduced to just one, ie zero bits?
That thinking is the basis of the optimisation method I consider : start with "no prefetch" and then move the prefetches bach and back and back... until no ID is available.
I consider creating a "jump" instruction, without prefetch, and the compiler's optimisation will replace each instance with a prefetch-select pair. Other approaches are possible too. But I know that's how I would do it "by hand".
This is a similar method that I use for the register-mapped data memory : create a "load" and "store" pseudo-instruction and then tell the compiler to split the address generation from the actual use. But I'm not a compiler guru, I simply translate my mental gymnastics into graph operations...
The other resource allocation trick is to start from the end(s) of the blocks.
Are you sure? yes | no
Does the prefetch instr give a static prediction of whether the branch is likely or unlikely to be taken? What is the role of branch prediction schemes here?
Can this remove all branch overheads for loop repeats, calls, and returns?
How does the prefetch instr get its target info, when that info comes from some protected structure like function pointers, object pointers, link tables, call stack?
In most cases, the prefetch instr will identify the target via a PC-relative address. Is that trusted? Is the entire code stream trusted?
Are you sure? yes | no
>Does the prefetch instr give a static prediction of whether the branch is likely or unlikely to be taken?
No,
1) there is not enough room in the instruction
2) not enough room in the stack width to save the "hint".
The "likeliness hint" is traditionally held in the branching instruction anyway.
Now, is this hint needed when the pipeline is so shallow ? Most of the latency now comes from
1) computing/evaluating the condition itself (could be 1, 2 or 3 cycles while it goes through the ALUs...)
2) fetching the target's contents (a lookup into L1 and eventually L2 might take several cycles, not even considering going off-chip)
so a branch predictor is almost moot at this point.
The core can continue executing a few instructions following the branch, and hold on the eventual results while the branch is not confirmed, or the target not yet loaded, but this is not required if the compiler does its job (of prefetching well in advance) because branches become almost immediate.
.
> What is the role of branch prediction schemes here?
Branch prediction is not yet considered since branching to a loaded/available target is so fast (one cycle only), the branch prediction wouldn't bring significantly more performance, unless the code is badly written and the memory is slow and the caches are "cold" (like in most usual programs today).
Some dumb speculative execution (fall-through if target unavailable) is possible later but that's not the lowest hanging fruit so far.
Actually : if I can avoid branch prediction, I'll do it ;-)
.
> Can this remove all branch overheads for loop repeats, calls, and returns?
* loop repeats : yes, the "loopentry" instruction copies the next PC buffers to the designated target line, so it's immediately available (a couple of cycles later at most) upon loopback.
* calls : explicit prefetch is required because call requires extra fields.
* returns : this is already handled by the control stack (which is directly linked to the eBTB's hidden entries for the storage)
So the answer would be : yes, it would reduce most overheads.
> How does the prefetch instr get its target info
What sort of info ?
the pf instruction has barely enough room for the 24-bit address and the 3-bit TID field. No matter if the target is another function or code in the same function.
.
> when that info comes from some protected structure like function pointers, object pointers, link tables, call stack?
* Call stack : there is a type field already (so the entry is tagged as a return address, not mixed with other functions)* link table : moot. Use IPC instead.
* object pointers : in a separate addressing space
* function pointer : prefetch the function to a given/ABI-defined target and call the other function, which will know which TID to call to get the extra data.
.
> In most cases, the prefetch instr will identify the target via a PC-relative address.
Not anymore, since 24-bit code pointers are now the norm and require no computation.
This shaves another cycle and a unit in the critical datapath.
.
> Is that trusted? Is the entire code stream trusted?
yes-and-no.
The threat model works this way :
* "modules" are monolithic, static units of code.
* no dynamic linking, no mixing of code of 2 different origins inside a single code addressing space. It was a dumb idea and the lzma/ssh injection attack nailed the coffin anyway.
* your module(s) may be called by unsafe or even adversarial modules so CYFA.
* you can't know if your module calls unsafe or adversarial modules so CYFA.
so what is "trusted" is... almost nothing :-) The trust granularity is each module and you make sure you have safe, proven, verified, trusted "root modules" (the kernel).
Do you remember Ada programming and the famous quote "Ada treats you like a criminal” ? With POSEVEN you can mess with data pointers like crazy if you want (and suffer the consequences but it's well known as a necessary evil and is treated separately, be patient) but you can't inject any arbitrary code address, and even if you could, it would only be relevant in a specific module (as each module has its own 24-bit code addressing space) to limit any eventual harm. You must consider that you may call code that wants to pwn you, or be called by it. And if you're the buggy or naughty one, your options for damage must be nil : you can code yourself into seppuku but this must not affect anything else.
To make this possible : the module's code is *not* modified, adjusted, patched (or so little) when loaded into memory by the kernel. The only "dynamic" feature is to associate a module by its name to an integer ID, through a specific/static module function. You keep this "handle" ID for the module as long as you need to call it. But this does not modify the calling code itself. "Fixing" the trampoline offsets might be useful but not strictly required, if some ABI is well defined.
Yes, more examples will be required soon ;-)
Are you sure? yes | no
This is acting as a specialized L0i code cache, but described as a kind of BTB or odd register file. For it to react quicklier to branches to recently used code lines, this cache will be indexed and searched based on branch target address. But unlike normal caches, it will likely need to be fully associative, with all held cache lines' tags tested in parallel. Larger caches use N-way set-associativity where a given address can only be found in a single column. Those get by with N comparators instead of a full set of parallel comparators. But here, the compiler controls where a code line is physically held, based on the 3-bit ID it chose.
Are you sure? yes | no
Yes the "concept" is now blurred : is this L0, BTB, register file... who cares of the name when it works as intended ? :-)
I think it is more efficient than adding a dedicated unit to compensate for this or for that (like the Pentium did back in the days) so I look at the overall structure and purpose and I merge "things that do the same".
And as you have noted, this "buffer" is only accessed through the ID that is assigned to each entry. It's almost the entry's (relative) number. There is almost no need to look the address up, I don't see why it would be necessary: if it ever became required, just make sure L1 has a copy of the eBTB's contents. So no question of associativity for now.
And if the scheme works well, F-CPU/FC1 would double the directly accessible eBTB size with a 4-bit ID :-P
Are you sure? yes | no
Is the code-prefetch instr always needed, or can it be omitted in infrequent code paths?
When does the code prefetcher fetch additional sequential code lines? Do they show up in time to avoid pipeline bubbles, for say the second code line of a function, or the second code line of an inner loop?
If the inner parts of a big function needs more BTB entries, how does it temporarily evict the entries used by the output parts? By explicit pushes and pops on the control stack? Some would be holding static values known to the compiler, and so could be overwritten to the old value by prefetch instr. But some would have dynamic values, like return addresses or case jump targets or functions called via pointers.
Are the targets of branches and calls always to the first instr of a cache line?
Are you sure? yes | no
Hello Duane !
Thanks for the questions !
.
> Is the code-prefetch instr always needed, or can it be omitted in infrequent code paths?
I consider implementing a "bare jump" instruction, which saves one instruction slot but comes with the expected latency...
The 8 targets "should be enough" to cover most cases and latencies, on top of the return targets managed by the stack. So the "bare jump" should be rarely needed, but it can be handy in some cases anyway.
"bare call" is not possible due to the other required fields.
.
> When does the code prefetcher fetch additional sequential code lines?
As soon as possible, usually when the target has been selected : the address of the target is incremented (that's a 24-3=21-bit incrementer) and used immediately for the PC's double-buffer.
.
> Do they show up in time to avoid pipeline bubbles,
That's the point :-)
Ideally, a couple of lines (that is: pairs of 8-instruction bundles) should be prefetched because you can't be sure the target lies in the first instructions of a bundle. Jumping to the 7th instruction of the prefetched bundle will certainly be "inefficient", particularly considering that the ILP is now 3 : 8 instructions only "last" for 2.5 cycles. So prefetching more than 8 instructions would be good, but that takes a lot of gates. L1 would hold the next line, hopefully.
The advice is usually to align the target address to the first instruction(s) of a cache line.
.
> how does it temporarily evict the entries used by the output parts? By explicit pushes and pops on the control stack?
Yes, this is one way.The other way is to overwrite the "frequent/near" targets, the compiler should be able to manage this.
.
> and so could be overwritten to the old value by prefetch instr.
Yes. The "new" 24-bit addressing space makes is totally straight-forward, no more relative addressing. Almost Cray-like rusticity.
.
> some would have dynamic values,
I am looking at this seriously though this is not the most critical situation. Safety must prevent arbitrary addresses from being injected so
* a sort of lookup table is considered
* if the number of choices is small then a string of conditional/predicated instructions could work too.
I suppose the design will evolve on this side too, under your constant prodding and inquiring ;-)
Are you sure? yes | no