Close
0%
0%

YGREC32

because F-CPU, YASEP and YGREC8 are not enough and I want to get started with the FC1 but it's too ambitious yet.

Similar projects worth following
The YGREC32 is an experimental superscalar shallow-pipeline RISC microprocessor core, meant to issue up to 3 instructions per cycle thanks to the split register set. Instructions are fixed (32-bit wide) and can only be decoded by one corresponding unit, though this is not LIW.

This is a successor of the YASEP toward superscalar execution, or a scaled-down version of the F-CPU FC1, so it's smaller but borrows many of the design features without their creep.

Both data and instruction memories are decoupled from the core: each request is managed through a 3-bit "handle" (Target ID or Address register number), that can be tied to a register or equivalent. This vastly reduces complexity and latency by allowing several memory accesses to be interleaved, without requiring OOO or speculative execution.

PRELIMINARY

May 8th, 2024: It only got started.

version 2024-08-18

Class for comparison : i586 (no MMX), i960/i860, MIPS R3000 or R5000, SPARC V7/8 or LEON, RISC-V, ARM Cortex-R...

  • Type : embedded safe/secure, 32-bit application processor for medium performance despite slow memory
  • 32 bits per instruction
  • 32-bit wide data words
  • 2 identical "globules" with 1 ALU, 16 registers and a dual-ported cache block each.
  • register mapped memory : 4 address registers and 4 data registers per glob.
  • CDI model: separate addressing spaces for Control (stack), Data and Instructions.
  • 24 bits for instruction pointers, that's 16M instructions or 64MB of code per module. 2^24 modules are possible simultaneously.
  • 2 very short parallel pipelines for data processing and a 3rd decoder for control/branch/stack instructions: ILP can go up to 3.
  • 8-entry "explicit Branch Target Buffer" with 8 more hidden/cached entries as well as 8 pairs of entries dedicated/linked to the control stack.
  • Multitasking suitable for Real Time/Industrial OS, light desktop or game console workloads.
  • Heavy computations are offloaded to suitable coprocessors.
  • Powerful tagged control stack
  • Some high-level single-cyle opcodes (and combinations thereof) provide basic control structures.
  • Resilient, safe and secure by design
  • Floating pooint ? Maybe later, we'll see.
  • Need 64 bits, more registers or SIMD ? Use its big brother the #F-CPU FC1 (tba)
  • Too overkill ? Use a microcontroller like the #YASEP Yet Another Small Embedded Processor (16/32 bits) or even the #YGREC8(8 bits).
  • Spoiler alert : it is not designed with Linux-friendliness in mind.

Rationale:

For now I'm only collecting and polishing the ideas. Several years ago I considered a streamlined YASEP with only 32-bit instructions but it would have broken too many things. The YASEP (either 16 or 32 bits) resides at a particular sweet spot but can't move significantly outside of it. OTOH a 32-bit mode for F-CPU would have been interesting but still too ambitious : F-CPU is a huge system so even implementing a simpler subset implies already having the whole already well figured.

So YGREC32 is not really a cleaned-up YASEP. The use of a dedicated control stack does not fit well in the YASEP which will remain a "microcontroller". The YGREC32 is an application processor for multitasking environments that will run user-supplied code, even potentially faulty. It is still suitable for real time but not heavy lifting. It could be simultaneous-multithreaded for even better efficiency. Yet YGREC32 binaries would be easily executed by FC1 with little adaptation since it's mostly a subset, with half the globules and smaller words. Upwards compatibility/translation of YASEP32 is also possible.

A redesigned, pure 32-bit processor is a clean slate where I can develop&experiment with several methods such as #A stack. It becomes the first architecture to explicitly implement and develop the #POSEVEN model. It will be a shaky ride but hopefully it will help further our goals.

 
-o-O-0-O-o-
 

Logs:
1. First sketch (and discussion)
2. Second sketch
3. eBTB v2.0
4. a first basic module.
5. more details.
6. Indirect branch
7. How fast can it run ?
8. Globule symmetry
9. Data Stack protections
10. Restructuring the instruction bundles
11. Perverting RV
12. aligned access
13. Instruction grouping
14.
15.
16.
17.
18.
19.
20.
.
.

......

  • Instruction grouping

    Yann Guidon / YGDES10/03/2025 at 18:20 0 comments

    Linus complains again at https://lkml.org/lkml/2025/10/1/1140

     - expose your pipeline details in the ISA
    
      Delayed branch slots or explicit instruction
      grouping is a great way to show that you eat
      crayons for breakfast before you start designing
      your hardware platform

    Delayed branches have been abandoned more than thirty years ago, even RISC-V has rejected them.

    Now, I'm not certain what "explicit instruction grouping" references to. VLIW/EPIC/Itanium ? Or MMX or even the decoding front-end of the P6+ family ?

    My experience with MMX and P6+ has taught me a few sour lessons and I applied them here. I think I have found a good compromise between ISA exposure, performance, evolution/compatibilitty and orthogonality.

    Explicit instruction grouping is not the real problem. It is required to keep the HW lean, fast and manageable. Not every platform needs or want a huge reordering buffer that wastes energy and space.

    The first key to YGREC32 and its family is symmetry: this is what makes it scalable and the same principles apply to 1, 2 or 4-way superscalar. more parallelism does not make sense since average ILP usually plateaus at 2 or 3. YGREC32 is the ILP sweet spot and YGREC64 provides more bandwidth for heavier computation loads.

    The second key is that the grouping is implicit. The same program runs well without the overhead of packing/framing fields, not only because the symmetrical architecture can remap registers to different globs, but also because the instruction itself (the destination register number) directs the decoder towards the corresponding glob.

    So an implicit grouping of symmetrical instructions is my answer to Linus, preserving scalability, performance and efficiency.

  • aligned access

    Yann Guidon / YGDES10/03/2025 at 16:00 0 comments

    Linus delivered again !

    https://lkml.org/lkml/2025/10/1/1140

    Many interesting points but let's focus on this one:

     - only do aligned memory accesses
    
       Bonus point for not even faulting,
       and just loading and storing
       garbage instead.

    It's 2025.

    Why would anybody perform an unaligned access ?

    I mean by that : mis-aligning a variable. It's slow and inconvenient, particularly for out-of-page access. It still exists for packed data (the BMP headers....) and that's why there is the insert/extract (IE) unit. But mostly you want to avoid that and keep your code lean.

    There is the case of getting a bogus/random pointer that forces loading garbage, and indeed this is caught by an exception (just check x LSB). But a decent language shouldn't let unchecked pointers strive. That's why my architectures have two levels:

    • Known good aligned pointers to full-length words are just like register access.
    • Anything else goes through the LSU/Shifter/alignment unit. This can throw an exception.

    A further mechanism can be added:

    if a register read is performed on a data register where the address register has LSB set, it may trap.

    This is quite easy to implement, though it might have to ripple through the pipeline...

    However, in the case of #aStrA : Aligned Strings format with attributes , we want an unaligned pointer to work. Is it worth an additional opcode ?

  • Perverting RV

    Yann Guidon / YGDES02/25/2025 at 00:22 0 comments

    Update : yeah, no, forget this brainfart.

    -----------------------------

    Let's talk about RiscV.

    In the 80s and 90s its design principles and implementations would have been absolutely wonderful. But we're in 2025 and the RISC-V architecture is ... a big disappointment on many levels. I won't go into details why but the simple fact that I design YGREC32 will prove some of my points.

    The problem is : nobody expects YGREC32 or F-CPU, everybody "wants RISC-V", yet is unable to say why, apart from "that's what everybody else wants". And they don't want to pay an ARM or a LEG for a license. And there are tons of "free cores that implement it already". Just don't look at the details, or the favourable comparisons with x86 start to vanish.

    Now what can I do ?

    My Grogu is determined to design a RISC-V SoC and there is no way to stop him, the Force is so strong with him, but I fear he will corner himself sooner or later, trying to oblige to "market pressure" and despite my attempts, I'm unable to help him on that outdated architecture. Damnit, I'm too old and cranky to even feel comfortable using a 6809 or even 6309...

    But the "baseline" RV arch is designed on extensibility, leaving the 2 LSB cleared for "later use", and the base opcodes have MSB=00. Meanwhile, Y32 has 3 pipelines, each could have one of of the remaining MSB combinations:

    • 00 : RV
    • 01 : Y32-control/stack
    • 10 : Y32 data pipeline / glob0
    • 11 : Y32 glob1

    So with some work, it would be possible to "graft" Y32 over a RV32 core, maybe as an "accelerator" or something, or a "bi-mode" processor, and later jettison RV32, or even worse: reintegrate certain Y32 features back into the RV32 core. I can hear Dave Patterson howl at the moon from here.

  • Restructuring the instruction bundles

    Yann Guidon / YGDES12/07/2024 at 04:01 0 comments

    I was very satisfied with my instruction-to-pipeline mapping, with one opcode for each globule followed by a control instruction. And today my ideas wandered somewhere else : what about the predicated instructions ?

    Due to the still high cost of dealing with branches, it would be awesome to be able to simply "selectively skip" a (small) number of instructions. And this creates a whole lot of new issues, but this is required to reduce the overhead of small branches, due to the still small number of branch target slots.

    First, how many instructions is it possible to skip ?

    I'd say : not more than a cache line. That's 8 instructions, or 3 cycles at full bandwidth.

    This can create some atomicity issues, particularly during decoding. And this also reshuffles the instruction bundle with the predicate in the first/leading position instead of the last.

    The point of predication is to conditionally "abort" instructions by preventing the result(s) from being written back to the register set. Otherwise, we're entering the "shadow register hell" and blow up the core's complexity. This also means that predicated instructions can not reuse earlier results from the same predicate. This also implies that control instructions can't be predicated, and maybe no more than 4 ALU operations shadowed (2 cycles get us to the ALU result, it's time to make sure the result is written back, or not).

    OTOH this gives just enough time for the control circuits to fetch the relevant condition and propagate the decision.

    .

    .

    BTW the predicated sequences work in a way similar to the "critical section" instruction so there is something to explore deeper here.

    .

    .

    .

  • Data Stack protections

    Yann Guidon / YGDES11/19/2024 at 03:34 0 comments

    YGREC32 has several features that I described in recent articles, in particular is can "shield" the bottom of the stack from read and/or write. This means that a caller can prevent callees from extracting or altering its own state.

    There is a series of articles that cover the matter:

    Les « tourments de la monopile », ou le « Single-Stack Syndrome »

    Une histoire des piles et de leur protection

    Une (autre) pile matérielle pour le modèle bipilaire

    Au-delà de la fonction : libérez tout le potentiel de la pile de contrôle !


    Recently I was thinking about what happens when a function returns but its data remain "on the stack" : the next call can scan these data and extract information.

    The usual approach is to flush the data on exit. This is wasteful and new functions usually start by initialising/flushing their stack frame anyway. So this would be required only for "secure" functions, but this is another slippery slope, since "what needs to be secured ?"

    So here is the proposal. YGREC32 already has the read_shield and write_shield ancillary registers, that trap when reading or writing below said addresses. Their values can be updated during call and return, thanks to the protected control stack. So why not make something similar but for the addresses above the stack frame ?

    The behaviour is different though :

    • reading above the given address will return 0 (or some canary value)
    • writing will work and replace the old value in the cache. This will also invalidate the corresponding cache line, except for the written word.

    This way :

    • writing to the top of the stack will not trigger a cache read even
    • reading above the stack top does not leak information from previous calls.
    • all function calls are safe and fast.

    Of course this requires cooperation with the data cache...

  • Globule symmetry

    Yann Guidon / YGDES10/27/2024 at 21:35 0 comments

    It seems I'll have to change, adjust, adapt my model of the globules because of the memory access patterns.

    The "symmetrical globule" idea is great because it's simple, efficient, easy. The functions can also be merged during one cycle by issuing an instruction pair to perform operations that are more complex than what a single instruction could do alone : 2R2W or 3R1W instructions such as 

    • addition with carries,
    • multiplication,
    • division,
    • barrel shift/rot,
    • bit insert/extract,
    • MAC maybe...

    Each globule is dumb and designed for speed, which limits their functionalities a lot, but new features naturally appear when they are coupled. And this breaks the symmetry, in particular for scheduling and the instruction decoder/dispatcher.

    The original idea was very simple : instructions are grouped in 3 consecutive words, or less, that follow the following pattern G1 G2 C or any substring, so overall we can have the following sequences:

    G1
    G2
    G1 G2
    G1 G2 C
    G1 C
    G2 C
    C

    (that's a total of 2³-1=7 combinations, and not 8 because the empty set does not exist).

    But the pairing breaks this nice little clean system. A pair would be a special case of G1 G2 where the first opcode signals a pair, which constrains the second opcode in the pair. Each globule provides their operands and then both can receive a result, often after another cycle (or more depending on the operation).

    But due to requirements in orthogonality, you may want your operands and results to go almost anywhere, right ? So you may end up with these sequences:

    P1 Q2
    P1 Q2 C
    P2 Q1
    P2 Q1 C

    where P is the first opcode of a pair, and Q its "qompanion". And you see that the order of the globules can now be swapped! You still have a requirement to have certain operands and results in separate globules but inhibiting P2 Q1 would add too much stress on the register allocator, I think.

    This also increases the number of opcode types : we have G, C, P and Q now. So the opcode requires 2 bits.

    Also the operands of the Q opcodes are implicitly complementary to the leading P opcode, so there is no need to distinguish Q1 and Q2, it's possible to save a couple of bits but I doubt it's worth the effort. Let's keep both for now.

    ....

    The system starts to break down when confronted to the constraints of accessing memory through the A/D register pairs. It's not a new system since the CDC6600 (and family) uses dedicated address and data registers as well, however 3 address registers are used for writing and 5 are for reading so the semantic is clear:

    • Write to a write register, and any write to the corresponding data register will trigger a memory write cycle.
    • Write to a read register and the read cycle is immediately triggered.

    On the YASEP/Y8/Y32, the A/D pairs are not committed, which allows read-modify-write sequences. It's all fine with the YASEP (16-32) and Y8 because the memory is tightly coupled and usually has one level. Triggering a read cycle upon register write is not an issue. However Y32 has more latency, in fact it's meant to deal with several cycles, so if you only want to write to memory, updating the A register could trigger a cache miss and a long, painful fetch sequence...

    .

    .

    .

    .

    ... tbc ...

  • How fast can it run ?

    Yann Guidon / YGDES10/27/2024 at 17:46 0 comments

    I just came across this very recent measurement on the latest Intel 2xx series chiplet-based CPU. It's interesting because it gives a modern view of the best performances and latencies the bleeding edge can achieve.

    See the source there:

    The main memory is still slow, hovering around 100ns, nothing new here. The quoted 130ns is from a misconfigured motherboard and an update brings it to a less sluggish 83ns, which has immediate effects on the performance. Further architectural enhancements can only come from the usual methods here : larger buses, more channels, more banks, tighter controller integration...

    The interesting parts here are the latency ratios of the cache levels. L2=5.8×L1, L3=L2*5.6

    Oh and that system can clock to 5.7GHz, so one L1 hit takes 4 clock cycles, and God knows how many instructions per cycle one such core can execute in this time.

    An interesting outlier is this datapoint : L1 copy is faster than read or write. Some mechanism is at work here, probably "locality" or "streaming", with a large bundle/buffer that groups words and aligns them or something.

    Note that 5.7GB/s at 5.7GHz amounts to one byte per cycle. That's decent but not much.

    .......

    The YGREC32 is not designed to work with the fastest silicon processes and most expensive foundries but the above numbers are rough estimates for the ratios that will be encountered in a "good enough" implementation. The point is that the architecture must be scalable and requires only little adjustment during shrinking.

    This first thing we can conclude is that without L3 yet still following the "6× per level" rule-of-thumb, the CPU can't run much faster than one GHz or so. More or less, it's the L3 that enables the latest CPU to reach and keep running abot (say) 1.5GHz.

    The other lesson is that memory latency is really, really important. An in-order CPU spends a lot of time waiting for data, leading to ever-increasing cache sizes and levels, as well as very deep out-of-order cores. Increasing the number of cores is another aspect, and Simultaneous MultiThreading helps with keeping the cores busy.

    The worst problems come from OOO and the "solution" for this, explored in Y32, is to explicitly prefetch. The data are already playing the prefetch game since the D/A register pairs implicitly work like "reservation stations". Y32 has 16 such stations for data and 8 for code. This is comfortable maybe up to 500MHz or so. Faster clocks will require SMT, which is easy with Y32 but more than 2 threads will significantly bloat the core's area, so more cores become necessary.

    OK, let's have two cores with 2× SMT for a decent speed. Now we hit two major walls: the FSB bandwidth wall gets hammered and there is probably not enough parellelism in the workload to occupy the cores. In interactive workloads, the separate cores act as local caches for the threads, since there is less space to move around, load to and spill from the caches. There is an incentive to keep as much data as possible onchip and swap only infrequently used data offchip. Speaking about swapping, this means that data blocks are more or less treated at the page level, or somesuch coarse granularity to increase efficiency, but this also increases latency by preventing multiplexing.

    So overall, the clock range for Y32 would be from about 100 to 500MHz without extra architectural features. Out of the box, it could handle up to 16+8+2 concurrent memory transactions (data, instructions and stack) per thread. That's good but can only be exploited by good code, made by good compilers that can think beyond the old load-store paradigm.

    The limiting factor will be the number of "reservation stations" (D/A registers and eBTB entries). FC1 will double them but that might not be sufficient, yet there is only so much space in an instruction word. "Register windows" and remapping of addressable resources will become necessary to push beyond the gigahertz hill.

  • Indirect branch

    Yann Guidon / YGDES09/20/2024 at 05:17 0 comments

    The current ISA draft is very incomplete... it is characterised by a very tight and strict (almost crippling) branch system focused on speed and safety. But this is insufficient in practice and a better compromise is required and "indirect" jumps/branches are required (or else coding can become miserable).

    An indirect branch can branch to one in a multitude of addresses that are stored in an array/table, to implement switch/case or other structures.

    To prevent arbitrary jump target injections, instruction addresses can only be located in instruction memory, which is considered "safe" (to a certain degree) because it is constant and immutable. The compiler is responsible for the validity of the contents of this space.

    Today's progress is the allocation of an opcode range, next to INV=0xFFFFFFFF, to hold indirect branch addresses. Instruction addresses are 24 bit long so the instruction prefix is 0xFE followed by the address. This prefix still decodes as INV though because it is NOT an instruction but a pointer so executing it must trap.

    Also this is NOT a JMP instruction because the pointer should only be handled by a dedicated "indirect branch" opcode. However, JMP could be allocated as a neighbour 0xFD opcode to merge some logic/function. So the opcode value works like a sort of "indirect branch target validation tag" or similarly found in recent ARM architectures.

    There are still a number of issues to resolve though :

    • Range validation: how do you ensure that the address array is correctly indexed, with no out-of-range access ? I guess that the prefetch instruction must include a maximum index somehow, probably limited to 256 entries ?
    • Multicycle operation: the prefetch must first fetch the instruction memory for the pointer then do another fetch cycle to get the actual instruction. That's not RISCy...

    Food for thought...

  • more details.

    Yann Guidon / YGDES08/17/2024 at 01:44 0 comments

    The Y32 architecture is mostly divided in four essential blocks, each with the dedicated memory buffers:

    1. Glob1 (with dCache1)
    2. Glob2 (with dCache2)
    3. eBTB (with iCache)
    4. control stack (with stack cache)

    A 5th block (with no memory access or array) processes data for multi-cycle and/or more than 2R1W operations (Mul/Div/Barrel shifter/...)

    Yet the YGREC32 is a 3-issue superscalar shallow-pipeline processor. This directly dictates the format of the instructions:

    • MSB=0 for "operations", that is: opcodes delivered to the globules to process data. Note: opcode 0x00000000 is NOP. The following bit addresses the globule, so there are 30 bits per instruction.
    • MSB=1 is the prefix of the "control" opcodes, such as pf/jmp/call/ret/IPC/spill/aspill/... This affects the control stack as well as the eBTB (sometimes simultaneously). Note: 0xFFFFFFFF is INV.

    .

    The globules are very similar to the YGREC8 : a simple ALU (with full features, boolean, even umin/umax/smin/smax), 16 registers with 8 for memory access, and a 32-bit byte shuffle unit (aka Insert/Extract unit, for alignment and byteswap: see the #YASEP Yet Another Small Embedded Processor's architecture).
    It must be small, fast : Add/sub, boolean and I/E must take only one cycle.

    Complex operations are performed by external units (in the 5th, optional/auxiliary block) that get read and write port access with simultaneous reading of both register sets to perform 2R2W without bloating the main datapath: multiply, divide, barrel shift/rotate... The pair of opcodes get "fused" and can become atomic, but could also be executed separately.

    Fused operations require an appropriate signaling in the opcode encoding.

    For context save/restore, it would be good to have another, complementary behaviour, where one opcode is sent to both globs (broadcast).

    It is good that control operations can be developed independently from the processing operations, though sharing the fields (operands and destinations) is still desired.

    .

    Both globs are identical, symmetrical, so there is only one to design, then mirror&stitch together.

    Each glob communicates with the other through the extra read port (the register sets are actually 3R1W), and are fed instructions from the eBTB. The only connection with the control stack is for SPILL and UNSPILL, using the existing read & write ports. Each glob provides a set of status flags, which are copies of the MSB and LSBs of the registers (plus a zero flag). Hence some of the possible/available conditions are

    • LSB0
    • LSB1
    • LSB2
    • MSB
    • Zero

    The multiple LSB conditions help with dynamical languages that encode type and other ancillary values in the pointers (including #Aligned Strings format ). MSB can be a proxy for carry.

    Oh and yes, add with carry is a "fused"/broadcast operation that writes the carry in the secondary destination.

  • a first basic module.

    Yann Guidon / YGDES07/03/2024 at 10:32 0 comments

    Here is a first attempt at writing a module that just prints "Hello world!".

    It shows the basic structure of a module as well as the IPC/IPE/IPR instructions that allow modules to call each others.

    ; HelloWorld.asm
    
    Section Trampoline
    
    ; init:
    Entry 0
      IPE
        ; nothing to initialise
      IPR
    
    ; run:
    Entry 1
      IPE
      ; no check of who's calling.
      jmp main
    
    Section Module
    
    main:
    ; get the number/ID of the module that writes
    ; on the console, by calling service "ModuleLookup"
    ; from module #1 
      set 1 R1
      set mod_str R2
      IPC ModuleLookup R1
    ; result:  module ID in R1
    
    ; write a string of characters on the console
    ; by calling the service "WriteString"
    ; in the module designated by par R1
      set mesg_str R2
      IPC WriteString R1
    
      IPR ; end of program.
    
    Section PublicConstants
      mod_str:  as8 "console"
      mesg_str: as8 "Hello world!"

    I'm a bit concerned that the IPC instruction has a rather long latency but... Prefetching would make it even more complicated, I suppose.

View all 13 project logs

Enjoy this project?

Share

Discussions

Similar Projects

Does this project spark your interest?

Become a member to follow this project and never miss any updates