Close

Chasing performance

A project log for CPU running Basic

Celebrating 50 years of Tiny Basic by implementing a custom micro-coded 16/32-bit CPU that executes it directly (up to 100MHz)

zpekiczpekic 11/17/2025 at 06:210 Comments

I was really curious how the Basic CPU will perform in comparison with classic CPUs of the home computer era and put some effort into optimizing the design with that goal in mind. Based on the benchmark tests, the goal has been only partially achieved. While in some cases the speed up of factor 4 to 8 looks noteworthy, my suspicion is that those Basic interpreters by default work with software - implemented floating point for the benchmark test (haven't explored them in depth), while other group showing more modest gain of 1.5 to 2X are "integer Basic" implementations, and a more fair comparison. 

CPU performance optimizations I used:

Clock frequency

A bit of "cheating" going on here - whole design is inside the FPGA which allows it to work at maximum available hardware clock frequency of 100MHz. Most importantly, the "core" RAM (Basic input buffer and program store) can be accessed in 1 or 2 clock cycles (so 10 or 20ns):

writeCore:    nWR = 0, if nBUSACK then repeat else return;

readCore:    nRD = 0, if nBUSACK then repeat else next;
        nRD = 0, MDR <= from_Bus, back;

This would clearly be impossible if the RAM is outside of the FPGA, even on the same board.  Depending on the CPU clock and memory speed, one or more wait cycles would need to be added. Anvyl board has a breadboard section, so in future I may move the Basic core memory to a 62256 type device and experiment for example how many wait cycles does a 70ns vs. 120ns memory chip need. 

Other limiting factor is I/O speed. Currently, max I/O serial speed is 38400 bps. Sending and receiving 1 byte over such channels takes at least 10-bit times, during which CPU has to wait for the ready signal. 

outChar:    if CHAROUT_READY then next else repeat;            // sync with baudrate clock
        if CHAROUT_READY then return else repeat;

 FIFO queue on both input and output would help, but their implementation belongs to the Ser2Par and Par2Ser components, not the CPU (except for the trace serial output, but that one is only active up to 4kHz CPU frequency so not much gained there). 

Cycle overlap

Basic CPU is a CISC processor, and pipelining is not traditionally their strength. However a very limited opportunity of overlap is used between execute and fetch in few instructions. Note that the fetch cycle has two entry points:

fetch:        traceString 51;                        // CR, LF, then trace Basic line number (in hex, for speed)
fetch1:        traceString 2;                             // trace IL_PC and future opcode
        IL_OP <= from_interpreter, IL_PC <= inc, traceSDepth;    // load opcode, advance IL_PC, indent by stack depth IL code if tracing is on
        T <= zero, alu <= reset0, if IL_A_VALID then fork else INTERNAL_ERR;    // jump to entry point implementing the opcode (or break if we went into the weeds) 

Most instructions finish their execute cycle and then branch back to "fetch" (no overlap). Few have the opportunity to execute "traceString 51" operation while in parallel doing other operations, and when done can branch to "fetch1" - this is a 1 clock cycle overlap. This could be utilized more, but tracing at the beginning of fetch cycle is very convenient debug tool, so this was an engineering compromise between 2 important goals (troubleshooting and performance).

////////////////////////////////////////////////////////////////////////////////
.map 0x12;                    // FV (Fetch Variable)
////////////////////////////////////////////////////////////////////////////////
traceString 36;                    // trace mnemonic
Vars <= indexFromExpStack, if STACK_IS_EMPTY then ESTACK_ERR;     // get index (variable name A-Z)
T <= from_vars, ExpStack <= pop1, traceString 51;        // T <= Vars(index)
ExpStack <= push_TWord, goto fetch1;                // push onto stack, go to 2rd fetch entry point as we overlapped 1 cycle

Parallel operations

Basic CPU uses "horizontal microcode" with fairly wide control word of 80 bits:

In theory, all 28 fields could independently specify an operation in parallel, so that all components within the CPU work in parallel. Obviously, Basic CPU is far from this ideal goal. Execution control is by definition engaged in every microcycle, but in average only 1 or two others. One of the better ones is this one:

directByte = ' ', writeCore(InlEnd, from_microcode), InlEnd <= inc;

 Following operations are being executed in same clock cycle:

  1. MAR register is loaded with value from InlEnd register
  2. MDR register is loaded with immediate value carried in the directByte field
  3. InlEnd register is incremented
  4. Microprogram counter is loaded with the address of "writeCore" routine, while the return stack pointed by microprogram stack pointer is loaded with current address + 1 (subroutine call can be done in 1 clock cycle due to design of the control unit)

4 operations, far cry from the 20+ possible. Interestingly, the MCC compiler could easily provide static microcode effectiveness analysis by counting the average parallelization (not currently implemented). Even better would be during runtime, when microinstructions executed more would carry more weight in the metric.  

Algorithm optimizations

Copying 1 byte in the core memory takes 5 clock cycles:

        //--- copy Y bytes in Core memory from S to R ---
copyCore:    T <= from_S, if Y_ZERO then return;            // bail if no bytes left to copy
        readCore(T);
        T <= from_R;
        writeCore(T, same);
        //traceALU();
        alu <= copy_next, goto copyCore;    // Y--, R and S incremented / decremented based on direction

It is easy to see how this could be collapsed to 3: allow the MAR register to be loaded directly from S and R registers, instead of going through T. Turns out this optimization would be costly (addition of 1 control bit to the microcode control word width) and also not necessary: memory copy is only used when inserting lines into the program, which happens in command mode (usually human interaction) and would not speed up the Basic program. 

However there are some other places that impact execution speed and I tried to optimize them:

(actual instruction times are longer due to stack push / pop overhead, tracing etc.) 

GOTO cache

This component is the single most significant perf accelerator. Depending on structure of the program, up to 2-3X faster runs with GOTO cache on. The reason is easy to see. Structure of Basic program is as follows:

Line numberstatement text, ASCII as entered0x0D
2 bytes, 1 word in big endian formatn bytes1 byte

In order to execute GOTO / GOSUB, whole memory must be searched from beginning, first reading and checking the word (for target number match), then all bytes following until 0x0D (carriage return) and then continue until:

A simple 1-way, 32-entry direct-mapped cache allows avoiding this traversal and providing the address of the target statement start leading to a dramatic perf improvement.

How does it work:

  1. 32 bit cache_valid register is cleared when Basic program starts to run (condition detected by hardware by checking the before / after value of Lino (line number register). If it goes from 0 to !0, reset
  2. Target line number for a GOTO is pulled from stack and lands in Lino register
  3. Last 5 bits of Lino are used to look up the valid bit in cache_valid (simple 32 to 1 MUX) and the entry in the cache (address presented to 32-word cache memory). For example in case of GOTO 100, 5th bit and 5th entry in cache would be selected
  4. Each entry (frame) contains data and tag. Tag is compared with upper 11 bits of Lino (value of 0b11 in case of GOTO 100), and if same cache_hit condition arises. Following cases are possible:
    1. valid = 0, hit = 0: entry is free, microcode looks for statement 100 and if found stores BP (address found) and upper 11 bits of Lino into the entry. cache_valid bit is flipped to 1. We pay the price of search for statement 100 now but not again (compulsory cache miss)
    2. valid = 0, hit = 1: same as above, there was junk data from previous Basic program run
    3. valid = 1, hit = 0: this entry has already been occupied by some other GOTO. For example in simple cache like this (no associativity, no purge) if GOTO 900 were executed first, it would take the entry and sit there until end of program (conflict miss)
    4. valid = 1, hit = 1: skip looking for beginning of statement 100, load BP with cache data (cache hit)

Microcode needs to just examine these two conditions and take proper path:

////////////////////////////////////////////////////////////////////////////////
		.map 0x16;									// GO (GOTO)
////////////////////////////////////////////////////////////////////////////////
		traceString 45;								// Trace mnemonic
		IL_PC <= XQhere, if STACK_IS_EMPTY then ESTACK_ERR;	// IL_PC <= XQhere, check stack
		alu <= R_fromStack, ExpStack <= pop2;
		T <= from_R;
		Lino <= T, if R_IS_ZERO then NOPROG_ERR;			// once Lino is set, CACHE_HIT and CACHE_VALID are meaningful
		// traceLino;
		if CACHE_VALID then go_cvalid;
		// INVALID: the cache entry is free, find target and store in cache (and stay there until end of run)
		findLino(Prog_start);
		alu <= cache_store, goto fetch;	// cache[Lino(4:0)] <= Lino & BP, cache_valid[Lino(4:0)] <= 1 
			
go_cvalid:	T <= Cache_Data, if CACHE_HIT then next else go_cmiss;
		// VALID HIT: pick it up and use it
		BP <= T, goto fetch;

		// VALID MISS: other GOTO already used the entry, no attempt to kick it out, take the perf penalty
go_cmiss:	findLino(Prog_start);			
		goto fetch;		

 Given the impact of cache on performance, I may in future improve the cache by implementing some of the well-known cache design choices:

Implementing proper cache would require measuring hits/misses during runs of various benchmark programs. Implementing such hardware is easy but devising and running such programs isn't. Right now, I am happy with 3 blinkenlights:

- cache empty - green LED

- cache not empty - yellow LED

- cache full - red LED

-- GOTO cache
cache_empty <= '1' when (cache_valid = X"00000000") else '0';	-- all 32 entries free
cache_full <= '1' when (cache_valid = X"FFFFFFFF") else '0';	-- all 32 entries used
cache_entry <= cache(to_integer(unsigned(Lino_index)));			-- data/tag cache entry pointed to by Lino 
cache_hit <= '1' when (cache_tag = Lino_tag) else '0';

Discussions