Fusing code to improve code density has a high potential of making refactoring very difficult. In this project I took the freedom to fuse core code when I saw a size (sometimes also a speed) advantage, and where I didn't see possible future variation points. In some cases I un-fused code from the original STM8EF source to get desired variation points (e.g. vectored I/O). This was easy since much of the eForth source code is written in Forth, and fusing code is no problem as there are no fundamental changes to the design of eForth.
So, where is the point? It's a pun. I used the TRAP instruction to improve the code density of Forth user code :-)
Please let me explain by diving a bit into how the Forth VM, the virtual CPU that drive a Forth systems works. When Forth compiles a "word", it creates a thread of tokens for consecutive execution by the Inner Interpreter, the execution unit of the Forth VM.
The Inner Interpreter can be implemented in
hardware (a Forth CPU), or in software. There are established coding
techniques for general purpose CPUs, like the often used Direct
Threaded Code (DTC) where tokens are pointers to the so called "Code
Field Address" of words.
Tokens can represent any of the following:
- other Forth words like DUP, SWAP, or other user words, that get executed in sequence
- the DOLIT instruction to push an literal integer, following the token, to the Data Stack
- the EXIT instruction that ends a thread, to return to the calling thread, or to end execution
- the BRANCH instruction to move the instruction pointer to a different point in the same thread
- ?BRANCH, DONEXT, or DOLOOP which are similar to BRANCH but conditional (using data on the Data or Return Stack)
Words like DOLIT or ?BRANCHdiffer from ordinary Forth words in that they manipulate the Inner Interpreter, and don't just work with the stacks of the Forth VM, and therefor the implementation of this class of words dependents on the coding technique used for implementing the Forth VM. Only a small number of such words is necessary, and most of them are core words.
STM8EF uses Subroutine Threaded Code (STC). Here, the Inner Interpreter is implemented by the CALL instruction of the CPU, the instruction pointer is the program counter, and the CPU's return stack is used as the Return Stack of the Forth VM. Words that need to manipulate the instruction pointer, like DOLIT, do this by changing the CPU's PC value on the return stack (the one that the CALL pushes for use by RET).
The advantage of STC is speed and simplicity of implementation. Context switching for multi-tasking is also easy to achieve (I took advantage of this). The disadvantage is code size: compared to DTC a simple STC implementation needs at least one more byte for representing a token, the CALL instruction. And this is before we even use words like DOLIT and EXIT to make the Inner Interpreter change the sequence of execution.
Since we're using the CPU as the Inner Interpreter there are other some more analogies between the CPU and the Forth VM that we can exploit for optimization:
Token | Optimization measure for increasing code density | Trade-Off | Done |
any | Many CPUs have CALL instruction that uses some form of relative or segmented addressing for improving code density, and an optimizing compiler can take advantage. The STM8S CALLR instruction is used if the call distance is less than 128 bytes (2 instead of 3 bytes). | Some code generator complexity. | ✓ |
EXIT | EXIT can be coded as native RET (1 instead of 3 bytes) | None | ✓ |
BRANCH | Branch can be coded as native JP (3 instead of 5 bytes) | None | ✓ |
?BRANCH | That's very hard since manipulation of the Data Stack is required (DONEXT and DOLOOP even manipulate data on the Forth VM Return Stack ). In Machine Forth, Chuck Moore changed the rules of the game to make this easier. | Be like Chuck Moore. |
|
DOLIT |
Same as ?BRANCH, except what DOLIT does is a simple operation of the Forth VM. |
hmmm |
Edit: I worked a bit on replacing CALL by CALLR to make one more check mark in the "Done" column in the first table. I'll be playing with the code before releasing it.
In most cases, manipulating the Forth VM stacks with native code requires many CPU instructions. This also sets limits for hand-optimizing core assembly code.
If we only had more of such 1:1 matches of Forth VM instructions to CPU opcodes (like EXIT-RET and BRANCH-JP). I'm not like Chuck Moore (and I don't change easily) but ... the STM8S has one malleable instruction: TRAP.
TRAP is a "non-maskable software interrupt with highest priority". The purpose of such an instruction is to consistently manipulate the state of the CPU, e.g. in an RTOS: it interrupts the CPU's flow of execution, saves the context to the stack, runs the TRAP handler code, restores the context, and returns to where it left off. It can only be interrupted by RESET or by TRAP (which rarely ever makes sense).
I used TRAP to implement the DOLIT instruction: "CALL DOLIT MSB LSB" can now be coded as "TRAP MSB LSB", that's 3 bytes instead of 5 for representing literals!
Here is the code (you may need the page on the TRAP instruction in the STM8S programming manual to make sense of it):
; TRAP handler implementing DOLIT ; Push the inline literal following the TRAP instruction _TRAP_Handler: DECW X DECW X LDW (3,SP),X ; XH,XL EXGW X,Y LDW X,(8,SP) ; PC MSB/LSB LDW X,(X) LDW (Y),X LDW (5,SP),X ; YH,YL LDW X,(8,SP) INCW X INCW X LDW (8,SP),X IRET
In order to use it for compiled user code I only had to change the implementation of LITERAL (3 lines), and the implementation of DODOES (3 lines), and DOES> (1 line). After applying the new "DOLIT instruction" in the core code a couple of times it had paid for itself!
In applications where a lot of literals have to be used in the code (not variables or arrays but literal numbers) the new approach has a clear code size advantage.
Of course, where there is light, there also is shadow. Here is a comparison of the old and the new implementation:
CALL DOLIT | DOLIT with TRAP | |
Execution time | 22 CPU cycles | 50 CPU cycles |
Interrupt latency | 4 CPU cycles (CALL or RET instruction) | 50 CPU cycles w/ TRAP context switch |
Bytes on return stack | 22 (multi tasking + interrupt) | 9 (non interruptible) |
At 16MHz clock frequency, 50 CPU cycles translate to just about 3µs execution time. In most applications, the additional 2µs won't be noticeable. Since time critical code (e.g. interrupt driven video signal generation) might see an impact (even if it's coded in assembly, or C). I'll try to make this new feature optional. The result of the stack analysis came as a surprise since I already had planned for increasing the return stack headroom.
Discussions
Become a Hackaday.io Member
Create an account to leave a comment. Already have an account? Log In.