-
Published first version of my Verilog version of this project
06/15/2020 at 20:10 • 1 commentIf you are interested, I have created a Verilog version of the TI-99/4A. It can be synthesised with the open source IceStorm toolchain. Supports the cool ULX3S board.
Very much work in progress, but due to some interest I decided to release the code in its current version. Please see:
-
Added cache
01/19/2019 at 09:35 • 0 commentsHappy New Year!
I've been preoccupied with other stuff, but found a little time to work on my TI-99/4A clone. One of the things I've been wanting to do a long time is to understand how cache memories work, so I created one for this system. My update at GitHub explains the details, but there is now a 1K byte combined code and data cache, and system performance is up by 22%.
Adding one cache is cool, but having more than one is better, so more coming - stay tuned. The cache is really the enabling piece in increasing concurrency within the CPU, since having caches allows multiple memory accesses to be processed simultaneously. The TMS9900 is a very memory intensive processor, as the architectural registers (the so-called workspace) is actually located in main memory. Thus it benefits from cache memories perhaps even more than many other systems.
Even with only the simple system-level cache that I created there are now many more opportunities to optimise the CPU's execution, as now memory reads no longer dominate execution time as much as in the past.
-
Two more FPGA CPU bug fixes
09/24/2018 at 19:46 • 0 commentsAlready second project update for the day! I added way more test cases to run through more instructions.
Now testing includes instructions ANDI, CB, SB, AB, XOR, INC, DEC, SLA, SRA, SRC, MOV, MOVB, SOCB, SZCB and X instruction comparisons in addition to the earlier tests for A, S, SOC, SZC, DIV, MPY, C, NEG and SRL instructions.
These were good additions, as I found two more bugs with CPU flags: the CB instruction did not set parity correctly, and the ABS instruction did not set overflow flag at all. I fixed those two, interestingly CB instruction sets parity according to source data byte instead of ALU subtract output, so that needed special casing. I suspect in the original TMS9900 there is only one parity generation circuit and it is sampled at a different time, I simply added a 2nd parity calculation.
After fixing these two bugs now the problem I had has disappeared, so now PRINT 1*-1 returns -1. I suspect this must be the ABS bug fix that helped.
I guess these fixes mean I need more test cases, since I am sure there are more bugs.
-
Fixing divider and overflow flag issues
09/24/2018 at 18:21 • 1 commentAfter remembering where I was in the project I started to look for bugs in my CPU. I know it does not fully work, for example since running in TI BASIC I get:
PRINT 1*-1 1
So something is not working in the CPU. In order to work on this, I took advantage of my previous design, which combines a real TMS99105 CPU with my FPGA implementation of the TI-99/4A. Running the above on it yields the correct result, -1.
So I wrote a piece of test code, ran it both on the FPGA CPU and the TMS99105 while capturing the results (by dumping a section of memory on both systems to a file), below is the comparison:
Left hand side is TMS99105, right hand side is my soft CPU, i.e. the FPGA TMS9900 core. Each instruction is tested 8 times with different data. The source code of this test is below, after the explanation below.
Each instruction test output takes 8 bytes or 4 words. The last word of the output are the flags (only top 6 bits preserved). The result words are R1,R2,R3 and flags. The instruction is always executed like SUB R1,R2. Thus R2 is the result and R1 shows the source operand.With that, we can see that there is a difference with the second instruction under test. It is the subtract instruction.
The first subtraction works fine (SUB 1,2, i.e. 2-1) but the second has a difference in the flags (SUB >7FFF,1) where the soft CPU has >2800 while the real CPU has >2000.
The flag that is different is ST4 i.e. the overflow flag. Also in the the other SUB instructions the overflow flag is sometimes bogus, here is a table of the eight cases:
SUB >1,>2 OK (note that with the TMS9900 this actually is the 2-1 operation)
SUB >7FFF,1 Bug: soft-cpu asserts overflow
SUB >8000,>7FFF Bug: soft-cpu does not assert overflow
SUB >7FFF,>8000 Bug: soft-cpu does not assert overflow
SUB >FFFF,>8000 OK
SUB >8000,>FFFF Bug: soft-cpu asserts overflow
SUB >8000,>8000 Bug: soft-cpu asserts overflow
SUB >0,>8000 OK
Looking at the data sheet carefully, there is a difference how ST4 (overflow) is asserted for adds and subs, the first condition is inverted. I bet I don’t do this.
Adds: If MSB(SA) == MSB(DA) and MSB of result != MSB(DA)
Subs: If MSB(SA) != MSB(DA) and MSB of result != MSB(DA)The only other difference with this data is at >0120, and here the result is wrong (but flags ok). Since R3 has changed, it must be a DIV or MPY instruction.
First instruction test output at 0..>3F, 2nd at 40..7F, 3rd at 80..BF, 4th at C0..FF, 5th at 100..13F. So the fifth instruction.
And indeed it is the DIV instruction - like PNR reported. One of my test cases at least now catches the problem. It is fifth test case, from above it is
DIV >FFFF,>8000 i.e. >8000 divided by >FFFF. The result should be >8000 as quotient and >8000 as remainder.
But my code gives >FFFE as quotient and >FFFE as remainder too.Here is the TMS9900 assembler test code (story continues after the listing):
; EP 2018-09-23 - run through a sequence of instructions with data and write ; results to RAM. This is to enable comparing the FPGA CPU and TMS9900. LI R5,>2000 ; point to result table LI R7,TEST_ROUTINES ; point to test routines RUN_TEST MOV *R7+,R8 ; address of routine to test LI R6,TEST_DATA_SEQ ! MOV *R6+,R1 ; fetch test parameters MOV *R6+,R2 CLR R3 ; perform operation under test BL *R8 ; save results MOV R1,*R5+ MOV R2,*R5+ MOV R3,*R5+ STST R3 ANDI R3,>FC00 ; only keep meaningful flags MOV R3,*R5+ CI R6,TEST_DEND JNE -! CI R7,TEST_ROUT_END JNE RUN_TEST ; write end marker to memory LI R3,>1234 MOV R3,*R5+ MOV R3,*R5+ MOV R3,*R5+ MOV R3,*R5+ And here is the data: TEST_DATA_SEQ ; Parameters to pass two various instructions DATA 1,2 ; First data set DATA >7FFF,1 ; 2nd DATA >8000,>7FFF DATA >7FFF,>8000 DATA >FFFF,>8000 ; 5th DATA >8000,>FFFF DATA >8000,>8000 ; 7th DATA 0,>8000 ; 8th TEST_DEND TEST_ROUTINES DATA DO_ADD, DO_SUB DATA DO_SOC, DO_SZC DATA DO_DIV, DO_MPY DATA DO_COMP TEST_ROUT_END DO_ADD A R1,R2 RT DO_SUB S R1,R2 RT DO_SOC SOC R1,R2 RT DO_SZC SZC R1,R2 RT DO_DIV DIV R1,R2 RT DO_MPY MPY R1,R2 RT DO_COMP C R1,R2 RT
Once I had analysed the problems, I fixed both the divider and the overflow flag. The fixes for both problems are in the tms9900.vhd source code committed to the soft-cpu-tms9902 branch at GitHub.
The fix for the overflow flag was easy, just taking into account the difference in flag generation. I took still a few iterations from me to get it working properly. The resulting VHDL code for overflow flag generation is:
-- ST4 overflow alu_flag_overflow <= '1' when (ope = alu_compare or ope = alu_sub) and arg1(15) /= arg2(15) and alu_result(15) /= arg1(15) else '1' when (ope /= alu_sla and not (ope = alu_compare or ope = alu_sub)) and arg1(15) = arg2(15) and alu_result(15) /= arg1(15) else '1' when ope = alu_sla and alu_result(15) /= arg2(15) else -- sla condition: if MSB changes during shift '0';
Fixing the division problem was a little more involved, once I understood that I actually need a 17-bit subtraction the divider started to work properly. I had not captured the problem in my earlier testing because I only manually tested a few cases and did not try dividers above 32767, i.e. where the 16-bit divisor has bit 15 set. Implementing this fix in a simple way required me to add a dedicated 17-bit subtract unit to the divider code, rather than relying on the ALU I already had in the design, here is the key snippet of code, the story continues a bit more after the code:
when do_div1 => dividend(15 downto 0) <= rd_dat; -- store the low word shift_count <= "10000"; -- 16 cpu_state <= do_div2; when do_div2 => dividend(31 downto 0) <= dividend(30 downto 0) & '0'; -- shift left -- perform 17-bit substraction, picking up the bit to shifted out too divider_sub <= std_logic_vector(unsigned(dividend(31 downto 15)) - unsi gned('0' & reg_t)); dec_shift_count := True; -- decrement count cpu_state <= do_div3; when do_div3 => if divider_sub(16)='0' then -- successful subtract dividend(31 downto 16) <= divider_sub(15 downto 0); dividend(0) <= '1'; end if; if shift_count /= "00000" then cpu_state <= do_div2; -- loop back else cpu_state <= do_div4; end if; when do_div4 => -- done with the division.
The good news is that after these fixes the instructions that I now test and compare with a real TMS99105 all work identically. The bad news is that the BASIC statement at beginning of this blog entry still gives me the same result - so on with the bug hunt for further bugs!
-
Back with the project!
09/22/2018 at 20:19 • 2 commentsI wrote the following as my comments to the GitHub commit I just made (formatted better here). I should additionally say that there are four branches at GitHub, the master branch and the soft-cpu-tms9902 branches are at the moment the ones I checked and/or worked with today.
Commit 2018-09-22:
- After a long while worked on the project. This was pretty much trying to remember where I was in the project.
- I synthesized again the master branch and also worked on the soft-cpu-tms9902 branch. The master branch is the branch which supports the TMS99105 CPU on the daughterboard / shield that I designed two years ago. Still works.
- There was some actual progress on the aforementioned soft-cpu-tms9902 branch. I clarified the naming and processing of reset signals. Thanks to this now two bugs are fixed:
- Sound works (again?) now that the audio DAC is not constantly being reset.
- The serloader component (handling communication from the host PC over USB serial port to the memory of the TMS9900 via DMA) was being reset by mistake while the CPU was placed to reset. Now if the host PC put the CPU to reset, that reset would also the serloader, effectively preventing any further communication with the system. This of course sucks big time, as the main use case for putting the CPU to reset in the first place is to load software to the memory of the TMS9900 system without having the CPU mess around with it while it was half loaded.
- I know that my FPGA CPU core has bugs, and I found a repeatable one: running BASIC and doing a simple multiplication with PRINT 1*-1 yields always 1 (plus one) with my FPGA CPU, while an actual TI-99/4A or my TMS99105 FPGA system (i.e. the master branch using real CPU silicon) yields -1 as they should. So this bug can be observed with high level software... There we go. It is a miracle BASIC runs in the first place. The bug is probably related to CPU flag handling. I also have been reported by **pnr** that my divider implementation does not work properly in all cases, so need to check that too.
Good to be back with the project! -
Support for the original keyboard
01/02/2018 at 20:08 • 3 commentsA quick addition of the day - this one was really easy to do as interfacing to a normal TI keyboard from the FPGA is way easier than communicating with the PC's keyboard through USB and the server process.
The implementation quite literally only involved in bringing out the keyboard row / column wires from my TMS9901 interface chip implementation inside the FPGA. There are no external active or passive components other than the keyboard switches, thanks to the internal pull-ups of the FPGA.
-
Stand-alone booting capability
12/30/2017 at 19:11 • 0 commentsAn update after a long last!
The next step for the design is to make the FPGA system stand-alone, i.e. able to boot and operate without a host PC. A USB connection will still be needed, but only to provide power. Today I implemented a new feature, where after reset the FPGA logic will load 256K of data from the SPI flash ROM to the SRAM of the system. That allows the system get the TI-99/4A system ROMs and GROMs to the static RAM in appropriate places. After the download one of the DIP switches controls the CPU's automatic boot - if switch zero is set the CPU in the FPGA will automatically boot and start executing the code that was transferred to SRAM.
The 256K of data is divided into three regions:
- First 128K is written to SRAM from address zero upwards. The logic of the FPGA maps this area to the cartridge ROM slot of the TI-99/4A. This is a paged are of 8K pages. By default my scripts but the extended Basic ROM code (16K there).
- Next 64K are written to SRAM from address 0x80000 onwards (at address 512K). This is the area where GROM data is stored in my design. By default I have there first 24K of system GROM followed by 32K of Extended Basic GROM code.
- The last 64K are written to SRAM at address 0xB0000. This is my ROM area. It is largely unused, but the first 8K (at address 0xB0000) are the disk support DSR space and another block of 8K (at address 0xBA000) is mapped to address zero of the TMS9900 core's address space, thus containing the normal console ROM code.
The Pepino board has 1M of static RAM overall. I had forgotten that the board has actually 16 megabytes of SPI flash storage so there is plenty of potential here.
The design of the SPI flash interface is from Magnus Karlsson, the designer of the Pepino FPGA board. I used the code from his Mac Plus example, and modified the code for my purposes. His code is written verily while my code is in VHDL, so I wrote the standard VHDL component header to enable me to interface the Verilog code from VHDL.
-
VDP character cell address masking feature
11/01/2017 at 21:20 • 0 commentsI pushed to GitHub an update to my TMS9918 VHDL core, adding support for undocumented but somewhat widely used and known graphics mode 2 masking features. The lack of this feature was the culprit of making the megademo (see my previous update) not working properly in quite a few screens in a systematic way.
With these fixes the megademo works much better, but there are still some problems (including the fact that the demo gets stuck at a certain point after running successfully through quite a few demo phases - the CPU core continues to run, but it appears to be in some kind of a loop that it cannot escape). So as always, fixing some bugs means its time to fix the next bugs...
The character masking feature appears in two places in the VHDL code, using low bits of registers 4 and 3 as character cell masks, the example below illustrates the use of register 4 during character cell address calculation in graphics mode 2:
-- Graphics mode 2. 768 unique characters are possible. -- Implement UNDOCUMENTED FEATURE: bits 1 and 0 of reg4 act as bit masks for the two -- MSBs of the 10 bit char code. This allows character set to be limited even in this mode. vram_out_addr <= reg4(2) -- MSB of the address & (char_addr(9 downto 8) and reg4(1 downto 0)) -- Character code with masks for bits 9 and 8 & char_code & ypos(2 downto 0); -- 8 bit code and line in character
-
VDP character cell address masking feature
11/01/2017 at 21:20 • 0 commentsI pushed to GitHub an update to my TMS9918 VHDL core, adding support for undocumented but somewhat widely used and known graphics mode 2 masking features. The lack of this feature was the culprit of making the megademo (see my previous update) not working properly in quite a few screens in a systematic way.
With these fixes the megademo works much better, but there are still some problems (including the fact that the demo gets stuck at a certain point after running successfully through quite a few demo phases - the CPU core continues to run, but it appears to be in some kind of a loop that it cannot escape). So as always, fixing some bugs means its time to fix the next bugs...
The character masking feature appears in two places in the VHDL code, using low bits of registers 4 and 3 as character cell masks, the example below illustrates the use of register 4 during character cell address calculation in graphics mode 2:
-- Graphics mode 2. 768 unique characters are possible. -- Implement UNDOCUMENTED FEATURE: bits 1 and 0 of reg4 act as bit masks for the two -- MSBs of the 10 bit char code. This allows character set to be limited even in this mode. vram_out_addr <= reg4(2) -- MSB of the address & (char_addr(9 downto 8) and reg4(1 downto 0)) -- Character code with masks for bits 9 and 8 & char_code & ypos(2 downto 0); -- 8 bit code and line in character
-
VDP character cell address masking feature
11/01/2017 at 21:20 • 0 commentsI pushed to GitHub an update to my TMS9918 VHDL core, adding support for undocumented but somewhat widely used and known graphics mode 2 masking features. The lack of this feature was the culprit of making the megademo (see my previous update) not working properly in quite a few screens in a systematic way.
With these fixes the megademo works much better, but there are still some problems (including the fact that the demo gets stuck at a certain point after running successfully through quite a few demo phases - the CPU core continues to run, but it appears to be in some kind of a loop that it cannot escape). So as always, fixing some bugs means its time to fix the next bugs...
The character masking feature appears in two places in the VHDL code, using low bits of registers 4 and 3 as character cell masks, the example below illustrates the use of register 4 during character cell address calculation in graphics mode 2:
-- Graphics mode 2. 768 unique characters are possible. -- Implement UNDOCUMENTED FEATURE: bits 1 and 0 of reg4 act as bit masks for the two -- MSBs of the 10 bit char code. This allows character set to be limited even in this mode. vram_out_addr <= reg4(2) -- MSB of the address & (char_addr(9 downto 8) and reg4(1 downto 0)) -- Character code with masks for bits 9 and 8 & char_code & ypos(2 downto 0); -- 8 bit code and line in character