GLXgears on a commodore 64

Project Logs

Collapse

GLXgears with the Budge 3D routines
lion mclionhead • 08/14/2024 at 03:45 • 0 comments

The next trick with glxgears is there's also a camera transformation. The budge library doesn't support a camera transformation & it doesn't use a transformation matrix that can be multiplied. The 3 gears would have to be translated after rotating Z, then translated again after rotating X & Y. Despite a generic transform being available, there aren't enough clockcycles to really use it.

As discovered earlier, the rotation can't be baked into the models because the shaft polygons need all the steps. Maybe the teeth could be baked in while the shafts were computationally rotated.

Through some clever programming, you can fit a single gear into 176 points & 264 lines by reusing the points. The budge library would need to be expanded to support 16 bit line pointers & to support different tables for each model.

Noted ca65 sometimes says "Didn't use zeropage addressing". You have to put all the variables before the usages so the assembler knows to use zero page instructions.
Over a few days, most of the original Budge 3D library was removed. Only the transformation & line drawing remaned. The arguments were set manually instead of using tables. The scaling, crunch, & code arrays were gone. A few more self modify entry points allowed pointers to different models.

For the record, there was an attempt to color the gears by setting screen memory.

It was too ugly. Where the gears overlap, there's a non deterministic order which would require real Z buffering to look right. It could be mostly done by copying bitmap memory into 2 sprites but it would be complicated & lions survived without it.

For some reason, it gets real slow when the gears are edge on. It has to draw all the vertical lines & the joiner lines in their full length, but none of the horizontal lines. The same phenomenon happens when the top or side of the gears is facing the camera. Somehow the joiner lines really slow it down. The line drawing could be impacted more by length than number.

In its fastest orientation, Glxgears with the 1980 budge code ran at 2.22fps compared with 1.25fps for the 2023 edition.

In the slowest orientation, the budge code ran at 1.45fps while the 2023 edition ran at .91fps.

Both were using orthogonal projection & limited to 256 columns. Kind of enlightening how methods known in 2023 produced slower results than methods known in 1980 & later forgotten.

-----------------------------------------------------------------------------------------------------------------------------------------------------

There are theories about how line drawing could be faster, but smarter minds spent 40 years optimizing it. Also, there's no way the budge line drawer could reach all 320 columns without becoming just as slow as the 2023 edition. The commodore can advance 7 of 8 rows with inc instead of a table lookup. It can advance 7 of 8 columns with bit shift instead of table lookup. To avoid more branches, it would have to be unrolled. If these loops were unrolled, it could probably reduce a 12 cycle address copy to a 5 cycle ror. Starting & ending the loop is problematic because the line could start & end anywhere in the 8 iterations. It could call a rolled version in certain conditions that occur for up to 14 iterations per line. If the average line starts & ends in between multiples of 8, it would be a loss. Most of glxgears is under 15 pixels with the start & end between multiples of 8 but not the cube demo.

Noted it's more expensive to advance rows than columns, vertical lines are slower than horizontal lines, because the row advance requires increasing a 16 bit address. You couldn't increment 1 byte & check the result for a multiple of 8 without making it slower.

The Bresenham algorithm is basically a counter which wraps around at every...
Read more »
angle resolution in Budge 3D graphics library
lion mclionhead • 08/11/2024 at 19:29 • 0 comments

The mane motivation for using the budge library for glxgears64 was it didn't have a general purpose 3D transform. There could be a way to fake perspective by hard coding the size of the polygons.

The mane problem with the budge library was smoother rotation. GLXgears needs 128 rotation values to achieve a useful gear visualization. Keyboard input actually has only 64 rotation values.

The rotation tables actually have magnitude * sin baked into the same lookup value. To search the entire space of magnitude * sin, he creates hashes of the magnitude & angle. A hash of bits 0:3 of magnitude & angle searches 1 table. A hash of bits 4:7 of magnitude & angle searches another table. The 2 lookup values are added to get magnitude * sin. Since the angle only changes once per frame & magnitude changes once per vertex, the hash mangles the angle bits but not the magnitude bits.

Compressing the angle into 4 bits required reducing it to 15 possible values. It was 15 because presumably, 360 was evenly divisible by 15. The 15 values are used twice to cover a complete rotation. 1 addition operation saves a lot of memory compared to a single table with all 256 possible magnitudes & all 15 possible angles. It allows all 16 magnitudes of the least significant nibble to be reused with every value of the most significant nibble without creating unique table entries.

128 angle steps with 16 magnitudes would need a bigger hash with a 6 bit angle joining the original 4 bits of magnitude. 64 * 16 for RotTabLo & 64 * 16 for RotTabHi. The 4 RotIndex tables would be 256 bytes to address the larger lookup space so another 1024. That's 3072 bytes for the rotation/magnitude tables.

That arrived at a higher resolution angle table which produced roughly the same values for magnitude 0x7f but could not properly render the image anymore.

Things fell apart with magnitude 0xd0. Small negative magnitudes like 0xff looked similar but large negative magnitudes like 0x80 & 0xd0 were falling apart. Lions hoped negative magnitudes would work themselves out naturally in that nibble algorithm. Bits 0:3 must have always been magnitudes 0-16 but bits 4:7 were magnitudes -128 - +128.

This got high res rotation working & the framerate was unaffected. There is a lingering problem of negative magnitudes from -16-0. These give a different table for the low nibble, but somehow the number theory manages to solve negative magnitudes with only bits 4:7 being signed. -113 * cos(0) or 0x8f resolves to 0x80 from RotTabHi + 0x0f from RotTabLo.

Helas, the video made it obvious that some table values were repeated, creating a stuttering. It's not so easy to see since the emulator has stuttering of its own. The total angles has to be 127 instead of 128. Still, these demos show a level of 3D performance on the commodore we never imagined 40 years ago.
Bill Budge's 3-D Graphics System for C64
lion mclionhead • 08/09/2024 at 05:59 • 0 comments

The lion kingdom set to the task of porting the Bill Budge 3D graphics system to the commodore, manely to try to improve the glxgears demo. It was apparently never released in source form. There's only a disassembly.

https://6502disassembly.com/a2-budge3d/MODULE.SHIP.CUBE.html

A character generator was added at some point, probably for debugging without an emulated printer.

The journey began with painstakingly pasting all the block hex codes into a converter to generate .byte tables.

He starts local variables with ]. Wherever there's a dfb, it's reserving a byte of program memory & setting it to the given value the same as .byte in cc65. Lions put those in zero page or made them literals.

A lot of variables are stored in zero page addresses which happen to also be unused in C64 assembly. He only draws in the 1st 256 columns, the same way lions did. The Apple II stored 7 pixels per byte with the MSB on the right. Each row was 40 horizontal bytes.

The pixel drawing is built into the line function where lions used a separate plot function. It only recomputes the row pointer when it has to so horizontal lines draw faster. Vertical lines still recompute the column pointer for every pixel.

It takes self modifying code to the point of overwriting opcodes. Despite this, he still ores the row table with the current bitmap start address to get addition without clearing the carry bit. He could have self modified 3 uses of the row table. The biggest speed gain might come from XORing the image instead of clearing the entire screen. XOR was something unique about pinball construction set, along with neglecting the multicolor pixel codes.

At the same time, the lion program used line optimizations made 15 years later.

The budge disassembly doesn't contain enough to run it. It's definitely not a simple porting job. It needs an applesoft BASIC program which creates a bunch of arrays of opcodes. Then it runs the opcodes. That part is described in

https://www.apple.asimov.net/documentation/applications/misc/Bill%20Budges%203-D%20Graphics%20System%20and%20Game%20Tool.pdf

You had to be pretty motivated to dig through that document in 1980. It's kind of like the instructions for a modern AI framework. As one who depended on the source code to learn how to use it, it's hard to imagine anyone got very far without source code in 1980.

After much contemplation, the decision was made to keep the original CRUNCH API & move the applesoft arrays to assembly language tables. As bloated as the CODE opcodes are, it seems to be a more abstracted way than manually calling the transform & draw commands. The trick is the applesoft arrays were uint16's while the assembly language arrays are bytes.

Set the scale or rotation above maximum & it'll skip that step to aid in debugging.

The next problem was .align in ca65 wasn't aligning the tables on 0x100 bytes because the linker put it at 0x080d. The linker & assembler don't talk to each other. Getting ca65 to properly align the tables took random tweeks to the .align tags & eventually just abandoning alignment for some tables. Some of the alignments burned a lot of memory to save a couple instructions & gained no speed.

All that arrived at a simple demo. Sadly, the source code for McFadden's video was not easily found so just made some simple rotation demos. The budge library rendered a cube at 9.1fps & the mclionhead library rendered a comparable cube at 6.6fps. That was memsetting the entire bitmap to clear instead of using XOR.   XOR was slower in the budge library for an object as large as the cube.

A 37% speed improvement was a big deal.   The only glaring defect was the rotation being limited to 28 steps. The rotation...
Read more »
Bill Budging
lion mclionhead • 08/03/2024 at 21:38 • 0 comments

After reviewing Bill Budge's Apple II demo, was questioning the lion kingdom's choices in life. It was just a mane hair faster than glxgears. The Apple II only did 280x192. Obviously, he could bake all the 2D coordinates while glxgears had to recalculate all of them to be interactive. He accessed all 280 columns while lions had to limit it to 256 to get it as fast as it was. The slowest step for lions was the line drawing routine. The only other thing which might have sped that up is just blitting predrawn bitmaps into the changing regions of the screen. The giveaway was how only small parts of the screen were animated. That would have been the ultimate speedup but not been true 3D rendering in realtime.

Glxgears could go faster if it prerendered all the bitmaps, every time the user moved it. The memory usage would be problematic. It could encode each line as a series of precalculated OR operations.

The budge graphics library is available in source form. It might be smarter to port it to C64 than invest any more in the lion demo or it could be that orthographic projection with a few more table lookups is the answer. The manetainer said it really does all the line drawing & transformations in realtime.

https://6502disassembly.com/a2-budge3d/

Pondered going to VCF-west & trying to get someone to run glxgears on a real C64 but the cost would be $40 to get in + over $20 in driving + unknown parking. In lieu of the Budge revelation, it would totally not be worth it. There was nothing else in the VCF part worth viewing. The mane thing in the museum would be the wiring inside the cray. The lion kingdom might have more demos beyond glxgears, someday. Seeing things run on period hardware is still a borderline goal.

Another problem with running demos from an SD card is the SD2IEC drive would be much faster than historic reality, ignoring all the seeks. It really needs a pi1541 or real drive to be worth running some other demos. At minimum, the lion kingdom would bring a pi1541 to a convention & try it.
Complete model
lion mclionhead • 02/21/2023 at 08:17 • 0 comments

Banged out all the gears for the reproduction of the original demo. Multicolors are sadly not possible. The 160x200 multicolor mode would be illegible. Maybe the color map could be set based on bounding boxes from the gears, but the overlapping areas wouldn't work. The best trick might be drawing 2 bitmaps, 1 bitmap containing just the big gear & the other bitmap containing just the small gears since they usually don't overlap. It could toggle between bitmaps in a raster interrupt, but it would have to be a static display.

The cursor keys rotate it manually, to prove it's doing the 3D transformation.

It runs at around 2 frames every 3 seconds. It burned 12kb.

1 thing we couldn't do 40 years ago was make crosseyed stereo pairs. There's not enough resolution to get much depth. There was an intriguing possibility of printing crosseyed stereo pairs on a VIC 1525.

Sadly, the magic of learning to program the commodore 64 40 years ago wasn't rediscovered. To a modern lion, it's just another embedded system with its own goofy architecture. There were other aspects like sprites, character sets, & sound but the mane thing young lions couldn't master was fast bitmap drawing.
The gear function
lion mclionhead • 02/20/2023 at 20:49 • 0 comments
In order to make any more than a cube fit, the memory mapping gets gnarly. A big deal in the old days was locating programs around the bitmaps. The default output from ld65 starts the executable at $800. The bitmaps go from $2000 to $7b40. Then $8000 to $d000 is free.

You can dump the memory in VICE by entering the monitor (alt-h). Type 'm 0 ffff' to dump the entire memory. You can make a kind of debugging trace by setting an address & printing the address with 'm 2000' Type 'r' to dump the registers.

CC65 has ways of creating memory holes but they're quite involved & specific to C. The easiest way in assembly was just reserving a hole & jumping past it:
```
    jmp mane
.res $7700
mane:
```
Another way is moving the bitmaps higher. There are all kinds of restrictions on where color memory & bitmap memory can be. The mane one is a bitmap can't span 2 VIC banks. The highest useful bitmap is $a000 so by moving the bitmaps to $5c00 to $c000 it had 20kb for the program. Anything higher overlaps the I/O registers or kernal. Cc65 automatically swaps out the BASIC ROM but it uses the kernal.

After freeing up enough memory, the procedural gear was drawing at roughly 2 frames per second.
Math optimization
lion mclionhead • 02/16/2023 at 10:37 • 0 comments

When trying to manually set rotation angles in the cube demo you have to calculate both cos & sin for the X rotation & again for the Y rotation. There were a few more optimizations to be had in eliminating division branches & combining the unsigned projection tables into 1 signed table. To increase the chance of the gears having enough resolution, the trig tables were increased to 256 entries with a range from -127 to 127.

The fastest way to draw a gear is going to be procedurally drawing 1 side of a gear with the Z rotation applied, then drawing 1 point on the other side to calculate a fixed offset between the 2 sides, then applying XY rotation with the existing code. Add the fixed offset to the one side to create the other side.

The cube demo can similarly be optimized by only computing 4 points & using those 4 points to compute fixed offsets to create the other 4 points. Technically a 4x4 transformation matrix does the same thing as adding a fixed offset to every point, but it also has a scaling step which is slow.

The gears would be baked in polar coordinates at compile time, then rotated & converted to XY coordinates for each frame. A key optimization would be precalculating the polar to XY conversions for the 9 circles in the model. That would use 512 * 9 or 4608 bytes. By knowing cos is just sin with a phase offset of 64, this can be reduced to 320 bytes per circle or 2880 bytes.

The thought occurred of how fast glxgears would run on an arduino if it used the same methods, but the point of that demo was manely to show the REGIS protocol drawing over a serial port.

The original cube demo hard coded which coordinates to use for all the line drawing commands. A more general gear routine needs to convert a batch of polar coordinates into 2D points. Another routine needs to draw lines from the set of points. The biggest gear contains 200 points.
Cube demo using cc65
lion mclionhead • 02/15/2023 at 07:06 • 0 comments
Ported the assembly language demo from https://retro64.altervista.org/blog/an-introduction-to-vector-based-graphics-the-commodore-64-rotating-simple-3d-objects/ to ca65.

Basic line drawing & pixel drawing kicked off this port. Thus ran the lion kingdom's 1st commodore 64 program in 35 years. Strangely more satisfying to rediscover commodore 64 programming & assembly language optimization 35 years later than it is to do something productive. It's the activity lions couldn't afford to do 35 years ago.

It might have been easier to find out what assembler the cube demo used. There are many wrinkles, the line routine only supporting 8 bit X but the plot routine supporting 16 bit X, big endian being used for local variables while little endian is used by the 6502 instructions, unnecessary fetches instead of constants.

Fetching from memory was really slow. As many opcodes as possible should have hard coded literals. There's a table of instructions & cycle times on https://the-dreams.de/aay64.txt

You can get the byte codes for the instructions by passing -l to ca65.
```
0001E9r 1  B9 rr rr         lda ytablehigh_BMP0,y
```
This gives the bytes occupied by ytablehigh_BMP0 at runtime.

The original compiler obviously didn't support ifdef. Many functions come from https://codebase64.org/ which has a lot of numerical recipes in assembly. Even a lion who hasn't programmed the 6502 in 40 years still finds a lot of bugs & waste. After all those struggles 40 years ago, the conversion from coordinates to bitmap offsets was actually very simple.

The tricky bit is the math library. Lions won't pretend to know what's going on there. 10 year old lion was 7 years away from being exposed to even the basic trig functions so it was nowhere close to happening in those days.

The demo gains a lot of speed by only drawing & clearing a small part of the screen. By the time the 1st cube demo was animating, it was clearly going to be super slow by the time it was 3 gears.

The original cube demo after porting drew a small cube in the center.

Managed to maximize the cube size & draw the X border. There's no clipping support so different angles create different limits on the dimensions. The coordinates are signed. Unlike most compilers, ca65 can't automatically convert negative numbers to unsigned. You have to write 256 - the number.

Compiled the various iterations of the cube demo into a vijeo. The only thing affecting the speed is the size of the area being cleared. The clear operation is a clockcycle buster. The line drawing & math is negligible in a polygon this size. The optimized 3D drawing goes a lot faster than lions remember 40 years ago but maximum size isn't fast enough to believe they didn't also use some aggressive optimizations.
It might be fastest to redraw the cube with a line erase function, but it wouldn't be fastest with a gear polygon.
Simple assembly language program with cc65
lion mclionhead • 02/13/2023 at 23:26 • 0 comments
Lions remember very little about C64 development. load "$",8 loads a directory listing into the program space. PRG files are programs. SEQ files are data. Assembly language programs tended to be stored in SEQ files. There was 1 PRG file for starting. load with ,8,1 was required for assembly language. The mane wrinkle was the many addressing modes enabled by the X Y index registers. You'd load 8 bit offsets into those & use LDA STA variants which add those to the address. There are no 16 bit registers, which lions tend to confuse with the 68HC11 which had a 16 bit X Y & accumulator.

Assembly language arguments in ca65 are:

$01 dereference a value in a hex address

1234 dereference a value in a decimal address

#$34 a literal in hex format

#%01010101 a literal in binary

#123 a literal in decimal format

< provides the low byte of a literal

> provides the high byte of a literal

Enclosing the agrument in various parenthesis  POINTER,X  (POINTER) (POINTER),Y (POINTER,X) invokes different addressing modes described in http://www.emulator101.com/6502-addressing-modes.html.

POINTER,X Add X to pointer to get POINTER2. Dereference the value in POINTER2. POINTER is a 16 bit or 8 bit address. There are different opcodes for the 16 bit & 8 bit variant. The 8 bit variant wraps at 256 (zero page memory). This works with X or Y.

(POINTER) Read the address stored in POINTER to get POINTER2. Dereference the value in POINTER2. POINTER is a 16 bit address. This only works with GOTO.

(POINTER,X) add X to POINTER to get POINTER2. Dereference the address in POINTER2.  POINTER is limited to zero page memory. This works with X only.

(POINTER),Y Dereference the address stored in POINTER to get POINTER2. Add Y to POINTER2 to get POINTER3. Dereference the address in POINTER3. POINTER is limited to zero page memory. This works with Y only.

Double & triple pointers were the key to accessing large amounts of memory with 8 bit registers. They had special instructions for accessing the 1st 256 bytes of memory (zero page). You'd ideally have all your variables in that space. 16 bit POINTER,X was the only indexing mode lions could understand 40 years ago.

A starting point is to call into the C library to print something.
```
.autoimport    on              ; imports _cprintf, pushax
.forceimport    __STARTUP__ ; imports STARTUP, INIT, ONCE
.export        _main           ; expose mane to the C library
.segment    "RODATA"
_Text:                      ; PETSCII text to print
    .byte    $C8,$45,$4C,$4C,$4F,$20,$57,$4F,$52,$4C,$44,$21,$00


.segment    "CODE"
.proc    _main: near
    lda     #<(_Text) ; low byte function argument
    ldx     #>(_Text) ; high byte function argument
    jsr     pushax ; put function arguments on stack
    ldy     #$02   ; size of function arguments (2 bytes)
    jsr     _cprintf ; C library function
.endproc
```
The trick with this is it uses PETSCII instead of ASCII.

The pushax function is a beast in libsrc/runtime/pushax.s

There's a command to assemble it.
```
ca65 -t c64 hello.s
```
Then link it with the standard C library.
```
ld65 -t c64 hello.o -o hello c64.lib
```
The executable doesn't end in .prg but is a PRG file anyway. It's an utterly gigantic 2108 bytes for what it does, probably because cprintf has to parse formatting codes. There is a simpler _puts function.
```
	lda     #<(string)
	ldx     #>(string)
	jsr     _puts
```
The trick with the C library is compiling C programs to figure out the function arguments. Cc65 generates a .s file with the assembly language function calls. The mane C function of note is cprintf.

For fast development on the emulator, the journey begins by creating a disk image
```
c1541 -format "disk,00" d64 disk.d64
```
Store the program in the disk image.
```
c1541 -attach...
```
Read more »
Introducing cc65
lion mclionhead • 02/11/2023 at 23:21 • 0 comments

CC65 compiles on Linux with just a simple make command. bin/ca65 is the assembler. As expected, it can't compile the 3D demo.

The samples directory has some sample programs which compile with make. 'make disk' creates a .d64 image with all the programs, but requires vice & its c1541 tool to be installed. The cbm directory has more samples which must be compiled with separate make commands. They all compile to machine language. The most impressive one might be the plasma demo. We didn't have fullscreen animations like that in 1985.

It has its own graphics library tgi.h which draws 2D polygons in the 320x200 monochrome mode. The commodore 64 routines are in libsrc/c64/tgi/c64-hi.s. The platform independent bits are in libsrc/tgi. It's all assembly & contains examples of the assembler syntax for a library. There are no examples of a standalone program in assembly language.

As was conventional in the old days, there are no register names. They write the numbers for all the addresses.

It sort of makes sense to do it in C & use the TGI library. Some benchmarks should be done to compare cc65 with paw coded math routines. The SETPIXEL routine seems to be a lot slower than the 3D one. It may be the current fascination with assembly language is just the preference of gootubers to watch vijeos about assembly language.

Helas, this style of development is not much of a trip to the past. The commodore is now just another embedded system. Moreover, it can never be more than an emulation of hardware lions can't afford.

The big question is how much optimization can be done before it's not really 3D graphics anymore. There are 854 vertices in the glxgears model. If the animation was baked into 8 bit XY coordinates to be fed directly to the line drawer, it would fit 38 frames in all 64k of RAM. It possibly could be animated with only 15 frames or 25620 bytes. Hiding the hidden lines could save a lot of memory. The mane loss is the interactive rotation.

The arduino port was manely to show how graphics could be sent over the serial port. The C64 port is manely to show how fast a C64 could draw a 3D graphics demo from 25 years ago. It's not really doing the purpose if it's baking it all in 2D.

It's never going to be a real 3D plotter anyway, but the mane differentiator from a 2D plotter is the need for interactive rotation. So the optimization is as much as still allows interactive rotation.