Close

Optimising CT reconstruction on 52 years old computer

marcelvanherkmarcelvanherk wrote 3 days ago • 3 min read • Like

As shown in the previous page, CT reconstruction works and I think it is quite fast even though the reconstructed slice is only 80x80 pixels as on the original EMI CT scanner. 

The core of the reconstruction algorithm is to take the array of projection data and stretch it out over rows of the CT slice. By modifying the shift and pitch of the stretch row by row the effect is a straight backprojection of the projection data, i.e., a single point becomes a line in a preset arbitary direction.

However, pixels of the output data will not land exactly on pixels of the input data (circles).

I.e., the pixel address is not an integer number. The Nova is 16 bits, so it is reasonable to use 32 bits fixed point numbers and use give the pitch (stepx) in 1/65536 of a pixel.

But the Nova has only four registers, two of which are this needed to maintain the subpixel address. If we have the step in register 2 and 3 and the subpixel address in register 0 and 1 we have used all registers!.

This is the code to step. Because Nova has no add with carry a SZC (Skip zero carry) is used to transfer the carry by incrementing register 1.

ADD 3 0 CZ SZC
INC 1 1
ADD 2 1

But now we need to get the projection pixel, and add it to a pixel in the slice row. There are not enough registers. I first dealt with this by using memory for most variables (function GSTRETCHADD):

  LDA 1 AC3 // get projection data
  LDA 2 IND autoinc.target
  ADD 2 1
  LDA 2 autoinc.target
  STA 1 AC2
  LDA 1 stepx
  LDA 2 stepx1
  ADD 1 0 CZ SZC
  INC 3 3
  ADD 2 3
  DSZ variable.counter
  JMP GSTRETCHADD1

But a register instruction takes only one cycle, while a memory instruction uses two cycles (3 if using indirection). This loop therefore uses 20 cycles. A variable with octal address 20 is used as its autoincrements when addressing indirect. But we need bith  read and write with a single increment so more instructions are needed.

We can improve on this as follows - in step 1 (macro GS1) use 5 cycles to calculate the subpixel address while having the step in register 2 and 3, and store the result:

ADD 3 0 CZ SZC
INC 1 1
ADD 2 1
STA 1 temp

and then this - in step 2 (macro GS2) use 8 cycles to add the inpout data pointed to by temp to the slice pixel pointed by AC2.

LDA 1 IND temp
LDA 3 AC2
ADD 1 3 
STA 3 AC2

This takes only 13 cycles. However, setting the required registers adds between steps 18 cycles. So 31 cycles in total. That is slower. The trick is therefore not to process one pixel at a time but doing 16 pixels at once by unrolling the loop. Then the 18 cycles overhead are shared by 16 pixels and the total loop requires a bit over 14 cycles per pixel. A speedup of 30%.

That was quite technical! If the assembly code looks strange to you, it is because it is assembled by my own written assembler in Lua. In lua_assembler\NovaReconstructor.lua you find the reconstruction code, file lua_assembler\telnet_nova.lua contains the assembler.

Like

Discussions