Fixing the LDIR instruction

In the past weeks, I've been working on a new pcb design. That's now almost done.

This weekend I looked at my TODO list, and I found, no surprise, the LDIR instruction.

In one of the previous logs I mentioned that there was an error in the LDIR instruction. It didn't behave correctly when an interrupt occurred.

What is the LDIR instruction ?

In the Z80 processor, this instruction can copy a whole section of memory in just a single instruction. It takes three 16-bit sized values as input:
- A source address in register pair HL
- A destination address in register pair DE
- A byte count in register pair BC

LDIR stands for Load, Increment, Repeat.

It will do the following actions:
- read a byte from the address at location HL
- write that byte at location DE
- increment the addresses in HL and DE
- decrement the byte counter in BC
- if the byte counter is not zero, repeat the actions
- if an interrupt occurs during this instruction, it will be handled and the actions will continue after interrupt is handled.

LDIR has a 2-byte opcode: ED B0

Optimizing the LDIR

The LDIR was already optimized, with 10 cycles per copied byte, but I had the strong feeling that something better was possible. I didn't try to find the problem in the old version, just started again from scratch.

Isetta has only a single hardware data pointer. But nothing keeps us from using the program counter as a data pointer ! For a normal instruction, that would cost way too many cycles, but for a repeated transfer instruction this will pay off. The LDIR microcode should just save the PC to a temporary location, and copy the HL register pair into the PC. Now the PC can be used to get the bytes out of memory. When all actions are completed, the PC is copied back to HL, and PC is restored from the temporary location. The existing, broken LDIR version also had this feature.
And as a bonus, the PC register can increment in the same cycle where it addresses a byte to be fetched.

The destination pointer (register pair D E) is copied to the hardware registers DPH and DPL. The byte, fetched with the address in PC, can now be stored at the address in DPH/DPL. There is a microinstruction that increments DPL, it can also detect a carry, but there is no possibility to increment the DPH register, because it can not be accessed after the change that is described HERE. But we could increment the D register and re-write DPH with the result. This will at least take 3 more cycles (microinstructions).

Then we must decrement the count in the BC registers. So first decrement C (2 cycles) and, when it drops below zero, decrement B (also 2 cycles).

The problems are in the counting, and in the increment of the upper byte (DPH),

But if we unroll the loop, in sections of 8 transfers, we don't have to count each byte. We simply subtract 8 from the LSB of the count, at the beginning of the unrolled loop. But when the count is lower than 8, we jump to a slow old-fashioned loop. This will decrement the MSB of the count when that is needed, and then go back to the unrolled loop. The slow loop will also check for zero to determine if the instruction has completed.

It would be nice if we never had to increment the upper byte (DPH)... What if we never increment the upper byte in the unrolled loop ? Before the unrolled loop starts, we could check if the lower byte (DPL) would overflow in that loop. That would happen if that lower byte is higher than 0xF7 (247), and in that case we do the old-fashioned loop, and when the upper byte has been incremented, we go to the unrolled loop again.

Now, the sections of the unrolled loop have only 3 cycles:
- A <- (PC++)
- (DPH/DPL) <- A
- inc DPL

Pseudo code of the fast copy loop:
- if LSB of the count < 8, goto slow loop
- if LSB of destination > 247, goto slow loop
- count = count - 8
- copy the DE registers (destination) into DPH/DPL
- destination = destination + 8;
- do the unrolled loop 8 times
- go back to fast copy loop

In this fast loop, there are 8 cycles of overhead, that brings the average number of cycles per byte transfer to 4.
Actually, the slow loop that occasionally must be executed during big transfers will bring the average to around 4.4.

Overhead at beginning and end of the instruction (saving PC, put HL in PC, saving A, restoring them) cost 27 cycles.

And here is the microcode

This listing is produced by the Javascript microcode generator. I added some comments.

---- page 0  ED B0 LDIR  ----
// start with checking the interrupt. If interrupt, go to addr 32 in page_z80_ix
001600 066224 0132C0  F:[to_ir <- lit32,p_z80_ix] T:[reg_v <- acc_a]
001602 012220 010022  F:[to_dpl <- 0] T:[to_t <- pcl]
001604 013259 013259  reg_tl <- acc_t
001606 01524D 01524D  pcl<-pch, to_pch <- reg_l
001608 010022 010022  to_t <- pcl
00160A 013258 013258  reg_th <- acc_t
00160C 01524C 01524C  pcl<-pch, to_pch <- reg_h
// obtain the value 0x98 (microcode can not use arbitrary immediate values)
00160E 0B0222 0B0222  to_t <- asl(lit8)
001610 0E0222 0E0222  to_t <- or(acc_t, lit8)
001612 0E0246 0E0246  to_t <- or(acc_t, reg_128)
// store the opcode (0x98) of the fast loop at temp location reg_vga_acc
001614 01325B 01325B  reg_vga_acc <- acc_t
001616 00625B 00625B  to_ir <- reg_vga_acc,p_z80_ed  // jump to opcode 0x98
001618 0132C0 0132C0  reg_v <- acc_a
-- 0/13 + 2 cycles

---- page 0  ED 98 LDIR-LOOP  ----
001300 014249 014249  to_a <- reg_c
001302 3C42A2 3C42A2  to_a <- sub(acc_a, lit8)
001304 610A4B 610A4B  to_t <- reg_e,tc_to_f
// if F=0, move lit32 (0x20) to IR to go to the slow loop
001306 006224 0D0222  F:[to_ir <- lit32,p_z80_ed] T:[to_t <- add(acc_t, lit8)]
001308 612A4B 612A4B  to_dpl <- reg_e,tc_to_f
00130A 01124A 006224  F:[to_dph <- reg_d] T:[to_ir <- lit32,p_z80_ed]
00130C 0132C9 0D0220  F:[reg_c <- acc_a] T:[to_t <- add(acc_t, 0)]
00130E 01324B 01324B  reg_e <- acc_t
001310 014008 014008  to_a <- (pc++)
001312 E13980 E13980  (dph|dpl) <- acc_a,irq_to_f
001314 2A2122 2A2122  to_dpl <- inc(dpl)
001316 014008 014008  to_a <- (pc++)
001318 E13980 E13980  (dph|dpl) <- acc_a,irq_to_f
00131A 2A2122 2A2122  to_dpl <- inc(dpl)
00131C 014008 014008  to_a <- (pc++)
00131E E13980 E13980  (dph|dpl) <- acc_a,irq_to_f
001320 2A2122 2A2122  to_dpl <- inc(dpl)
001322 014008 014008  to_a <- (pc++)
001324 E13980 E13980  (dph|dpl) <- acc_a,irq_to_f
001326 2A2122 2A2122  to_dpl <- inc(dpl)
001328 014008 014008  to_a <- (pc++)
00132A E13980 E13980  (dph|dpl) <- acc_a,irq_to_f
00132C 2A2122 2A2122  to_dpl <- inc(dpl)
00132E 014008 014008  to_a <- (pc++)
001330 E13980 E13980  (dph|dpl) <- acc_a,irq_to_f
001332 2A2122 2A2122  to_dpl <- inc(dpl)
001334 014008 014008  to_a <- (pc++)
001336 E13980 E13980  (dph|dpl) <- acc_a,irq_to_f
001338 2A2122 2A2122  to_dpl <- inc(dpl)
00133A 014008 014008  to_a <- (pc++)
// if F=1, interrupt not active, go back to the start of the fast loop
00133C 010220 00625B  F:[to_t <- 0] T:[to_ir <- reg_vga_acc,p_z80_ed]
00133E E13980 E13980  (dph|dpl) <- acc_a,irq_to_f
001340 010022 010022  to_t <- pcl
001342 01324D 01324D  reg_l <- acc_t
001344 010259 010259  to_t <- reg_tl
// Before going to interrupt code, sub_0(acc_t,1) will do PC = PC - 2
// So after the interrupt, this LDIR will be re-executed
001346 1C5221 1C5221  pcl<-pch, to_pch <- sub_0(acc_t, 1)
001348 010022 010022  to_t <- pcl
00134A 01324C 01324C  reg_h <- acc_t
// The dec_dtc will only do a decrement if the previous subtract crossed zero.
00134C DB5258 DB5258  pcl<-pch, to_pch <- dec_dtc(reg_th)
00134E 006246 006246  interrupt, nop
001350 E14AC0 E14AC0  to_a <- reg_v,irq_to_f
-- 0/41 + 2 cycles


---- page 0  ED 20 LDIR-SLOW  ----
000400 010249 010249  to_t <- reg_c
000402 0E2248 0E2248  to_dpl <- or(acc_t, reg_b)
000404 1B2122 1B2122  to_dpl <- dec(dpl)
000406 612ACB 612ACB  to_dpl <- reg_e,tc_to_f
// F=0: count is zero. Restore registers and go to next instruction
000408 010022 01124A  F:[to_t <- pcl] T:[to_dph <- reg_d]
00040A 01324D 014088  F:[reg_l <- acc_t] T:[to_a <- (pc++)]
00040C 015259 013180  F:[pcl<-pch, to_pch <- reg_tl] T:[(dph|dpl) <- acc_a]
00040E 010022 2A024B  F:[to_t <- pcl] T:[to_t <- inc(reg_e)]
000410 01324C 01324B  F:[reg_h <- acc_t] T:[reg_e <- acc_t]
// The inc_dtc will only do an increment if the previous increment produced a carry.
000412 015258 CA024A  F:[pcl<-pch, to_pch <- reg_th] T:[to_t <- inc_dtc(reg_d)]
000414 046008 01324A  F:[next, nop] T:[reg_d <- acc_t]
000416 E14AC0 1B0249  F:[to_a <- reg_v,irq_to_f] T:[to_t <- dec(reg_c)]
000418 013249 013249  reg_c <- acc_t
00041A DB0248 DB0248  to_t <- dec_dtc(reg_b)
00041C E13A48 E13A48  reg_b <- acc_t,irq_to_f
// If interrupt not active, jump back to fast loop 
00041E 010220 00625B  F:[to_t <- 0] T:[to_ir <- reg_vga_acc,p_z80_ed]
000420 010022 010022  to_t <- pcl
000422 01324D 01324D  reg_l <- acc_t
000424 010259 010259  to_t <- reg_tl
// Before going to interrupt code, sub_0(acc_t,1) will do PC = PC - 2
// So after the interrupt, this LDIR will be re-executed
000426 1C5221 1C5221  pcl<-pch, to_pch <- sub_0(acc_t, 1)
000428 010022 010022  to_t <- pcl
00042A 01324C 01324C  reg_h <- acc_t
00042C DB5258 DB5258  pcl<-pch, to_pch <- dec_dtc(reg_th)
00042E 006246 006246  interrupt, nop
000430 E14AC0 E14AC0  to_a <- reg_v,irq_to_f
-- 12/25 + 2 cycles

Maximum number of cycles per instruction

The fast copy loop will take 8 * 3 + 8 (overhead) = 24 cycles. It is actually longer, because it also tests the interrupt line, and when that is active, it restores the registers, and jumps to the interrupt code. (and before jumping to the interrupt code, it sets the PC two positions back, so the LDIR will be re-entered after the interrupt has been serviced).

But I said somewhere before, that each instruction has max 16 cycles !

Well, actually, the lower 4 bits of the instruction register are in a counter chip. So when the 16 cycles are executed, and no new instruction was fetched, execution continues in the microcode of the next opcode ! So, a maximum of 256 different cycles can be executed, and after that the same 256 cycles come again, until it jumps out of that sequence somewhere.

Of course, when this system is used, the next opcode should not be defined as a normal instruction.

For the LDIR instruction (opcode 0xB0 in the Z80_ED page), a few initializations will be done in the opcode 0xB0 microcode, but will then jump to the fast loop at address 0x98 in the same page. That is an opcode that is undefined by the Z80, and also the opcodes that follow it (0x99, 0x9A) are undefined. That is where our LDIR fast code lives. It can also do a jump to the slow loop, at 0x20. The slow loop also uses the next opcode, 0x21. Both 0x20 and 0x21 are undefined Z80 opcodes.

LDDR

I could do the same for the LDDR (load-decrement-repeat) instruction, but I did not do that, because it is used less often (only when source and destination range overlap in a certain way). Also, it would mean that PC had to decrement instead of increment. Decrementing the PC is not a hardware function, but it could be done by microcode, at two extra cycles per transfer. So if LDDR got the same treatment, the average time would be 6.4 cycles per transferred byte.

So the LDDR is a standard loop, a byte transfer costs 21 cycles.

Progress of new PCB

Changed bankswitching system

Discussions

Become a Hackaday.io Member