-
Why I Am and Am Not a Fan of STEbus.
11/05/2021 at 19:22 • 12 commentsFirst, the cons, which are deal-breakers for me.
1. 20 Address Bits. This might seem like an awful lot of address space; I mean, you have 1MiB of memory space with STEbus. However, with so few address bits, this means you cannot practically add any card that exposes a memory space about that size. Consider, a video card with a 24-bit 800x600 resolution will require close to 2MB alone, forcing the processor to switch between memory banks when updating the frame buffer. For good performance and minimizing visual tear, this just plain sucks! Also, consider the case of a SCSI or ATA controller, wanting to DMA data from a harddrive into memory. The processor will be able to address a lot more memory than the STEbus can, so what is the value of having a bus master on the STEbus if you can't address the processor's memory? The only thing that makes sense is a separate CPU card.
2. 12 Address Bits. When addressing I/O, you're limited to 4KiB of space for all 20 cards possible on the backplane. For processors like the 8086 and its subsequent progeny, this is a serious limitation, as a number of peripherals first designed for the ISA bus readily use, or even hard-wire, 16-bit I/O addresses.
3. No plug-and-play capabilities what-so-ever. You're obliged to select a board's I/O and memory base addresses via jumpers. You're obliged to select interrupts and DMA requests via jumpers. There are jumpers literally everywhere in this system. Compared to the Apple IIe bus, the STEbus offers a sharp regression in user experience and functionality.
4. DIN 41612 connectors. If all you care about is 10 insertions before they break,
then this connector is fine. You can pick them up for relatively cheap on Digikey.
However, while building my own video/FPGA development card for the RC2014,
I must have removed and inserted cards well in excess of this figure. So, if you want a DIN connector that can handle more than this number of insertions, you need to start shelling out upwards of $7 per connector. That's $7 for each connector on your backplane, plus $7 for each mating connector on your expansion cards. This quickly gets expensive. My RC2014 has six cards in it. If DIN connectors were used, that'd total $14*6=$84, just in connectors used. That's almost 1/3rd the cost of the computer! There are significantly cheaper options.5. Only three bus masters throughout the whole system. You get the "default" bus master, plus no more than two other bus masters across the whole backplane, even if your backplane has the full allotment of 20 slots!
To fix these deficiencies, I would do the following:
- Use at least 24 to 32 address lines on the connector.
- Unify the I/O and memory address spaces.
- Use geographical addressing. Drive REGSEL# for a particular slot when addressing that slot's register space.
- Make register space at least 64KB in size per slot.
- Get rid of ATNRQ# lines for interrupts, and replace them with IRQ#, IRQ_IN#, and IRQ_OUT#, the latter two providing a priority chain of arbitrary length.
- Get rid of ATNRQ# lines for DMA, and just replace them with a generic bus mastering protocol, DMA_REQ#, DMA_ACK_IN#, DMA_ACK_OUT# analogous to IRQs.
- Swap in sets of 40-pin SIP connectors in place of DIN connectors. A pair gives you 80 usable pins which are far cheaper, more reliable, mechanically stable, and far easier to solder and maintain.
STEbus does have some pros which I really like, though:
1. Fully asynchronous. The master drives ADRSTB#. Then it drives DATSTB#. Then it waits for the card to assert DATACK#. Then, the master negates DATSTB#, which triggers the card to release DATACK#. (Concurrently and unrelated, ADRSTB# may also be negated; or it may not, in the case of a read-modify-write cycle.) Only then can the next cycle begin. Using chips like the 74HCT688, xCT138, and xCT74 for a timing chain, you can readily generate the acknowledgement. The circuit is literally no worse than DTACK# generation on a 68000.
The SYSCLK (16MHz) clock provided on the bus is intended only as a timing reference for those xCT74 DFFs, and even then, only if you choose to use it. It otherwise has no other influence on the operation of the bus.
2. Surprisingly Fast. If you look at the timing diagrams and parameters for STEbus, you'll see that this bus can achieve a theoretical maximum data rate of about 28MB/s. That's insanely fast! Unfortunately, that's also impossible in the real world; which is to say, anything outside of a NIST laboratory. However, more practical designs will be able to achieve closer to 4MB/s to 6MB/s, which is still very impressive, considering that Zorro-II can push at most 3.5MB/s.
These are absolutely features to keep in any expansion bus technology chosen for the ForthBox, I think.
The only thing is, because of how asynchronous interfaces work, everything behind the interface must be considered "slow." So, the ForthBox will probably end up having a "slow RAM" and "fast RAM" dichotomy, just as the Apple IIgs did, and the Commodore Amiga (vis-a-vis. "chip RAM" instead of "slow RAM"). Basically, fast RAM is local to its own CPU and inaccessible to other CPUs or I/O devices, almost like a cache memory, while slow RAM is global, seen by everything else in the system.
The end result of such a design would more closely resemble NuBus-on-a-Budget, versus STEbus's VME-on-a-Budget design approach.
ForthBox?
Oh, what is a ForthBox? I guess you'll need to wait and see.
And, will I use such a bus for the ForthBox? Not entirely sure, to be honest; but, it is on my short-list of considerations.
-
Parts received, but no boards yet. :(
08/02/2016 at 17:41 • 0 commentsWell, I got the parts I ordered from Digikey, but so far, no boards yet.
-
More EDA woes. You'd think this was simple stuff.
06/07/2016 at 17:44 • 0 commentsI'm having a great deal of difficulty resolving one final (known) bug in the PCB layout. And I cannot seem to fix it through any recommended method I know of.
The problem is that the ADJUST pin on the voltage regulator couples into one end of a potentiometer. This should either be pin '1' or pin '3' of the pot. The opposing pin and the wiper pin should be grounded. So, either pins 1 and 2 are grounded, OR, pins 2 and 3 are grounded. Pins 1 and 3 should most definitely not be shorted.
And yet, while this is very clearly expressed in gschem, and the footprint for the pot was redrawn just to make absolutely sure everything is correct, PCB literally insists on shorting pins 1 and 3 of the pot, leaving pin 2 to do whatever it wants.
This is most infuriating, as you can imagine. After spending literally tens of hours trying to debug this, I was left literally screaming at the computer. I can manually route the traces, of course, but the netlist would be completely borked if I do, which renders the "find signal" (F or CTRL-F) function in PCB utterly useless.
I'm at wits end. I don't know what to do.
-
A Bit of Hindsight: Part 2: Byte Lanes vs. Width Hierarchies
05/31/2016 at 18:35 • 0 commentsBackbone, as it's currently defined, is basically Wishbone exposed to the world. It's an almost purpose-built bus interface just for the Kestrel-3's hardware development as I work towards a single-board version of the computer. Its mission, and thus its criteria for success are:
- It lets me explore different pieces of the Kestrel-3 in isolation of other components. With an SBC, this is not possible; I'd have to refab the entire board if I changed even just one circuit.
- It lets me explore bus architecture design. This is already a resounding success; I don't even have a board fabbed yet, and have already identified two things I would do differently next time I need a parallel bus. I've already documented one of these things in the previous log; this log is devoted to the second.
One characteristic of the Wishbone bus is that, per the specification, wide interfaces need to be qualified with one or more select signals; these select signals function the same as BEx in Intel CPUs, DSx in 68K CPUs, etc. SEL0, when asserted, means that valid data appears on DAT0-DAT7. SEL1 means data appears on DAT8-DAT15, and so on. (All assuming an 8-bit granular interface, of course.) This also implies that the address bus is split into two parts: ADR0..ADRx is literally hidden from the outside world, since it combined with the desired transfer size is used to calculate the proper SEL line settings, and ADRx+1..ADRy (where y is your highest address bit; typically 15, 31, or 63 for 16-, 32-, or 64-bit address spaces). More concretely, a 64-bit wide, 8-bit granular bus will not expose A0, A1, or A2, since the meaning of these bits are used to determine which of SEL0, SEL1, SEL2, SEL3, SEL4, SEL5, SEL6, or SEL7 are asserted for bytes, which pair is asserted for half-world transfers, etc.
This is a great optimization if you're addressing memory. Memory is inherently amenable to such row/column decomposition of an address space like this, so it makes perfect sense. The problem is that literally everything else you'd ever want to talk to on the bus is not so amenable.
Consider the KIA, which I introduced first for the Kestrel-2, which also used a Wishbone bus. Its registers are only 8-bits wide, and the core has only a single address input. You'd expect its registers to appear at KIA+0 and KIA+1; however, this is a mistake. Because A0 is not exposed to the world, it does not participate in address decoding. Instead, A1 is attached (the Kestrel-2 is a 16-bit CPU and bus system), which means its registers are actually located at KIA+0 and KIA+2. So what appears at KIA+1 and KIA+3? Nothing. If the KIA had writable control registers, and you attempt to write to those locations, you run the real risk of loading garbage into those control registers, since the state of the byte lanes those registers would talk to exclusively would be completely undefined.
A much better approach is to use High Enables instead. Instead of a linear decomposition of the bus lanes (where a 64-bit bus has 8 lanes of 8-bits each), a logarithmic decomposition is used instead (a 64-bit bus has 1 32-bit high word, 1 16-bit high half-word, 1 8-bit high byte, and 1 low byte). Such a bus allows 8-bit devices to focus just on D0-D7 without concern for which byte-lane it should attach to, 16-bit devices to D0-D15, and so forth.
It is also naturally supportive of upward compatibility. To illustrate, let's start with a simple nybble-wide bus.
A0-A3 D0-D3 WE STB ACK
Pretty simple; it allows us to read or write any nybble in a 16 nybble address space. We can expand the address space easily by just tacking on more address bits: this doesn't affect old hardware since they just ignore the upper address bits.
A0-A7 D0-D3 WE STB ACK
But, if we now want to address bytes, we need to tack on another set of data bits. The CPU would tell the addressed peripheral that it wants to transfer a full byte by using a "Nybble High Enable" (NHE) control signal.
A0-A7 D0-D7 WE STB ACK NHE
We need to know if D0-D3 or if D0-D7 are valid. That's the purpose of NHE, and it behaves like so:
A0 NHE D0-D3 D4-D7 0 0 Nybble A 0 1 Nybble A Nybble A+1 1 0 Nybble A+1 1 1 Impossible condition.
If NHE is negated, then A0-A7 determines what value appears on D0-D3 just like the old 4-bit bus. But, if NHE is asserted, then A1-A7 (NOTE! A0 not involved!) determines which byte to read from or write to. A0 will always be zero, since that will make the address byte aligned. Accessing data with both NHE and A0 set would be an alignment violation.
This can be expanded upwards to support a 16-bit bus as well, and it can be done in a completely backward compatible manner:
A1 A0 BHE NHE D0-D3 D4-D7 D8-D15 0 0 0 0 Nybble A 0 1 0 0 Nybble A+1 1 0 0 0 Nybble A+2 1 1 0 0 Nybble A+3 0 0 0 1 Nybble A Nybble A+1 0 1 0 1 Impossible condition. 1 0 0 1 Nybble A+2 Nybble A+3 1 1 0 1 Impossible condition. - - 1 0 Impossible condition. 0 0 1 1 Nybble A Nybble A+1 Byte A+2 0 1 1 1 Impossible condition. 1 - 1 1 Impossible condition.
Trivia: why must BHE and NHE be asserted at the same time? Because all byte accesses are also nybble accesses. Likewise, all 16-bit word addresses are also byte and nybble accesses as well. NHE needs to be asserted because hardware unaware of BHE will not know to drive D4-D7 during a byte or word-sized transaction.
And this keeps scaling up and up. I used nybbles to illustrate in a more or less convenient way, but in the real world, you'd typically use Byte Enables instead of Nybble Enables. If you just widen everything by 4 bits above, you'll notice that we described a 32-bit bus with the same number of total signals as a byte-lane type bus, but which retains full backward compatibility with a simple 8-bit bus.
Once you go beyond 32-bits, though, this is where the savings come in big. To widen the bus to 64 bits, you need one new high-enable, and another 32-bit data lane. Let me repeat that: you have a total of three high enables, not eight like you'd have with a typical laned bus. For a 128-bit bus, you'll add a 64-bit data lane, and one more high enable. If we compare bus data and lane select bits, we see the following trend (assuming a 64KB address space; add pins as needed):
Data bits 8 16 32 64 128 Addr bits 16 15 14 13 12 SEL bits 0 2 4 8 16 Totals 24 33 50 85 156
Data bits 8 16 32 64 128 Addr bits 16 16 16 16 16 HE bits 0 1 2 3 4 Totals 24 33 50 83 148
In the worst-case, you're at parity with the number of signals you need to route, and in the best case, you have (potentially quite a bit of) a savings.
In terms of compatibility, you can certainly make something like a packed KIA address layout work with a laned bus too; but, the target hardware has to be aware of the bus architecture for this to work right. In the worst case, you'd basically need a new hardware spin with each widening of the bus (except in those cases where the base address remains naturally aligned with the bus word size). In the best possible case, you need a "bus bridge" to perform lane management on behalf of the older peripheral hardware. You'll need to recover lower address bits based on received SEL lines, and that assumes no illegal bit patterns!
All in all, using a logarithmic bus decomposition with high-enables seems to offer a ton of advantages over a flatly decomposed lane-based bus. Probably about the only time a laned bus will demonstrate any superiority is in those cases where the bus controller write-combines non-adjacent transactions. Except for video controllers, I can't think of any time you'd want to do this. Maybe I'm wrong though.
EDIT: Looking at the tables above, it's clear to me now why Wishbone B4 spec limits the port size to 64 bits.
-
A Bit of Hindsight: Part 1: Signal Routing
05/31/2016 at 17:11 • 0 commentsFor my needs, it doesn't really matter how I lay out the address or data bus pins. When I synthesize a design to an FPGA, the signals can be routed to arbitrary pins through the UCF or PCF files. I was relying on this when I came up with the pin layout for the DIN connectors.
However, in retrospect, it was probably a mistake to put all data pins on row A, and all address pins on row C. Based on my experience routing the bus on the backplane, it would have been better to keep all the related signals together on the FPGA (minimizes internal routing resources), and interleave the data and address pins across rows A and C. So, instead of:
Row A Row C 1 D0 A0 | pins assigned along the row. 2 D1 A1 | 3 D2 A2 | 4 D3 A3 V
I should have done this instead:
Row A Row C 1 D0 D1 ---> Pins assigned across rows. 2 D2 D3 3 A0 A1 4 A2 A3
Electrically, they're identical; it's just that it makes routing buses to relevant pins on FPGAs easier, particularly if the FPGA is in a TQFP or similar package.
For BGA devices, I don't think it matters as much; breaking signals out of a 16x16 BGA (such as with an iCE40HX8K-CT256 device) is going to require no less than a 4-layer board and quite possibly more, just to route signals a few centimeters in any coherent direction and in any reasonable order. And, it's going to involve a lot of vias. A lot of vias.
The one nice thing about the layout of Backbone's pinout now is that it makes interfacing to microcontrollers-as-slaves that much easier. For example, perhaps I'll replace the KIA circuit in the FPGA with a KIA-like interface in a microcontroller, which acts as a USB-keyboard-in, standard-bytecode-out KIA-like replacement. Such a device is much easier to implement using a microcontroller than using FPGA resources. (Sounds like a job for the S16X4A again!)
-
DIN41612 routing back on course.
05/31/2016 at 16:58 • 0 commentsI discovered a number of settings in PCB that allows me to route all 96 pins of a DIN 41612 connector on a single side of a two-layer circuit board. I had to set my trace size to 6 mil, and reduce my annular ring size to somewhere in the vicinity of 10mil. These are figures which OSHPark seems to support, so I don't think other PCB fabs will have issues either.
I have many of the paths routed already. I just need to find an optimal layout for the rest of the circuitry. I really wish I didn't need a 74LVT20 or 74LVT04. Capturing and responding to signals on a card-by-card basis really ruins the elegance of the overall design, and appreciably complects the routing of signals. Thankfully I have two layers to play with.
-
DIN 41612 too difficult to route.
05/30/2016 at 16:33 • 0 commentsWhen trying to break traces out from a DIN 41612 plot on a 2-layer PCB design, I found that it was possible only with great difficulty; it required a lot of surface area that otherwise had no other components. This represents a lack of efficiency, and drives the cost of the board up significantly. It also lengthens the individual traces to well beyond four inches, so additional termination circuitry would definitely be needed. Since this backplane is not intended for industrial use, I am not able to justify the cost of a 4-layer board to myself right now.
But, if I only have two rows of pins instead of three, I can route the bus very efficiently indeed. In fact, it can be done entirely on a single side of the PCB, leaving the other side free to be a ground pour.
So instead of a single DIN 41612 connector, I'm thinking I should instead use two or three co-linear 2x20 box headers instead. You know the kind: they were used to connect parallel ATA devices like harddrives to PCs for years. Because of their ubiquity, they're dirt cheap (two box headers still comes to about 66% the cost of a single DIN 41612 connector), and if my math is right, increases the minimum length of a plug-in card from 3-ish inches to 4-ish inches. In other words, the average cost increase of a larger PCB is mostly offset by the lower cost of the connectors, and so it should be a wash, price-wise.
The only disadvantage that I can see is that I'm losing 16 pins, which means I will have no room whatsoever for upward expansion. Moreover, I'm losing a large number of +5V pins as well.
My plan is to break the bus up into two connectors, giving me a total of 80 pins to work with. Each row is segmented into four pin groups: 3 signal pins and one ground pin. The grounds are staggered; this way, no signal is more than two pins away from a ground. This leaves a total of 60 signal pins left over.
In connector J1, you'll find an 8-bit subset of the Backbone bus. D0-D7, A1-A7 for register select purposes, and A56-A63 for I/O device decoding. As well, you'll find WE, SEL0, STB, ACK, CLK, RESET, and CDONE pins. These should be sufficient to, for instance, wire up a number of 65C22 or i8255 chips, or some other similarly simple 8-bit interface. Note that there's no need to monitor CYCA here, since if SEL0 is asserted, it will be because a cycle is in progress. What you won't be able to tell, though, is if the bus transaction is part of a read-modify-write transaction. But, honestly, that information is rarely useful except in multiprocessor configurations anyway. This results in the cheapest possible board configuration; a PCB can be even smaller than the original design, at just about 2" long on a side.
In connector J2, you'll find D8-D15, A8-A23, SEL1, and the remaining bus mastership pins. This lets you take full advantage of the 16-bit data path, the complete address space, and/or the ability to master the bus.