Overview of the firmware

The firmware running on the Cyclone IV FPGA of the Haasoscope Pro is the central brains of the whole oscilloscope. Its key jobs are to:

Control and monitor the behavior of the many ICs on the board such as DACs, amplifiers, the main ADC, etc. over SPI
Collect and synchronize data coming from the main ADC, using 56 LVDS pairs
Process the incoming ADC data, including triggering, downsampling (optionally ignoring some fraction of samples to decrease the effective sample rate), and buffering of the triggered data while waiting for readout
Accept and respond to commands over USB via the FT232H chip, including sending out the buffered ADC data over USB

The firmware is written and compiled using the Altera Quartus IDE. The "lite" version is free, and runs on Windows or Linux. To get started, just open the Haasoscope Pro firmware project file, "HaasoscopePro/adc board firmware/coincidence.qpf".

Here's an overview of the firmware structure, as seen in Quartus:

These are the basic building blocks:

PLL block, top left, provides clocks for other blocks
LVDS input blocks, below the PLL, handle LVDS inputs from the main ADC and output data to the main processor block
Data buffer block, below the LVDS blocks
SPI controller block, top right, handles SPI communication to/from various chips and the main processor block, and the SPI input MUX, below the SPI block, switches which SPI input is being listened to
FT232H USB block, below the MUX, buffers and handles USB IO to/from the main processor block
Main processor block, bottom right, is the main brains: receives and responds to commands from the USB block, handles main ADC data from the LVDS inputs, triggers on the ADC data, processes and buffers triggered ADC data

Let's now look at each block in a bit more detail.

PLL block

The PLL can take in either a 50 MHz clock from the board's 50 MHz crystal, or an external 50 MHz clock from LVDS (used for synchronizing multiple boards). Outputs tell the main processor block which of the two input clocks are available, which is actively being used, and whether the PLL is locked. An input, "clkswitch", is controlled by the main processor block and decides which clock input is active.

There are 5 clocks output from the PLL:

c0 runs at 8*50 MHz = 400 MHz and tells the LVDS input blocks when to sample inputs. The LVDS inputs are DDR, so thus run at 800 MHz each, near the limit of the Cyclone IV.
c1 runs at 8/5 * 50 MHz = 80 MHz and tells the main processor when to read from the LVDS input blocks. Each LVDS input block takes in 10 bits of serial data at 800 MHz and then outputs a single 10-bit wide block of data at 80 MHz. This de-serialization is key to allowing the FPGA to process the data, since the FPGA logic runs at a max clock rate of a few hundred MHz.
c2 is a 50 MHz output copy of the input clock, but de-jittered by the PLL. It drives the main processor logic which interfaces to the USB block, outputs to the LVDS clock out for syncing other boards, and outputs to the ADF4350 PLL for making the 1600 MHz differential clock that drives the main ADC.
c3 and c4 are also 400 MHz outputs that drive subsets of the LVDS inputs whose connections from the main ADC are either shorter or longer, respectively. The LVDS signals travel at about the speed of light / 2 in the board, since the dielectric of the FR4 is about 4, and it goes like the sqrt of the dielectric. So they travel at about 30cm/ns / 2 = 15 cm/ns. The shortest LVDS connection is about 2 cm and the longest is 6cm, so there's a 4 cm difference, or 4cm / 15cm/ns = ~0.3 ns max time difference. At 800 MHz, that's a sample every 1.25 ns. That's a significant enough difference that some of the inputs may not be sampled correctly if all are driven from the same clock. We could have tried to make all the LVDS connections more similar in length, by meandering the tracks, but this is a bit tricky for differential tracks while maintaining impedance (in Eagle), and the routing is already extremely dense. Instead we just drive the short, standard, and long tracks each on their own LVDS clock, c3, c0, and c4, respectively, and let the LVDS input blocks synchronize them. The outputs of the LVDS blocks are all read on the same clock, c1.

There is an interface to the PLL, connected to the main processor block, for dynamically adjusting the phase of each output clock. This is used to adjust the phase of each of those LVDS clocks, so all the LVDS inputs can be synced.

LVDS input blocks

We've mostly already discussed the LVDS blocks. There are 4 12-bit LVDS inputs, so 48 LVDS pairs coming from the main ADC, which is 3200 MHz of 12-bits data. There's also a clock and a strobe LVDS input for each of the 4 12-bit LVDS busses, so 4*12+4*2=56 LVDS inputs total, each running at 800 MHz. I still find it amazing that an FPGA like this can process that much data, 44.8 Gb/s, for ~$75, using ~1W!

Data buffer block

The data buffer is an array of FPGA memory 1024 words long and 560 bits wide, that stores the processed ADC data before it's read out. (560 bits = 12 bits + clock + strobe for each of the 4 LVDS busses from the ADC, times 10 samples per c1 LVDS deserialized readout clock tick.) There's no external RAM on the board. Interfacing to that would require another huge number of LVDS connections in order to write the data to RAM fast enough to keep up with the ADC. Processed data from the main processor block (see below) is written by it into the data buffer. It's a circular buffer. Data is constantly being written every clock cycle into the buffer to a location that is incremented, and the current write location is wrapped back around to the beginning of the buffer when it reaches the end of the buffer. When a trigger occurs, the main processor stops writing into the buffer after a given number of clock ticks. This allows for B samples to exist in the buffer before the trigger and A samples after the trigger, where A+B=1024. The division between A and B is set by the "trigger position" command via the software.

SPI controller block and SPI input MUX

The SPI block listens to commands from the main processor block and creates SPI commands to send to target devices on the board. It's based on spi-master by Nandland. Then there's an input MUX to select which device we want to receive SPI data from. All higher level SPI sequencing, which just to send, read, etc., is handled in the main processor block.

FT232H USB interface block

The USB interface block communicates between the main processor block and the FT232H chip that talks to the software over USB. It is based on FPGA-ftdi245fifo by WangXuan95. Bytes received from the software go into a FIFO in the USB interface block, which then can be read by the main processor block. Bytes sent from the main processor block go into another FIFO in the USB interface block, which then are sent out over USB to the software. The USB block has its own 60 MHz clock, generated by the FT232H chip from an external 12 MHz crystal on the board, to handle the physical layer USB communications.

Main processor block

The main processor block is where most of the real logic and processing occurs. The file is "command_processor.v". There are three main sections.

The first section takes the input LVDS data from the main ADC and creates the "samples" that will be stored into the data buffer. These samples may need to be slower than those being provided by the ADC. For instance, the ADC is sampling at 1600 MHz on two channels, but we may want to see a longer length of time in our data buffer, i.e. we may want a longer time base, say 1 ms / division. To do that, we need to ignore almost all of the incoming samples, but just keep 1 out of every N samples... enough to fill the data buffer after 10 ms (assuming 10 divisions). This is called downsampling. It gets a little complicated - actually super annoyingly complicated (I banged my head on the wall for many hours to get it right!) - because the samples are coming in 40 at a time in each clock tick, thanks to the LVDS 10x deserialization and the 4x LVDS busses which are interleaved (or 2x if we are in two-channel mode).

The next section handles triggering. We have rising edge and falling edge triggers, as well as external triggers coming from another board or external input. More triggers may be added of course. When a trigger occurs, we have to record which of the 40 samples of the clock tick actually crossed the trigger threshold, and also at which position we were writing into the data buffer. We'll need to know these for the data readout later and for correctly positioning the data on the screen in the software. Once a trigger occurs, we freeze the writing into the data buffer and wait for readout.

The last section runs on the 50 MHz clock, not the 80 MHz LVDS buffer read clock used for the above sections. It takes in commands from the USB interface block, possibly does some things, and then responds to the USB interface block by sending it data. The commands come in as 8-byte words. The first byte determines the type of command. For instance, if the first byte is 0 it sends back the data buffer, if it's 1 it sets channel and trigger types and sends back trigger info for the last event, if it's 2 it sends back the firmware version or other info, 3 is for an SPI command, etc. The other 7 bytes in the command can be used as parameters for the command, if needed.

Timing

Lastly, the firmware would be nothing without detailed timing constraints. They are in the file "coincidence.sdc". These tell the Quartus compiler what must happen by when. Quartus then attempts to "fit" a firmware which satisfies these constraints, by placing logic in the FPGA at locations such that routing delays are small enough so that things happen by when they need to.

First, the clocks are defined. Then clock groups are defined which tell Quartus which clocks are unrelated to each other. For instance, the clocks based on the 50 MHz crystal input and the 50 MHz external LVDS input clock are not related since only one set is active at any given time. The 60 MHz USB clock is also not related since it is used by a separate block only.

Next we ignore some routing paths which are known to be erroneously flagged by Quartus as warnings. It's important to let Quartus know these are not really problems so it doesn't ruin other routing trying to solve them.

Last, and maybe most importantly, we set the timing constraints for each external input and output. These are saying how early or late each IO pin can be compared to the others. LVDS inputs are of course critical to time accurately, as well as the signals interfacing with the FT232H USB chip. SPI signals need be less strongly constrained, etc. Only after properly defining these constraints does the resulting firmware behave as we like. And we can be pretty sure it will perform correctly by checking the Timing Report in Quartus which will tell us if the constraints could be satisfied, and if not where not and by how much.

Results

Quartus provides a nice summary of FPGA usage:

Total logic elements 11,564 / 28,848 ( 40 % )
Total registers 4107
Total pins 228 / 329 ( 69 % )
Total memory bits 592,000 / 608,256 ( 97 % )
Embedded Multiplier 9-bit elements 0 / 132 ( 0 % )
Total PLLs 1 / 4 ( 25 % )

We are only using about half of the available logic, though closing timing constraints gets harder once more of the FPGA is used, depending on what that logic needs to access. There's still over 100 IO pins free, so more IO could be added to the board design, though routing is starting to get tricky. We're using nearly all the FPGA memory for the data buffer (and a little for the USB FIFOs) - 1024*40 12-bit samples just fits! We're not using any of the DSP resources - they could be used for filtering? FFT in firmware? And finally, we do have lots more clock resources that could be made use of. It's pretty amazing how much more could still be added to the design, given how much we're doing already. And this is only a medium-sized Cyclone IV from 2009! Imagine what the newly released Altera Agilex FPGAs will be capable of!

Syncing units

Oversampling

Discussions

Become a Hackaday.io Member