Close

Overview of the firmware

A project log for Haasoscope Pro

A 2 GHz oscilloscope for everyone

haashaas 12/08/2024 at 18:210 Comments

The firmware running on the Cyclone IV FPGA of the Haasoscope Pro is the central brains of the whole oscilloscope. Its key jobs are to:

The firmware is written and compiled using the Altera Quartus IDE. The "lite" version is free, and runs on Windows or Linux. To get started, just open the Haasoscope Pro firmware project file, "HaasoscopePro/adc board firmware/coincidence.qpf".

Here's an overview of the firmware structure, as seen in Quartus:

These are the basic building blocks:

Let's now look at each block in a bit more detail.

PLL block

The PLL can take in either a 50 MHz clock from the board's 50 MHz crystal, or an external 50 MHz clock from LVDS (used for synchronizing multiple boards). Outputs tell the main processor block which of the two input clocks are available, which is actively being used, and whether the PLL is locked. An input, "clkswitch", is controlled by the main processor block and decides which clock input is active. 

There are 5 clocks output from the PLL:

There is an interface to the PLL, connected to the main processor block, for dynamically adjusting the phase of each output clock. This is used to adjust the phase of each of those LVDS clocks, so all the LVDS inputs can be synced.

LVDS input blocks

We've mostly already discussed the LVDS blocks. There are 4 12-bit LVDS inputs, so 48 LVDS pairs coming from the main ADC, which is 3200 MHz of 12-bits data. There's also a clock and a strobe LVDS input for each of the 4 12-bit LVDS busses, so 4*12+4*2=56 LVDS inputs total, each running at 800 MHz. I still find it amazing that an FPGA like this can process that much data, 44.8 Gb/s, for ~$75, using ~1W!

Data buffer block

The data buffer is an array of FPGA memory 1024 words long and 560 bits wide, that stores the processed ADC data before it's read out. (560 bits = 12 bits + clock + strobe for each of the 4 LVDS busses from the ADC, times 10 samples per c1 LVDS deserialized readout clock tick.) There's no external RAM on the board. Interfacing to that would require another huge number of LVDS connections in order to write the data to RAM fast enough to keep up with the ADC. Processed data from the main processor block (see below) is written by it into the data buffer. It's a circular buffer. Data is constantly being written every clock cycle into the buffer to a location that is incremented, and the current write location is wrapped back around to the beginning of the buffer when it reaches the end of the buffer. When a trigger occurs, the main processor stops writing into the buffer after a given number of clock ticks. This allows for B samples to exist in the buffer before the trigger and A samples after the trigger, where A+B=1024. The division between A and B is set by the "trigger position" command via the software.

SPI controller block and SPI input MUX

The SPI block listens to commands from the main processor block and creates SPI commands to send to target devices on the board. It's based on spi-master by Nandland. Then there's an input MUX to select which device we want to receive SPI data from. All higher level SPI sequencing, which just to send, read, etc., is handled in the main processor block.

FT232H USB interface block

The USB interface block communicates between the main processor block and the FT232H chip that talks to the software over USB. It is based on FPGA-ftdi245fifo by WangXuan95. Bytes received from the software go into a FIFO in the USB interface block, which then can be read by the main processor block. Bytes sent from the main processor block go into another FIFO in the USB interface block, which then are sent out over USB to the software. The USB block has its own 60 MHz clock, generated by the FT232H chip from an external 12 MHz crystal on the board, to handle the physical layer USB communications.

Main processor block

The main processor block is where most of the real logic and processing occurs. The file is "command_processor.v". There are three main sections. 

The first section takes the input LVDS data from the main ADC and creates the "samples" that will be stored into the data buffer. These samples may need to be slower than those being provided by the ADC. For instance, the ADC is sampling at 1600 MHz on two channels, but we may want to see a longer length of time in our data buffer, i.e. we may want a longer time base, say 1 ms / division. To do that, we need to ignore almost all of the incoming samples, but just keep 1 out of every N samples... enough to fill the data buffer after 10 ms (assuming 10 divisions). This is called downsampling. It gets a little complicated - actually super annoyingly complicated (I banged my head on the wall for many hours to get it right!) - because the samples are coming in 40 at a time in each clock tick, thanks to the LVDS 10x deserialization and the 4x LVDS busses which are interleaved (or 2x if we are in two-channel mode). 

The next section handles triggering. We have rising edge and falling edge triggers, as well as external triggers coming from another board or external input. More triggers may be added of course. When a trigger occurs, we have to record which of the 40 samples of the clock tick actually crossed the trigger threshold, and also at which position we were writing into the data buffer. We'll need to know these for the data readout later and for correctly positioning the data on the screen in the software. Once a trigger occurs, we freeze the writing into the data buffer and wait for readout.

The last section runs on the 50 MHz clock, not the 80 MHz LVDS buffer read clock used for the above sections. It takes in commands from the USB interface block, possibly does some things, and then responds to the USB interface block by sending it data. The commands come in as 8-byte words. The first byte determines the type of command. For instance, if the first byte is 0 it sends back the data buffer, if it's 1 it sets channel and trigger types and sends back trigger info for the last event, if it's 2 it sends back the firmware version or other info, 3 is for an SPI command, etc. The other 7 bytes in the command can be used as parameters for the command, if needed.

Timing

Lastly, the firmware would be nothing without detailed timing constraints. They are in the file "coincidence.sdc". These tell the Quartus compiler what must happen by when. Quartus then attempts to "fit" a firmware which satisfies these constraints, by placing logic in the FPGA at locations such that routing delays are small enough so that things happen by when they need to. 

First, the clocks are defined. Then clock groups are defined which tell Quartus which clocks are unrelated to each other. For instance, the clocks based on the 50 MHz crystal input and the 50 MHz external LVDS input clock are not related since only one set is active at any given time. The 60 MHz USB clock is also not related since it is used by a separate block only. 

Next we ignore some routing paths which are known to be erroneously flagged by Quartus as warnings. It's important to let Quartus know these are not really problems so it doesn't ruin other routing trying to solve them.

Last, and maybe most importantly, we set the timing constraints for each external input and output. These are saying how early or late each IO pin can be compared to the others. LVDS inputs are of course critical to time accurately, as well as the signals interfacing with the FT232H USB chip. SPI signals need be less strongly constrained, etc. Only after properly defining these constraints does the resulting firmware behave as we like. And we can be pretty sure it will perform correctly by checking the Timing Report in Quartus which will tell us if the constraints could be satisfied, and if not where not and by how much.

Results

Quartus provides a nice summary of FPGA usage:

Total logic elements 11,564 / 28,848 ( 40 % )
Total registers 4107
Total pins 228 / 329 ( 69 % )
Total memory bits 592,000 / 608,256 ( 97 % )
Embedded Multiplier 9-bit elements 0 / 132 ( 0 % )
Total PLLs 1 / 4 ( 25 % )

We are only using about half of the available logic, though closing timing constraints gets harder once more of the FPGA is used, depending on what that logic needs to access. There's still over 100 IO pins free, so more IO could be added to the board design, though routing is starting to get tricky. We're using nearly all the FPGA memory for the data buffer (and a little for the USB FIFOs) - 1024*40 12-bit samples just fits! We're not using any of the DSP resources - they could be used for filtering? FFT in firmware? And finally, we do have lots more clock resources that could be made use of. It's pretty amazing how much more could still be added to the design, given how much we're doing already. And this is only a medium-sized Cyclone IV from 2009! Imagine what the newly released Altera Agilex FPGAs will be capable of!

Discussions