Clunky McCluster

I know that Pi clusters have been done, redone and overdone so nothing new under the sun. YouTube is already full of videos that show projects with 2 to 100s of RPi doing various stuff in different ways. But for the #PEAC Pisano with End-Around Carry algorithm project, I need a big cluster. Like, 10K cores or so. I'm not sure I can get them all with Pis (ha ha ha in advance) but I can slowly build up something that could eventually get me there, one day. It's a re-purposable project following an evolutive path that can spin-off a few cool and eventually marketable designs.

Now, consider that a basic ping to a RPi takes almost 1ms but toggling a GPIO pin is about one microsecond. For trivially parallel algorithms, this does not matter much but not everything is simple in practice...

The SPI interface can move data with low latency at almost 30Mbps and the SMI port has even more power (that I still must learn to harness).

How could that help with a cluster ? This can reduce latency both physically as shown, and through a leaner program because dealing with IO pins takes some lines of inline C at most, no need to call external libraries or the OS. The synchronisation primitives are then implemented in external HW. Many topologies and organisations are possible and any one that I implement will not fit a particular problem/project so I simply put a FPGA.

Intended is the affordable ProASIC3 A3P250 PQ208 to provide 150 IO pins, with 8 internal reconfigurable FIFOs/memory blocks that you can then use as mailboxes, transfer queues, glue logic... The PQ208 package can accommodate higher density parts (up to A3P1000 and A3PE3000 if you are rich) if more memory, processing, logic etc. are needed. Or use another FPGA brand/family if you prefer.

4 RPis is a good, square quantity, first and mostly because cheap Ethernet hubs have 5 ports, and you need the 5th port to connect the cluster to something else. This also allocates 2 FIFOs per Pi in the FPGA, which must preserve pins for extensions (like, connecting to other FPGA in a token ring maybe).

Maybe I'll find a way to boot, load and configure the program through the interco system, as well as control/sequence the power up/down. It's not a KVM but it gets closer to the comfort that you'll find in half-decent Beowulf clusters (I learned a few things while working for a cluster company around 2002).

Maybe a simplified, cheaper, smaller version for the Pi Zero or Compute Module could be derived from this later but for now I want the comfort of Ethernet. The 40-pin GPIO header is the focus of this project.

-o-O-0-O-o-

Logs:
1. Getting started...
2. FPGA selection
3. Bare metal
4. Where am I ?
5. Inventory
.
.

Project Logs

Collapse

Inventory
Yann Guidon / YGDES • 07/29/2021 at 00:38 • 0 comments
Some old stuff, some new stuff...
I received more stuff and here is the situation:
- Raspberry Pi:
  - 4× 3B+ (making one high performance quad)
  - 2× 3B
  - 2× 2B (together with the 3B, that makes another quad with slightly less performance)
  - 8× 1B+ (that's 2 quad with low performance)
- Hubs :
  - 4× 5 ports
  - 4× 8 ports
- RJ45: I have found 4× very short patch cables, I need more but don't want to order them. So I crimp them myself to measure, I received a bag of RJ45 plugs and I can finally use the old broken cables that were accumulated through all these years...
- SD cards : Found several from old project, and bought 12 during the sales. These should be enough for 3 quads: 6×8GB and 6×16GB. Usually I keep my images limited to 4GB, more is possible but I prefer when the storage is managed properly and centralised. Ideally the SD cards are mounted in read-only to prevent wear out.
- To ease SD card duplications, I ordered a few USB adapters. It seems that my internal port has gone kaputt anyway...
- Temperature management : I should by default stick a heat spreader on all the Pi's CPUs, just to keep the temperature reasonable and prevent the automatic clock throttling from kicking in. An extra undervolted 12V fan will increase airflow without being too noisy.
So far it seems I can shoot for a cluster of 4 quads, or 16 Pis, or 64 cores if I slowly replace the old single-core boards with newer ones. I hope that the overall throughput will help me with my current projects... The nice thing is that it is scalable and features can be added progressively. Another order of magnitude can then be gained once the GPU is harnessed. For now I need brute-force trivial jobs and it could evolve later, for example with distributed memory, so short and fast messages will become critical.
Where am I ?
Yann Guidon / YGDES • 07/29/2021 at 00:02 • 7 comments

One of the features that this project can provide (if implemented) is with helping each node find its static address.
Usually one would either use DHCP or static IPs for the Ethernet port. DHCP makes it tedious to know who is who and requires a master node to handle probing the network, while static IPs require a manual management, at least modifying each config file during the SD card duplication. Tedious.
The FPGA provides 4 ports and is intended to communicate with other quads, so each FPGA has its own address, that can be setup with jumpers/DIL switches/hex encoding wheels/whatever. And each Pi port is static so it can have its own fixed address.
Let's say there are 4 Pi ports, so 2 hardwired bits, and the FPGA can fetch its own address from a few GPIO pins (to a 74HC165 or 4017 for example, or even charlieplexing? SPI requires 4 wires, I²C only two but is more complex). This port address can then be read/fetched from the respective Pi which will then execute the program related to its address. Let's be reasonable and say we have 6 bits for the quad address, that makes a byte for each Pi which can then configure its own static IP address. That would be the best of both worlds :-)
Another aspect is that, if several quads need to be connected, each quad must know its own address to send and receive packets. No routing is possible without addresses.
Bare metal
Yann Guidon / YGDES • 07/27/2021 at 21:37 • 0 comments

For now I focus on the Pi3B+ to get quick results with the least amount of efforts and without breaking the already sadly-looking bank. Pi4s draw and heat too much but could be used, as well as many other compatible devices. In fact the key is simply the form factor and in particular the 40-pins header. So "model A"s could be used as well, but I stick to the Bs because they have an Ethernet port that makes management very practical.

Now if I use As, the other management process will be through the Wifi link, which saves cables and a physical hub, but requires a dedicated Wifi manager: thing have only been moved around... But going further, Ethernet and Wifi will not be a great communication medium. The FPGA is already there to help the applications interconnect, so what about... exposing the link to the Linux system more broadly? write a block or character device driver ?

Going further, Linux will have little to do, such as managing the hardware health, configuring stuff, allocating memory and storage... No need of a full-blown distribution, right ? So I'm looking at "bare metal" systems such that the SBC can boot fast, avoid wearing off the µSD card, ... I can start with a basic OS image and remove tons of things, then add a custom application for exposing the interco. Unfortunately, because of systemd, long is gone the time when you could simply add "init=/bin/bash" to the kernel's command line :-(

I even wonder if/how I could let the whole system boot off the interco, thus saving the µSD card altogether. I know people have already made their own bootware, such as Tristan for the Pi3, but this normally goes on flash storage.

But for now and during development, I stick to the standard OS...
FPGA selection
Yann Guidon / YGDES • 07/27/2021 at 21:12 • 0 comments
For this project, I enlist the ProASIC3 family because
1. I know it very well,
2. It's suited for the task (no need of ultra-high-performance features, the speed is right)
3. The price is OK (look them up on eBay : A3P250 is around $15)
4. The PQ208 package is reasonably easy to solder at home so the end user could swap parts or hack it further)
5. The SW is free (as in free beer), works on Linux and Windows, not as crazy to install as others I've tried, and offers a choice of locks for the licence (it's not crazy constraining). Just make sure you get a compatible FlashPro JTAG probe, or a suitable equivalent.
6. I have stock.
Now let's look at the product table from the official site:

A3P125 is the smallest in QFP208 and is able of minimal functions though one detail matters. Not only are only 133 GPIO available, but there are only 2 I/O banks, read: only 2 independent voltage zones. The others have 4 banks and can have their voltages vary, meaning: you can hotswap. Wouldn't it be nice if you could shutdown, remove or add a node while the cluster is operating?

So the "minimum specification" for my potato cluster (youtube reference) is the A3P250PQG208. It has 3072 LUT3 gates (serving either as logic or DFF) which is enough for normal interco management, and 8 small dual-port SRAM blocks that are easily configured as FIFO (when enabling the dedicated circuit). I have pushed that type of chip easily above 60MHz with real designs and synthesis around 100MHz is possible with some care. This is the range of frequency where the Pi's GPIO pins can operate rather reliably so it's a great match.

The file A3P_QFP208_pinout.txt shows the pinout differences between various chip densities so a single PCB can accommodate most of them. If I can go far enough, the pin layout files will be public of course.
From there on, if the A3P250 is too tight for you, you can look at the A3P400 (50% more resources) and the A3P600 (24 SRAM blocks and 7K LUT3) for when your routing protocols get crazy and you need more buffering (that provides a depth of maybe 4 or 5 FIFOs or 2KB per Pi, which is getting overkill unless you have a lousy protocol).
The top of the line is the A3P1000 with its 32 SRAM blocks and 11K tiles. I don't know what you'd want to do with that, unless you want to integrate a softcore CPU and/or more sophisticated interfaces instead of a basic message-passing link between quads. At least you have the choice.
Beyond that you'll find the A3PE1500 and A3PE3000. They're just massive and expensive. I doubt anybody would use that so I don't check the pinout compatibility. The A3PE1500 however has 6 independent GPIO zones so that could add another benefit (better hotplug support), but at a high cost.
Getting started...
Yann Guidon / YGDES • 07/22/2021 at 01:41 • 4 comments
This project starts because I need to crunch a lot of numbers in parallel. One of the available methods is to reuse, and then upgrade, my collection of Raspberry Pi left from past projects.

OTOH this particular cluster project relies on the 40-pin GPIO connector which appeared a while back, at the end of the 1st generation. Luckily, my inventory contains 9pc RPi B+ v1.2 from 2014 (aww, 7 years old now...) and this is enough to get started !

Performance-wise, we'd need 8 boards of this single-core ARM at 700MHz to reach the throughput of one Pi3B+, which is quad-core and clocked at twice the speed, or half as fast as my i7 laptop. A cluster of 2 quads with the old boards is then a mock-up, a demonstrator and a prototype, where I will later replace/upgrade the boards with faster versions. The old v1B+ serve to test and weather the bugs and shorts before the more expensive, faster boards enter duty. At this moment, I wonder if the Pi3A+ would do the trick: still fast but cheaper, smaller and Ethernet can then be replaced by onbard WiFi.
I also have a pair of Pi 2B (quad 900MHz), and some Pi 3B+ (quad 1400MHz) should arrive soon. With some basic thermal management measures, I'll try to overclock them a bit.

With all those boards, several quad-clusters can be implemented so I can work on interconnecting the quads. With the planned upgrades, making the cluster heterogeneous, I must not only consider many independent clock domains, but also speeds...

The inventory also covers the necessary accessories :
- Ethernet : at least 3 hubs with 5 port, more might hide here or there. I'll need many short patch cables as well, not sure how many I have left.
- microSD cards : it's sales season so I'm looking around, for 4 or 8GB ones. I found 6 so far.
- Power supplies : moot :-) Power comes through the GPIO port.
- Female sockets : I found the appropriate 2×20 right angled female header and must wait a few weeks for delivery.
- The rest : 5V sources, A3PxxxPQ208, fans and bare proto PCB are in stock. They form the core of the project, around which I add features...
When the proto PCB is validated, I can then open EAGLE to layout the pre-series.

View all 5 project logs

A3P-Q208-pinouts.zip Compilation of the manufacturer's pinouts for the A3P and A3PE families in quad 208 pins package. Zip Archive - 132.83 kB - 07/27/2021 at 22:23		Download
A3P_QFP208_pinout.txt Comparative pinout table for A3P125, A3P250, A3P400, A3P600 and A3P1000 in PQFP208 package plain - 4.23 kB - 07/27/2021 at 21:15		Download

Clunky McCluster

Description

Details

Files

A3P-Q208-pinouts.zip

A3P_QFP208_pinout.txt

Project Logs

Collapse

Inventory

Where am I ?

Bare metal

FPGA selection

Getting started...

Discussions

Similar Projects

RPi WiFi

"Ultimate" Lora Gateway Backplane

PiCarts: GPIO ROM Carts

Einstein-Rosen Bridge (WiFi)

Clunky McCluster

Become a Hackaday.io member

Just one more thing

Description

Details

Files

A3P-Q208-pinouts.zip

A3P_QFP208_pinout.txt

Project Logs Collapse

Enjoy this project?

Discussions

Become a Hackaday.io Member

Similar Projects

Does this project spark your interest?

Report project as inappropriate

Send message

Remove Member

Project Logs

Collapse