-
2x2 MaxPool in ESP32 S3 Assembly
05/30/2024 at 23:10 • 0 commentsxor a8, a8, a8 xor a9, a9, a9 movi a7, {image_height} movi a6, {image_width} slli a12, a6, 4 or a13, a12, a12 addi a13, a13, -16 max_col: movi a6, {image_width} or a10, a8, a8 add a8, a12, a8 or a11, a9, a9 addx2 a9, a12, a9 max_block: ee.vld.128.ip q0, a11, 16 ee.vld.128.xp q1, a11, a13 ee.vmax.s8.ld.incp q2, a11, q5, q0, q1 ee.vmax.s8.ld.incp q3, a11, q6, q5, q2 sub a11, a11, a12 ee.vmax.s8 q7, q6, q3 st.qr q7, a10, 0 addi a10, a10, 16 addi a6, a6, -2 bnez a6, max_block end_max_block: addi a7, a7, -2 bnez a7, max_col end_max_col:
I managed to get my ESP32 S3 Emulator to a level where it can run a lot of the SIMD instructions. I am implementing functionality as I need it, only implementing instructions when I see a reason to use them in the assembly code I'm writing. This makes the process a bit more manageable, because writing the emulator is mind-numbing, carpal tunnel inducing torture as it is...
By using HWC format, and using a number of channels that is a multiple of 16 helps with alignment. Each pixel is exactly one EE.VLD.128.IP instruction for 16 channel data. The max pool uses these in conjunction with the EE.VMAX.S8.LD.INCP which calculates the maximum between 2 vectors containing 16x8 bit signed integers, while loading new data.
-
Biting the bullet...
05/24/2024 at 23:01 • 0 commentsWriting assembly for the ESP32 S3 is fun, don't get me wrong. But it is also very frustrating when you have to debug it, rebuild, wait for the upload, monitor, crash, repeat. I know you can do emulation and gdb, but to be fair, I haven't ever used gdb before and I don't feel like learning to use it at this time.
At the same time, I'm still learning all the ins and outs of the instructions available on the Xtensa CPU. This basically boils down to reading and rereading the ISA over and over until it sticks.
I'm all for getting things done fast, with the least amount of resources spent, because I'm a prototyper making proof-of-concepts. But you do have to be smart about it. I noticed that as my program grows in complexity, so does the time debugging the assembly and its overhead (building, flashing).
So I decided to spend a considerable amount of time writing an ESP32 S3 emulator, in python. It must be able to read some assembly, execute it and be able to show me the state of the registers. I can then import this in a notebook and start an interactive session with the emulator basically. When the assembly does what it is supposed to do, it could output the assembly to be assembled by the assembler, or maybe even assemble it itself.
This will give me 2 things:
- I will learn the details of every instruction in the ISA and get a very detailed overview of the CPU's inner workings.
- I can develop and debug much quicker
My prediction is that the time spent writing this emulator is very easy to win back, considering the amount of assembly code I want to write for this (and subsequent) projects.
The code will be available here. Keep in mind that its purpose is to serve the 2 goals stated above, not to be a complete or perfect emulator.
-
4x4 Convolutions
05/17/2024 at 00:58 • 0 commentsWe're so used to using 3x3 convolutions we don't often think about switching it up, and why would we? The 3x3 convolution is very efficient, and 2 of them back to back with a nonlinearity between them usually outperform an equivalent 5x5 convolution. So why would you use 4x4 convolutions instead?
Technically, a 4x4 kernel can be constructed by padding a 3x3 kernel with 0's, which means that they can serve as drop in replacements. When you look at the SIMD instruction set available on the ESP32 S3 you quickly see that you are best off when working with 16 bytes at a time. Also, you can avoid headaches by keeping your data access aligned to 16 byte boundaries. So for the first layers in my network I replaced the 3 to 16 channel 3x3 convolution with a 4 channel (RGB+pad) 4x4 one, and retrained the network. The bigger convolution requires a bit more resources during training and improves the accuracy of the network slightly. But now, I can load the feature map for a single output channel and multiply+accumulate it with a block in the source image in only 10 instructions (using NHWC format):
ee.ld.accx.ip %[bias],0 ee.vld.128.ip q0,%[in],128 ee.vld.128.ip q4,%[weight],16 ee.vld.128.ip q1,%[in],128 ee.vmulas.s8.accx.ld.ip q5,%[weight],16,q0,q4 ee.vld.128.ip q2,%[in],128 ee.vmulas.s8.accx.ld.ip q6,%[weight],16,q1,q5 ee.vld.128.ip q3,%[in],128 ee.vmulas.s8.accx.ld.ip q7,%[weight],16,q2,q6 ee.vmulas.s8.accx q3,q7
Imagine all the shifting and masking needed in order to make this a 3x3x3 convolution.
Once you reach a point where the number of channels is a multiple of 16 you're out of the woods, as long as you're using NHWC :)
-
Quantization: PyTorch vs ESP32 S3
05/13/2024 at 19:13 • 0 commentsI'm working on a custom model, and I'm using pytorch to train it. Most of the layers are custom so I can't just export to some standard format and hope for the best. I'm going to duplicate the layers' logic in C on the ESP32, then use PyTorch to quantize my model weights.
I would like to try the ESP-DL library from Espressif, but unfortunately they use a different quantization scheme than PyTorch and claim you can't use your model with their API. This is not entirely true, it's just that there is no easy way to use your model with their quantization scheme, but you certainly can.
The key thing to understand is how both quantization schemes work. PyTorch uses a zero-point and a scale:
f32 = (i8 - zero_point) * scale
while ESP-DL uses an exponent:
f32 = i8 * (2 ** exponent)
which they claim is not compatible.
We can make this work though, if we force PyTorch to use a zero-point with value 0 and a scale that is always 2 to the power of a (signed) int.
Getting a zero-point of 0 is easy, we have to set the qconfig to use a symmetric quantization scheme. The scale is a little bit harder but no rocket science either: We can overload a suitable QuantizationObserver to produce qparams with a scale that is updated to
scale = 2 ** round( log2( scale ))
Like so:
import torch.ao.quantization as Q class ESP32MovingAverageMinMaxObserver(Q.MovingAverageMinMaxObserver): def __init__(self, *args, **kwargs): super().__init__(*args, **kwargs) def _calculate_qparams(self, min_val: torch.Tensor, max_val: torch.Tensor): s,z = super()._calculate_qparams(min_val, max_val) assert (z == 0).all() s = 2 ** s.log2().round().clamp(-128,127) return s,z
Then when it is time to export the weights we also export the exponent we use in ESP-DL by simply getting the log2 of the scale of the weight tensor.