xor a8, a8, a8 xor a9, a9, a9 movi a7, {image_height} movi a6, {image_width} slli a12, a6, 4 or a13, a12, a12 addi a13, a13, -16 max_col: movi a6, {image_width} or a10, a8, a8 add a8, a12, a8 or a11, a9, a9 addx2 a9, a12, a9 max_block: ee.vld.128.ip q0, a11, 16 ee.vld.128.xp q1, a11, a13 ee.vmax.s8.ld.incp q2, a11, q5, q0, q1 ee.vmax.s8.ld.incp q3, a11, q6, q5, q2 sub a11, a11, a12 ee.vmax.s8 q7, q6, q3 st.qr q7, a10, 0 addi a10, a10, 16 addi a6, a6, -2 bnez a6, max_block end_max_block: addi a7, a7, -2 bnez a7, max_col end_max_col:
I managed to get my ESP32 S3 Emulator to a level where it can run a lot of the SIMD instructions. I am implementing functionality as I need it, only implementing instructions when I see a reason to use them in the assembly code I'm writing. This makes the process a bit more manageable, because writing the emulator is mind-numbing, carpal tunnel inducing torture as it is...
By using HWC format, and using a number of channels that is a multiple of 16 helps with alignment. Each pixel is exactly one EE.VLD.128.IP instruction for 16 channel data. The max pool uses these in conjunction with the EE.VMAX.S8.LD.INCP which calculates the maximum between 2 vectors containing 16x8 bit signed integers, while loading new data.
Discussions
Become a Hackaday.io Member
Create an account to leave a comment. Already have an account? Log In.