Project | Jetson Nano Convolution Reverb

« Back to project details Sort by:

No more alsa seq requirement
02/06/2022 at 03:52 • 0 comments

I have been trying to get the code working on a production Jetson Xavier NX, but it has a custom board and rebuilding / signing the kernel is a minefield I would rather not cross on my weekend off. I rather bypass seq altogether.

So I tried to get jack to detect raw midi devices... No luck.

So now I just wrote a rawmidi interface myself, it is pretty basic and has some potential issues if you want to do advanced routing, but for now this should remove the requirement for the kernel rebuild.

(Still need to test on actual jetson, working on laptop right now)

UPDATE: Now tested and functioning on jetson Xavier NX.

PS: When testing audio applications, always test with headphones and wear them around your neck. Then monitor the signal by putting one of the cups on your ear, DJ - style, just in case a bug is causing noise. I once had a bug that produced so much noise the headphones were vibrating, protect your hearing!
Jetson Nano Test
01/25/2022 at 20:53 • 0 comments

Finished testing on the Nano, works great!
When running with a single instance, the convolution time is 1.9ms for 2 input channels. When running 2 instances in a single process the convolution time doubles, and this seems to be the maximum I can get out of the Nano. As soon as I start 3 instances, the screen starts flickering and the audio sounds like a buzzer. Yikes...
So, to round up, 4 input channels / 4 output channels, should be doable! I tested with a (jack) buffersize of 512, this is as low as the TR-6S will go.
Testing the limits
01/25/2022 at 20:15 • 0 comments
```
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.29.05    Driver Version: 495.29.05    CUDA Version: 11.5     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Quadro P2000 wi...  On   | 00000000:01:00.0 Off |                  N/A |
| N/A   61C    P0    N/A /  N/A |   1882MiB /  4040MiB |     69%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
```
There are still some optimizations I could do, but they would require a bit of effort. I tried to see how many instances I could run in parallel before I would start to get xruns. Interestingly, on my laptop at least, if I run all instances in the same process, the whole thing breaks down at 4 instances, getting a lot of noise. The GPU seems capped on 50%.
If I however run 2 instances per process and start multiple processes, I can go up to 5 processes, 10 instances = 20 (!) parallel convolution reverbs. GPU usage then tops out at 70%. When starting 4 instances over 2 processes, the GPU usages is only 25% in comparison to 50% in a single process. (Each instance has its own stream(s), so not sure how this is happening.
Could it be that there are multiple engines that can only be used when the actual cuda context is different, that might run in parallel so the usage is halved?
So I'm quite pleased with the performance so far, and don't see a good reason to start optimizing. Let's see how it performs on the Jetson.
Concept
01/23/2022 at 20:17 • 0 comments

I started working with NVIDIA Jetson a few years ago, when the Nano just came out. Learned CUDA and loved it. Soon I outgrew the Nano and after I got my first Xavier NX I never looked back. The Xavier NX was in turn shelved when I moved to the AGX Xavier.

What a waste though, they are wonderful little boards in their own right, I just need to find a nice job for them to do. I’ve been wondering if CUDA could be of any help for real time audio processing. I figured that the buffer sizes would most likely be too small to offset the overhead of copying the data to and from the GPU, and even if they weren’t, you’d have so little data that your grid size would probably be 1.

On top of that, many filters simulate (to some extent) electronics or physics, where the new state depends on the previous state, thus calculating serially in the time domain. These algorithms are not easily parallelized. These are generally known as Infinite Impulse Response filters, or IIR. The only way to make proper use of the GPU for these filters is to have a whole stack of them, all independent of each other, and run them serially in parallel. I played with the idea of simulating all strings in a piano, or a symphonic orchestra with each violin processed in parallel.

The counterpart of IIR is FIR, Finite Impulse Response filters. It usually entails transforming a signal from the time domain to the frequency domain using Fast Fourier Transformations, then do some magic math and transform the result back. When you multiply in the frequency domain you get convolution in the time domain and vice versa. This can be used in in audio effects to simulate anything from amplifiers to stereo fields in a large church. They convolve your plain audio with the recorded response (echo) of an impulse (clap, snip, tick) that was played inside a large hall or through a guitar amplifier for instance. This causes the resulting audio to sound like it was recorded right there in that hall or with that amplifier.

Now, remember that GPU’s are very good at certain AI algorithms. For instance: CNN or Convolutional Neural Nets. Yup, that is the same convolution mentioned above, just generally 2D for images. GPU’s love doing convolution so it should come as no surprise that nvidia has a cuda library just for that, CUFFT. The way Fourier transformations can be parallelized makes it work equally well for a large 1D transform, or a bunch of small and ones (like scanlines). If we get a buffer with 256 samples and want to add about 1.5 sec of echo, your convolution would have a size of 65536, equivalent to a 256x256 image. A convolution can be compared to a single layer of a model, and I’m expecting to do about 200 buffers x 3 convolutions. How many filters could we run in parallel, what latency can we expect? Initial tests on the Jetson Nano suggest 1 or 2 filters in parallel with a latency of 1.5 milliseconds. Looks promising!!

Jetson Nano Convolution Reverb

No more alsa seq requirement

Jetson Nano Test

Testing the limits

Concept