Testing the limits

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.29.05    Driver Version: 495.29.05    CUDA Version: 11.5     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Quadro P2000 wi...  On   | 00000000:01:00.0 Off |                  N/A |
| N/A   61C    P0    N/A /  N/A |   1882MiB /  4040MiB |     69%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

There are still some optimizations I could do, but they would require a bit of effort. I tried to see how many instances I could run in parallel before I would start to get xruns. Interestingly, on my laptop at least, if I run all instances in the same process, the whole thing breaks down at 4 instances, getting a lot of noise. The GPU seems capped on 50%.

If I however run 2 instances per process and start multiple processes, I can go up to 5 processes, 10 instances = 20 (!) parallel convolution reverbs. GPU usage then tops out at 70%. When starting 4 instances over 2 processes, the GPU usages is only 25% in comparison to 50% in a single process. (Each instance has its own stream(s), so not sure how this is happening.

Could it be that there are multiple engines that can only be used when the actual cuda context is different, that might run in parallel so the usage is halved?

So I'm quite pleased with the performance so far, and don't see a good reason to start optimizing. Let's see how it performs on the Jetson.

Concept

Jetson Nano Test

Discussions

Become a Hackaday.io Member