Decided to benchmark body25 without decoding the outputs & got a variety of frame rates vs. network size.
256x256 5.5fps
224x224 6.5fps
192x192 8fps
160x160 9fps
256x144 9fps
128x128 12fps
448x256 was what the Asus GL502V has used at 12fps since 2018. Lions believe 9fps to be the lowest useful frame rate for tracking anything. It pops between 160x160 & 128x128 for some reason. Considering 224x128 with the original caffe model hit 5fps, it's still doing a lot better. Memory usage was about half of the caffe model. It only needs 2GB.
This was using 1280x720 video. Capturing 640x360 stepped up the frame rates by 5% because of the GUI refreshes. Sadly, it's not going to do the job for rep counting, but it should be enough for camera tracking.
Following the logic in openpose, the magic happens in PoseExtractorCaffe::forwardPass.
The model fires in spNets.at(i)->forwardPass
The output appears in spCaffeNetOutputBlobs. The sizes can be queried with
spCaffeNetOutputBlobs[0]->shape(1) = 78
spCaffeNetOutputBlobs[0]->shape(2) = 18
spCaffeNetOutputBlobs[0]->shape(3) = 32
The output dimensions are transposed from 32x18 to 18x32.
At this point, it became clear that openpose uses a 16x9 input size instead of 1x1 like trt_pose. Providing it -1x128 causes it to make a 224x128 input. Providing it -1x256 causes it to make a 448x256 input. That could explain why it's more robust than trt_pose but it doesn't explain why body25 did better with 16x9 than 4x3.
Openpose processes the output strictly in CUDA while trt_pose does it in the CPU. The CUDA functions sometimes call back into C++.
Openpose does an extra step which trt_pose doesn't called "Resize heat maps + merge different scales".
spResizeAndMergeCaffe->Forward transfers the output from spCaffeNetOutputBlobs 18x32 to spHeatMapsBlob 78x144x256
TRT_pose & openpose both continue with non maximum suppression. trt_pose calls find_peaks_out_nchw & openpose calls spNmsCaffe->Forward. Trt_pose does it in the CPU. Openpose does it in the GPU.
spNmsCaffe->Forward transfers spHeatMapsBlob 78x144x256 to spPeaksBlob 25x128x3
find_peaks_out_nchw transfers the CMAP 42x56x56 to refined_peaks (size 3600) peak_counts (size 18) & peaks (size 3600)
Finally openpose connects the body parts by calling spBodyPartConnectorCaffe->Forward which indirects to connectBodyPartsGpu in CUDA.
This transfers spHeatMapsBlob & spPeaksBlob to mPoseKeypoints & mPoseScores but doesn't use a PAF table anywhere. TRT_pose does 3 more steps in the CPU with a PAF table.
At this point, it seems simpler to port PoseExtractorCaffe::forwardPass to tensorrt & keep its CUDA functions intact.
Discussions
Become a Hackaday.io Member
Create an account to leave a comment. Already have an account? Log In.