The gesture detecting algorithms that define at which point in time, depending on the movements of the player, a sound is triggered are central to the lalelu_drums project. My goal is to find one or multiple algorithms that give the player a natural feeling of drumming.
So far, the gesture detecting algorithms in lalelu_drums were all model based. This means that they apply a heuristic combination of rolling averages, vector algebra, derivatives, thresholds and hysteresis to the time series of estimated poses, eventually generating trigger signals that are then output e.g. to a MIDI sound generator.
In this log entry, for the first time, I present a data driven gesture detecting algorithm. It is based on training data generated in the following way.
A short repetitive rhythmic pattern is played to the user and she performs gestures in correlation with the sounds of the pattern. The movements of the user are recorded and so a dataset is generated that contains both, the time series of estimated poses together with ground-truth trigger times.
From these datasets, a 1D-convolutional network is trained that outputs a trigger signal based on the realtime pose estimation data. In the following example, the network takes the coordinates of the left wrist and the left elbow as input. A second network of the same structure is used to process the coordinates of the right wrist and the right elbow.
The video shows the performance of the networks. Additionally to the two networks, a classic position threshold is used to choose between two different sounds for each wrist.
In the example, each network has 219k parameters. The inference (pytorch) is done on the CPU since the GPU is blocked by the pose estimation network (tensorflow). The inference time for both networks together is below 2ms.
Discussions
Become a Hackaday.io Member
Create an account to leave a comment. Already have an account? Log In.