OK, more like enduring joint pain for my work, but that sounds lame.
The glove's machine-learning model starts with an RNN (I'm using a LSTM for now) to pull discrete patterns out of the string of incoming coordinates. I suspect I'll need a bunch of hidden layers after that because letters are made of downstrokes, circles, and other mini-shapes, and I need a bunch of abstraction to distinguish them.
Why does this matter? Recurrent and deep networks are easy to overtrain, and that means my wrist is gonna hurt.
Overtraining is when the machine-learning model aces a test by writing the answers on its hand. The network has internalized the training data, and instead of recognizing an incoming gesture, it just outputs which training sample it looks like. It does great with the training data, but face-plants in the real world. It gets new data, looks at the answers on its hand, doesn't see any close matches, and takes an embarrassingly poor guess.
LSTM's and deep learning are physically large and can encode a lot of data within themselves. In other words, it's got a really big hand and it can't help writing answers on it. There's only one solution, and it's...
More data.
Each training run needs to consist of many samples, far more than the model can internalize. If the model tries to make a cheat sheet, it quickly runs out of paper and is forced to actually do the work. With 100 nodes in the densest layer, I estimate that I need about 100 samples for each of the 50 letters, numbers, and punctuation marks to do the job.
That's 5,000 times I need to waggle my hand for science.
The problem is, the network will still inevitably overtrain. I need this glove to work in all kinds of circumstances - when I'm walking, sitting, jogging, have my hands full, can't move much, or need to make big gestures onstage. So, I need...
Even more data.
The machine-learning model, mad lad that it is, tells me to kiss its ass and it memorizes the answers anyways. But I'm a step ahead! I give it a second test with totally different questions on it.
The model blows a wet raspberry and memorizes the answers to that test too. So, when it comes back to class, I give it a third totally different test.
This repeats, the model memorizing tests and me making new tests, until the model starts failing. Its strategy is stretched too thin - the model is just not big enough to encode all those questions and answers, so the only way it can get good grades is to learn patterns, not answers.
This is called k-fold cross-validation - you split your data into subsets so the same model has to ace many different tests at once, and it never gets asked a question from the practice test in the real test.
A rule of thumb is that 10-fold cross-validation is the point of diminishing returns. You need 10 times the data that are required to train the model, to prevent the model from memorizing answers.
That means I need to flap my meat mittens 50,000 times to collect an appropriate training set.
So say I do that. The machine-learning model eventually passes all ten tests, and I rejoice. But that little bastard is wearing an obnoxious grin, almost as if to say, "Stupid monkey! I memorized all ten tests! Eat my shorts!" There's only one way to know for sure, and it is...
I got a fever, and the only prescription is more data.
That's right. As that pissant mass of floating-point operations snickers to itself, I look it dead in the eye, reach into my back pocket, and whip out a double-super-top-secret eleventh test.
This set of data, the sanity check, is special. Unlike the training set, the model isn't allowed to see these data while training. The model has no way to memorize the answers, because it's not allowed to see the grade. I see the grade, and if the model flunks the final exam, I drag it out back and put it out of its misery (reset and start over). If it passes, the job is done, and it gets loaded onto the glove and sent into the real world.
The sanity check is the final line of defense, and the final metric of how effectively it does the thing. Say I want my glove to be 95% accurate. That means it must score a 95 or higher on the final, or backyard pew pew. I need to put more than 100 questions on the test, because I myself might screw up and make a test that has, say, 50% fewer walking-around samples than real-world usage.
So, I think it's prudent to collect 200 more samples of each gesture, for a total of 60,000 samples, to know for sure that this thing is rock-solid.
This is gonna hurt.
Oh yeah. 1,200 copies of each sample is enough to give me RSI, not to mention the soul-eroding boredom. I estimate that I can collect about 100 samples per minute, so the entire data-set will take ten entire hours of nothing but nonstop hand-waving.
Pitter-patter, let's get at 'er.
Discussions
Become a Hackaday.io Member
Create an account to leave a comment. Already have an account? Log In.