Motivation
This is potentially a long story. I'll begin at the beginning. What prompted this project was the need to keep Layla out of Foxy's dry food. No real momentum on the project happened until I stumbled upon a cat face detector article while researching object detection models. At that point I decided I would attempt to build an app to control a cat bowl.
Development Process
Platform and Tools
First decision was that the app would run under Android on a smartphone. The reasoning for this was:
- The model would require a reasonably powerful platform to run at a speed that could handle video input of at least 10fps
- Google TensorFlow Lite was a viable target format for the machine vision model and it was readily supported under Android
- Android offered the "Camera2" interface, which could efficiently supply video frames to an app
- I had several old Android smartphones available for testing and development
- Google Android Studio was a viable and free IDE (integrated Development Environment) that could easily be set up and used (unlike Apple) Google's programming languages have evolved now, but at the time Java was the preferred language for Android, and I was somewhat familiar with it.
Cat Face Detection
Once the platform and tools were decided the next task was to find a model that could detect cat faces, was fast enough for a video feed and would output bounding boxes around each face in a video frame. A pre-trained Haar Cascade model was found that could detect cat faces called: haarcascade_frontalcatface.xml. This model will run in an OpenCV library supporting a Cascade Classifier. The one I used was : "org.opencv.objdetect.CascadeClassifier". Tests with this model went well.
Cat Face Recognition
Next was to find a suitable face recognition model. The most compact and readily available model at the time was MobileFaceNet, which was found in a TFLite format. This model was, of course, trained on human faces. Nothing pre-trained on cat faces was found at the time and to train a recognition model from scratch takes a huge amount of resources and labor. So, I reasoned, a cat face isn't all that different from a human face - two eyes, a nose, 2 ears. To improve the chances of it working I decided to do additional training of the model with cat faces. That's where the fun really started. The MobileFaceNet model found was already in TensorFlow Lite, which is a compact format for deployment on mobile devices. The catch is that it's not possible to do training with a TFLite model. So the model had to be reconstructed as a Keras model for training and could then be converted back to a TFLite model once training was done.
It took a considerable amount of time to replicate the structure of the MobileFaceNet model with a Keras model. The structure had to match exactly so that all training parameters from the MobileFaceNet model could be loaded into the Keras model as the starting point for further training. This work was all done in Google CoLab in Python. CoLab can be used for free with limitations on the amount of CPU consumed. Fortunately, the additional training with cat faces fit within those limits.
Training images were collected from the internet. The images had to be a portrait style format of just the cat's face (ears included). A total of about 5000 cat faces were collected for the training. The images were divided up into training and test groups. Training was done with APN (Actual/Positive/Negative) labelled triplets. This dataset is extremely small by model training standards, but it still took a considerable amount of manual effort to compile.
Using Google CoLab the model was further trained with the cat face data and ultimately achieved reasonably good accuracy. The trained Keras model was then converted back to a TensorFlow Lite model for deployment.
It's important to note that the recognition model doesn't output a "yes/no" result. It wasn't trained to recognize specific cats. The training simply improved its ability to distinguish between cats. The actual output of the recognition model is a vector of 128 values. You can think of this as a fingerprint of the input image. Because of the model training, the more alike two cat faces are to each other the smaller the "distance" will be between their respective fingerprints. Distance in this case is measured by taking the difference between corresponding elements of the two fingerprint vectors, squaring each difference, summing all the squares, and finally taking the square root of the sum.
The App
This section provides an overview of the app. Refer to the User Manual pdf (in files section) if you want more detail.
Basic User Setup
In Basic Setup the user specifies the minimum information required for the app to operate:
- Specify the default state of the bowl - opened or closed. The bowl will always return to this state after a specified timeout
- Identify each cat by name and specify whether the bowl should open or close when that cat is recognized.
- Other basic settings include whether to speak the cat's name when recognized and which camera (front or back) to use for input.
Advanced Setup
Since this is a prototype there are numerous parameters that can be tweaked to affect the models. An end user should not mess with these parameters.
Initial Startup
At initial startup the app knows about the cats but doesn't know what they look like. In this state whenever a cat face is detected its snapshot is placed in an Unclassified Images log. This log is later used for training. The app will always place images of unrecognized cats into this log.
Model Training
Training is done by associating a snapshot in the Unclassified Images log with a specific cat. The selected images are then stored in a log maintained for each cat. At first nothing is recognized, but as the user continues to associate images to specific cats the model will begin to recognize the cats and will get better at it as more images are added. Ultimately, each cat will have a collection of images (and their fingerprints) that represent that cat.
Multiple images are required mainly because of different poses the cat may present when approaching the bowl. Doing it this way avoids having the app expend a lot of resources (and time) manipulating input frames (such as image rotations) to find a match. Instead, it simply scans each list of cat images and calculates the distance between the current input image and each stored image. The shortest distance below a pre-set threshold wins and the corresponding cat is recognized.
Working strictly from snapshots taken by the app keeps the context of each snapshot (lighting, background, etc.) consistent. This helps to improve the reliability of face recognition.
Monitoring Progress
The app keeps a log of face recognition events so the user can confirm recognition is working properly. An interface is provided to review the log and make corrections if needed. Each log entry shows a snapshot along with the name of the cat recognized. This information can be used to make corrections to the training if needed.
Web Interface
All user interface tasks can be done directly on the phone via the app. But this is not really all that convenient since the phone is mounted in a holder above the bowl and a landscape orientation has proven to be best. To get around this limitation the app also provides a web interface, which can be accessed from any browser.
The Devilish Details
As with many projects, the devil is usually in the details. Over a period of several months there were a few problems to deal with.
- Post Recognition: the app has recognized a cat and opened (or closed) the bowl. Great! But you don't really want the app to be continually recognizing the same cat at a rapid interval. Especially if you have "speak names" enabled. Recognition of the same cat is inhibited until a set amount of time (1-2 minutes) has passed. But during the timeout period if a different cat arrives it's recognized immediately.
- Bowl Closing: as mentioned earlier, the bowl has a timer that will cause the bowl to return to its default state. That's fine, except if the cat is still eating out of the bowl. The cat can't be recognized again if it's face down in the bowl eating. To get around this, the app uses motion detection to extend the return-to-default timer.
- "Look Up!" Teaser: Some cats will approach the bowl head down since they're looking at the food in the bowl. If motion is detected at the bowl, but no cat has been recognized, then the app will attempt to get the cat to look up at the phone. Images will move across the display and some noises are emitted.
- Opening Feeder to Fill: During initial use the feeder was open by default. Did this because Foxy was afraid of the feeder opening as she approached. There was no issue with adding food at that time. Once Foxy became comfortable with the feeder the default state was changed to closed. It then became clear that, short of dragging Foxy over to the bowl to open it, a way to manually open the bowl was needed. The bowl had a number of buttons and a switch that were no longer used. The original power switch was wired to the Arduino controller and code was added to force open the bowl when the switch was open.
- Smarty Cats: There is one loophole in the design that a cat might master if they're determined. A cat allowed to eat out of the bowl finishes and walks away from the bowl. The bowl will remain open for the remaining default timeout period. Another cat not allowed to eat out of the bowl approaches immediately after the other cat leaves and the bowl is still open. If they approach with their head up at any point they should be recognized, and the bowl will close. But if they stay head down during the entire approach then they can get into the bowl and their motion will keep it open. Detection can occur a fair distance from the bowl so it will depend on the positioning of the bowl and the approach to it. Also, keeping the default timeout fairly short will help avoid this situation. This has not happened with our cats.
The Bowl
A commercial bowl (Sure Petcare) was heavily modified for use in this project. The bowl's original function was simply to open when motion was detected. For this project the bowl needed to accept open/close commands via Bluetooth so the phone app could control it. To accomplish this, all of the original electronics in the bowl, except for the motor, were removed and replaced with a custom Arduino Nano controller. The bowl is a project unto itself. Software development was done using Microsoft Visual Studio 2019 with the Arduino IDE for Visual Studio extension.
More to come....