A multitude of use cases needs AI along with depth information for meaningful deployment in real-world scenarios. Let's dive deep into each mentioned solution, to experience the possibilities of Depth AI in various domains. The technical milestones and progress sequence of each solution would unfold in the "Project Log" section.
Consolidated Project Demo
Solution #1: ADAS - Collision Avoidance System on Indian Cars
India accounts for only 1% of total vehicles in the world. However, World Bank’s survey reports 11% of global road death happens in India, exposing the dire need to enhance road safety. Most of the developing countries like India pose a set of challenges, unique on their own. These include chaotic traffic, outdated vehicles, lack of pedestrian lanes and zebra crossings, animals crossing the road, and the like. Needless to say, most vehicles don't have advanced driver-assist features nor can they afford to upgrade the car for better safety.
Against this backdrop, this solution aims to augment even the least expensive cars in India with an ultra-cheap ADAS Level 0, i.e. collision avoidance and smart surround-view. Modern cars with a forward-collision warning (FCW) system or autonomous emergency braking (AEB) are very expensive, but we can augment such functions on old cars, at a low cost.
The idea is to use a battery-powered Pi connected with a LIDAR, Pi Cam, LED SHIM, and NCS 2, mounted on the car bonnet to perceive frontal objects with their depth and direction. This not only enables a forward-collision warning system but also driver assistance alerts about traffic signs or pedestrians, walking along the roadside, or crossing the road.
To build a non-blocking system flow, a modular architecture is employed wherein, each independent node is dependent on different hardware components. i.e., the "object detection" node uses Movidius for inference, whereas the "distance estimation" node takes LIDAR data as input, while the "Alert" module signals to Pimoroni Blinkt and the speaker. The modules are communicated via MQTT messages on respective topics.
Architecture Diagram
The time synchronization module takes care of the "data relevance factor" for Sensor Fusion. For ADAS, the location of detected objects by 'Node 1' may change, as the objects move. Thus, the distance estimate to the bounding box could go wrong after 2-3 seconds (while the message can remain in the MQTT queue). In order to synchronize, current time = 60*minutes + seconds is appended to the message (to ignore lagged messages at receiving end).
Watch the gadget plying the Indian Roads, giving driver-assist alerts after perceiving the surrounding objects, along with their depth and direction.
ADAS-CAS Gadget warns the driver with sound and light indicator
The speaker is be wired internally to make the announcement more audible to the driver. This ADAS device can be connected to the CAN bus to let it accelerate, steer, or apply the brake. RPi doesn't have a built-in CAN Bus, but its GPIO includes SPI Bus, which is supported by a number of CAN controllers like MCP2515. So, autonomous emergency braking (AEB) can also be done with a collision-avoidance system (CAS) by connecting this device to the CAN bus.
The source code of this solution is available here
Solution #2: Indoor Robot Localization with SLAM
Robots replace humans at establishments like hotels and restaurants. But how do they navigate indoors without spatial information about their surroundings? If you already have a map, then you know where you are. When there is no map beforehand, then you need to not only generate the map (mapping) but also understand your location on the map (localization). Imagine how hard that can be, even for a human. Then how difficult it can be, for a robot!
SLAM (Simultaneous Localization And Mapping) algorithms use LiDAR and IMU data to simultaneously locate the robot in real-time and generate a coherent map of surrounding landmarks such as buildings, trees, rocks, and other world features, at the same time. This has been approximately solved using methods like Particle Filter, Extended Kalman Filter (EKF), Covariance intersection, and GraphSLAM. SLAM enables accurate mapping where GPS localization is unavailable, such as indoor spaces.
Reasonably so, SLAM is the core algorithm being used in robot navigation, autonomous cars, robotic mapping, virtual reality, augmented reality, etc. If we can do robot localization on RPi then it is easy to make a moving car or walking robot that can ply indoors autonomously.
In order to do SLAM, I have integrated RPi with RPLidar A1 M8, running on a 5V 3A battery. SLAM generates a 2D LIDAR point cloud map of surroundings to assist navigation or to generate a floor map. Finally, the LIDAR point cloud map is visualized on a remote machine using MQTT.
Now let's see SLAM in action, deployed on a Raspberry Pi with RPLidar A1 M8, running on a 5V 3A battery. As you can see in the video, the portable unit is taken across different rooms of my house and a real-time trajectory is transmitted to the MQTT server and also stored on the SD card of RPi.
The RPLidar A1 is integrated into an RPi which has BreezySLAM deployed on it
You can see the visualization of the LIDAR point cloud map and estimated robot trajectory using PyRoboViz as below.
Note that, the high-dense linear point cloud in the LIDAR maps represents stable obstacles like walls. Hence, we can use algorithms like Hough Transform to find the best fit line on these linear point clouds to generate floor maps.
The source code of this solution is available here
Solution #3: Touch-less Facial Attendance System
Touch-free interaction of all public devices has become an imperative post-Covid. No wonder, facial recognition-based entry, and attendance gadgets are in much demand to replace attendance registers and bio-metric access control. These embedded devices can be used in big companies, flats, institutions, or even to take class attendance.
Face recognition can be used to identify the person while depth information is required to open the door only to those near the door. Face Recognition based on Deep Learning gives better accuracy than Haar Cascade, LBPH, HOG, or SVM. Depth can be estimated by LIDAR or DepthAI platforms like LUX-OAK-D that can combine depth perception, object detection, and tracking in one SoM. Ultrasonic sensor HC-SR04 can also be used to detect persons near the door.
Four major steps involved in this project are,
1. Localize and identify face (Raspberry Pi + Pi Cam + OpenVINO + Movidius)
2. Publish Identity to the server (Use MQTT Pub-Sub for IoT communication)
3. Persist identity and register attendance (MySQL or AWS)
4. Alert security if the person is unidentified (SMS), else open the door.
To enhance the alert mechanism, the footage of unidentified persons is pushed to security via FFmpeg. The SMS messages about unidentified guests are pushed to security's mobile via Twilio integration.
You can see the system in action, detecting the registered faces and alerting the unrecognized ones to security.
Note that the performance is optimal with 20 FPS (no face in frame) and 10-12 FPS (with faces) on a Raspberry Pi 4 with Intel Movidius NCS 2.
This project can be used in multiple scenarios driven by a person, event, or custom objects. To explicate, a customized video doorbell can be made to identify a person, event, or vehicle waiting at the door to invoke an SMS, send video or audio message, or even to give an automated response.
The source code of this solution is available here
Solution #4: Smart Cam with Gesture Alarm for Security
Notwithstanding notable advancements in technology, the developing economies are still trapped in the clutches of patriarchal evils like molestation, rape, or crime against women, in general. Women are often not allowed to stay back in their professional workspaces during late hours, nor are considered safe alone even during day time, especially in the developing world. Imperative, it has become, to enable the other half of population to be more safe & productive. Why not use advancements in technology to arm them with more power?
Imagine a lady who is sitting alone in a clinic, shop, company, or isolated elsewhere needs urgent rescue. She may not have the opportunity or liberty to make a call. The surveillance camera should be smart enough to detect her gestures, be it with hand or objects, as an SOS signal. Based on the object depth, a scaling factor can be computed to magnify the gesture to reduce false negatives.
Let's evaluate our choices, whether to use image processing, deep learning, or arithmetic algorithms to analyze the incoming video frames from an SoC with a camera, to trigger an alert.
There are 3 main modules for this project,
a) Localize the object, used to trigger the event.
b) Analyze the motion of the object to recognize the signal
c) Deployment on an SoC to install the gadget in the environment.
This is a tricky problem, as we need a cheap solution for easy adoption, but not at the cost of accuracy, as we deal with emergencies here. Similarly, it is easier to detect a custom object but it is more useful to use your own body to signal an emergency. Watch the project in action.
If a circular motion is detected, then messages are pushed to people concerned and an alarm is triggered. The event trigger is demonstrated by flashing 'red' light on a Pimoroni Blinkt! signaled using MQTT messages and SMS to mobile using Twilio integration.
Above is an efficient and practical solution for gesture detection on edge devices. Please note that you can use any object in place of the tennis ball or any gesture other than circular motion. Just that you need to tweak the mathematical formula accordingly.
Alternatively, we can train the hand gesture classification model and convert it to OpenVINO IR format to deploy on RPi, as given in "Solution 4". Another trained hand gesture classification model was deployed on Jetson Nano which gave nearly 5 FPS, but this is a more expensive solution. To conclude, we can consider the vector algebra model to be an efficient, cheap, and generic solution to detect gestures.
The source code of this solution is available here
Solution #5: Indoor Navigation Assist for Blind & Elderly
It is quite challenging for the blind or elderly to move around, even inside their own house. We can use SLAM techniques demonstrated in Solution #2 to perceive the indoor boundaries in order to give navigational directions.
However, they need to know more about the things around them in the room. In order to do this, we may have to do custom training with the potential objects and deploy the model on an SoC or SoM. You can even deploy the solution on a mobile, to make the solution most handy. Training can be done with images taken from the same house to maximize detection accuracy.
The hand-held gadget can be made half the size, by directly connecting the NCS2 to Pi and using a USB sound card instead of an external speaker. However, once deployed on a mobile, the solution becomes very handy.
More classes can be trained to detect using your own mobile and announce via the mobile speaker. In brief, custom object detection on edge devices, along with Solution #2 forms the basis of indoor navigational assistance for the elderly. To augment distance alert, it's ideal to use Depth AI platforms like OAK-D by Luxonis, as LIDAR is not so handy.
The source code of this project is available here
Solution #6: Mathematical Hacks for Edge Cases
We can employ clever mathematical hacks to assist or partially replace AI/ML models to make it feasible to run the solution on the Edge. Let me explicate the idea with two useful Edge use-cases.
Soln #6A: Security Barrier Cam using Shape Context
This idea is not limited to the Security Barrier use case. In an ample number of machine vision problems, we need the robot or gadget to be able to read and behave based on it. But how does a robot read from images?
We can use deep learning techniques like EAST, CRAFT, and PyTesseract to do OCR. But can we do it more efficiently? Yeah, we can do this around 15-20x faster using a mathematical descriptor known as Shape Context, which uses log-polar histograms to encode relative shape information.
This technique can efficiently do OCR on machine vision problems. Take the case of a security barrier cam, where the device has to identify the vehicle, from its number plate, when it comes near, to decide whether to open the barrier or not. The proximity of the vehicle can be sensed using an ultrasonic sensor or from the image itself, using a Depth AI device such as OAK-D by Luxonis.
Note that the solution is particularly suitable for hardware deployments to efficiently read from images. See a real-world implementation of security barrier cam use-case powered by shape context.
However, every solution has pros and cons. While this solution is very fast, it can be error-prone if the lighting is not proper. However, given proper conditions, you can even build a multi-lingual OCR on the Edge using Shape Context. The appealing features of this approach are simplicity, speed, and language-agnosticism.
The source code of this project is available here
Soln #6B: Monocular Depth - Social Distance Tracker
Monocular Depth Estimation is the idea of estimating scene depth using a single image. We can exploit subtle, visual cues in the image and prior knowledge using deep learning-based techniques.
Inspired from a paper, titled 'Towards Robust Monocular Depth Estimation' [16], a "midasnet" model can be trained to produce a disparity map for a given input image. This is an interesting idea, as we can solve Depth AI use cases by using existing monocular camera installations or where stereo cams are not feasible. This opens up avenues to augment existing camera installations with smart features without any additional cost.
Since the MYRIAD device doesn't support some layers of the monocular depth estimation model, it needs to be hosted on a remote server as an API. However, as no hardware change is required, this is an ultra-cheap way to augment depth smartness on existing vision installations on the Edge.
Power Considerations and Hardware Alternatives
To curtail personal cost, it was decided to use the hardware already available at hand, to build multiple solutions. LIDAR was the only device that I had, which can cater to all the above solutions, to augment depth estimates. Moreover, the aim here was to demonstrate the power of Depth AI, with as many use cases as possible.
However, given a specific use case, proximity sensors may be enough, or Depth AI hardware like Luxonis Oak-D may be even better, as it can perceive depth as well as do neural inference, at low power. The above solutions can easily be ported to such SoCs to minimize power consumption. From experiments, it was found that OpenMV Cam, Luxonis OAK-D, and Pico4ML are better alternatives, for individual Depth AI use cases. A detailed analysis of power consumption and hardware alternatives to deploy the above solutions are discussed in the "Project Log" section.
Use Case Specific Power Considerations
- All the above solutions are done on portable battery-operated devices (10K mAh).
- We have replaced ML with Math models to minimize power consumption. This optimization helped fetch 50+ hours of uptime for Gesture Cam & Security Barrier solutions.
- For the purpose of integration, the ADAS gadget can be connected to the car battery also, without any noticeable impact on load (5W). But this is not required as the device can run around 10 hours on a single charge.
- We have even seen the deployment of an indoor navigational assistance solution on an Android mobile. The mobile device can last a minimum of 12-24 hours or higher, depending upon the mobile battery.
- We can utilize the low power mode on devices like OpenMV Cam, controlled by proximity sensors, to further minimize power consumption.
- In terms of power efficiency and ease of use, Depth AI platforms like OpenMV Cam, Luxonis OAK-D, and Pico4ML are some better device choices, especially for specific use-cases like ADAS-CAS or indoor navigation.
- In the Gesture Cam use case, Pico4ML achieves the least power usage while OpenMV Cam balances out power efficiency and real-world usage. For security applications also, OpenMV Cam is a suitable choice as it supports frame differencing in the hardware itself.
Summary: Industry-Grade Problems & Solutions
Many more industry-grade problems can be solved using a combination of the solutions demonstrated above. We can measure the performance of the driver based on feedback from the ADAS-CAS device. This is useful in e-scooter and micro-mobility, not to allow folks to ride rented e-scooters like jerks. Further, the workers in big industrial plants can send commands or request help using the mathematical gesture models integrated with Twilio.
More generally, we can use custom object detection to detect any object and automate tasks. For instance, a door access solution can open the door only for in-house pets. Or detect people without a face mask to not open the door. Depth-based collision avoidance is useful to alert workers in the worksite. A very useful industrial solution could be the detection of people not wearing helmets in worksites and giving warnings for their safety.
Scores of use cases are there, where we need a machine vision solution to read alphabets and numbers to do a task. We have solved it efficiently with the idea of Shape Context. You can use SLAM indoor navigation to deploy autonomous buggy in closed environments like airports, warehouses, or industrial plants. SLAM can be combined with Indoor Navigational Assistance and Gesture Recognition to build a personal assistance gadget for the blind and elderly to help them move around the house. It remains to your creativity to solve more meaningful industrial use cases using the Depth AI solutions demonstrated above.
In this project, we have seen the potential of Depth AI, its application across multiple domains, and also some power-efficient tweaks and hardware variations. A unique contribution of this project is the emphasis on mathematical models to assist and augment AI/ML models to efficiently execute on resource and power-constrained environments.
References
1. Breezy Implementation of SLAM: https://github.com/simondlevy/BreezySLAM
Note: Have contacted the author, Prof. Simon D. Levy, CSE Dept., Washington and Lee University for LIDAR scan data error in his code, during visualization. I have fixed the code by separating lidar scan code as a separate thread and implemented inter-thread communication and he acknowledged the fix.2. LIDAR Point Cloud Visualization:https://github.com/simondlevy/PyRoboViz
3. Udacity ComputerVision Nanodegree:https://www.udacity.com/course/computer-vision-nanodegree--nd891
4. LIDAR-Camera Sensor Fusion High Level: https://www.thinkautonomous.ai/blog/?p=lidar-and-camera-sensor-fusion-in-self-driving-cars
5. OpenVINO Face Recognition Module:https://docs.openvinotoolkit.org/latest/omz_demos_face_recognition_demo_python.html
6. LIDAR data scan code stub by Adafruit:https://learn.adafruit.com/remote-iot-environmental-sensor/code
7. Book: University Physics I - Classical Mechanics. The Cross Product and Rotational Quantities. https://phys.libretexts.org.
8. Camera Calibration and Intrinsic Matrix Estimation: https://www.cc.gatech.edu/classes/AY2016/cs4476_fall/results/proj3/html/agartia3/index.html
9. Visual Fusion For Autonomous Cars Course at PyImageSearch University: https://www.pyimagesearch.com/pyimagesearch-university/
10. LIDAR Distance Estimation:https://en.wikipedia.org/wiki/Lidar
11. Anand P. V, Karthik K. "Multilingual Text Inversion Detection using Shape Context" IEEE Region 10 Symposium (TENSYMP) 2021
12. Belongie, Serge, Jitendra Malik, and Jan Puzicha. "Shape matching and object recognition using shape contexts." IEEE transactions on pattern analysis and machine intelligence 24.4 (2002): 509-522.
13. RPLIDAR A1 M8 Hardware Specification:https://www.generationrobots.com/media/rplidar-a1m8-360-degree-laser-scanner-development-kit-datasheet-1.pdf
14. Model Training, Data Cleansing & Augmentation:www.roboflow.com
15. Power Consumption: Raspberry Pi, Pi Models & Pi Cam. https://raspi.tv/2019/how-much-power-does-the-pi4b-use-power-measurementshttps://www.pidramble.com/wiki/benchmarks/power-consumption
16. Ranftl, René, et al. "Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer." arXiv:1907.01341 (2019).
Note: The Indian traffic sign model has been trained using the Traffic dataset offered by Datacluster Labs, India.