-
Building an Autonomous Person Following System with Computer Vision
08/15/2025 at 14:47 • 0 commentsWhen your robot becomes your shadow – implementing intelligent person following with object detection and ROS 2
Imagine a robot that follows you around like a loyal companion, maintaining the perfect distance whether you're walking through a warehouse, giving a facility tour, or need hands-free assistance. While gesture and pose control are great for direct commands, sometimes you want your robot to simply tag along autonomously. That's exactly what we've built into GestureBot – a standalone person following system that transforms any detected person into a moving target for smooth, intelligent pursuit.
The Appeal of Autonomous Following
Person following robots aren't just cool demos – they solve real problems. Consider a hospital robot carrying supplies that needs to follow a nurse through rounds, a security robot accompanying a guard on patrol, or a service robot helping someone navigate a large facility. In these scenarios, constant manual control becomes tedious and impractical.
The key insight is that following behavior should be completely autonomous once activated. No gestures, no poses, no commands – just intelligent tracking that maintains appropriate distance while handling the inevitable challenges of real-world environments: people walking behind obstacles, multiple individuals in the scene, varying lighting conditions, and the need for smooth, non-jerky motion that won't startle or annoy.
Leveraging Existing Object Detection Infrastructure
Rather than building a specialized person tracking system from scratch, we cleverly repurpose GestureBot's existing object detection capabilities. The system already runs MediaPipe's EfficientDet model at 5 FPS, detecting 80 different object classes including people with confidence scores and precise bounding boxes.
This architectural decision provides several advantages: proven stability, existing performance optimizations, and the ability to simultaneously track people and obstacles. The object detection system publishes to
/vision/objects, providing a stream of detected people that our following controller can consume.# Object detection provides person detections like this: DetectedObject { class_name: "person" confidence: 0.76 bbox_x: 145 # Top-left corner bbox_y: 89 bbox_width: 312 # Bounding box dimensions bbox_height: 387 }The person following controller subscribes to this stream and implements sophisticated logic to select, track, and follow the most appropriate person in the scene.
Smart Person Selection: More Than Just "Pick the Biggest"
When multiple people appear in the camera view, the system needs to intelligently choose who to follow. Our selection algorithm uses a weighted scoring system that considers three key factors:
Size Score (40% weight): Larger bounding boxes typically indicate closer people or those more prominently positioned in the scene. This naturally biases toward the person most likely intended as the target.
Center Score (30% weight): People closer to the image center are preferred, following the reasonable assumption that users position themselves centrally when activating following mode.
Confidence Score (30% weight): Higher detection confidence indicates more reliable tracking, reducing the chance of following false positives or poorly detected individuals.
def select_initial_target(self, people): scored_people = [] for person in people: # Normalize bounding box to 0-1 coordinates size_score = (person.bbox_width * person.bbox_height) / (640 * 480) center_x = (person.bbox_x + person.bbox_width/2) / 640 center_score = 1.0 - abs(center_x - 0.5) * 2 total_score = (size_score * 0.4 + center_score * 0.3 + person.confidence * 0.3) scored_people.append((person, total_score)) return max(scored_people, key=lambda x: x[1])[0]Once a target is selected, the system maintains tracking continuity by matching people across frames based on position prediction, preventing erratic switching between similar individuals.
Distance Estimation: Computer Vision Meets Physics
Estimating distance from a single camera image is a classic computer vision challenge. Without stereo vision or depth sensors, we rely on the relationship between object size and distance – larger bounding boxes generally indicate closer people.
After extensive calibration with real-world measurements, we developed an empirical mapping between normalized bounding box area and estimated distance:
def estimate_distance(self, person_bbox): # Convert pixel coordinates to normalized area bbox_area = (person.bbox_width * person.bbox_height) / (640 * 480) # Empirical distance mapping calibrated for average person height if bbox_area > 0.55: # 55%+ of image = very close return 1.0 # ~1 meter elif bbox_area > 0.35: # 35-55% = near target distance return 1.5 # ~1.5 meters (target) elif bbox_area > 0.20: # 20-35% = medium distance return 2.2 # ~2.2 meters # ... additional ranges up to 7 metersThis approach works surprisingly well for typical indoor environments, though it assumes average human height and doesn't account for unusual poses. To improve stability, we apply a 3-point moving average that smooths out frame-to-frame variations in bounding box size.
Following Behavior: The Art of Robotic Companionship
The core following logic implements two simultaneous control loops: distance maintenance and person centering. The distance controller calculates the error between current estimated distance and the target distance (1.5 meters), applying proportional control to generate forward/backward velocity commands.
The centering controller keeps the person positioned in the center of the camera view by calculating the horizontal offset and generating appropriate angular velocity commands. This dual-axis control creates natural following behavior that maintains both proper distance and orientation.
def calculate_following_command(self, person, estimated_distance): # Distance control (linear velocity) distance_error = estimated_distance - self.target_distance # 1.5m target linear_velocity = distance_error * 0.8 # Control gain linear_velocity = max(-0.3, min(0.3, linear_velocity)) # Clamp limits # Centering control (angular velocity) center_x = (person.bbox_x + person.bbox_width/2) / 640 center_error = center_x - 0.5 # 0.5 = image center angular_velocity = -center_error * 1.5 # Control gain angular_velocity = max(-0.8, min(0.8, angular_velocity)) # Clamp limits return {'linear_x': linear_velocity, 'angular_z': angular_velocity}Smooth Motion Through Advanced Control
Raw velocity commands would create jerky, uncomfortable robot motion. Our solution implements a sophisticated velocity smoothing system running at 25 Hz – much faster than the 10 Hz control calculation rate. This high-frequency loop applies acceleration limiting to gradually ramp velocities up and down.
The acceleration limits (1.0 m/s² linear, 2.0 rad/s² angular) are carefully tuned for the Raspberry Pi 5's processing capabilities while ensuring smooth motion. A typical acceleration from rest to 0.3 m/s takes about 0.3 seconds over 7-8 control cycles, creating natural-feeling motion that won't startle users or cause mechanical stress.
Critical to the system's success is preventing rapid target switching. The control system holds velocity commands for a minimum of 500ms, preventing oscillations that would occur if the robot constantly changed direction based on minor detection variations.
Safety First: Multiple Protection Layers
Autonomous following requires robust safety systems. Our implementation includes several protection mechanisms:
Distance Limits: The robot won't approach closer than 0.8 meters (safety zone) and stops following if the person exceeds 5.0 meters (lost target). These limits prevent uncomfortable crowding and endless pursuit of distant figures.
Timeout Protection: If no person is detected for 3 seconds, the system automatically deactivates following mode and stops the robot. This handles cases where the target leaves the camera view or detection fails.
Emergency Override: The system monitors an
/emergency_stoptopic, immediately halting motion if any other system component detects a problem.Backward Motion Safety: When a person gets too close, the robot smoothly backs away rather than stopping abruptly, maintaining comfortable interpersonal distance.
ROS 2 Architecture: Clean and Modular
The person following system integrates seamlessly with ROS 2's publish-subscribe architecture. The
person_following_controllernode subscribes to/vision/objectsfrom the existing object detection system and publishes velocity commands to/cmd_velfor the robot base.Activation and deactivation happen through a simple service interface:
# Activate person following mode ros2 service call /follow_mode/activate std_srvs/srv/SetBool "data: true" # Deactivate following mode ros2 service call /follow_mode/activate std_srvs/srv/SetBool "data: false" # Monitor following behavior ros2 topic echo /cmd_velThis modular design means the following system can work with any robot platform that accepts standard ROS 2 velocity commands, making it broadly applicable beyond the original GestureBot hardware.
Real-World Performance: Lessons from Testing
In practice, the system performs remarkably well in typical indoor environments. Response time from person detection to robot motion is under 200ms, creating immediate feedback that feels natural. The robot maintains the target 1.5-meter distance with ±0.3-meter accuracy under normal conditions.
The biggest challenges come from environmental factors: people walking behind furniture create temporary occlusions that the system handles by maintaining last-known velocity until the person reappears. Multiple people in the scene occasionally cause target switching, though our scoring algorithm minimizes this issue.
Lighting variations affect object detection confidence but rarely cause complete tracking failure. The system gracefully degrades by reducing following speed when confidence drops, rather than stopping entirely.
Hardware Performance: Pi 5 Delivers
Running on a Raspberry Pi 5 with 8GB RAM, the complete system (object detection + person following + ROS 2 navigation) consumes approximately 70-75% CPU at steady state. This leaves headroom for additional processing while maintaining stable performance.
The Pi Camera Module 3 provides sufficient image quality for reliable person detection at 640x480 resolution. Higher resolutions improve detection accuracy but reduce frame rate – the current configuration strikes an optimal balance for real-time following behavior.
Power consumption remains reasonable at ~8W total system power, making battery operation feasible for mobile applications.
Comparison with Other Control Methods
Person following complements rather than replaces gesture and pose control. Each method has its ideal use cases:
Gesture Control: Best for precise, intentional commands when the user wants direct robot control Pose Control: Ideal for hands-free operation when gestures aren't practical Person Following: Perfect for autonomous companionship when continuous manual control would be tedious
The beauty of the modular architecture is that users can switch between modes as needed, or even combine them – imagine a robot that follows you autonomously but responds to gesture overrides for specific actions.
Future Directions: Beyond Basic Following
The current implementation provides a solid foundation for more sophisticated behaviors. Future enhancements could include:
Outdoor Operation: Adapting distance estimation for varying lighting and longer ranges Multi-Person Scenarios: Following groups or switching between designated individuals Predictive Tracking: Using motion prediction to handle temporary occlusions more gracefully Sensor Fusion: Integrating lidar or depth cameras for more accurate distance measurement Social Awareness: Adjusting following distance based on environmental context and social norms
-
4-Pose Navigation with MediaPipe
08/15/2025 at 14:41 • 0 commentsWhen hand gestures aren't enough, your whole body becomes the remote control
We've all been there – trying to control a robot with hand gestures while your hands are full, wearing gloves, or when lighting conditions make finger detection unreliable. What if your robot could understand your intentions through simple body poses instead? That's exactly what we've implemented in the latest iteration of GestureBot, a Raspberry Pi 5-powered robot that now responds to four distinct body poses for intuitive navigation control.
Why Body Poses Beat Hand Gestures
While hand gesture recognition is impressive, it has practical limitations. Gestures require clear hand visibility, specific lighting conditions, and can be ambiguous when multiple people are present. Body poses, on the other hand, are larger, more distinctive, and work reliably even when hands are obscured or busy with other tasks.
Consider a warehouse worker guiding a robot while carrying boxes, or a surgeon directing a medical robot while maintaining sterile conditions. Full-body pose detection opens up robotics applications where traditional gesture control falls short.
The Technical Foundation: MediaPipe Pose Detection
At the heart of our system lies Google's MediaPipe Pose Landmarker, which provides real-time detection of 33 body landmarks covering the entire human skeleton – from head to toe. Running on a Raspberry Pi 5 with 8GB RAM and a Pi Camera Module 3, we achieve stable 3-7 FPS pose detection at 640x480 resolution.
The MediaPipe model excels at tracking key body points including shoulders, elbows, wrists, hips, and the torso center. What makes this particularly powerful for robotics is the consistency of landmark detection even with partial occlusion or varying lighting conditions.
# Core MediaPipe configuration optimized for Pi 5 pose_landmarker_options = { 'base_options': BaseOptions(model_asset_path='pose_landmarker.task'), 'running_mode': VisionRunningMode.LIVE_STREAM, 'num_poses': 2, # Track up to 2 people 'min_pose_detection_confidence': 0.5, 'min_pose_presence_confidence': 0.5, 'min_tracking_confidence': 0.5 }Simplicity Through Four Poses
After experimenting with complex pose vocabularies, we settled on four reliable poses that provide comprehensive robot control:
🙌 Arms Raised (Forward Motion): Both arms extended upward above shoulder level triggers forward movement at 0.3 m/s. This pose is unmistakable and feels natural for "go forward."
👈 Pointing Left (Turn Left): Left arm extended horizontally while right arm remains down commands a left turn at 0.8 rad/s. The asymmetry makes this pose highly distinctive.
👉 Pointing Right (Turn Right): Mirror of the left turn – right arm extended horizontally triggers rightward rotation.
🤸 T-Pose (Emergency Stop): Both arms extended horizontally creates the universal "stop" signal, immediately halting all robot motion.
The pose classification algorithm analyzes shoulder and wrist positions relative to the torso center, using angle calculations and position thresholds to distinguish between poses:
def classify_pose(self, landmarks): # Extract key landmarks left_shoulder = landmarks[11] right_shoulder = landmarks[12] left_wrist = landmarks[15] right_wrist = landmarks[16] # Calculate arm angles relative to shoulders left_arm_angle = self.calculate_arm_angle(left_shoulder, left_wrist) right_arm_angle = self.calculate_arm_angle(right_shoulder, right_wrist) # Classify based on arm positions if left_arm_angle > 60 and right_arm_angle > 60: return "arms_raised" elif abs(left_arm_angle) < 30 and abs(right_arm_angle) < 30: return "t_pose" # ... additional classification logicROS 2 Integration: From Pose to Motion
The system architecture follows a clean pipeline: pose detection → classification → navigation commands → smooth motion control. Built on ROS 2 Jazzy, the implementation uses three main components:
Pose Detection Node: Processes camera frames through MediaPipe, publishes 33-point landmark data and classified pose actions to
/vision/posestopic.Pose Navigation Bridge: Subscribes to pose classifications and converts them to velocity commands published on
/cmd_vel. This node implements the critical safety and smoothing logic.Velocity Smoothing System: Perhaps the most important component for real-world deployment, this 25 Hz control loop applies acceleration limiting to prevent jerky robot motion that could cause instability or discomfort.
# Launch the complete 4-pose navigation system ros2 launch gesturebot pose_detection.launch.py ros2 launch gesturebot pose_navigation_bridge.launch.py # View real-time pose detection with skeleton overlay ros2 launch gesturebot image_viewer.launch.py \ image_topics:='["/vision/pose/annotated"]'The navigation bridge includes multiple safety layers: pose confidence thresholds (0.7 minimum), timeout protection (2-second auto-stop), and velocity limits that prevent dangerous accelerations. If no valid pose is detected for two seconds, the robot automatically stops – a crucial safety feature for real-world deployment.
Hardware: Raspberry Pi 5 Proves Its Worth
The Raspberry Pi 5 represents a significant leap in embedded AI capability. With its ARM Cortex-A76 quad-core processor and 8GB RAM, it handles MediaPipe pose detection while simultaneously running ROS 2 navigation, camera processing, and system monitoring. The Pi Camera Module 3's 12MP sensor with autofocus provides the image quality needed for reliable landmark detection.
Power consumption remains reasonable at ~8W total system power, making this suitable for battery-powered mobile robots. We've found that active cooling is beneficial for sustained operation, but not strictly necessary for typical use cases.
Real-World Performance and Applications
In practice, the 4-pose system feels remarkably natural. The poses are intuitive enough that new users can control the robot within minutes without training. Response time from pose detection to robot motion is under 200ms, providing immediate feedback that makes the interaction feel responsive.
The system shines in scenarios where traditional interfaces fail:
- Hands-free operation: Control robots while carrying objects or wearing protective equipment
- Distance control: Operate robots from across a room where gesture details would be invisible
- Multi-user environments: Body poses are less likely to trigger false positives from background activity
- Industrial applications: Robust operation in challenging lighting or environmental conditions
We've tested the system with users of varying heights and body types, finding consistent performance across different demographics. The pose classification algorithms adapt well to individual differences in arm length and posture.
The Code: Open Source and Ready to Deploy
The entire implementation is open source and built with reproducibility in mind. The modular ROS 2 architecture means you can easily integrate pose control into existing robot platforms or extend the system with additional poses.
Key configuration parameters are exposed through launch files, allowing fine-tuning for different robot platforms:
pose_navigation_bridge: ros__parameters: pose_confidence_threshold: 0.7 max_linear_velocity: 0.3 # m/s max_angular_velocity: 0.8 # rad/s pose_timeout: 2.0 # seconds motion_smoothing_enabled: trueThe GestureBot project continues to evolve, with pose detection joining gesture recognition and autonomous person following as part of a comprehensive vision-based robotics platform. Each modality has its place, and together they're building toward more adaptable and intuitive robot companions.
-
Gesture based Navigation
08/14/2025 at 15:54 • 0 commentsGesture-controlled robotics represents a compelling intersection of computer vision, human-robot interaction, and real-time motion control. I developed GestureBot as a comprehensive system that translates hand gestures into precise robot movements, addressing the unique challenges of responsive detection, mechanical stability, and modular architecture design.
The project tackles several technical challenges inherent in gesture-controlled navigation: achieving sub-second response times while maintaining detection stability, preventing mechanical instability in tall robot form factors through acceleration limiting, and creating a modular architecture that supports future multi-modal integration. My implementation demonstrates how MediaPipe's gesture recognition capabilities can be effectively integrated with ROS2 navigation systems to create a responsive, stable, and extensible robot control platform.
System Architecture
I designed GestureBot with a modular architecture that separates gesture detection from motion control, enabling flexible deployment and future expansion. The system consists of two primary components connected through ROS2 topics:
Core Components
Gesture Recognition Module: Handles camera input and MediaPipe-based gesture detection, publishing stable gesture results to
/vision/gestures. This module operates independently and can function without the motion control system for testing and development.Navigation Bridge Module: Subscribes to gesture detection results and converts them into smooth robot motion commands published to
/cmd_vel. This separation allows the navigation bridge to potentially receive input from multiple detection sources in future implementations.Data Flow Architecture
Camera Input → MediaPipe Processing → Gesture Stability Filtering → /vision/gestures ↓ /cmd_vel ← Acceleration Limiting ← Velocity Smoothing ← Motion Mapping ←──┘The modular design enables independent operation of components. I can run gesture detection without motion control for development, or use external gesture sources with the navigation bridge. This architecture prepares the system for Phase 4 multi-modal integration where object detection and pose estimation will feed into the same navigation bridge.
Launch File Structure
I implemented separate launch files for each component:
gesture_recognition.launch.py: Camera and gesture detection onlygesture_navigation_bridge.launch.py: Motion control and navigation logic- Future:
multi_modal_navigation.launch.py: Integrated multi-modal system
This separation provides deployment flexibility and simplifies parameter management for different robot configurations.
Technical Implementation
MediaPipe Integration for Hand Gesture Detection
I integrated MediaPipe's gesture recognition model using a controller-based architecture that handles the MediaPipe lifecycle independently from ROS2 infrastructure. The implementation uses MediaPipe's LIVE_STREAM mode with asynchronous processing for optimal performance:
class GestureRecognitionController: def __init__(self, model_path: str, confidence_threshold: float, max_hands: int, result_callback): self.model_path = model_path self.confidence_threshold = confidence_threshold self.max_hands = max_hands self.result_callback = result_callback # Initialize MediaPipe gesture recognizer base_options = python.BaseOptions(model_asset_path=self.model_path) options = vision.GestureRecognizerOptions( base_options=base_options, running_mode=vision.RunningMode.LIVE_STREAM, result_callback=self._mediapipe_callback, min_hand_detection_confidence=self.confidence_threshold, min_hand_presence_confidence=self.confidence_threshold, min_tracking_confidence=self.confidence_threshold, num_hands=self.max_hands ) self.recognizer = vision.GestureRecognizer.create_from_options(options)The controller processes camera frames asynchronously and extracts gesture classifications, hand landmarks, and handedness information. I implemented robust handedness extraction that handles MediaPipe's data structure variations:
def extract_handedness(self, handedness_list, hand_index: int) -> str: """Extract handedness from MediaPipe results using standard category_name format.""" if not handedness_list or hand_index >= len(handedness_list): return 'Unknown' try: handedness_data = handedness_list[hand_index] if hasattr(handedness_data, '__len__') and len(handedness_data) > 0: if hasattr(handedness_data[0], 'category_name'): return handedness_data[0].category_name except (IndexError, AttributeError): pass return 'Unknown'Gesture-to-Motion Mapping System
I implemented a comprehensive mapping system that translates 8 distinct hand gestures into specific robot movements:
GESTURE_MOTION_MAP = { 'Thumb_Up': {'linear_x': 0.3, 'angular_z': 0.0}, # Move forward 'Thumb_Down': {'linear_x': -0.2, 'angular_z': 0.0}, # Move backward 'Open_Palm': {'linear_x': 0.0, 'angular_z': 0.0}, # Emergency stop 'Pointing_Up': {'linear_x': 0.3, 'angular_z': 0.0}, # Move forward (alternative) 'Victory': {'linear_x': 0.0, 'angular_z': 0.8}, # Turn left 'ILoveYou': {'linear_x': 0.0, 'angular_z': -0.8}, # Turn right 'Closed_Fist': {'linear_x': 0.0, 'angular_z': 0.0}, # Emergency stop 'None': {'linear_x': 0.0, 'angular_z': 0.0} # No gesture detected }The mapping system includes safety considerations with multiple emergency stop gestures (Open_Palm and Closed_Fist) that bypass acceleration limiting for immediate response. Forward and backward movements use different maximum velocities, with backward motion limited to 0.2 m/s for safety.
Acceleration Limiting for Mechanical Stability
Tall robots with high centers of mass require careful acceleration management to prevent wobbling and tipping. I implemented a comprehensive acceleration limiting system that operates at 25 Hz to provide smooth velocity transitions:
def apply_acceleration_limit(self, current_vel: float, target_vel: float, max_accel: float, dt: float) -> float: """Apply acceleration limiting to prevent abrupt velocity changes.""" velocity_diff = target_vel - current_vel max_change = max_accel * dt if abs(velocity_diff) <= max_change: return target_vel # Can reach target this step else: # Limit the change to maximum allowed acceleration return current_vel + (max_change if velocity_diff > 0 else -max_change)The system uses conservative acceleration limits tuned for tall robot stability:
- Linear acceleration: 0.25 m/s² (balanced responsiveness and stability)
- Angular acceleration: 0.5 rad/s² (smooth turning without destabilization)
- Emergency deceleration: 1.2 m/s² (faster stopping for safety)
High-Frequency Velocity Smoothing
I implemented a 25 Hz velocity smoothing loop that continuously interpolates between current and target velocities. This high-frequency control prevents the jerky motion that can cause mechanical instability:
def update_smoothed_velocity(self) -> None: """High-frequency velocity smoothing with acceleration limiting.""" current_time = time.time() dt = current_time - self.last_velocity_update self.last_velocity_update = current_time # Skip if dt is too large (system lag) or too small if dt > 0.1 or dt < 0.001: return # Apply acceleration limiting to linear velocity self.current_velocity['linear_x'] = self.apply_acceleration_limit( self.current_velocity['linear_x'], self.target_velocity['linear_x'], max_linear_accel, dt ) # Apply acceleration limiting to angular velocity self.current_velocity['angular_z'] = self.apply_acceleration_limit( self.current_velocity['angular_z'], self.target_velocity['angular_z'], max_angular_accel, dt ) # Create and publish smoothed Twist message twist = Twist() twist.linear.x = self.current_velocity['linear_x'] twist.angular.z = self.current_velocity['angular_z'] self.cmd_vel_publisher.publish(twist)Performance Optimizations
Gesture Stability Filtering
I developed a multi-layered stability filtering system that balances responsiveness with detection reliability. The system combines three filtering mechanisms:
Time-based stability: Requires gestures to be detected consistently for a minimum duration (0.1 seconds for maximum responsiveness).
Consistency checking: Validates that the same gesture appears in consecutive detections (single detection sufficient for immediate response).
Transition delay: Enforces minimum time between different gesture changes (0.05 seconds for fastest viable switching).
def check_gesture_stability(self, gesture_name: str, confidence: float, timestamp: float) -> bool: """Enhanced stability checking with consistency and transition delay.""" # Add current detection to history self.gesture_detection_history.append({ 'gesture': gesture_name, 'confidence': confidence, 'timestamp': timestamp }) # Check consistency - same gesture detected N consecutive times if not self._check_gesture_consistency(gesture_name): return False # Check transition delay - minimum time between different gestures if not self._check_transition_delay(gesture_name, timestamp): return False # Check time-based stability - existing method if not self.is_gesture_stable(gesture_name, timestamp): return False return TrueThese parameters were optimized through testing to achieve sub-second response times while maintaining detection stability. The system achieves gesture-to-motion latency of 0.3-0.5 seconds under optimal conditions.
Smart Logging System
I implemented intelligent logging that reduces noise while preserving essential debugging information. The system only logs when velocity commands change significantly or represent meaningful motion:
def log_velocity_change(self, twist: Twist) -> None: """Smart logging that only logs when velocity actually changes or is significant.""" current_linear = twist.linear.x current_angular = twist.angular.z last_linear = self.last_published_velocity['linear_x'] last_angular = self.last_published_velocity['angular_z'] # Check if velocity is zero is_zero_velocity = abs(current_linear) < 0.001 and abs(current_angular) < 0.001 was_zero_velocity = abs(last_linear) < 0.001 and abs(last_angular) < 0.001 # Check if velocity has changed significantly velocity_changed = (abs(current_linear - last_linear) > 0.01 or abs(current_angular - last_angular) > 0.01) # Log conditions: non-zero velocities, significant changes, or transitions to stop should_log = (not is_zero_velocity or (velocity_changed and not was_zero_velocity) or (velocity_changed and not self.zero_velocity_logged)) if should_log: self.get_logger().info(f'Velocity: linear: {current_linear:.3f}, angular: {current_angular:.3f}')This approach reduces log volume by approximately 80% while maintaining visibility into actual motion commands and system state changes.
Cross-Workspace ROS2 Integration
I designed the system to support cross-workspace integration, enabling gesture control of robots with existing navigation stacks. The modular architecture publishes standard
/cmd_velmessages that any ROS2 navigation system can consume:# GestureBot workspace (publishes /cmd_vel) cd ~/GestureBot/gesturebot_ws source ~/GestureBot/gesturebot_env/bin/activate source install/setup.bash ros2 launch gesturebot gesture_recognition.launch.py ros2 launch gesturebot gesture_navigation_bridge.launch.py # Robot base workspace (subscribes to /cmd_vel) cd ~/Robot/robot_ws source ~/Robot/robot_env/bin/activate source install/setup.bash ros2 run robot_base base_controller_nodeThis architecture enables gesture control of any ROS2-compatible robot without modifying existing navigation code.
Results and Performance Metrics
Responsiveness Improvements
Through systematic optimization, I achieved significant improvements in gesture-to-motion response times:
Before optimization:
- Gesture-to-motion latency: 5-10 seconds
- Motion start time: 2-4 seconds
- Gesture transition rate: 8.94 transitions/second (excessive noise)
After optimization:
- Gesture-to-motion latency: 0.3-0.5 seconds (90% improvement)
- Motion start time: 0.5-1.0 seconds (75% improvement)
- Gesture transition rate: 6.03 transitions/second (32% reduction in noise)
Stability Achievements
The acceleration limiting system successfully prevents mechanical instability in tall robot configurations:
Acceleration compliance:
- Linear acceleration: ≤0.25 m/s² (100% compliance)
- Angular acceleration: ≤0.5 rad/s² (100% compliance)
- Smooth transitions: >95% of velocity changes within limits
Motion characteristics:
- Time to reach maximum linear velocity (0.3 m/s): 1.2 seconds
- Time to reach maximum angular velocity (0.8 rad/s): 1.6 seconds
- Emergency stop response: <0.1 seconds (bypasses acceleration limiting)
Detection Performance
The gesture recognition system demonstrates robust performance across various conditions:
Detection accuracy:
- Gesture recognition confidence: >0.7 for stable detections
- Handedness detection: 100% accuracy when hands are clearly visible
- False positive rate: <5% with stability filtering enabled
System resource usage:
- CPU utilization: 15-20% for gesture recognition, 2-5% for navigation bridge
- Memory usage: 200-300MB total system footprint
- Detection rate: 3-4 Hz for stable gestures, 8-15 Hz raw MediaPipe output
-
Building a 33-Point Human Skeleton Tracker
08/14/2025 at 15:20 • 0 commentsMost robotics vision systems treat humans as simple bounding boxes – "person detected, avoid obstacle." But humans are dynamic, expressive, and predictable if you know how to read body language. A person leaning forward might be about to walk into the robot's path. Someone pointing could be giving directional commands. Arms raised might signal "stop."
I needed a system that could:
- Track 33 distinct body landmarks in real-time
- Handle multiple people simultaneously (up to 2 poses)
- Run headless on embedded hardware without X11 dependencies
- Integrate cleanly with my existing ROS 2 navigation stack
- Provide visual feedback for development and debugging
The Core Architecture
The heart of my implementation is a modular ROS 2 node that wraps MediaPipe's PoseLandmarker model. I chose a composition pattern to keep the ROS infrastructure separate from the MediaPipe processing logic:
class PoseDetectionNode(MediaPipeBaseNode, MediaPipeCallbackMixin): def __init__(self, **kwargs): MediaPipeCallbackMixin.__init__(self) super().__init__( node_name='pose_detection_node', **kwargs ) # Initialize the pose detection controller self.controller = PoseDetectionController( model_path=self.model_path, confidence_threshold=self.confidence_threshold, max_poses=self.max_poses, logger=self.get_logger() )The
PoseDetectionControllerhandles all MediaPipe-specific operations:class PoseDetectionController: def __init__(self, model_path: str, confidence_threshold: float, max_poses: int, logger): self.logger = logger # Configure MediaPipe options base_options = python.BaseOptions(model_asset_path=model_path) options = vision.PoseLandmarkerOptions( base_options=base_options, running_mode=vision.RunningMode.LIVE_STREAM, num_poses=max_poses, min_pose_detection_confidence=confidence_threshold, min_pose_presence_confidence=confidence_threshold, min_tracking_confidence=confidence_threshold, result_callback=self._pose_callback ) self._landmarker = vision.PoseLandmarker.create_from_options(options)The 33-Point Pose Model
MediaPipe's pose model detects 33 landmarks covering the entire human body:
- Face: Nose, eyes, ears (5 points)
- Torso: Shoulders, hips, center points (6 points)
- Arms: Shoulders, elbows, wrists (6 points)
- Hands: Wrist, thumb, fingers (10 points)
- Legs: Hips, knees, ankles, feet (6 points)
Each landmark provides normalized (x, y) coordinates plus a visibility score, giving rich information about human pose and orientation.
Handling MediaPipe's Async Processing
One of the trickier aspects was properly handling MediaPipe's asynchronous LIVE_STREAM mode. The pose detection happens in a separate thread, with results delivered via callback:
def _pose_callback(self, result: vision.PoseLandmarkerResult, output_image: mp.Image, timestamp_ms: int): """Handle pose detection results from MediaPipe.""" try: # Convert MediaPipe timestamp to ROS time ros_timestamp = self._convert_timestamp(timestamp_ms) # Process pose landmarks pose_msg = PoseLandmarks() pose_msg.header.stamp = ros_timestamp pose_msg.header.frame_id = 'camera_frame' if result.pose_landmarks: pose_msg.num_poses = len(result.pose_landmarks) # Handle MediaPipe's pose landmark structure variations for pose_landmarks in result.pose_landmarks: try: # MediaPipe structure can vary between versions if hasattr(pose_landmarks, '__iter__') and not hasattr(pose_landmarks, 'landmark'): landmarks = pose_landmarks # Direct list access else: landmarks = pose_landmarks.landmark # Attribute access for landmark in landmarks: point = Point() point.x = landmark.x point.y = landmark.y point.z = landmark.z pose_msg.landmarks.append(point) except Exception as e: self.logger.warn(f'Pose landmark processing error: {e}') continue # Publish results self.pose_publisher.publish(pose_msg) except Exception as e: self.logger.error(f'Pose callback error: {e}')\Performance Reality Check: 3-7 FPS on Pi 5
Let me be honest about performance – this isn't going to run at 30 FPS on a Raspberry Pi 5. Through extensive testing, I measured:
- Actual Performance: 3-7 FPS @ 640x480 resolution
- Processing Time: ~150-300ms per frame
- Memory Usage: ~200MB additional RAM
- CPU Load: ~40-60% of one core during active processing
But here's the thing – for robotics applications, this is actually sufficient. Human movement is relatively slow compared to computer vision processing. A robot doesn't need to track every micro-movement; it needs to understand general pose, direction, and intent.
The key insight was optimizing for stability over speed. I'd rather have consistent 5 FPS processing than erratic 15 FPS with dropped frames and errors.
Debugging MediaPipe Integration Challenges
The most frustrating part of this project was dealing with MediaPipe's pose landmark data structures. Different versions of MediaPipe return pose landmarks in slightly different formats, and the documentation doesn't clearly explain the variations.
I spent hours debugging errors like:
AttributeError: 'list' object has no attribute 'landmark'The solution was implementing robust structure detection:
# Handle both possible MediaPipe pose landmark structures if hasattr(pose_landmarks, '__iter__') and not hasattr(pose_landmarks, 'landmark'): # pose_landmarks is already a list of landmarks (newer format) landmarks = pose_landmarks else: # pose_landmarks has .landmark attribute (older format) landmarks = pose_landmarks.landmark# Handle both possible MediaPipe pose landmark structures
if hasattr(pose_landmarks, '__iter__') and not hasattr(pose_landmarks, 'landmark'):
# pose_landmarks is already a list of landmarks (newer format)
landmarks = pose_landmarks
else:
# pose_landmarks has .landmark attribute (older format)
landmarks = pose_landmarks.landmarkif self.annotated_image_publisher and result.pose_landmarks: annotated_frame = self._create_annotated_image(output_image.numpy_view(), result) annotated_msg = self.cv_bridge.cv2_to_imgmsg(annotated_frame, encoding='bgr8') annotated_msg.header.stamp = self.get_clock().now().to_msg() annotated_msg.header.frame_id = 'camera_frame' self.annotated_image_publisher.publish(annotated_msg)I can develop and debug with full visualization, then deploy the same code headless on the robot.
Multi-Modal Vision Integration
The real power comes from combining pose detection with other vision modalities. My unified image viewer can display multiple vision streams simultaneously:
# View all vision systems together ros2 launch gesturebot image_viewer.launch.py \ image_topics:='["/vision/objects/annotated", "/vision/gestures/annotated", "/vision/pose/annotated"]' \ topic_window_names:='{"\/vision\/objects\/annotated": "Objects", "\/vision\/gestures\/annotated": "Gestures", "\/vision\/pose\/annotated": "Poses"}'This gives me object detection (what's there), gesture recognition (what they're doing), and pose detection (how they're positioned) all in one integrated system.
Practical Robotics Applications
Human-Aware Navigation
Instead of treating humans as static obstacles, my robot can now:
- Predict movement direction from body orientation
- Maintain appropriate social distances based on pose
- Recognize when someone is trying to interact vs. just passing by
Gesture Command Integration
Pose data enhances gesture recognition by providing context:
- A pointing gesture combined with body orientation gives directional commands
- Raised arms with forward-leaning pose might indicate "stop" vs. "hello"
- Multiple people can give conflicting gestures – pose helps determine who to follow
Safety and Interaction
The 33-point skeleton provides rich safety information:
- Detect if someone has fallen (unusual pose angles)
- Recognize aggressive vs. friendly body language
- Identify when someone is carrying objects that might affect navigation
-
Implementing Real-Time Gesture Recognition for Robot Control
08/13/2025 at 03:40 • 0 commentsSystem Overview and Architecture
The GestureBot gesture recognition system is built on a modular ROS 2 architecture that combines MediaPipe's powerful computer vision capabilities with efficient real-time processing optimized for embedded systems. The core system processes camera input at 15 FPS, detects hand gestures with 21-point landmark tracking, and translates recognized gestures into navigation commands for autonomous robot control.
Key Performance Metrics
Through extensive testing and optimization, I've achieved the following performance characteristics on Raspberry Pi 5:
- Processing Rate: 15 FPS @ 640x480 resolution
- Gesture Recognition Latency: <100ms from detection to command
- Hand Landmark Accuracy: 21-point skeleton with sub-pixel precision
- System Resource Usage: <25% CPU utilization during active processing
- Memory Footprint: ~150MB including MediaPipe models and ROS 2 overhead
MediaPipe Integration with ROS 2
The foundation of the gesture recognition system is a robust integration between MediaPipe's gesture recognition capabilities and ROS 2's distributed computing framework. I implemented this using a callback-based architecture that maximizes performance while maintaining system responsiveness.
Core Architecture Components
The system consists of several key components working in concert:
GestureRecognitionNode: The primary ROS 2 node that inherits from
MediaPipeBaseNode, providing standardized MediaPipe integration patterns across the vision system.MediaPipe Gesture Recognizer: Utilizes the pre-trained
gesture_recognizer.taskmodel for real-time gesture classification with confidence scoring.Unified Image Viewer: A multi-topic display system that can simultaneously show gesture recognition results, hand landmarks, and performance metrics.
Callback-Based Processing Pipeline
I implemented the MediaPipe integration using an asynchronous callback pattern that ensures optimal performance:
def initialize_mediapipe(self) -> bool: """Initialize MediaPipe gesture recognizer with callback processing.""" try: # Configure gesture recognizer options options = mp_vis.GestureRecognizerOptions( base_options=mp_py.BaseOptions(model_asset_path=self.model_path), running_mode=mp_vis.RunningMode.LIVE_STREAM, result_callback=self._process_callback_results, num_hands=self.max_hands, min_hand_detection_confidence=self.confidence_threshold, min_hand_presence_confidence=self.confidence_threshold, min_tracking_confidence=self.confidence_threshold ) self.gesture_recognizer = mp_vis.GestureRecognizer.create_from_options(options) return True except Exception as e: self.get_logger().error(f"Failed to initialize MediaPipe: {e}") return FalseThis approach uses MediaPipe's
LIVE_STREAMmode withdetect_async()for optimal performance, avoiding blocking operations in the main processing thread.Hand Landmark Detection System
The gesture recognition system implements comprehensive 21-point hand landmark tracking, providing detailed skeletal information for each detected hand. This landmark data serves dual purposes: gesture classification input and visual feedback for system debugging.
21-Point Hand Skeleton
MediaPipe's hand landmark model provides 21 key points representing the complete hand structure:
- Wrist (0): Base reference point
- Thumb (1-4): Complete thumb chain from base to tip
- Index Finger (5-8): Four points from base to fingertip
- Middle Finger (9-12): Complete middle finger structure
- Ring Finger (13-16): Four-point ring finger chain
- Pinky (17-20): Complete pinky finger structure
Landmark Visualization Implementation
I implemented a comprehensive visualization system that draws both individual landmarks and connecting skeleton lines:
def draw_hand_landmarks(self, image: np.ndarray, hand_landmarks_list) -> None: """Draw complete hand landmarks with skeleton connections.""" try: if not hand_landmarks_list: return height, width = image.shape[:2] for hand_index, hand_landmarks in enumerate(hand_landmarks_list): # Draw landmark connections (skeleton) mp_drawing.draw_landmarks( image, hand_landmarks, mp_hands.HAND_CONNECTIONS, mp_drawing.DrawingSpec(color=(0, 255, 0), thickness=2, circle_radius=2), mp_drawing.DrawingSpec(color=(255, 0, 0), thickness=2) ) # Highlight key landmarks (wrist and fingertips) for i, landmark in enumerate(hand_landmarks.landmark): x = int(landmark.x * width) y = int(landmark.y * height) if i in [0, 4, 8, 12, 16, 20]: # Wrist and fingertips cv2.circle(image, (x, y), 8, (0, 0, 255), -1) # Red circles cv2.circle(image, (x, y), 10, (255, 255, 255), 2) # White border except Exception as e: self.get_logger().error(f"Error drawing landmarks: {e}")Gesture-to-Navigation Command Mapping
The system implements a comprehensive mapping between recognized gestures and robot navigation commands, designed for intuitive and safe robot control.
Supported Gestures and Commands
I've implemented support for 9 distinct gestures, each mapped to specific navigation behaviors:
Gesture Navigation Command Robot Action Use Case 👍 Thumbs Up start_navigationBegin autonomous navigation Start mission 👎 Thumbs Down stop_navigationStop all movement End current task ✋ Open Palm pause_navigationPause current navigation Temporary halt 👆 Pointing Up move_forwardMove forward briefly Manual positioning 👈 Pointing Left turn_leftTurn left Direction control 👉 Pointing Right turn_rightTurn right Direction control ✌️ Peace Sign follow_personEnter person-following mode Interactive following ✊ Fist emergency_stopImmediate emergency stop Safety override 👋 Wave return_homeReturn to home position Mission completion Gesture Stability and Confidence Thresholds
To ensure reliable operation and prevent false triggers, I implemented a multi-layered validation system:
Confidence Threshold: Minimum 0.5 confidence score required for gesture recognition (configurable via parameters).
Gesture Stability: Gestures must be maintained for a minimum duration (default: 0.5 seconds) before triggering navigation commands.
Command Debouncing: Prevents rapid-fire command execution by implementing cooldown periods between identical commands.
def process_gesture_results(self, results, timestamp) -> Optional[Dict]: """Process gesture recognition results with stability validation.""" if not results.gestures: self.current_gesture = None self.gesture_start_time = None return None # Get highest confidence gesture gesture = results.gestures[0][0] gesture_name = gesture.category_name confidence = gesture.score # Apply confidence threshold if confidence < self.confidence_threshold: return None # Check gesture stability current_time = time.time() if self.current_gesture != gesture_name: self.current_gesture = gesture_name self.gesture_start_time = current_time return None # Wait for stability # Validate stability duration if current_time - self.gesture_start_time < self.gesture_stability_threshold: return None # Generate navigation command nav_command = self.GESTURE_COMMANDS.get(gesture_name, '') return { 'gesture_name': gesture_name, 'confidence': confidence, 'nav_command': nav_command, 'handedness': results.handedness[0][0].category_name if results.handedness else 'unknown', 'timestamp': timestamp }Unified Image Viewer Architecture
One of the key architectural improvements I implemented is the unified image viewer system, which replaces multiple separate image viewer nodes with a single, efficient multi-topic display system.
Multi-Topic Display Capabilities
The unified image viewer can simultaneously display multiple vision streams in separate windows, each with independent performance tracking and custom labeling:
# Display both gesture recognition and object detection simultaneously ros2 launch gesturebot image_viewer.launch.py \ image_topics:='["/vision/gestures/annotated", "/vision/objects/annotated"]' \ topic_window_names:='{"\/vision\/gestures\/annotated": "Gestures", "\/vision\/objects\/annotated": "Objects"}'Performance Optimizations
The unified architecture provides significant performance improvements:
- Memory Efficiency: ~15MB per window vs ~25MB for separate viewers
- CPU Optimization: <5% total CPU usage at 10 FPS display rate
- Resource Sharing: Single OpenCV event loop handles multiple windows
- QoS Compatibility: BEST_EFFORT reliability matches vision system publishers
FPS Overlay Implementation
I implemented a sophisticated FPS overlay system that provides per-topic performance monitoring positioned in the lower-right corner of each display window:
def add_fps_overlay(self, image: np.ndarray, topic: str) -> np.ndarray: """Add FPS overlay to the image for a specific topic in the lower-right corner.""" fps_text = f"Display FPS: {self.current_fps_values[topic]:.1f}" topic_text = f"Topic: {topic.split('/')[-1]}" # Get image dimensions and calculate lower-right position img_height, img_width = image.shape[:2] # Text properties (standardized across vision system) font = cv2.FONT_HERSHEY_SIMPLEX font_scale = 0.5 color = (0, 255, 0) # Green thickness = 1 # Calculate positioning with proper padding padding = 10 rect_x2 = img_width - padding rect_y2 = img_height - padding # Draw background rectangle and text cv2.rectangle(image, (rect_x1, rect_y1), (rect_x2, rect_y2), (0, 0, 0), -1) cv2.putText(image, fps_text, (text_x, fps_text_y), font, font_scale, color, thickness) cv2.putText(image, topic_text, (text_x, topic_text_y), font, font_scale, color, thickness) return imagePerformance Optimization for Raspberry Pi 5
Achieving real-time performance on embedded hardware required careful optimization across multiple system layers.
Camera Configuration Optimization
I optimized the camera pipeline specifically for gesture recognition performance:
# Optimal camera settings for gesture recognition camera_node: ros__parameters: format: "BGR888" # Optimal for MediaPipe processing width: 640 height: 480 FrameDurationLimits: [66667, 66667] # 15 FPS = 66.67ms ExposureTime: 20000 # 20ms for fast capture AnalogueGain: 1.0 DigitalGain: 1.0 buffer_queue_size: 2 # Reduced buffer for lower latency# Optimal camera settings for gesture recognition
camera_node:
ros__parameters:
format: "BGR888" # Optimal for MediaPipe processing
width: 640
height: 480
FrameDurationLimits: [66667, 66667] # 15 FPS = 66.67ms
ExposureTime: 20000 # 20ms for fast capture
AnalogueGain: 1.0
DigitalGain: 1.0
buffer_queue_size: 2 # Reduced buffer for lower latencydef monitor_system_performance(self): """Monitor system resources and adapt processing accordingly.""" cpu_usage = psutil.cpu_percent(interval=1) memory_usage = psutil.virtual_memory().percent if cpu_usage > 75: # Reduce processing rate under high load self.processing_config.max_fps = 10.0 self.get_logger().warn("High CPU usage detected, reducing processing rate") elif cpu_usage < 50: # Restore full processing rate when resources available self.processing_config.max_fps = 15.0Configuration and Usage
Launch File Configuration
The gesture recognition system provides comprehensive configuration through launch file parameters:
# Complete gesture recognition system launch ros2 launch gesturebot gesture_recognition.launch.py \ camera_format:=BGR888 \ confidence_threshold:=0.5 \ max_hands:=2 \ gesture_stability_threshold:=0.5 \ publish_annotated_images:=true \ show_landmark_indices:=false \ buffer_logging_enabled:=false \ enable_performance_tracking:=trueKey Parameters Explained
camera_format: BGR888 provides optimal performance for MediaPipe processing, avoiding unnecessary color space conversions.
confidence_threshold: Minimum confidence score (0.0-1.0) required for gesture recognition. Higher values reduce false positives but may miss valid gestures.
max_hands: Maximum number of hands to track simultaneously (1-2). Higher values increase computational load.
gesture_stability_threshold: Minimum duration (seconds) a gesture must be maintained before triggering commands. Prevents accidental activations.
publish_annotated_images: Enables/disables annotated image publishing. Defaults to true for visual feedback but can be disabled to save resources.
Integration with Navigation System
The gesture recognition system integrates seamlessly with ROS 2 navigation stacks through standardized message interfaces:
# Gesture command publisher self.gesture_command_publisher = self.create_publisher( GestureCommand, '/gesture_commands', 10 ) # Navigation command mapping def publish_navigation_command(self, gesture_info): """Publish navigation command based on recognized gesture.""" command_msg = GestureCommand() command_msg.gesture_name = gesture_info['gesture_name'] command_msg.confidence = gesture_info['confidence'] command_msg.navigation_command = gesture_info['nav_command'] command_msg.timestamp = self.get_clock().now().to_msg() self.gesture_command_publisher.publish(command_msg)Visual System Features and Consistency
I implemented a comprehensive visual feedback system with consistent styling across all vision components.
Font Standardization
All text overlays throughout the GestureBot vision system use standardized font properties for visual consistency:
# Standardized font properties across all vision nodes STANDARD_FONT = cv2.FONT_HERSHEY_SIMPLEX STANDARD_FONT_SCALE = 0.5 STANDARD_THICKNESS = 1This standardization applies to:
- FPS overlays in the unified image viewer
- Gesture name and confidence annotations
- Hand landmark indices (when enabled)
- Bounding box labels
Annotated Image Publishing
The system publishes richly annotated images that include:
Hand Landmarks: Complete 21-point skeleton with connecting lines Gesture Labels: Current gesture name with confidence percentage Navigation Commands: Associated navigation command for each gesture Performance Metrics: Real-time FPS and processing statistics
def create_annotated_image(self, frame, gesture_info): """Create comprehensive annotated image with all visual elements.""" annotated_frame = frame.copy() # Draw hand landmarks if self._current_hand_landmarks: self.draw_hand_landmarks(annotated_frame, self._current_hand_landmarks) # Add gesture text with standardized font gesture_text = f"{gesture_info['gesture_name']} ({gesture_info['confidence']:.2f})" if gesture_info.get('nav_command'): gesture_text += f" -> {gesture_info['nav_command']}" self.draw_text_with_background( annotated_frame, gesture_text, (10, 30), cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 255, 0), 1 ) return annotated_framePerformance Optimization
Issue: Poor FPS performance or high CPU usage Solutions:
- Reduce processing rate: Lower max_fps parameter
- Optimize camera settings: Use BGR888 format, reduce resolution if needed
- Disable unnecessary features: Turn off landmark indices, reduce max_hands
- Monitor system resources: Use htop and ROS 2 performance topics
# Monitor gesture recognition performance ros2 topic echo /vision/gestures/performance # Adjust parameters for better performance ros2 param set /gesture_recognition_node max_fps 10.0 ros2 param set /gesture_recognition_node max_hands 1 -
Manual Object Detection Annotations in GestureBot
08/11/2025 at 05:04 • 0 commentsWhen developing the GestureBot vision system, I encountered a common challenge in robotics computer vision: balancing performance with visualization flexibility. While MediaPipe provides excellent object detection capabilities, its built-in annotation system proved limiting for our specific visualization requirements. This post details how I implemented a custom manual annotation system using OpenCV primitives while maintaining MediaPipe's high-performance
LIVE_STREAMprocessing mode.Problem Statement: Why Move Beyond MediaPipe's Built-in Annotations
MediaPipe's object detection framework excels at inference performance, but its visualization capabilities presented several limitations for our robotics application:
MediaPipe Annotation Limitations
- Limited customization: Fixed annotation styles with minimal configuration options
- Inconsistent output:
LIVE_STREAMmode doesn't always provide reliableoutput_imageresults - Performance overhead: Built-in annotations add processing latency in the inference pipeline
- Inflexible styling: No control over color schemes, font sizes, or confidence display formats
Our Requirements
For GestureBot's vision system, I needed:
- Color-coded confidence levels for quick visual assessment
- Percentage-based confidence display for precise evaluation
- Consistent annotation rendering regardless of detection confidence
- Minimal performance impact on the real-time processing pipeline
- Full control over visual styling to match our robotics interface
Technical Implementation: Manual Annotation Architecture
The solution involved decoupling MediaPipe's inference engine from the visualization layer, creating a custom annotation system that operates on the original RGB frames.
System Architecture
# High-level flow RGB Frame → MediaPipe Detection (LIVE_STREAM) → Manual Annotation → ROS PublishingThe key insight was to preserve MediaPipe's asynchronous
detect_async()processing while applying custom annotations to the original RGB frames, rather than relying on MediaPipe'soutput_image.Core Implementation: Manual Annotation Method
def draw_manual_annotations(self, image: np.ndarray, detections) -> np.ndarray: """ Manually draw bounding boxes, labels, and confidence scores using OpenCV. Args: image: RGB image array (H, W, 3) detections: MediaPipe detection results Returns: Annotated RGB image array """ if not detections: return image.copy() annotated_image = image.copy() height, width = image.shape[:2] for detection in detections: # Get bounding box coordinates bbox = detection.bounding_box x_min = int(bbox.origin_x) y_min = int(bbox.origin_y) x_max = int(bbox.origin_x + bbox.width) y_max = int(bbox.origin_y + bbox.height) # Ensure coordinates are within image bounds x_min = max(0, min(x_min, width - 1)) y_min = max(0, min(y_min, height - 1)) x_max = max(0, min(x_max, width - 1)) y_max = max(0, min(y_max, height - 1)) # Get the best category (highest confidence) if detection.categories: best_category = max(detection.categories, key=lambda c: c.score if c.score else 0) class_name = best_category.category_name or 'unknown' confidence = best_category.score or 0.0 # Color-coded boxes based on confidence levels if confidence >= 0.7: color = (0, 255, 0) # Green for high confidence (RGB) elif confidence >= 0.5: color = (255, 255, 0) # Yellow for medium confidence (RGB) else: color = (255, 0, 0) # Red for low confidence (RGB) # Draw bounding box rectangle cv2.rectangle(annotated_image, (x_min, y_min), (x_max, y_max), color, 2) # Prepare label text with confidence percentage confidence_percent = int(confidence * 100) label = f"{class_name}: {confidence_percent}%" # Calculate text size for background rectangle font = cv2.FONT_HERSHEY_SIMPLEX font_scale = 0.6 thickness = 2 (text_width, text_height), baseline = cv2.getTextSize(label, font, font_scale, thickness) # Position text above bounding box, or below if not enough space text_x = x_min text_y = y_min - 10 if y_min - 10 > text_height else y_max + text_height + 10 # Draw background rectangle for text (filled) cv2.rectangle( annotated_image, (text_x, text_y - text_height - baseline), (text_x + text_width, text_y + baseline), color, -1 # Filled rectangle ) # Draw text label in black for good contrast cv2.putText( annotated_image, label, (text_x, text_y), font, font_scale, (0, 0, 0), # Black text (RGB) thickness, cv2.LINE_AA ) return annotated_imageIntegration with MediaPipe Pipeline
The manual annotation system integrates seamlessly with MediaPipe's asynchronous processing:
def process_frame(self, frame: np.ndarray, timestamp: float) -> Optional[Dict]: """Process frame for object detection.""" # ... MediaPipe processing ... if detections: result_dict = { 'detections': detections, 'timestamp': timestamp, 'processing_time': (time.time() - timestamp) * 1000, 'rgb_frame': rgb_frame # Include original RGB frame for manual annotation } return result_dict def publish_results(self, results: Dict, timestamp: float) -> None: """Publish object detection results and optionally annotated images.""" # ... detection publishing ... # Extract the original RGB frame and detections for manual annotation rgb_frame = results['rgb_frame'] detections = results['detections'] # Apply manual annotations using OpenCV drawing primitives annotated_rgb = self.draw_manual_annotations(rgb_frame, detections) # Convert RGB to BGR for ROS publishing cv_image = cv2.cvtColor(annotated_rgb, cv2.COLOR_RGB2BGR)Visual Features: Color-Coded Confidence System
The manual annotation system implements a three-tier confidence visualization scheme designed for rapid assessment in robotics applications:
Confidence Color Mapping
- 🟢 Green (≥70%): High confidence detections suitable for autonomous decision-making
- 🟡 Yellow (≥50%): Medium confidence detections requiring validation
- 🔴 Red (<50%): Low confidence detections for debugging purposes
Typography and Positioning
- Font:
cv2.FONT_HERSHEY_SIMPLEXfor clear readability - Background: Filled rectangles matching bounding box colors for contrast
- Text Color: Black for optimal contrast against colored backgrounds
- Positioning: Adaptive placement above or below bounding boxes based on available space
Percentage Display Format
Confidence scores are displayed as integers (e.g., "person: 76%") rather than decimals, providing immediate visual feedback without cognitive overhead during real-time operation.
-
Optimizing Object Detection Pipeline Performance: A 68.7% Improvement Through Systematic Bottleneck Analysis
08/11/2025 at 01:44 • 0 commentsIntroduction
I recently completed a comprehensive performance optimization project for a ROS 2-based object detection pipeline using MediaPipe and OpenCV. The system processes camera frames for real-time object detection in robotics applications, but initial performance analysis revealed significant bottlenecks that were limiting throughput and consuming excessive CPU resources.
The object detection pipeline consists of three main stages:
- Preprocessing: Camera frame format conversion (BGR→RGB) for MediaPipe compatibility
- MediaPipe Inference: Object detection using TensorFlow Lite models
- Postprocessing: Result conversion and ROS message publishing, including optional annotated image generation
Through systematic measurement and targeted optimization, I achieved a 68.7% reduction in total pipeline processing time, from 8.65ms to 2.71ms per frame, while maintaining full functionality and improving system stability.
Baseline Performance Analysis
Measurement Infrastructure
Before implementing any optimizations, I established a comprehensive performance measurement system to ensure accurate, statistically reliable data collection. The measurement infrastructure includes:
PipelineTimer Class: High-precision timing using
time.perf_counter()for microsecond-level accuracy:class PipelineTimer: def __init__(self): self.stage_times = {} self.start_time = None def start_stage(self, stage_name: str): self.stage_times[stage_name] = time.perf_counter() def end_stage(self, stage_name: str) -> float: if stage_name in self.stage_times: duration = time.perf_counter() - self.stage_times[stage_name] return duration * 1000 # Convert to milliseconds return 0.0PerformanceStats Class: Aggregates timing data over 5-second periods and publishes metrics to ROS topics:
class PerformanceStats: def __init__(self): self.period_start_time = time.perf_counter() self.frames_processed = 0 self.total_preprocessing_time = 0.0 self.total_mediapipe_time = 0.0 self.total_postprocessing_time = 0.0 self.period_duration = 5.0 # secondsStatistical Methodology: I used 30-second test periods with multiple measurement intervals to ensure statistical confidence. Each test collected 5-6 data points, allowing calculation of mean performance metrics and variance analysis.
Baseline Performance Results
Using YUYV camera format with full annotated image processing enabled, the baseline performance measurements revealed:
>td ###Total Pipeline Time
Metric Average Time 8.65ms 100>#/td### Preprocessing Time 1.22ms 14>#/td### MediaPipe Inference 2.13ms 25>#/td### Postprocessing Time 5.30ms 61>#/td### Effective FPS 2.28 - The baseline analysis immediately identified postprocessing as the primary bottleneck, consuming 61% of total pipeline time. This stage includes MediaPipe result conversion, RGB→BGR color conversion, and ROS Image message creation for annotated output.
Optimization #1: Conditional Annotated Image Processing
Problem Analysis
The postprocessing bottleneck was caused by unconditional generation of annotated images, even when no ROS subscribers were listening to the
/vision/objects/annotatedtopic. This resulted in expensive memory operations and color conversions being performed unnecessarily.Implementation
I implemented a subscriber count check to conditionally skip annotated image processing when no subscribers are present:
def publish_results(self, results: Dict, timestamp: float) -> None: """Publish object detection results and optionally annotated images.""" try: # Always publish detection results msg = MessageConverter.detection_results_to_ros(results, timestamp) self.detections_publisher.publish(msg) # Conditional annotated image publishing if (self.annotated_image_publisher is not None and 'output_image' in results and results['output_image'] is not None): # Optimization: Skip expensive postprocessing if no subscribers subscriber_count = self.annotated_image_publisher.get_subscription_count() if subscriber_count == 0: self.log_buffered_event( 'ANNOTATED_PROCESSING_SKIPPED', 'Skipping annotated image processing - no subscribers', subscriber_count=subscriber_count ) return # Continue with annotated image processing only when needed # ... (expensive postprocessing operations)Performance Impact
The conditional processing optimization delivered significant improvements:
Metric Before After Improvement Total Pipeline Time 8.65ms 5.19ms 🚀 40.0% FASTER Postprocessing Time 5.30ms 1.52ms 🚀 71.3% FASTER CPU Efficiency High overhead Reduced by 40>#/td### Significant This optimization eliminates 3.46ms of processing time per frame when annotated images are not needed, which is the common case in production robotics applications where object detection results are consumed programmatically rather than visually.
Optimization #2: RGB888 Camera Format
Problem Analysis
After eliminating the postprocessing bottleneck, MediaPipe inference became the primary performance constraint. Analysis revealed that the BGR→RGB color conversion in preprocessing, combined with MediaPipe's internal processing of converted frames, was creating inefficiencies.
Implementation
I switched the camera configuration from YUYV to RGB888 format, allowing direct RGB input to MediaPipe:
def process_frame(self, frame: np.ndarray, timestamp: float) -> Optional[Dict]: """Process frame for object detection.""" try: # Optimization: Check if frame is already in RGB format (camera_format=RGB888) if frame.shape[2] == 3: # Ensure it's a 3-channel image # Direct RGB input from camera eliminates BGR→RGB conversion rgb_frame = frame self.log_buffered_event( 'PREPROCESSING_OPTIMIZED', 'Using direct RGB input - skipping BGR→RGB conversion', frame_shape=str(frame.shape) ) else: # Fallback: Convert BGR to RGB for MediaPipe rgb_frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB) # Create MediaPipe image with direct RGB data mp_image = mp.Image(image_format=mp.ImageFormat.SRGB, data=rgb_frame)Camera Launch Configuration:
ros2 launch gesturebot object_detection.launch.py camera_format:=RGB888Technical Benefits
The RGB888 format optimization provides multiple performance improvements:
- Eliminates BGR→RGB Conversion: Removes expensive
cv2.cvtColor()operation in preprocessing - MediaPipe Efficiency: Direct RGB input is processed more efficiently by MediaPipe's internal algorithms
- Memory Bandwidth Reduction: Fewer memory copy operations and intermediate buffer allocations
- Cache Efficiency: Better memory access patterns with consistent RGB format throughout the pipeline
Performance Impact
The RGB888 optimization delivered substantial additional improvements:
Metric After Opt #1 After Opt #2 Additional Improvement Total Pipeline Time 5.19ms 2.71ms 🚀 47.8% FASTER MediaPipe Time 2.47ms 0.81ms 🚀 67.2% FASTER Preprocessing Time 1.30ms 0.83ms 🚀 36.2% FASTER The MediaPipe inference time reduction of 67.2% was particularly significant, transforming it from the primary bottleneck to a well-optimized component.
Performance Measurement Infrastructure
Buffered Logging System
To collect detailed performance data without impacting measurements, I implemented a configurable buffered logging system:
class BufferedLogger: def __init__(self, enabled: bool = True, max_size: int = 200): self.enabled = enabled self.buffer = [] self.max_size = max_size self.mode = 'circular' if enabled else 'disabled' def log_event(self, event_type: str, message: str, **kwargs): if not self.enabled: return event = { 'timestamp': time.perf_counter(), 'event_type': event_type, 'message': message, **kwargs } if len(self.buffer) >= self.max_size: self.buffer.pop(0) # Remove oldest entry self.buffer.append(event)\Real-Time Performance Publishing
The system publishes aggregated performance metrics every 5 seconds to ROS topics for real-time monitoring:
def publish_performance_stats(self): """Publish performance statistics to ROS topic.""" if not self.enable_performance_tracking: return current_time = time.perf_counter() period_duration = current_time - self.stats.period_start_time if period_duration >= self.stats.period_duration: # Calculate averages avg_preprocessing = self.stats.total_preprocessing_time / self.stats.frames_processed avg_mediapipe = self.stats.total_mediapipe_time / self.stats.frames_processed avg_postprocessing = self.stats.total_postprocessing_time / self.stats.frames_processed avg_total = avg_preprocessing + avg_mediapipe + avg_postprocessing # Create and publish performance message perf_msg = PerformanceMetrics() perf_msg.header.stamp = self.get_clock().now().to_msg() perf_msg.current_fps = self.stats.frames_processed / period_duration perf_msg.avg_preprocessing_time = avg_preprocessing perf_msg.avg_mediapipe_time = avg_mediapipe perf_msg.avg_postprocessing_time = avg_postprocessing perf_msg.avg_total_pipeline_time = avg_total self.performance_publisher.publish(perf_msg)Statistical Validation
I used consistent testing methodology to ensure reliable measurements:
- Test Duration: 30-second measurement periods
- Data Points: 5-6 measurement intervals per test
- Consistency Validation: Multiple test runs to verify reproducibility
- Variance Analysis: Standard deviation calculations to assess measurement stability
Results Summary
Cumulative Performance Improvement
The systematic optimization approach achieved exceptional results:
Configuration Total Pipeline Time Improvement from Baseline Cumulative Improvement Baseline (YUYV + Full Processing) 8.65ms - - Optimization #1 (Conditional Processing) 5.19ms 40.0% faster 40.0>#/td### Final Optimized (RGB888 + Conditional) 2.71ms 68.7% faster 68.7>#/td### Stage-by-Stage Transformation
Pipeline Stage Baseline After Opt #1 Final Optimized Total Improvement Preprocessing 1.22ms (14%) 1.30ms (25%) 0.83ms (31%) 32.0% faster MediaPipe 2.13ms (25%) 2.47ms (48%) 0.81ms (30%) 62.0% faster Postprocessing 5.30ms (61%) 1.52ms (29%) 1.06ms (39%) 80.0% faster Mathematical Validation
The cumulative improvement follows the expected multiplicative formula:
Total Improvement = 1 - (1 - Opt1%) × (1 - Opt2%) 68.7% = 1 - (1 - 0.40) × (1 - 0.478) 68.7% = 1 - (0.60 × 0.522) = 68.7% ✓Technical Insights
Bottleneck Migration Pattern
The optimization process revealed an important pattern of bottleneck migration:
- Initial State: Postprocessing dominated (61% of pipeline time)
- After Optimization #1: MediaPipe became the primary bottleneck (48% of pipeline time)
- Final State: Balanced pipeline with no dominant bottleneck (30-39% distribution)
This demonstrates the importance of iterative optimization and continuous measurement, as eliminating one bottleneck often reveals the next performance constraint.
Optimization Sequencing Strategy
The success of this optimization effort validates several key principles:
- Measure First: Comprehensive performance measurement infrastructure enabled data-driven decisions
- Target the Largest Bottleneck: Addressing postprocessing first provided the highest initial impact
- Iterative Approach: Sequential optimization revealed secondary bottlenecks that weren't initially apparent
- Validate Each Step: Statistical validation ensured that improvements were real and reproducible
Camera Format Selection Impact
The RGB888 camera format optimization provided insights into the importance of data format consistency throughout processing pipelines:
- Format Conversion Overhead: BGR→RGB conversion consumed significant CPU cycles
- MediaPipe Efficiency: Direct RGB input dramatically improved inference performance
- Memory Access Patterns: Consistent format reduced cache misses and memory bandwidth requirements
Performance Measurement Best Practices
The measurement infrastructure development highlighted several critical practices:
- Non-Intrusive Logging: Buffered logging prevents measurement artifacts
- Aggregated Metrics: 5-second averaging provides stable, meaningful performance data
- Statistical Rigor: Multiple measurement periods enable confidence interval calculation
- Real-Time Monitoring: ROS topic publishing allows live performance observation
Conclusion
This optimization project demonstrates the effectiveness of systematic, measurement-driven performance improvement. By implementing comprehensive performance tracking, identifying bottlenecks through data analysis, and applying targeted optimizations, I achieved a 68.7% reduction in object detection pipeline processing time.
The key success factors were:
- Robust Measurement Infrastructure: Accurate, non-intrusive performance tracking enabled data-driven optimization decisions
- Bottleneck-Focused Approach: Targeting the largest performance constraints first maximized improvement impact
- Format Optimization: Aligning data formats throughout the pipeline eliminated unnecessary conversions
- Conditional Processing: Smart resource management reduced computational overhead when full processing isn't needed
The optimized pipeline now processes frames in 2.71ms compared to the original 8.65ms, providing substantial headroom for additional robotics processing tasks while maintaining full object detection functionality. This work demonstrates that significant performance improvements are achievable through systematic analysis and targeted optimization, even in complex multi-stage processing pipelines.
-
Refactoring Buffered Logging in ROS 2 Vision Pipelines
08/10/2025 at 20:51 • 0 commentsWhen building real-time computer vision systems with ROS 2, diagnostic logging becomes critical for debugging complex processing pipelines. However, poorly designed logging systems can create more confusion than clarity. I recently refactored the buffered logging system in my GestureBot object detection node, transforming a confusing, duplicated implementation into a clean, reusable architecture that other robotics developers can learn from.
The Problem: Misleading Abstractions and Technical Debt
The original buffered logging system suffered from several fundamental issues that made it difficult to use and maintain:
Confusing Terminology: The system used "production mode" and "debug mode" labels that didn't reflect actual behavior. "Production mode" suggested it was only for deployment, while "debug mode" implied it was only for development. In reality, both modes had legitimate use cases across different scenarios.
Inconsistent Timer Behavior: The system used a 120-second timer for "debug mode" and a 10-second timer for "production mode." This inconsistency made it difficult to predict when diagnostic information would be available.
Code Duplication: The
BufferedLoggerclass was implemented directly inobject_detection_node.py
, making it impossible for other vision nodes (gesture detection, face detection) to reuse the same logging infrastructure.Unclear Parameters: Launch file parameters like
enable_debug_bufferobscured what the system actually did, requiring developers to read implementation details to understand behavior.Solution Architecture: Behavior-Based Design
I redesigned the system around three core principles: clear behavioral naming, consistent timing, and reusable architecture.
1. Renamed Modes to Reflect Actual Behavior
The new system uses descriptive names that immediately communicate what each mode does:
# Before: Confusing mode names 'mode': 'debug' if self.debug_mode else 'production' # After: Behavior-based naming 'mode': 'unlimited' if self.unlimited_mode else 'circular'Circular Mode: Uses a fixed-size circular buffer (200 entries) with automatic dropping when full. Ideal for continuous monitoring with bounded memory usage.
Unlimited Mode: Allows unlimited buffer growth with timer-only flushing. Perfect for comprehensive diagnostic sessions where you need complete event history.
Disabled Mode: No buffering overhead, only critical errors logged directly. Optimal for production deployments where performance is paramount.
2. Standardized Timer Intervals
I unified the timer interval to 10 seconds across all modes, eliminating the arbitrary distinction between 120-second and 10-second intervals:
# Before: Inconsistent timing flush_interval = 120.0 if enable_debug_buffer else 10.0 # After: Consistent 10-second intervals self.buffer_flush_timer = self.create_timer(10.0, self._flush_buffered_logger)This change provides more responsive feedback while maintaining reasonable performance characteristics.
3. Moved to Base Class Architecture
The most significant architectural improvement was moving
BufferedLoggerfrom the specific object detection node to the baseMediaPipeBaseNodeclass:# vision_core/base_node.py class MediaPipeBaseNode(Node, ABC): def __init__(self, node_name: str, feature_name: str, config: ProcessingConfig, enable_buffered_logging: bool = True, unlimited_buffer_mode: bool = False): super().__init__(node_name) # Initialize buffered logging for all MediaPipe nodes self.buffered_logger = BufferedLogger( buffer_size=200, logger=self.get_logger(), unlimited_mode=unlimited_buffer_mode, enabled=enable_buffered_logging )This inheritance-based approach means any new vision node automatically gains sophisticated logging capabilities without code duplication.
4. Updated Launch File Parameters
The launch file parameters now clearly communicate their purpose:
# Before: Unclear parameter names declare_enable_debug_buffer = DeclareLaunchArgument( 'enable_debug_buffer', default_value='false', description='Enable comprehensive debug buffer logging...' ) # After: Behavior-focused parameters declare_unlimited_buffer_mode = DeclareLaunchArgument( 'unlimited_buffer_mode', default_value='false', description='Enable unlimited buffer mode (timer-only flushing for comprehensive diagnostics)...' )Implementation Details and Performance Characteristics
Memory Management Strategy
Each mode implements a different memory management strategy optimized for its use case:
Circular Mode uses Python's
deque(maxlen=200)for automatic memory bounds:# Automatic memory management self.buffer = deque(maxlen=self.buffer_size) # Auto-drops old entriesUnlimited Mode uses unbounded
deque()for comprehensive logging:# Unlimited growth for complete diagnostics self.buffer = deque() # No maxlen restrictionDisabled Mode eliminates buffer allocation entirely:
# Zero memory overhead self.buffer = NoneEvent Logging Interface
The refactored system provides a clean interface for logging diagnostic events:
# Simple event logging with metadata self.log_buffered_event( 'IMAGE_PROCESSING_START', 'Starting annotated image processing', frame_size=(640, 480), processing_time_ms=15.2 )Integration with ROS 2 Ecosystem
The refactored system integrates cleanly with ROS 2 parameter management:# Runtime parameter inspection
ros2 service call /object_detection_node/get_parameters rcl_interfaces/srv/GetParameters \ "{names: ['unlimited_buffer_mode', 'buffer_logging_enabled']}" # Response shows current configuration # unlimited_buffer_mode: False, buffer_logging_enabled: True -
Simplifying MediaPipe Vision Processing
08/10/2025 at 20:46 • 0 commentsIn my recent work on the GestureBot vision system, I made several architectural improvements that significantly simplified the codebase while maintaining performance. Here's what I learned about building robust MediaPipe-based vision pipelines in ROS 2.
The Problem: Over-Engineering for Simplicity
Initially, I implemented a complex architecture with ComposableNodes, thread pools, and async processing patterns. The system created a new thread for every camera frame and used intricate callback checking mechanisms. While this seemed like a performance optimization, it introduced unnecessary complexity:
# Old approach - complex threading threading.Thread( target=self._process_frame_async, args=(cv_image, timestamp), daemon=True ).start() # Complex callback checking after submission if self.processing_lock.acquire(blocking=False): # Process and check callback results...I refactored the entire system to use a straightforward synchronous approach that separates concerns cleanly:
1. Converted from ComposableNode to Regular Node Architecture
Before:
camera_container = ComposableNodeContainer( name='object_detection_camera_container', package='rclcpp_components', executable='component_container', composable_node_descriptions=[ ComposableNode(package='camera_ros', plugin='camera::CameraNode') ] )After:
camera_node = Node( package='camera_ros', executable='camera_node', name='camera_node', namespace='camera' )Why this works better: Since my object detection node runs in Python and can't be part of the same composable container anyway, using regular nodes eliminates complexity without sacrificing performance.
2. Separated Processing Contexts
I redesigned the processing flow to have two distinct, non-blocking contexts:
def image_callback(self, msg: Image) -> None: """Simple synchronous image processing callback.""" cv_image = self.cv_bridge.imgmsg_to_cv2(msg, 'bgr8') timestamp = time.time() # Process frame synchronously - no threading complexity results = self.process_frame(cv_image, timestamp) if results is not None: self.publish_results(results, timestamp)Key insight: Instead of checking MediaPipe callbacks after submission, I let MediaPipe's callback system handle result publishing directly. This eliminates the need for complex synchronization between submission and result retrieval.
3. Fixed MediaPipe Message Conversion Robustness
MediaPipe sometimes returns
Nonevalues for bounding box coordinates and confidence scores. I added comprehensive None-value handling:# Handle None values explicitly origin_x = getattr(bbox, 'origin_x', None) msg.bbox_x = int(origin_x) if origin_x is not None else 0 # Robust confidence assignment with multiple fallback approaches if score_val is not None: confidence_val = float(score_val) else: confidence_val = 0.0 try: msg.confidence = confidence_val except: object.__setattr__(msg, 'confidence', confidence_val)This eliminated the persistent
<function DetectedObject.confidence at 0x...> returned a result with an exception seterrors that were blocking the system.4. Added Shared Memory Transport for Performance
While simplifying the architecture, I maintained performance by enabling shared memory transport. This provides most of the performance benefits of ComposableNodes without the architectural complexity.
5. Cleaned Up Topic Namespace
I consolidated all camera-related topics under a clean
/camera/namespace:remappings=[ ('~/image_raw', '/camera/image_raw'), ('~/camera_info', '/camera/camera_info'), ]This eliminates duplicate topics like
/camera_node/image_rawand/camera/image_rawthat were causing confusion.Results: Better Performance Through Simplicity
The refactored system achieves:
- Eliminated threading overhead: No more thread creation per frame
- Cleaner error handling: Robust None-value processing prevents crashes
- Simplified debugging: Linear execution flow is easy to trace
- Maintained performance: Shared memory transport provides efficient image transfer
- Clean topic structure: Single source of truth for camera data
-
Mechanical Design and Hardware Integration Notes
08/08/2025 at 17:43 • 0 commentsHardware Platform
- Base: iRobot Create 2 (Roomba)
- Structural frame: 3/4" Schedule 40 PVC pipe
- Custom parts: 3D‑printed base bracket and upper “blue” electronics enclosure
- Sensors and compute (upper assembly): Raspberry Pi 5 + active cooler, camera, MPU6050 IMU, top‑mounted LiDAR, 3S LiPo, 5 V regulator, wiring harnesses
Why iRobot Create 2 (Roomba) as the base
I chose the Create 2 because:
- It is a proven, rugged differential‑drive platform with integrated motor drivers, encoders, bump sensors, cliff sensors, and a charge dock interface.
- The Open Interface (OI) provides documented serial control for motion and telemetry, which simplifies bring‑up and reduces the number of custom PCBs I need to maintain.
- The chassis carries batteries low in the body, giving a naturally low center of mass that helps with the tall mast structure.
- Replacement parts and batteries are widely available; consumables (wheels, brushes) are inexpensive.
In short, it gives me reliable locomotion and power infrastructure so I can focus engineering time on perception and interaction.
Structural Framework: 3/4" Schedule 40 PVC
I built the superstructure as a four‑post mast using standard 3/4" Schedule 40 PVC with printed sockets at the base and a printed upper enclosure that captures the posts.
PVC framework rationale
- Cost‑effectiveness: PVC pipe and fittings cost a fraction of aluminum extrusion and require no specialty tooling. I can build and iterate for a few dollars per meter.
- Structural rigidity: For a ~1–1.2 m mast, four 3/4" PVC uprights provide adequate bending stiffness when posts are constrained at both ends; adding a single mid‑height brace eliminates noticeable sway.
- Lightweight: Low mass keeps the center of gravity near the Roomba deck, improving tip resistance during sudden stops or dock approaches.
- Modularity: I cut posts to length and swap elbows/tees to reconfigure sensor height in minutes. Printed collars give me mounting points exactly where I need them.
- Easy iteration: I can drill, ream, and solvent‑bond or simply screw into PVC without worrying about galvanic corrosion or thread wear in thin‑wall aluminum.
Practical tip: I lightly ream the pipe OD and size printed sockets with +0.3 to +0.5 mm clearance, then use two self‑tapping screws per joint. This holds under vibration and still allows disassembly.
Base Bracket (3D‑printed)
The base bracket is a circular plate that sits on the Roomba’s top deck and presents four vertical sockets for the PVC posts.
Design choices:
- I align the sockets on a square bolt circle to match the upper enclosure’s posts; this prevents torsion and keeps the mast square.
- The bracket uses the Create 2’s existing screw bosses for anchoring (no drilling in the shell). I embed heat‑set inserts in the print so I can torque fasteners without crushing plastic.
- Filleted ribs radiate from each socket into the center ring to distribute mast loads and survive side hits.
Material and print settings:
- PETG or ABS at 30–40% gyroid infill, four perimeters, 0.24–0.28 mm layer height. PETG gives enough ductility to absorb bumps without cracking.
Upper Assembly (“Blue Enclosure”)
The upper enclosure is a printed housing that integrates compute, power, and sensors while acting as the frame’s top plate. It also provides an easy surface for future sensors and user interfaces.
What I integrated
- Raspberry Pi 5 (8 GB) with the official active cooler
- 5 V buck regulator (≥ 5 A) from the 3S LiPo rail
- IMX219 camera module (front‑facing), recessed window
- MPU6050 IMU (mounted near the enclosure’s CG to reduce rotational noise)
- Top‑mounted LiDAR (clear 360° FOV, minimal occlusion from the mast)
- 3‑cell LiPo battery with inline fuse and master switch
- Cable glands and internal harnesses
Thermal management
- I treated the Pi 5 cooler as a forced‑air inlet and provided exhaust vents on the opposite wall. Short, straight flow paths are more effective than decorative perforations.
- Mounting bosses standoff the Pi to keep airflow under the board and let the radiator breathe.
- During sustained computer vision tasks, case temps stayed below ~72 °C with the fan at default curve.
Cable management and signal routing
- I routed all power and signal lines inside the PVC uprights. Two strategies worked best:
- Slotted posts: A 6 mm slot near each end allows the cable to enter/exit without visible loops.
- Printed clip rings: Snap‑on rings with zip‑tie slots prevent cable rattle.
- Power architecture:
- 3S LiPo → 5 V buck (Pi 5 + camera + small peripherals)
- LiDAR on a dedicated 5 V rail with its own inline fuse to prevent brown‑outs on Pi load spikes
- Roomba Interface:
- I use a USB‑to‑TTL serial adapter (3.3 V logic) to the Create 2 Open Interface, routed through the mast. A small isolation board and a common ground tie keep noise out of the IMU.
- UART wiring is strain‑relieved inside the enclosure; the connector is service‑looped for easy removal.
Serviceability
- The lid comes off with four machine screws. All boards mount to threaded brass inserts; nothing self‑tapped into plastic.
- The battery is strapped to a removable tray with a finger‑pull, so I can swap packs without touching the rest of the wiring.
Component Selection Rationale
- Raspberry Pi 5: Enough CPU for real‑time perception pipelines; strong community support; camera and LiDAR libraries are mature on Ubuntu.
- IMX219 camera: Small, easy to mount flush in the enclosure, validated at 30 fps with the libcamera stack.
- MPU6050: Commodity IMU with acceptable drift for mobile base stabilization and motion gating; easy to filter at 200 Hz on the Pi.
- Top‑mounted LiDAR: Clear line of sight above people and furniture; the mast keeps occlusions outside of the primary scan plane.
- 3S LiPo + buck: Keeps high current off the Roomba’s internal 5 V; isolates compute from base brown‑outs and gives me a clean power budget.
Mechanical Integration Details
- Fasteners: Almost everything is M3 or M4 to simplify spares. Steel washers are used where printed parts interface with PVC to prevent local crushing.
- Tolerances: I leave 0.2–0.3 mm clearance for printed‑to‑printed fits and 0.4–0.5 mm for printed‑to‑PVC sockets; this range consistently assembles without post‑processing on a 0.4 mm nozzle.
- Vibration: A single cross‑brace mid‑mast reduces LiDAR ringing and camera shake noticeably. If I add heavier sensors later, I’ll upgrade to a truss‑style brace that bolts to captured nuts in the posts.
Why PVC Instead of Aluminum Extrusion
Criterion PVC (3/4" Sch 40) 2020/2040 Aluminum Extrusion Cost Very low Moderate to high Stiffness/weight for this height Adequate with bracing Higher Tooling Hand saw, drill, small screws Chop saw, tapping, brackets Modularity Cut‑to‑length, printed adapters Excellent with slot hardware Iteration speed Very fast Moderate Aesthetics Utility‑grade Professional For a research robot that changes every week, the speed and cost advantages of PVC outweigh the stiffness and finish benefits of extrusion. When the design freezes, I can translate the printed sockets to aluminum adapters if needed.
Assembly Summary
- Print base bracket, post sockets, and upper enclosure components with embedded heat‑set inserts.
- Cut four PVC posts to length; drill cable entry/exit slots.
- Mount the base bracket to the Create 2 using existing bosses; verify level.
- Route the harnesses through posts, then seat posts into the base bracket and temporarily pin with screws.
- Install the upper enclosure, capture the posts, and secure all joints.
- Fit the Pi 5, regulator, IMU, camera, LiDAR, and battery; complete wiring with fuses and the master switch.
- Bring up power rails, verify voltages, and perform sensor smoke tests before closing the lid.
Integration Challenges and Solutions
- EMI into IMU at motor start: I added a small LC filter and routed the IMU cable away from the main battery run; problem disappeared.
- LiDAR cable strain: A printed strain‑relief block under the top lid ensures the connector cannot back out with mast flex.
- Tip stability during docking: Adding a small cross‑brace reduced mast oscillations that could trigger the Roomba’s safety bumpers.
Design Goals vs. Outcomes
- Cost: Achieved. PVC + printed parts kept the mechanical BOM inexpensive.
- Functionality: Achieved. Clear FOV for the camera and LiDAR, with clean sensor cabling.
- Modularity: Achieved. I can reconfigure height and add shelves in under an hour.
- Ease of assembly/maintenance: Achieved. All fasteners are accessible; boards mount to inserts; battery swaps are fast.
Bill of Materials (mechanical/electro‑mechanical, major items)
- iRobot Create 2 (base)
- 3/4" Schedule 40 PVC pipe + self‑tapping screws
- 3D‑printed base bracket, post sockets, upper enclosure, strain‑reliefs
- Raspberry Pi 5 + official active cooler
- 5 V high‑current buck regulator (≥ 5 A)
- IMX219 camera module + acrylic window
- MPU6050 IMU breakout
- 2D LiDAR (top mount)
- 3S LiPo battery, inline fuse, master switch
- USB‑to‑TTL serial adapter (3.3 V) for Roomba OI
- Cable glands, wire, ferrules, heat‑shrink
Vipin M