Project | GestureBot - Computer Vision Robot

« Back to project details Sort by:

Building an Autonomous Person Following System with Computer Vision
08/15/2025 at 14:47 • 0 comments
When your robot becomes your shadow – implementing intelligent person following with object detection and ROS 2

Imagine a robot that follows you around like a loyal companion, maintaining the perfect distance whether you're walking through a warehouse, giving a facility tour, or need hands-free assistance. While gesture and pose control are great for direct commands, sometimes you want your robot to simply tag along autonomously. That's exactly what we've built into GestureBot – a standalone person following system that transforms any detected person into a moving target for smooth, intelligent pursuit.

The Appeal of Autonomous Following

Person following robots aren't just cool demos – they solve real problems. Consider a hospital robot carrying supplies that needs to follow a nurse through rounds, a security robot accompanying a guard on patrol, or a service robot helping someone navigate a large facility. In these scenarios, constant manual control becomes tedious and impractical.

The key insight is that following behavior should be completely autonomous once activated. No gestures, no poses, no commands – just intelligent tracking that maintains appropriate distance while handling the inevitable challenges of real-world environments: people walking behind obstacles, multiple individuals in the scene, varying lighting conditions, and the need for smooth, non-jerky motion that won't startle or annoy.

Leveraging Existing Object Detection Infrastructure

Rather than building a specialized person tracking system from scratch, we cleverly repurpose GestureBot's existing object detection capabilities. The system already runs MediaPipe's EfficientDet model at 5 FPS, detecting 80 different object classes including people with confidence scores and precise bounding boxes.

This architectural decision provides several advantages: proven stability, existing performance optimizations, and the ability to simultaneously track people and obstacles. The object detection system publishes to /vision/objects, providing a stream of detected people that our following controller can consume.
```
# Object detection provides person detections like this:
DetectedObject {
    class_name: "person"
    confidence: 0.76
    bbox_x: 145        # Top-left corner
    bbox_y: 89
    bbox_width: 312    # Bounding box dimensions
    bbox_height: 387
}
```
The person following controller subscribes to this stream and implements sophisticated logic to select, track, and follow the most appropriate person in the scene.

Smart Person Selection: More Than Just "Pick the Biggest"

When multiple people appear in the camera view, the system needs to intelligently choose who to follow. Our selection algorithm uses a weighted scoring system that considers three key factors:

Size Score (40% weight): Larger bounding boxes typically indicate closer people or those more prominently positioned in the scene. This naturally biases toward the person most likely intended as the target.

Center Score (30% weight): People closer to the image center are preferred, following the reasonable assumption that users position themselves centrally when activating following mode.

Confidence Score (30% weight): Higher detection confidence indicates more reliable tracking, reducing the chance of following false positives or poorly detected individuals.
```
def select_initial_target(self, people):
    scored_people = []
    for person in people:
        # Normalize bounding box to 0-1 coordinates
        size_score = (person.bbox_width * person.bbox_height) / (640 * 480)
        center_x = (person.bbox_x + person.bbox_width/2) / 640
        center_score = 1.0 - abs(center_x - 0.5) * 2
        
        total_score = (size_score * 0.4 + 
                      center_score * 0.3 + 
                      person.confidence * 0.3)
        scored_people.append((person, total_score))
    
    return max(scored_people, key=lambda x: x[1])[0]
```
Once a target is selected, the system maintains tracking continuity by matching people across frames based on position prediction, preventing erratic switching between similar individuals.

Distance Estimation: Computer Vision Meets Physics

Estimating distance from a single camera image is a classic computer vision challenge. Without stereo vision or depth sensors, we rely on the relationship between object size and distance – larger bounding boxes generally indicate closer people.

After extensive calibration with real-world measurements, we developed an empirical mapping between normalized bounding box area and estimated distance:
```
def estimate_distance(self, person_bbox):
    # Convert pixel coordinates to normalized area
    bbox_area = (person.bbox_width * person.bbox_height) / (640 * 480)
    
    # Empirical distance mapping calibrated for average person height
    if bbox_area > 0.55:    # 55%+ of image = very close
        return 1.0          # ~1 meter
    elif bbox_area > 0.35:  # 35-55% = near target distance  
        return 1.5          # ~1.5 meters (target)
    elif bbox_area > 0.20:  # 20-35% = medium distance
        return 2.2          # ~2.2 meters
    # ... additional ranges up to 7 meters
```
This approach works surprisingly well for typical indoor environments, though it assumes average human height and doesn't account for unusual poses. To improve stability, we apply a 3-point moving average that smooths out frame-to-frame variations in bounding box size.

Following Behavior: The Art of Robotic Companionship

The core following logic implements two simultaneous control loops: distance maintenance and person centering. The distance controller calculates the error between current estimated distance and the target distance (1.5 meters), applying proportional control to generate forward/backward velocity commands.

The centering controller keeps the person positioned in the center of the camera view by calculating the horizontal offset and generating appropriate angular velocity commands. This dual-axis control creates natural following behavior that maintains both proper distance and orientation.
```
def calculate_following_command(self, person, estimated_distance):
    # Distance control (linear velocity)
    distance_error = estimated_distance - self.target_distance  # 1.5m target
    linear_velocity = distance_error * 0.8  # Control gain
    linear_velocity = max(-0.3, min(0.3, linear_velocity))  # Clamp limits
    
    # Centering control (angular velocity)  
    center_x = (person.bbox_x + person.bbox_width/2) / 640
    center_error = center_x - 0.5  # 0.5 = image center
    angular_velocity = -center_error * 1.5  # Control gain
    angular_velocity = max(-0.8, min(0.8, angular_velocity))  # Clamp limits
    
    return {'linear_x': linear_velocity, 'angular_z': angular_velocity} 
```
Smooth Motion Through Advanced Control

Raw velocity commands would create jerky, uncomfortable robot motion. Our solution implements a sophisticated velocity smoothing system running at 25 Hz – much faster than the 10 Hz control calculation rate. This high-frequency loop applies acceleration limiting to gradually ramp velocities up and down.

The acceleration limits (1.0 m/s² linear, 2.0 rad/s² angular) are carefully tuned for the Raspberry Pi 5's processing capabilities while ensuring smooth motion. A typical acceleration from rest to 0.3 m/s takes about 0.3 seconds over 7-8 control cycles, creating natural-feeling motion that won't startle users or cause mechanical stress.

Critical to the system's success is preventing rapid target switching. The control system holds velocity commands for a minimum of 500ms, preventing oscillations that would occur if the robot constantly changed direction based on minor detection variations.

Safety First: Multiple Protection Layers

Autonomous following requires robust safety systems. Our implementation includes several protection mechanisms:

Distance Limits: The robot won't approach closer than 0.8 meters (safety zone) and stops following if the person exceeds 5.0 meters (lost target). These limits prevent uncomfortable crowding and endless pursuit of distant figures.

Timeout Protection: If no person is detected for 3 seconds, the system automatically deactivates following mode and stops the robot. This handles cases where the target leaves the camera view or detection fails.

Emergency Override: The system monitors an /emergency_stop topic, immediately halting motion if any other system component detects a problem.

Backward Motion Safety: When a person gets too close, the robot smoothly backs away rather than stopping abruptly, maintaining comfortable interpersonal distance.

ROS 2 Architecture: Clean and Modular

The person following system integrates seamlessly with ROS 2's publish-subscribe architecture. The person_following_controller node subscribes to /vision/objects from the existing object detection system and publishes velocity commands to /cmd_vel for the robot base.

Activation and deactivation happen through a simple service interface:
```
# Activate person following mode
ros2 service call /follow_mode/activate std_srvs/srv/SetBool "data: true"

# Deactivate following mode  
ros2 service call /follow_mode/activate std_srvs/srv/SetBool "data: false"

# Monitor following behavior
ros2 topic echo /cmd_vel
```
This modular design means the following system can work with any robot platform that accepts standard ROS 2 velocity commands, making it broadly applicable beyond the original GestureBot hardware.

Real-World Performance: Lessons from Testing

In practice, the system performs remarkably well in typical indoor environments. Response time from person detection to robot motion is under 200ms, creating immediate feedback that feels natural. The robot maintains the target 1.5-meter distance with ±0.3-meter accuracy under normal conditions.

The biggest challenges come from environmental factors: people walking behind furniture create temporary occlusions that the system handles by maintaining last-known velocity until the person reappears. Multiple people in the scene occasionally cause target switching, though our scoring algorithm minimizes this issue.

Lighting variations affect object detection confidence but rarely cause complete tracking failure. The system gracefully degrades by reducing following speed when confidence drops, rather than stopping entirely.

Hardware Performance: Pi 5 Delivers

Running on a Raspberry Pi 5 with 8GB RAM, the complete system (object detection + person following + ROS 2 navigation) consumes approximately 70-75% CPU at steady state. This leaves headroom for additional processing while maintaining stable performance.

The Pi Camera Module 3 provides sufficient image quality for reliable person detection at 640x480 resolution. Higher resolutions improve detection accuracy but reduce frame rate – the current configuration strikes an optimal balance for real-time following behavior.

Power consumption remains reasonable at ~8W total system power, making battery operation feasible for mobile applications.

Comparison with Other Control Methods

Person following complements rather than replaces gesture and pose control. Each method has its ideal use cases:

Gesture Control: Best for precise, intentional commands when the user wants direct robot control Pose Control: Ideal for hands-free operation when gestures aren't practical Person Following: Perfect for autonomous companionship when continuous manual control would be tedious

The beauty of the modular architecture is that users can switch between modes as needed, or even combine them – imagine a robot that follows you autonomously but responds to gesture overrides for specific actions.

Future Directions: Beyond Basic Following

The current implementation provides a solid foundation for more sophisticated behaviors. Future enhancements could include:

Outdoor Operation: Adapting distance estimation for varying lighting and longer ranges Multi-Person Scenarios: Following groups or switching between designated individuals Predictive Tracking: Using motion prediction to handle temporary occlusions more gracefully Sensor Fusion: Integrating lidar or depth cameras for more accurate distance measurement Social Awareness: Adjusting following distance based on environmental context and social norms
4-Pose Navigation with MediaPipe
08/15/2025 at 14:41 • 0 comments
When hand gestures aren't enough, your whole body becomes the remote control

We've all been there – trying to control a robot with hand gestures while your hands are full, wearing gloves, or when lighting conditions make finger detection unreliable. What if your robot could understand your intentions through simple body poses instead? That's exactly what we've implemented in the latest iteration of GestureBot, a Raspberry Pi 5-powered robot that now responds to four distinct body poses for intuitive navigation control.

Why Body Poses Beat Hand Gestures

While hand gesture recognition is impressive, it has practical limitations. Gestures require clear hand visibility, specific lighting conditions, and can be ambiguous when multiple people are present. Body poses, on the other hand, are larger, more distinctive, and work reliably even when hands are obscured or busy with other tasks.

Consider a warehouse worker guiding a robot while carrying boxes, or a surgeon directing a medical robot while maintaining sterile conditions. Full-body pose detection opens up robotics applications where traditional gesture control falls short.

The Technical Foundation: MediaPipe Pose Detection

At the heart of our system lies Google's MediaPipe Pose Landmarker, which provides real-time detection of 33 body landmarks covering the entire human skeleton – from head to toe. Running on a Raspberry Pi 5 with 8GB RAM and a Pi Camera Module 3, we achieve stable 3-7 FPS pose detection at 640x480 resolution.

The MediaPipe model excels at tracking key body points including shoulders, elbows, wrists, hips, and the torso center. What makes this particularly powerful for robotics is the consistency of landmark detection even with partial occlusion or varying lighting conditions.
```
# Core MediaPipe configuration optimized for Pi 5
pose_landmarker_options = {
    'base_options': BaseOptions(model_asset_path='pose_landmarker.task'),
    'running_mode': VisionRunningMode.LIVE_STREAM,
    'num_poses': 2,  # Track up to 2 people
    'min_pose_detection_confidence': 0.5,
    'min_pose_presence_confidence': 0.5,
    'min_tracking_confidence': 0.5
}
```
Simplicity Through Four Poses

After experimenting with complex pose vocabularies, we settled on four reliable poses that provide comprehensive robot control:

🙌 Arms Raised (Forward Motion): Both arms extended upward above shoulder level triggers forward movement at 0.3 m/s. This pose is unmistakable and feels natural for "go forward."

👈 Pointing Left (Turn Left): Left arm extended horizontally while right arm remains down commands a left turn at 0.8 rad/s. The asymmetry makes this pose highly distinctive.

👉 Pointing Right (Turn Right): Mirror of the left turn – right arm extended horizontally triggers rightward rotation.

🤸 T-Pose (Emergency Stop): Both arms extended horizontally creates the universal "stop" signal, immediately halting all robot motion.

The pose classification algorithm analyzes shoulder and wrist positions relative to the torso center, using angle calculations and position thresholds to distinguish between poses:
```
def classify_pose(self, landmarks):
    # Extract key landmarks
    left_shoulder = landmarks[11]
    right_shoulder = landmarks[12]
    left_wrist = landmarks[15]
    right_wrist = landmarks[16]
    
    # Calculate arm angles relative to shoulders
    left_arm_angle = self.calculate_arm_angle(left_shoulder, left_wrist)
    right_arm_angle = self.calculate_arm_angle(right_shoulder, right_wrist)
    
    # Classify based on arm positions
    if left_arm_angle > 60 and right_arm_angle > 60:
        return "arms_raised"
    elif abs(left_arm_angle) < 30 and abs(right_arm_angle) < 30:
        return "t_pose"
    # ... additional classification logic
```
ROS 2 Integration: From Pose to Motion

The system architecture follows a clean pipeline: pose detection → classification → navigation commands → smooth motion control. Built on ROS 2 Jazzy, the implementation uses three main components:

Pose Detection Node: Processes camera frames through MediaPipe, publishes 33-point landmark data and classified pose actions to /vision/poses topic.

Pose Navigation Bridge: Subscribes to pose classifications and converts them to velocity commands published on /cmd_vel. This node implements the critical safety and smoothing logic.

Velocity Smoothing System: Perhaps the most important component for real-world deployment, this 25 Hz control loop applies acceleration limiting to prevent jerky robot motion that could cause instability or discomfort.
```
# Launch the complete 4-pose navigation system
ros2 launch gesturebot pose_detection.launch.py
ros2 launch gesturebot pose_navigation_bridge.launch.py

# View real-time pose detection with skeleton overlay
ros2 launch gesturebot image_viewer.launch.py \
    image_topics:='["/vision/pose/annotated"]'
```
The navigation bridge includes multiple safety layers: pose confidence thresholds (0.7 minimum), timeout protection (2-second auto-stop), and velocity limits that prevent dangerous accelerations. If no valid pose is detected for two seconds, the robot automatically stops – a crucial safety feature for real-world deployment.

Hardware: Raspberry Pi 5 Proves Its Worth

The Raspberry Pi 5 represents a significant leap in embedded AI capability. With its ARM Cortex-A76 quad-core processor and 8GB RAM, it handles MediaPipe pose detection while simultaneously running ROS 2 navigation, camera processing, and system monitoring. The Pi Camera Module 3's 12MP sensor with autofocus provides the image quality needed for reliable landmark detection.

Power consumption remains reasonable at ~8W total system power, making this suitable for battery-powered mobile robots. We've found that active cooling is beneficial for sustained operation, but not strictly necessary for typical use cases.

Real-World Performance and Applications

In practice, the 4-pose system feels remarkably natural. The poses are intuitive enough that new users can control the robot within minutes without training. Response time from pose detection to robot motion is under 200ms, providing immediate feedback that makes the interaction feel responsive.

The system shines in scenarios where traditional interfaces fail:
- Hands-free operation: Control robots while carrying objects or wearing protective equipment
- Distance control: Operate robots from across a room where gesture details would be invisible
- Multi-user environments: Body poses are less likely to trigger false positives from background activity
- Industrial applications: Robust operation in challenging lighting or environmental conditions
We've tested the system with users of varying heights and body types, finding consistent performance across different demographics. The pose classification algorithms adapt well to individual differences in arm length and posture.

The Code: Open Source and Ready to Deploy

The entire implementation is open source and built with reproducibility in mind. The modular ROS 2 architecture means you can easily integrate pose control into existing robot platforms or extend the system with additional poses.

Key configuration parameters are exposed through launch files, allowing fine-tuning for different robot platforms:
```
pose_navigation_bridge:
  ros__parameters:
    pose_confidence_threshold: 0.7
    max_linear_velocity: 0.3      # m/s
    max_angular_velocity: 0.8     # rad/s
    pose_timeout: 2.0             # seconds
    motion_smoothing_enabled: true
```
The GestureBot project continues to evolve, with pose detection joining gesture recognition and autonomous person following as part of a comprehensive vision-based robotics platform. Each modality has its place, and together they're building toward more adaptable and intuitive robot companions.
Gesture based Navigation
08/14/2025 at 15:54 • 0 comments
Gesture-controlled robotics represents a compelling intersection of computer vision, human-robot interaction, and real-time motion control. I developed GestureBot as a comprehensive system that translates hand gestures into precise robot movements, addressing the unique challenges of responsive detection, mechanical stability, and modular architecture design.

The project tackles several technical challenges inherent in gesture-controlled navigation: achieving sub-second response times while maintaining detection stability, preventing mechanical instability in tall robot form factors through acceleration limiting, and creating a modular architecture that supports future multi-modal integration. My implementation demonstrates how MediaPipe's gesture recognition capabilities can be effectively integrated with ROS2 navigation systems to create a responsive, stable, and extensible robot control platform.

System Architecture

I designed GestureBot with a modular architecture that separates gesture detection from motion control, enabling flexible deployment and future expansion. The system consists of two primary components connected through ROS2 topics:

Core Components

Gesture Recognition Module: Handles camera input and MediaPipe-based gesture detection, publishing stable gesture results to /vision/gestures. This module operates independently and can function without the motion control system for testing and development.

Navigation Bridge Module: Subscribes to gesture detection results and converts them into smooth robot motion commands published to /cmd_vel. This separation allows the navigation bridge to potentially receive input from multiple detection sources in future implementations.

Data Flow Architecture
```
Camera Input → MediaPipe Processing → Gesture Stability Filtering → /vision/gestures
                                                                           ↓
/cmd_vel ← Acceleration Limiting ← Velocity Smoothing ← Motion Mapping ←──┘
```
The modular design enables independent operation of components. I can run gesture detection without motion control for development, or use external gesture sources with the navigation bridge. This architecture prepares the system for Phase 4 multi-modal integration where object detection and pose estimation will feed into the same navigation bridge.

Launch File Structure

I implemented separate launch files for each component:
- gesture_recognition.launch.py: Camera and gesture detection only
- gesture_navigation_bridge.launch.py: Motion control and navigation logic
- Future: multi_modal_navigation.launch.py: Integrated multi-modal system
This separation provides deployment flexibility and simplifies parameter management for different robot configurations.

Technical Implementation

MediaPipe Integration for Hand Gesture Detection

I integrated MediaPipe's gesture recognition model using a controller-based architecture that handles the MediaPipe lifecycle independently from ROS2 infrastructure. The implementation uses MediaPipe's LIVE_STREAM mode with asynchronous processing for optimal performance:
```
class GestureRecognitionController:
    def __init__(self, model_path: str, confidence_threshold: float, max_hands: int, result_callback):
        self.model_path = model_path
        self.confidence_threshold = confidence_threshold
        self.max_hands = max_hands
        self.result_callback = result_callback
        
        # Initialize MediaPipe gesture recognizer
        base_options = python.BaseOptions(model_asset_path=self.model_path)
        options = vision.GestureRecognizerOptions(
            base_options=base_options,
            running_mode=vision.RunningMode.LIVE_STREAM,
            result_callback=self._mediapipe_callback,
            min_hand_detection_confidence=self.confidence_threshold,
            min_hand_presence_confidence=self.confidence_threshold,
            min_tracking_confidence=self.confidence_threshold,
            num_hands=self.max_hands
        )
        self.recognizer = vision.GestureRecognizer.create_from_options(options)
```
The controller processes camera frames asynchronously and extracts gesture classifications, hand landmarks, and handedness information. I implemented robust handedness extraction that handles MediaPipe's data structure variations:
```
def extract_handedness(self, handedness_list, hand_index: int) -> str:
    """Extract handedness from MediaPipe results using standard category_name format."""
    if not handedness_list or hand_index >= len(handedness_list):
        return 'Unknown'
    
    try:
        handedness_data = handedness_list[hand_index]
        if hasattr(handedness_data, '__len__') and len(handedness_data) > 0:
            if hasattr(handedness_data[0], 'category_name'):
                return handedness_data[0].category_name
    except (IndexError, AttributeError):
        pass
    
    return 'Unknown'
```
Gesture-to-Motion Mapping System

I implemented a comprehensive mapping system that translates 8 distinct hand gestures into specific robot movements:
```
GESTURE_MOTION_MAP = {
    'Thumb_Up': {'linear_x': 0.3, 'angular_z': 0.0},      # Move forward
    'Thumb_Down': {'linear_x': -0.2, 'angular_z': 0.0},   # Move backward  
    'Open_Palm': {'linear_x': 0.0, 'angular_z': 0.0},     # Emergency stop
    'Pointing_Up': {'linear_x': 0.3, 'angular_z': 0.0},   # Move forward (alternative)
    'Victory': {'linear_x': 0.0, 'angular_z': 0.8},       # Turn left
    'ILoveYou': {'linear_x': 0.0, 'angular_z': -0.8},     # Turn right
    'Closed_Fist': {'linear_x': 0.0, 'angular_z': 0.0},   # Emergency stop
    'None': {'linear_x': 0.0, 'angular_z': 0.0}           # No gesture detected
}
```
The mapping system includes safety considerations with multiple emergency stop gestures (Open_Palm and Closed_Fist) that bypass acceleration limiting for immediate response. Forward and backward movements use different maximum velocities, with backward motion limited to 0.2 m/s for safety.

Acceleration Limiting for Mechanical Stability

Tall robots with high centers of mass require careful acceleration management to prevent wobbling and tipping. I implemented a comprehensive acceleration limiting system that operates at 25 Hz to provide smooth velocity transitions:
```
def apply_acceleration_limit(self, current_vel: float, target_vel: float, max_accel: float, dt: float) -> float:
    """Apply acceleration limiting to prevent abrupt velocity changes."""
    velocity_diff = target_vel - current_vel
    max_change = max_accel * dt
    
    if abs(velocity_diff) <= max_change:
        return target_vel  # Can reach target this step
    else:
        # Limit the change to maximum allowed acceleration
        return current_vel + (max_change if velocity_diff > 0 else -max_change)
```
The system uses conservative acceleration limits tuned for tall robot stability:
- Linear acceleration: 0.25 m/s² (balanced responsiveness and stability)
- Angular acceleration: 0.5 rad/s² (smooth turning without destabilization)
- Emergency deceleration: 1.2 m/s² (faster stopping for safety)
High-Frequency Velocity Smoothing

I implemented a 25 Hz velocity smoothing loop that continuously interpolates between current and target velocities. This high-frequency control prevents the jerky motion that can cause mechanical instability:
```
def update_smoothed_velocity(self) -> None:
    """High-frequency velocity smoothing with acceleration limiting."""
    current_time = time.time()
    dt = current_time - self.last_velocity_update
    self.last_velocity_update = current_time
    
    # Skip if dt is too large (system lag) or too small
    if dt > 0.1 or dt < 0.001:
        return
    
    # Apply acceleration limiting to linear velocity
    self.current_velocity['linear_x'] = self.apply_acceleration_limit(
        self.current_velocity['linear_x'],
        self.target_velocity['linear_x'],
        max_linear_accel,
        dt
    )
    
    # Apply acceleration limiting to angular velocity
    self.current_velocity['angular_z'] = self.apply_acceleration_limit(
        self.current_velocity['angular_z'],
        self.target_velocity['angular_z'],
        max_angular_accel,
        dt
    )
    
    # Create and publish smoothed Twist message
    twist = Twist()
    twist.linear.x = self.current_velocity['linear_x']
    twist.angular.z = self.current_velocity['angular_z']
    self.cmd_vel_publisher.publish(twist)
```
Performance Optimizations

Gesture Stability Filtering

I developed a multi-layered stability filtering system that balances responsiveness with detection reliability. The system combines three filtering mechanisms:

Time-based stability: Requires gestures to be detected consistently for a minimum duration (0.1 seconds for maximum responsiveness).

Consistency checking: Validates that the same gesture appears in consecutive detections (single detection sufficient for immediate response).

Transition delay: Enforces minimum time between different gesture changes (0.05 seconds for fastest viable switching).
```
def check_gesture_stability(self, gesture_name: str, confidence: float, timestamp: float) -> bool:
    """Enhanced stability checking with consistency and transition delay."""
    # Add current detection to history
    self.gesture_detection_history.append({
        'gesture': gesture_name,
        'confidence': confidence,
        'timestamp': timestamp
    })
    
    # Check consistency - same gesture detected N consecutive times
    if not self._check_gesture_consistency(gesture_name):
        return False
    
    # Check transition delay - minimum time between different gestures
    if not self._check_transition_delay(gesture_name, timestamp):
        return False
    
    # Check time-based stability - existing method
    if not self.is_gesture_stable(gesture_name, timestamp):
        return False
    
    return True
```
These parameters were optimized through testing to achieve sub-second response times while maintaining detection stability. The system achieves gesture-to-motion latency of 0.3-0.5 seconds under optimal conditions.

Smart Logging System

I implemented intelligent logging that reduces noise while preserving essential debugging information. The system only logs when velocity commands change significantly or represent meaningful motion:
```
def log_velocity_change(self, twist: Twist) -> None:
    """Smart logging that only logs when velocity actually changes or is significant."""
    current_linear = twist.linear.x
    current_angular = twist.angular.z
    last_linear = self.last_published_velocity['linear_x']
    last_angular = self.last_published_velocity['angular_z']
    
    # Check if velocity is zero
    is_zero_velocity = abs(current_linear) < 0.001 and abs(current_angular) < 0.001
    was_zero_velocity = abs(last_linear) < 0.001 and abs(last_angular) < 0.001
    
    # Check if velocity has changed significantly
    velocity_changed = (abs(current_linear - last_linear) > 0.01 or 
                       abs(current_angular - last_angular) > 0.01)
    
    # Log conditions: non-zero velocities, significant changes, or transitions to stop
    should_log = (not is_zero_velocity or 
                 (velocity_changed and not was_zero_velocity) or
                 (velocity_changed and not self.zero_velocity_logged))
    
    if should_log:
        self.get_logger().info(f'Velocity: linear: {current_linear:.3f}, angular: {current_angular:.3f}')
```
This approach reduces log volume by approximately 80% while maintaining visibility into actual motion commands and system state changes.

Cross-Workspace ROS2 Integration

I designed the system to support cross-workspace integration, enabling gesture control of robots with existing navigation stacks. The modular architecture publishes standard /cmd_vel messages that any ROS2 navigation system can consume:
```
# GestureBot workspace (publishes /cmd_vel)
cd ~/GestureBot/gesturebot_ws
source ~/GestureBot/gesturebot_env/bin/activate
source install/setup.bash
ros2 launch gesturebot gesture_recognition.launch.py
ros2 launch gesturebot gesture_navigation_bridge.launch.py

# Robot base workspace (subscribes to /cmd_vel)
cd ~/Robot/robot_ws
source ~/Robot/robot_env/bin/activate
source install/setup.bash
ros2 run robot_base base_controller_node
```
This architecture enables gesture control of any ROS2-compatible robot without modifying existing navigation code.

Results and Performance Metrics

Responsiveness Improvements

Through systematic optimization, I achieved significant improvements in gesture-to-motion response times:

Before optimization:
- Gesture-to-motion latency: 5-10 seconds
- Motion start time: 2-4 seconds
- Gesture transition rate: 8.94 transitions/second (excessive noise)
After optimization:
- Gesture-to-motion latency: 0.3-0.5 seconds (90% improvement)
- Motion start time: 0.5-1.0 seconds (75% improvement)
- Gesture transition rate: 6.03 transitions/second (32% reduction in noise)
Stability Achievements

The acceleration limiting system successfully prevents mechanical instability in tall robot configurations:

Acceleration compliance:
- Linear acceleration: ≤0.25 m/s² (100% compliance)
- Angular acceleration: ≤0.5 rad/s² (100% compliance)
- Smooth transitions: >95% of velocity changes within limits
Motion characteristics:
- Time to reach maximum linear velocity (0.3 m/s): 1.2 seconds
- Time to reach maximum angular velocity (0.8 rad/s): 1.6 seconds
- Emergency stop response: <0.1 seconds (bypasses acceleration limiting)
Detection Performance

The gesture recognition system demonstrates robust performance across various conditions:

Detection accuracy:
- Gesture recognition confidence: >0.7 for stable detections
- Handedness detection: 100% accuracy when hands are clearly visible
- False positive rate: <5% with stability filtering enabled
System resource usage:
- CPU utilization: 15-20% for gesture recognition, 2-5% for navigation bridge
- Memory usage: 200-300MB total system footprint
- Detection rate: 3-4 Hz for stable gestures, 8-15 Hz raw MediaPipe output

Building a 33-Point Human Skeleton Tracker

08/14/2025 at 15:20 • 0 comments

Most robotics vision systems treat humans as simple bounding boxes – "person detected, avoid obstacle." But humans are dynamic, expressive, and predictable if you know how to read body language. A person leaning forward might be about to walk into the robot's path. Someone pointing could be giving directional commands. Arms raised might signal "stop."

I needed a system that could:

Track 33 distinct body landmarks in real-time
Handle multiple people simultaneously (up to 2 poses)
Run headless on embedded hardware without X11 dependencies
Integrate cleanly with my existing ROS 2 navigation stack
Provide visual feedback for development and debugging

The Core Architecture

The heart of my implementation is a modular ROS 2 node that wraps MediaPipe's PoseLandmarker model. I chose a composition pattern to keep the ROS infrastructure separate from the MediaPipe processing logic:

class PoseDetectionNode(MediaPipeBaseNode, MediaPipeCallbackMixin):
    def __init__(self, **kwargs):
        MediaPipeCallbackMixin.__init__(self)
        super().__init__(
            node_name='pose_detection_node',
            **kwargs
        )
        
        # Initialize the pose detection controller
        self.controller = PoseDetectionController(
            model_path=self.model_path,
            confidence_threshold=self.confidence_threshold,
            max_poses=self.max_poses,
            logger=self.get_logger()
        )

The PoseDetectionController handles all MediaPipe-specific operations:

class PoseDetectionController:
    def __init__(self, model_path: str, confidence_threshold: float, max_poses: int, logger):
        self.logger = logger
        
        # Configure MediaPipe options
        base_options = python.BaseOptions(model_asset_path=model_path)
        options = vision.PoseLandmarkerOptions(
            base_options=base_options,
            running_mode=vision.RunningMode.LIVE_STREAM,
            num_poses=max_poses,
            min_pose_detection_confidence=confidence_threshold,
            min_pose_presence_confidence=confidence_threshold,
            min_tracking_confidence=confidence_threshold,
            result_callback=self._pose_callback
        )
        
        self._landmarker = vision.PoseLandmarker.create_from_options(options)

The 33-Point Pose Model

MediaPipe's pose model detects 33 landmarks covering the entire human body:

Face: Nose, eyes, ears (5 points)
Torso: Shoulders, hips, center points (6 points)
Arms: Shoulders, elbows, wrists (6 points)
Hands: Wrist, thumb, fingers (10 points)
Legs: Hips, knees, ankles, feet (6 points)

Each landmark provides normalized (x, y) coordinates plus a visibility score, giving rich information about human pose and orientation.

Handling MediaPipe's Async Processing

One of the trickier aspects was properly handling MediaPipe's asynchronous LIVE_STREAM mode. The pose detection happens in a separate thread, with results delivered via callback:

def _pose_callback(self, result: vision.PoseLandmarkerResult, 
                   output_image: mp.Image, timestamp_ms: int):
    """Handle pose detection results from MediaPipe."""
    try:
        # Convert MediaPipe timestamp to ROS time
        ros_timestamp = self._convert_timestamp(timestamp_ms)
        
        # Process pose landmarks
        pose_msg = PoseLandmarks()
        pose_msg.header.stamp = ros_timestamp
        pose_msg.header.frame_id = 'camera_frame'
        
        if result.pose_landmarks:
            pose_msg.num_poses = len(result.pose_landmarks)
            
            # Handle MediaPipe's pose landmark structure variations
            for pose_landmarks in result.pose_landmarks:
                try:
                    # MediaPipe structure can vary between versions
                    if hasattr(pose_landmarks, '__iter__') and not hasattr(pose_landmarks, 'landmark'):
                        landmarks = pose_landmarks  # Direct list access
                    else:
                        landmarks = pose_landmarks.landmark  # Attribute access
                        
                    for landmark in landmarks:
                        point = Point()
                        point.x = landmark.x
                        point.y = landmark.y  
                        point.z = landmark.z
                        pose_msg.landmarks.append(point)
                        
                except Exception as e:
                    self.logger.warn(f'Pose landmark processing error: {e}')
                    continue
        
        # Publish results
        self.pose_publisher.publish(pose_msg)
        
    except Exception as e:
        self.logger.error(f'Pose callback error: {e}')\

Performance Reality Check: 3-7 FPS on Pi 5

Let me be honest about performance – this isn't going to run at 30 FPS on a Raspberry Pi 5. Through extensive testing, I measured:

Actual Performance: 3-7 FPS @ 640x480 resolution
Processing Time: ~150-300ms per frame
Memory Usage: ~200MB additional RAM
CPU Load: ~40-60% of one core during active processing

But here's the thing – for robotics applications, this is actually sufficient. Human movement is relatively slow compared to computer vision processing. A robot doesn't need to track every micro-movement; it needs to understand general pose, direction, and intent.

The key insight was optimizing for stability over speed. I'd rather have consistent 5 FPS processing than erratic 15 FPS with dropped frames and errors.

Debugging MediaPipe Integration Challenges

The most frustrating part of this project was dealing with MediaPipe's pose landmark data structures. Different versions of MediaPipe return pose landmarks in slightly different formats, and the documentation doesn't clearly explain the variations.

I spent hours debugging errors like:

AttributeError: 'list' object has no attribute 'landmark'

The solution was implementing robust structure detection:

# Handle both possible MediaPipe pose landmark structures
if hasattr(pose_landmarks, '__iter__') and not hasattr(pose_landmarks, 'landmark'):
    # pose_landmarks is already a list of landmarks (newer format)
    landmarks = pose_landmarks
else:
    # pose_landmarks has .landmark attribute (older format)  
    landmarks = pose_landmarks.landmark

# Handle both possible MediaPipe pose landmark structures
if hasattr(pose_landmarks, '__iter__') and not hasattr(pose_landmarks, 'landmark'):
# pose_landmarks is already a list of landmarks (newer format)
landmarks = pose_landmarks
else:
# pose_landmarks has .landmark attribute (older format)
landmarks = pose_landmarks.landmark

if self.annotated_image_publisher and result.pose_landmarks:
    annotated_frame = self._create_annotated_image(output_image.numpy_view(), result)
    annotated_msg = self.cv_bridge.cv2_to_imgmsg(annotated_frame, encoding='bgr8')
    annotated_msg.header.stamp = self.get_clock().now().to_msg()
    annotated_msg.header.frame_id = 'camera_frame'
    self.annotated_image_publisher.publish(annotated_msg)

I can develop and debug with full visualization, then deploy the same code headless on the robot.

Multi-Modal Vision Integration

The real power comes from combining pose detection with other vision modalities. My unified image viewer can display multiple vision streams simultaneously:

# View all vision systems together
ros2 launch gesturebot image_viewer.launch.py \
    image_topics:='["/vision/objects/annotated", "/vision/gestures/annotated", "/vision/pose/annotated"]' \
    topic_window_names:='{"\/vision\/objects\/annotated": "Objects", "\/vision\/gestures\/annotated": "Gestures", "\/vision\/pose\/annotated": "Poses"}'

This gives me object detection (what's there), gesture recognition (what they're doing), and pose detection (how they're positioned) all in one integrated system.

Practical Robotics Applications

Human-Aware Navigation

Instead of treating humans as static obstacles, my robot can now:

Predict movement direction from body orientation
Maintain appropriate social distances based on pose
Recognize when someone is trying to interact vs. just passing by

Gesture Command Integration

Pose data enhances gesture recognition by providing context:

A pointing gesture combined with body orientation gives directional commands
Raised arms with forward-leaning pose might indicate "stop" vs. "hello"
Multiple people can give conflicting gestures – pose helps determine who to follow

Safety and Interaction

The 33-point skeleton provides rich safety information:

Detect if someone has fallen (unusual pose angles)
Recognize aggressive vs. friendly body language
Identify when someone is carrying objects that might affect navigation

Implementing Real-Time Gesture Recognition for Robot Control

08/13/2025 at 03:40 • 0 comments

System Overview and Architecture

The GestureBot gesture recognition system is built on a modular ROS 2 architecture that combines MediaPipe's powerful computer vision capabilities with efficient real-time processing optimized for embedded systems. The core system processes camera input at 15 FPS, detects hand gestures with 21-point landmark tracking, and translates recognized gestures into navigation commands for autonomous robot control.

Key Performance Metrics

Through extensive testing and optimization, I've achieved the following performance characteristics on Raspberry Pi 5:

Processing Rate: 15 FPS @ 640x480 resolution
Gesture Recognition Latency: <100ms from detection to command
Hand Landmark Accuracy: 21-point skeleton with sub-pixel precision
System Resource Usage: <25% CPU utilization during active processing
Memory Footprint: ~150MB including MediaPipe models and ROS 2 overhead

MediaPipe Integration with ROS 2

The foundation of the gesture recognition system is a robust integration between MediaPipe's gesture recognition capabilities and ROS 2's distributed computing framework. I implemented this using a callback-based architecture that maximizes performance while maintaining system responsiveness.

Core Architecture Components

The system consists of several key components working in concert:

GestureRecognitionNode: The primary ROS 2 node that inherits from MediaPipeBaseNode, providing standardized MediaPipe integration patterns across the vision system.

MediaPipe Gesture Recognizer: Utilizes the pre-trained gesture_recognizer.task model for real-time gesture classification with confidence scoring.

Unified Image Viewer: A multi-topic display system that can simultaneously show gesture recognition results, hand landmarks, and performance metrics.

Callback-Based Processing Pipeline

I implemented the MediaPipe integration using an asynchronous callback pattern that ensures optimal performance:

def initialize_mediapipe(self) -> bool:
    """Initialize MediaPipe gesture recognizer with callback processing."""
    try:
        # Configure gesture recognizer options
        options = mp_vis.GestureRecognizerOptions(
            base_options=mp_py.BaseOptions(model_asset_path=self.model_path),
            running_mode=mp_vis.RunningMode.LIVE_STREAM,
            result_callback=self._process_callback_results,
            num_hands=self.max_hands,
            min_hand_detection_confidence=self.confidence_threshold,
            min_hand_presence_confidence=self.confidence_threshold,
            min_tracking_confidence=self.confidence_threshold
        )
        
        self.gesture_recognizer = mp_vis.GestureRecognizer.create_from_options(options)
        return True
    except Exception as e:
        self.get_logger().error(f"Failed to initialize MediaPipe: {e}")
        return False

This approach uses MediaPipe's LIVE_STREAM mode with detect_async() for optimal performance, avoiding blocking operations in the main processing thread.

Hand Landmark Detection System

The gesture recognition system implements comprehensive 21-point hand landmark tracking, providing detailed skeletal information for each detected hand. This landmark data serves dual purposes: gesture classification input and visual feedback for system debugging.

21-Point Hand Skeleton

MediaPipe's hand landmark model provides 21 key points representing the complete hand structure:

Wrist (0): Base reference point
Thumb (1-4): Complete thumb chain from base to tip
Index Finger (5-8): Four points from base to fingertip
Middle Finger (9-12): Complete middle finger structure
Ring Finger (13-16): Four-point ring finger chain
Pinky (17-20): Complete pinky finger structure

Landmark Visualization Implementation

I implemented a comprehensive visualization system that draws both individual landmarks and connecting skeleton lines:

def draw_hand_landmarks(self, image: np.ndarray, hand_landmarks_list) -> None:
    """Draw complete hand landmarks with skeleton connections."""
    try:
        if not hand_landmarks_list:
            return
            
        height, width = image.shape[:2]
        
        for hand_index, hand_landmarks in enumerate(hand_landmarks_list):
            # Draw landmark connections (skeleton)
            mp_drawing.draw_landmarks(
                image, 
                hand_landmarks, 
                mp_hands.HAND_CONNECTIONS,
                mp_drawing.DrawingSpec(color=(0, 255, 0), thickness=2, circle_radius=2),
                mp_drawing.DrawingSpec(color=(255, 0, 0), thickness=2)
            )
            
            # Highlight key landmarks (wrist and fingertips)
            for i, landmark in enumerate(hand_landmarks.landmark):
                x = int(landmark.x * width)
                y = int(landmark.y * height)
                if i in [0, 4, 8, 12, 16, 20]:  # Wrist and fingertips
                    cv2.circle(image, (x, y), 8, (0, 0, 255), -1)  # Red circles
                    cv2.circle(image, (x, y), 10, (255, 255, 255), 2)  # White border
                    
    except Exception as e:
        self.get_logger().error(f"Error drawing landmarks: {e}")

Gesture-to-Navigation Command Mapping

The system implements a comprehensive mapping between recognized gestures and robot navigation commands, designed for intuitive and safe robot control.

Supported Gestures and Commands

I've implemented support for 9 distinct gestures, each mapped to specific navigation behaviors:

Gesture	Navigation Command	Robot Action	Use Case
👍 Thumbs Up	`start_navigation`	Begin autonomous navigation	Start mission
👎 Thumbs Down	`stop_navigation`	Stop all movement	End current task
✋ Open Palm	`pause_navigation`	Pause current navigation	Temporary halt
👆 Pointing Up	`move_forward`	Move forward briefly	Manual positioning
👈 Pointing Left	`turn_left`	Turn left	Direction control
👉 Pointing Right	`turn_right`	Turn right	Direction control
✌️ Peace Sign	`follow_person`	Enter person-following mode	Interactive following
✊ Fist	`emergency_stop`	Immediate emergency stop	Safety override
👋 Wave	`return_home`	Return to home position	Mission completion

Gesture Stability and Confidence Thresholds

To ensure reliable operation and prevent false triggers, I implemented a multi-layered validation system:

Confidence Threshold: Minimum 0.5 confidence score required for gesture recognition (configurable via parameters).

Gesture Stability: Gestures must be maintained for a minimum duration (default: 0.5 seconds) before triggering navigation commands.

Command Debouncing: Prevents rapid-fire command execution by implementing cooldown periods between identical commands.

def process_gesture_results(self, results, timestamp) -> Optional[Dict]:
    """Process gesture recognition results with stability validation."""
    if not results.gestures:
        self.current_gesture = None
        self.gesture_start_time = None
        return None
        
    # Get highest confidence gesture
    gesture = results.gestures[0][0]
    gesture_name = gesture.category_name
    confidence = gesture.score
    
    # Apply confidence threshold
    if confidence < self.confidence_threshold:
        return None
        
    # Check gesture stability
    current_time = time.time()
    if self.current_gesture != gesture_name:
        self.current_gesture = gesture_name
        self.gesture_start_time = current_time
        return None  # Wait for stability
        
    # Validate stability duration
    if current_time - self.gesture_start_time < self.gesture_stability_threshold:
        return None
        
    # Generate navigation command
    nav_command = self.GESTURE_COMMANDS.get(gesture_name, '')
    
    return {
        'gesture_name': gesture_name,
        'confidence': confidence,
        'nav_command': nav_command,
        'handedness': results.handedness[0][0].category_name if results.handedness else 'unknown',
        'timestamp': timestamp
    }

Unified Image Viewer Architecture

One of the key architectural improvements I implemented is the unified image viewer system, which replaces multiple separate image viewer nodes with a single, efficient multi-topic display system.

Multi-Topic Display Capabilities

The unified image viewer can simultaneously display multiple vision streams in separate windows, each with independent performance tracking and custom labeling:

# Display both gesture recognition and object detection simultaneously
ros2 launch gesturebot image_viewer.launch.py \
    image_topics:='["/vision/gestures/annotated", "/vision/objects/annotated"]' \
    topic_window_names:='{"\/vision\/gestures\/annotated": "Gestures", "\/vision\/objects\/annotated": "Objects"}'

Performance Optimizations

The unified architecture provides significant performance improvements:

Memory Efficiency: ~15MB per window vs ~25MB for separate viewers
CPU Optimization: <5% total CPU usage at 10 FPS display rate
Resource Sharing: Single OpenCV event loop handles multiple windows
QoS Compatibility: BEST_EFFORT reliability matches vision system publishers

FPS Overlay Implementation

I implemented a sophisticated FPS overlay system that provides per-topic performance monitoring positioned in the lower-right corner of each display window:

def add_fps_overlay(self, image: np.ndarray, topic: str) -> np.ndarray:
    """Add FPS overlay to the image for a specific topic in the lower-right corner."""
    fps_text = f"Display FPS: {self.current_fps_values[topic]:.1f}"
    topic_text = f"Topic: {topic.split('/')[-1]}"
    
    # Get image dimensions and calculate lower-right position
    img_height, img_width = image.shape[:2]
    
    # Text properties (standardized across vision system)
    font = cv2.FONT_HERSHEY_SIMPLEX
    font_scale = 0.5
    color = (0, 255, 0)  # Green
    thickness = 1
    
    # Calculate positioning with proper padding
    padding = 10
    rect_x2 = img_width - padding
    rect_y2 = img_height - padding
    
    # Draw background rectangle and text
    cv2.rectangle(image, (rect_x1, rect_y1), (rect_x2, rect_y2), (0, 0, 0), -1)
    cv2.putText(image, fps_text, (text_x, fps_text_y), font, font_scale, color, thickness)
    cv2.putText(image, topic_text, (text_x, topic_text_y), font, font_scale, color, thickness)
    
    return image

Performance Optimization for Raspberry Pi 5

Achieving real-time performance on embedded hardware required careful optimization across multiple system layers.

Camera Configuration Optimization

I optimized the camera pipeline specifically for gesture recognition performance:

# Optimal camera settings for gesture recognition
camera_node:
  ros__parameters:
    format: "BGR888"  # Optimal for MediaPipe processing
    width: 640
    height: 480
    FrameDurationLimits: [66667, 66667]  # 15 FPS = 66.67ms
    ExposureTime: 20000  # 20ms for fast capture
    AnalogueGain: 1.0
    DigitalGain: 1.0
    buffer_queue_size: 2  # Reduced buffer for lower latency

# Optimal camera settings for gesture recognition
camera_node:
ros__parameters:
format: "BGR888" # Optimal for MediaPipe processing
width: 640
height: 480
FrameDurationLimits: [66667, 66667] # 15 FPS = 66.67ms
ExposureTime: 20000 # 20ms for fast capture
AnalogueGain: 1.0
DigitalGain: 1.0
buffer_queue_size: 2 # Reduced buffer for lower latency

def monitor_system_performance(self):
    """Monitor system resources and adapt processing accordingly."""
    cpu_usage = psutil.cpu_percent(interval=1)
    memory_usage = psutil.virtual_memory().percent
    
    if cpu_usage > 75:
        # Reduce processing rate under high load
        self.processing_config.max_fps = 10.0
        self.get_logger().warn("High CPU usage detected, reducing processing rate")
    elif cpu_usage < 50:
        # Restore full processing rate when resources available
        self.processing_config.max_fps = 15.0

Configuration and Usage

Launch File Configuration

The gesture recognition system provides comprehensive configuration through launch file parameters:

# Complete gesture recognition system launch
ros2 launch gesturebot gesture_recognition.launch.py \
    camera_format:=BGR888 \
    confidence_threshold:=0.5 \
    max_hands:=2 \
    gesture_stability_threshold:=0.5 \
    publish_annotated_images:=true \
    show_landmark_indices:=false \
    buffer_logging_enabled:=false \
    enable_performance_tracking:=true

Key Parameters Explained

camera_format: BGR888 provides optimal performance for MediaPipe processing, avoiding unnecessary color space conversions.

confidence_threshold: Minimum confidence score (0.0-1.0) required for gesture recognition. Higher values reduce false positives but may miss valid gestures.

max_hands: Maximum number of hands to track simultaneously (1-2). Higher values increase computational load.

gesture_stability_threshold: Minimum duration (seconds) a gesture must be maintained before triggering commands. Prevents accidental activations.

publish_annotated_images: Enables/disables annotated image publishing. Defaults to true for visual feedback but can be disabled to save resources.

Integration with Navigation System

The gesture recognition system integrates seamlessly with ROS 2 navigation stacks through standardized message interfaces:

# Gesture command publisher
self.gesture_command_publisher = self.create_publisher(
    GestureCommand,
    '/gesture_commands',
    10
)

# Navigation command mapping
def publish_navigation_command(self, gesture_info):
    """Publish navigation command based on recognized gesture."""
    command_msg = GestureCommand()
    command_msg.gesture_name = gesture_info['gesture_name']
    command_msg.confidence = gesture_info['confidence']
    command_msg.navigation_command = gesture_info['nav_command']
    command_msg.timestamp = self.get_clock().now().to_msg()
    
    self.gesture_command_publisher.publish(command_msg)

Visual System Features and Consistency

I implemented a comprehensive visual feedback system with consistent styling across all vision components.

Font Standardization

All text overlays throughout the GestureBot vision system use standardized font properties for visual consistency:

# Standardized font properties across all vision nodes
STANDARD_FONT = cv2.FONT_HERSHEY_SIMPLEX
STANDARD_FONT_SCALE = 0.5
STANDARD_THICKNESS = 1

This standardization applies to:

FPS overlays in the unified image viewer
Gesture name and confidence annotations
Hand landmark indices (when enabled)
Bounding box labels

Annotated Image Publishing

The system publishes richly annotated images that include:

Hand Landmarks: Complete 21-point skeleton with connecting lines Gesture Labels: Current gesture name with confidence percentage Navigation Commands: Associated navigation command for each gesture Performance Metrics: Real-time FPS and processing statistics

def create_annotated_image(self, frame, gesture_info):
    """Create comprehensive annotated image with all visual elements."""
    annotated_frame = frame.copy()
    
    # Draw hand landmarks
    if self._current_hand_landmarks:
        self.draw_hand_landmarks(annotated_frame, self._current_hand_landmarks)
    
    # Add gesture text with standardized font
    gesture_text = f"{gesture_info['gesture_name']} ({gesture_info['confidence']:.2f})"
    if gesture_info.get('nav_command'):
        gesture_text += f" -> {gesture_info['nav_command']}"
    
    self.draw_text_with_background(
        annotated_frame, gesture_text, (10, 30),
        cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 255, 0), 1
    )
    
    return annotated_frame

Performance Optimization

Issue: Poor FPS performance or high CPU usage Solutions:

Reduce processing rate: Lower max_fps parameter
Optimize camera settings: Use BGR888 format, reduce resolution if needed
Disable unnecessary features: Turn off landmark indices, reduce max_hands
Monitor system resources: Use htop and ROS 2 performance topics

# Monitor gesture recognition performance
ros2 topic echo /vision/gestures/performance

# Adjust parameters for better performance
ros2 param set /gesture_recognition_node max_fps 10.0
ros2 param set /gesture_recognition_node max_hands 1

Manual Object Detection Annotations in GestureBot

08/11/2025 at 05:04 • 0 comments

When developing the GestureBot vision system, I encountered a common challenge in robotics computer vision: balancing performance with visualization flexibility. While MediaPipe provides excellent object detection capabilities, its built-in annotation system proved limiting for our specific visualization requirements. This post details how I implemented a custom manual annotation system using OpenCV primitives while maintaining MediaPipe's high-performance LIVE_STREAM processing mode.

Problem Statement: Why Move Beyond MediaPipe's Built-in Annotations

MediaPipe's object detection framework excels at inference performance, but its visualization capabilities presented several limitations for our robotics application:

MediaPipe Annotation Limitations

Limited customization: Fixed annotation styles with minimal configuration options
Inconsistent output: LIVE_STREAM mode doesn't always provide reliable output_image results
Performance overhead: Built-in annotations add processing latency in the inference pipeline
Inflexible styling: No control over color schemes, font sizes, or confidence display formats

Our Requirements

For GestureBot's vision system, I needed:

Color-coded confidence levels for quick visual assessment
Percentage-based confidence display for precise evaluation
Consistent annotation rendering regardless of detection confidence
Minimal performance impact on the real-time processing pipeline
Full control over visual styling to match our robotics interface

Technical Implementation: Manual Annotation Architecture

The solution involved decoupling MediaPipe's inference engine from the visualization layer, creating a custom annotation system that operates on the original RGB frames.

System Architecture

# High-level flow
RGB Frame → MediaPipe Detection (LIVE_STREAM) → Manual Annotation → ROS Publishing

The key insight was to preserve MediaPipe's asynchronous detect_async() processing while applying custom annotations to the original RGB frames, rather than relying on MediaPipe's output_image.

Core Implementation: Manual Annotation Method

def draw_manual_annotations(self, image: np.ndarray, detections) -> np.ndarray:
    """
    Manually draw bounding boxes, labels, and confidence scores using OpenCV.
    
    Args:
        image: RGB image array (H, W, 3)
        detections: MediaPipe detection results
        
    Returns:
        Annotated RGB image array
    """
    if not detections:
        return image.copy()
        
    annotated_image = image.copy()
    height, width = image.shape[:2]
    
    for detection in detections:
        # Get bounding box coordinates
        bbox = detection.bounding_box
        x_min = int(bbox.origin_x)
        y_min = int(bbox.origin_y)
        x_max = int(bbox.origin_x + bbox.width)
        y_max = int(bbox.origin_y + bbox.height)
        
        # Ensure coordinates are within image bounds
        x_min = max(0, min(x_min, width - 1))
        y_min = max(0, min(y_min, height - 1))
        x_max = max(0, min(x_max, width - 1))
        y_max = max(0, min(y_max, height - 1))
        
        # Get the best category (highest confidence)
        if detection.categories:
            best_category = max(detection.categories, key=lambda c: c.score if c.score else 0)
            class_name = best_category.category_name or 'unknown'
            confidence = best_category.score or 0.0
            
            # Color-coded boxes based on confidence levels
            if confidence >= 0.7:
                color = (0, 255, 0)  # Green for high confidence (RGB)
            elif confidence >= 0.5:
                color = (255, 255, 0)  # Yellow for medium confidence (RGB)
            else:
                color = (255, 0, 0)  # Red for low confidence (RGB)
            
            # Draw bounding box rectangle
            cv2.rectangle(annotated_image, (x_min, y_min), (x_max, y_max), color, 2)
            
            # Prepare label text with confidence percentage
            confidence_percent = int(confidence * 100)
            label = f"{class_name}: {confidence_percent}%"
            
            # Calculate text size for background rectangle
            font = cv2.FONT_HERSHEY_SIMPLEX
            font_scale = 0.6
            thickness = 2
            (text_width, text_height), baseline = cv2.getTextSize(label, font, font_scale, thickness)
            
            # Position text above bounding box, or below if not enough space
            text_x = x_min
            text_y = y_min - 10 if y_min - 10 > text_height else y_max + text_height + 10
            
            # Draw background rectangle for text (filled)
            cv2.rectangle(
                annotated_image,
                (text_x, text_y - text_height - baseline),
                (text_x + text_width, text_y + baseline),
                color,
                -1  # Filled rectangle
            )
            
            # Draw text label in black for good contrast
            cv2.putText(
                annotated_image,
                label,
                (text_x, text_y),
                font,
                font_scale,
                (0, 0, 0),  # Black text (RGB)
                thickness,
                cv2.LINE_AA
            )
    
    return annotated_image

Integration with MediaPipe Pipeline

The manual annotation system integrates seamlessly with MediaPipe's asynchronous processing:

def process_frame(self, frame: np.ndarray, timestamp: float) -> Optional[Dict]:
    """Process frame for object detection."""
    # ... MediaPipe processing ...
    
    if detections:
        result_dict = {
            'detections': detections,
            'timestamp': timestamp,
            'processing_time': (time.time() - timestamp) * 1000,
            'rgb_frame': rgb_frame  # Include original RGB frame for manual annotation
        }
        return result_dict

def publish_results(self, results: Dict, timestamp: float) -> None:
    """Publish object detection results and optionally annotated images."""
    # ... detection publishing ...
    
    # Extract the original RGB frame and detections for manual annotation
    rgb_frame = results['rgb_frame']
    detections = results['detections']
    
    # Apply manual annotations using OpenCV drawing primitives
    annotated_rgb = self.draw_manual_annotations(rgb_frame, detections)
    
    # Convert RGB to BGR for ROS publishing
    cv_image = cv2.cvtColor(annotated_rgb, cv2.COLOR_RGB2BGR)

Visual Features: Color-Coded Confidence System

The manual annotation system implements a three-tier confidence visualization scheme designed for rapid assessment in robotics applications:

Confidence Color Mapping

🟢 Green (≥70%): High confidence detections suitable for autonomous decision-making
🟡 Yellow (≥50%): Medium confidence detections requiring validation
🔴 Red (<50%): Low confidence detections for debugging purposes

Typography and Positioning

Font: cv2.FONT_HERSHEY_SIMPLEX for clear readability
Background: Filled rectangles matching bounding box colors for contrast
Text Color: Black for optimal contrast against colored backgrounds
Positioning: Adaptive placement above or below bounding boxes based on available space

Percentage Display Format

Confidence scores are displayed as integers (e.g., "person: 76%") rather than decimals, providing immediate visual feedback without cognitive overhead during real-time operation.

Optimizing Object Detection Pipeline Performance: A 68.7% Improvement Through Systematic Bottleneck Analysis

08/11/2025 at 01:44 • 0 comments

Introduction

I recently completed a comprehensive performance optimization project for a ROS 2-based object detection pipeline using MediaPipe and OpenCV. The system processes camera frames for real-time object detection in robotics applications, but initial performance analysis revealed significant bottlenecks that were limiting throughput and consuming excessive CPU resources.

The object detection pipeline consists of three main stages:

Preprocessing: Camera frame format conversion (BGR→RGB) for MediaPipe compatibility
MediaPipe Inference: Object detection using TensorFlow Lite models
Postprocessing: Result conversion and ROS message publishing, including optional annotated image generation

Through systematic measurement and targeted optimization, I achieved a 68.7% reduction in total pipeline processing time, from 8.65ms to 2.71ms per frame, while maintaining full functionality and improving system stability.

Baseline Performance Analysis

Measurement Infrastructure

Before implementing any optimizations, I established a comprehensive performance measurement system to ensure accurate, statistically reliable data collection. The measurement infrastructure includes:

PipelineTimer Class: High-precision timing using time.perf_counter() for microsecond-level accuracy:

class PipelineTimer:
    def __init__(self):
        self.stage_times = {}
        self.start_time = None
    
    def start_stage(self, stage_name: str):
        self.stage_times[stage_name] = time.perf_counter()
    
    def end_stage(self, stage_name: str) -> float:
        if stage_name in self.stage_times:
            duration = time.perf_counter() - self.stage_times[stage_name]
            return duration * 1000  # Convert to milliseconds
        return 0.0

PerformanceStats Class: Aggregates timing data over 5-second periods and publishes metrics to ROS topics:

class PerformanceStats:
    def __init__(self):
        self.period_start_time = time.perf_counter()
        self.frames_processed = 0
        self.total_preprocessing_time = 0.0
        self.total_mediapipe_time = 0.0
        self.total_postprocessing_time = 0.0
        self.period_duration = 5.0  # seconds

Statistical Methodology: I used 30-second test periods with multiple measurement intervals to ensure statistical confidence. Each test collected 5-6 data points, allowing calculation of mean performance metrics and variance analysis.

Baseline Performance Results

Using YUYV camera format with full annotated image processing enabled, the baseline performance measurements revealed:

>td ###Total Pipeline Time

Metric	Average Time
8.65ms	100>#/td###
Preprocessing Time	1.22ms	14>#/td###
MediaPipe Inference	2.13ms	25>#/td###
Postprocessing Time	5.30ms	61>#/td###
Effective FPS	2.28	-

The baseline analysis immediately identified postprocessing as the primary bottleneck, consuming 61% of total pipeline time. This stage includes MediaPipe result conversion, RGB→BGR color conversion, and ROS Image message creation for annotated output.

Optimization #1: Conditional Annotated Image Processing

Problem Analysis

The postprocessing bottleneck was caused by unconditional generation of annotated images, even when no ROS subscribers were listening to the /vision/objects/annotated topic. This resulted in expensive memory operations and color conversions being performed unnecessarily.

Implementation

I implemented a subscriber count check to conditionally skip annotated image processing when no subscribers are present:

def publish_results(self, results: Dict, timestamp: float) -> None:
    """Publish object detection results and optionally annotated images."""
    try:
        # Always publish detection results
        msg = MessageConverter.detection_results_to_ros(results, timestamp)
        self.detections_publisher.publish(msg)

        # Conditional annotated image publishing
        if (self.annotated_image_publisher is not None and
            'output_image' in results and
            results['output_image'] is not None):
            
            # Optimization: Skip expensive postprocessing if no subscribers
            subscriber_count = self.annotated_image_publisher.get_subscription_count()
            
            if subscriber_count == 0:
                self.log_buffered_event(
                    'ANNOTATED_PROCESSING_SKIPPED',
                    'Skipping annotated image processing - no subscribers',
                    subscriber_count=subscriber_count
                )
                return
            
            # Continue with annotated image processing only when needed
            # ... (expensive postprocessing operations)

Performance Impact

The conditional processing optimization delivered significant improvements:

Metric	Before	After	Improvement
Total Pipeline Time	8.65ms	5.19ms	🚀 40.0% FASTER
Postprocessing Time	5.30ms	1.52ms	🚀 71.3% FASTER
CPU Efficiency	High overhead	Reduced by 40>#/td###	Significant

This optimization eliminates 3.46ms of processing time per frame when annotated images are not needed, which is the common case in production robotics applications where object detection results are consumed programmatically rather than visually.

Optimization #2: RGB888 Camera Format

Problem Analysis

After eliminating the postprocessing bottleneck, MediaPipe inference became the primary performance constraint. Analysis revealed that the BGR→RGB color conversion in preprocessing, combined with MediaPipe's internal processing of converted frames, was creating inefficiencies.

Implementation

I switched the camera configuration from YUYV to RGB888 format, allowing direct RGB input to MediaPipe:

def process_frame(self, frame: np.ndarray, timestamp: float) -> Optional[Dict]:
    """Process frame for object detection."""
    try:
        # Optimization: Check if frame is already in RGB format (camera_format=RGB888)
        if frame.shape[2] == 3:  # Ensure it's a 3-channel image
            # Direct RGB input from camera eliminates BGR→RGB conversion
            rgb_frame = frame
            self.log_buffered_event(
                'PREPROCESSING_OPTIMIZED',
                'Using direct RGB input - skipping BGR→RGB conversion',
                frame_shape=str(frame.shape)
            )
        else:
            # Fallback: Convert BGR to RGB for MediaPipe
            rgb_frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
        
        # Create MediaPipe image with direct RGB data
        mp_image = mp.Image(image_format=mp.ImageFormat.SRGB, data=rgb_frame)

Camera Launch Configuration:

ros2 launch gesturebot object_detection.launch.py camera_format:=RGB888

Technical Benefits

The RGB888 format optimization provides multiple performance improvements:

Eliminates BGR→RGB Conversion: Removes expensive cv2.cvtColor() operation in preprocessing
MediaPipe Efficiency: Direct RGB input is processed more efficiently by MediaPipe's internal algorithms
Memory Bandwidth Reduction: Fewer memory copy operations and intermediate buffer allocations
Cache Efficiency: Better memory access patterns with consistent RGB format throughout the pipeline

Performance Impact

The RGB888 optimization delivered substantial additional improvements:

Metric	After Opt #1	After Opt #2	Additional Improvement
Total Pipeline Time	5.19ms	2.71ms	🚀 47.8% FASTER
MediaPipe Time	2.47ms	0.81ms	🚀 67.2% FASTER
Preprocessing Time	1.30ms	0.83ms	🚀 36.2% FASTER

The MediaPipe inference time reduction of 67.2% was particularly significant, transforming it from the primary bottleneck to a well-optimized component.

Performance Measurement Infrastructure

Buffered Logging System

To collect detailed performance data without impacting measurements, I implemented a configurable buffered logging system:

class BufferedLogger:
    def __init__(self, enabled: bool = True, max_size: int = 200):
        self.enabled = enabled
        self.buffer = []
        self.max_size = max_size
        self.mode = 'circular' if enabled else 'disabled'
    
    def log_event(self, event_type: str, message: str, **kwargs):
        if not self.enabled:
            return
        
        event = {
            'timestamp': time.perf_counter(),
            'event_type': event_type,
            'message': message,
            **kwargs
        }
        
        if len(self.buffer) >= self.max_size:
            self.buffer.pop(0)  # Remove oldest entry
        self.buffer.append(event)\

Real-Time Performance Publishing

The system publishes aggregated performance metrics every 5 seconds to ROS topics for real-time monitoring:

def publish_performance_stats(self):
    """Publish performance statistics to ROS topic."""
    if not self.enable_performance_tracking:
        return
    
    current_time = time.perf_counter()
    period_duration = current_time - self.stats.period_start_time
    
    if period_duration >= self.stats.period_duration:
        # Calculate averages
        avg_preprocessing = self.stats.total_preprocessing_time / self.stats.frames_processed
        avg_mediapipe = self.stats.total_mediapipe_time / self.stats.frames_processed
        avg_postprocessing = self.stats.total_postprocessing_time / self.stats.frames_processed
        avg_total = avg_preprocessing + avg_mediapipe + avg_postprocessing
        
        # Create and publish performance message
        perf_msg = PerformanceMetrics()
        perf_msg.header.stamp = self.get_clock().now().to_msg()
        perf_msg.current_fps = self.stats.frames_processed / period_duration
        perf_msg.avg_preprocessing_time = avg_preprocessing
        perf_msg.avg_mediapipe_time = avg_mediapipe
        perf_msg.avg_postprocessing_time = avg_postprocessing
        perf_msg.avg_total_pipeline_time = avg_total
        
        self.performance_publisher.publish(perf_msg)

Statistical Validation

I used consistent testing methodology to ensure reliable measurements:

Test Duration: 30-second measurement periods
Data Points: 5-6 measurement intervals per test
Consistency Validation: Multiple test runs to verify reproducibility
Variance Analysis: Standard deviation calculations to assess measurement stability

Results Summary

Cumulative Performance Improvement

The systematic optimization approach achieved exceptional results:

Configuration	Total Pipeline Time	Improvement from Baseline	Cumulative Improvement
Baseline (YUYV + Full Processing)	8.65ms	-	-
Optimization #1 (Conditional Processing)	5.19ms	40.0% faster	40.0>#/td###
Final Optimized (RGB888 + Conditional)	2.71ms	68.7% faster	68.7>#/td###

Stage-by-Stage Transformation

Pipeline Stage	Baseline	After Opt #1	Final Optimized	Total Improvement
Preprocessing	1.22ms (14%)	1.30ms (25%)	0.83ms (31%)	32.0% faster
MediaPipe	2.13ms (25%)	2.47ms (48%)	0.81ms (30%)	62.0% faster
Postprocessing	5.30ms (61%)	1.52ms (29%)	1.06ms (39%)	80.0% faster

Mathematical Validation

The cumulative improvement follows the expected multiplicative formula:

Total Improvement = 1 - (1 - Opt1%) × (1 - Opt2%)
68.7% = 1 - (1 - 0.40) × (1 - 0.478)
68.7% = 1 - (0.60 × 0.522) = 68.7% ✓

Technical Insights

Bottleneck Migration Pattern

The optimization process revealed an important pattern of bottleneck migration:

Initial State: Postprocessing dominated (61% of pipeline time)
After Optimization #1: MediaPipe became the primary bottleneck (48% of pipeline time)
Final State: Balanced pipeline with no dominant bottleneck (30-39% distribution)

This demonstrates the importance of iterative optimization and continuous measurement, as eliminating one bottleneck often reveals the next performance constraint.

Optimization Sequencing Strategy

The success of this optimization effort validates several key principles:

Measure First: Comprehensive performance measurement infrastructure enabled data-driven decisions
Target the Largest Bottleneck: Addressing postprocessing first provided the highest initial impact
Iterative Approach: Sequential optimization revealed secondary bottlenecks that weren't initially apparent
Validate Each Step: Statistical validation ensured that improvements were real and reproducible

Camera Format Selection Impact

The RGB888 camera format optimization provided insights into the importance of data format consistency throughout processing pipelines:

Format Conversion Overhead: BGR→RGB conversion consumed significant CPU cycles
MediaPipe Efficiency: Direct RGB input dramatically improved inference performance
Memory Access Patterns: Consistent format reduced cache misses and memory bandwidth requirements

Performance Measurement Best Practices

The measurement infrastructure development highlighted several critical practices:

Non-Intrusive Logging: Buffered logging prevents measurement artifacts
Aggregated Metrics: 5-second averaging provides stable, meaningful performance data
Statistical Rigor: Multiple measurement periods enable confidence interval calculation
Real-Time Monitoring: ROS topic publishing allows live performance observation

Conclusion

This optimization project demonstrates the effectiveness of systematic, measurement-driven performance improvement. By implementing comprehensive performance tracking, identifying bottlenecks through data analysis, and applying targeted optimizations, I achieved a 68.7% reduction in object detection pipeline processing time.

The key success factors were:

Robust Measurement Infrastructure: Accurate, non-intrusive performance tracking enabled data-driven optimization decisions
Bottleneck-Focused Approach: Targeting the largest performance constraints first maximized improvement impact
Format Optimization: Aligning data formats throughout the pipeline eliminated unnecessary conversions
Conditional Processing: Smart resource management reduced computational overhead when full processing isn't needed

The optimized pipeline now processes frames in 2.71ms compared to the original 8.65ms, providing substantial headroom for additional robotics processing tasks while maintaining full object detection functionality. This work demonstrates that significant performance improvements are achievable through systematic analysis and targeted optimization, even in complex multi-stage processing pipelines.

Refactoring Buffered Logging in ROS 2 Vision Pipelines
08/10/2025 at 20:51 • 0 comments
When building real-time computer vision systems with ROS 2, diagnostic logging becomes critical for debugging complex processing pipelines. However, poorly designed logging systems can create more confusion than clarity. I recently refactored the buffered logging system in my GestureBot object detection node, transforming a confusing, duplicated implementation into a clean, reusable architecture that other robotics developers can learn from.

The Problem: Misleading Abstractions and Technical Debt

The original buffered logging system suffered from several fundamental issues that made it difficult to use and maintain:

Confusing Terminology: The system used "production mode" and "debug mode" labels that didn't reflect actual behavior. "Production mode" suggested it was only for deployment, while "debug mode" implied it was only for development. In reality, both modes had legitimate use cases across different scenarios.

Inconsistent Timer Behavior: The system used a 120-second timer for "debug mode" and a 10-second timer for "production mode." This inconsistency made it difficult to predict when diagnostic information would be available.

Code Duplication: The BufferedLogger class was implemented directly in

object_detection_node.py
, making it impossible for other vision nodes (gesture detection, face detection) to reuse the same logging infrastructure.

Unclear Parameters: Launch file parameters like enable_debug_buffer obscured what the system actually did, requiring developers to read implementation details to understand behavior.

Solution Architecture: Behavior-Based Design

I redesigned the system around three core principles: clear behavioral naming, consistent timing, and reusable architecture.

1. Renamed Modes to Reflect Actual Behavior

The new system uses descriptive names that immediately communicate what each mode does:
```
# Before: Confusing mode names
'mode': 'debug' if self.debug_mode else 'production'

# After: Behavior-based naming
'mode': 'unlimited' if self.unlimited_mode else 'circular'
```
Circular Mode: Uses a fixed-size circular buffer (200 entries) with automatic dropping when full. Ideal for continuous monitoring with bounded memory usage.

Unlimited Mode: Allows unlimited buffer growth with timer-only flushing. Perfect for comprehensive diagnostic sessions where you need complete event history.

Disabled Mode: No buffering overhead, only critical errors logged directly. Optimal for production deployments where performance is paramount.

2. Standardized Timer Intervals

I unified the timer interval to 10 seconds across all modes, eliminating the arbitrary distinction between 120-second and 10-second intervals:
```
# Before: Inconsistent timing
flush_interval = 120.0 if enable_debug_buffer else 10.0

# After: Consistent 10-second intervals
self.buffer_flush_timer = self.create_timer(10.0, self._flush_buffered_logger)
```
This change provides more responsive feedback while maintaining reasonable performance characteristics.

3. Moved to Base Class Architecture

The most significant architectural improvement was moving BufferedLogger from the specific object detection node to the base MediaPipeBaseNode class:
```
# vision_core/base_node.py
class MediaPipeBaseNode(Node, ABC):
    def __init__(self, node_name: str, feature_name: str, config: ProcessingConfig, 
                 enable_buffered_logging: bool = True, unlimited_buffer_mode: bool = False):
        super().__init__(node_name)
        
        # Initialize buffered logging for all MediaPipe nodes
        self.buffered_logger = BufferedLogger(
            buffer_size=200,
            logger=self.get_logger(),
            unlimited_mode=unlimited_buffer_mode,
            enabled=enable_buffered_logging
        )
```
This inheritance-based approach means any new vision node automatically gains sophisticated logging capabilities without code duplication.

4. Updated Launch File Parameters

The launch file parameters now clearly communicate their purpose:
```
# Before: Unclear parameter names
declare_enable_debug_buffer = DeclareLaunchArgument(
    'enable_debug_buffer',
    default_value='false',
    description='Enable comprehensive debug buffer logging...'
)

# After: Behavior-focused parameters
declare_unlimited_buffer_mode = DeclareLaunchArgument(
    'unlimited_buffer_mode',
    default_value='false',
    description='Enable unlimited buffer mode (timer-only flushing for comprehensive diagnostics)...'
) 
```
Implementation Details and Performance Characteristics

Memory Management Strategy

Each mode implements a different memory management strategy optimized for its use case:

Circular Mode uses Python's deque(maxlen=200) for automatic memory bounds:
```
# Automatic memory management
self.buffer = deque(maxlen=self.buffer_size)  # Auto-drops old entries
```
Unlimited Mode uses unbounded deque() for comprehensive logging:
```
# Unlimited growth for complete diagnostics
self.buffer = deque()  # No maxlen restriction
```
Disabled Mode eliminates buffer allocation entirely:
```
# Zero memory overhead
self.buffer = None
```
Event Logging Interface

The refactored system provides a clean interface for logging diagnostic events:
```
# Simple event logging with metadata
self.log_buffered_event(
    'IMAGE_PROCESSING_START',
    'Starting annotated image processing',
    frame_size=(640, 480),
    processing_time_ms=15.2
) 
```
Integration with ROS 2 Ecosystem

The refactored system integrates cleanly with ROS 2 parameter management:# Runtime parameter inspection
```
ros2 service call /object_detection_node/get_parameters rcl_interfaces/srv/GetParameters \
  "{names: ['unlimited_buffer_mode', 'buffer_logging_enabled']}"

# Response shows current configuration
# unlimited_buffer_mode: False, buffer_logging_enabled: True
```
Simplifying MediaPipe Vision Processing
08/10/2025 at 20:46 • 0 comments
In my recent work on the GestureBot vision system, I made several architectural improvements that significantly simplified the codebase while maintaining performance. Here's what I learned about building robust MediaPipe-based vision pipelines in ROS 2.

The Problem: Over-Engineering for Simplicity

Initially, I implemented a complex architecture with ComposableNodes, thread pools, and async processing patterns. The system created a new thread for every camera frame and used intricate callback checking mechanisms. While this seemed like a performance optimization, it introduced unnecessary complexity:
```
# Old approach - complex threading
threading.Thread(
    target=self._process_frame_async,
    args=(cv_image, timestamp),
    daemon=True
).start()

# Complex callback checking after submission
if self.processing_lock.acquire(blocking=False):
    # Process and check callback results...
```
I refactored the entire system to use a straightforward synchronous approach that separates concerns cleanly:

1. Converted from ComposableNode to Regular Node Architecture

Before:
```
camera_container = ComposableNodeContainer(
    name='object_detection_camera_container',
    package='rclcpp_components',
    executable='component_container',
    composable_node_descriptions=[
        ComposableNode(package='camera_ros', plugin='camera::CameraNode')
    ]
)
```
After:
```
camera_node = Node(
    package='camera_ros',
    executable='camera_node',
    name='camera_node',
    namespace='camera'
)
```
Why this works better: Since my object detection node runs in Python and can't be part of the same composable container anyway, using regular nodes eliminates complexity without sacrificing performance.

2. Separated Processing Contexts

I redesigned the processing flow to have two distinct, non-blocking contexts:
```
def image_callback(self, msg: Image) -> None:
    """Simple synchronous image processing callback."""
    cv_image = self.cv_bridge.imgmsg_to_cv2(msg, 'bgr8')
    timestamp = time.time()
    
    # Process frame synchronously - no threading complexity
    results = self.process_frame(cv_image, timestamp)
    
    if results is not None:
        self.publish_results(results, timestamp)
```
Key insight: Instead of checking MediaPipe callbacks after submission, I let MediaPipe's callback system handle result publishing directly. This eliminates the need for complex synchronization between submission and result retrieval.

3. Fixed MediaPipe Message Conversion Robustness

MediaPipe sometimes returns None values for bounding box coordinates and confidence scores. I added comprehensive None-value handling:
```
# Handle None values explicitly
origin_x = getattr(bbox, 'origin_x', None)
msg.bbox_x = int(origin_x) if origin_x is not None else 0

# Robust confidence assignment with multiple fallback approaches
if score_val is not None:
    confidence_val = float(score_val)
else:
    confidence_val = 0.0

try:
    msg.confidence = confidence_val
except:
    object.__setattr__(msg, 'confidence', confidence_val)
```
This eliminated the persistent <function DetectedObject.confidence at 0x...> returned a result with an exception set errors that were blocking the system.

4. Added Shared Memory Transport for Performance

While simplifying the architecture, I maintained performance by enabling shared memory transport. This provides most of the performance benefits of ComposableNodes without the architectural complexity.

5. Cleaned Up Topic Namespace

I consolidated all camera-related topics under a clean /camera/ namespace:
```
remappings=[
    ('~/image_raw', '/camera/image_raw'),
    ('~/camera_info', '/camera/camera_info'),
]
```
This eliminates duplicate topics like /camera_node/image_raw and /camera/image_raw that were causing confusion.

Results: Better Performance Through Simplicity

The refactored system achieves:
- Eliminated threading overhead: No more thread creation per frame
- Cleaner error handling: Robust None-value processing prevents crashes
- Simplified debugging: Linear execution flow is easy to trace
- Maintained performance: Shared memory transport provides efficient image transfer
- Clean topic structure: Single source of truth for camera data

Mechanical Design and Hardware Integration Notes

08/08/2025 at 17:43 • 0 comments

Hardware Platform

Base: iRobot Create 2 (Roomba)
Structural frame: 3/4" Schedule 40 PVC pipe
Custom parts: 3D‑printed base bracket and upper “blue” electronics enclosure
Sensors and compute (upper assembly): Raspberry Pi 5 + active cooler, camera, MPU6050 IMU, top‑mounted LiDAR, 3S LiPo, 5 V regulator, wiring harnesses

Why iRobot Create 2 (Roomba) as the base

I chose the Create 2 because:

It is a proven, rugged differential‑drive platform with integrated motor drivers, encoders, bump sensors, cliff sensors, and a charge dock interface.
The Open Interface (OI) provides documented serial control for motion and telemetry, which simplifies bring‑up and reduces the number of custom PCBs I need to maintain.
The chassis carries batteries low in the body, giving a naturally low center of mass that helps with the tall mast structure.
Replacement parts and batteries are widely available; consumables (wheels, brushes) are inexpensive.

In short, it gives me reliable locomotion and power infrastructure so I can focus engineering time on perception and interaction.

Structural Framework: 3/4" Schedule 40 PVC

I built the superstructure as a four‑post mast using standard 3/4" Schedule 40 PVC with printed sockets at the base and a printed upper enclosure that captures the posts.

PVC framework rationale

Cost‑effectiveness: PVC pipe and fittings cost a fraction of aluminum extrusion and require no specialty tooling. I can build and iterate for a few dollars per meter.
Structural rigidity: For a ~1–1.2 m mast, four 3/4" PVC uprights provide adequate bending stiffness when posts are constrained at both ends; adding a single mid‑height brace eliminates noticeable sway.
Lightweight: Low mass keeps the center of gravity near the Roomba deck, improving tip resistance during sudden stops or dock approaches.
Modularity: I cut posts to length and swap elbows/tees to reconfigure sensor height in minutes. Printed collars give me mounting points exactly where I need them.
Easy iteration: I can drill, ream, and solvent‑bond or simply screw into PVC without worrying about galvanic corrosion or thread wear in thin‑wall aluminum.

Practical tip: I lightly ream the pipe OD and size printed sockets with +0.3 to +0.5 mm clearance, then use two self‑tapping screws per joint. This holds under vibration and still allows disassembly.

Base Bracket (3D‑printed)

The base bracket is a circular plate that sits on the Roomba’s top deck and presents four vertical sockets for the PVC posts.

Design choices:

I align the sockets on a square bolt circle to match the upper enclosure’s posts; this prevents torsion and keeps the mast square.
The bracket uses the Create 2’s existing screw bosses for anchoring (no drilling in the shell). I embed heat‑set inserts in the print so I can torque fasteners without crushing plastic.
Filleted ribs radiate from each socket into the center ring to distribute mast loads and survive side hits.

Material and print settings:

PETG or ABS at 30–40% gyroid infill, four perimeters, 0.24–0.28 mm layer height. PETG gives enough ductility to absorb bumps without cracking.

Upper Assembly (“Blue Enclosure”)

The upper enclosure is a printed housing that integrates compute, power, and sensors while acting as the frame’s top plate. It also provides an easy surface for future sensors and user interfaces.

What I integrated

Raspberry Pi 5 (8 GB) with the official active cooler
5 V buck regulator (≥ 5 A) from the 3S LiPo rail
IMX219 camera module (front‑facing), recessed window
MPU6050 IMU (mounted near the enclosure’s CG to reduce rotational noise)
Top‑mounted LiDAR (clear 360° FOV, minimal occlusion from the mast)
3‑cell LiPo battery with inline fuse and master switch
Cable glands and internal harnesses

Thermal management

I treated the Pi 5 cooler as a forced‑air inlet and provided exhaust vents on the opposite wall. Short, straight flow paths are more effective than decorative perforations.
Mounting bosses standoff the Pi to keep airflow under the board and let the radiator breathe.
During sustained computer vision tasks, case temps stayed below ~72 °C with the fan at default curve.

Cable management and signal routing

I routed all power and signal lines inside the PVC uprights. Two strategies worked best:
- Slotted posts: A 6 mm slot near each end allows the cable to enter/exit without visible loops.
- Printed clip rings: Snap‑on rings with zip‑tie slots prevent cable rattle.
Power architecture:
- 3S LiPo → 5 V buck (Pi 5 + camera + small peripherals)
- LiDAR on a dedicated 5 V rail with its own inline fuse to prevent brown‑outs on Pi load spikes
Roomba Interface:
- I use a USB‑to‑TTL serial adapter (3.3 V logic) to the Create 2 Open Interface, routed through the mast. A small isolation board and a common ground tie keep noise out of the IMU.
- UART wiring is strain‑relieved inside the enclosure; the connector is service‑looped for easy removal.

Serviceability

The lid comes off with four machine screws. All boards mount to threaded brass inserts; nothing self‑tapped into plastic.
The battery is strapped to a removable tray with a finger‑pull, so I can swap packs without touching the rest of the wiring.

Component Selection Rationale

Raspberry Pi 5: Enough CPU for real‑time perception pipelines; strong community support; camera and LiDAR libraries are mature on Ubuntu.
IMX219 camera: Small, easy to mount flush in the enclosure, validated at 30 fps with the libcamera stack.
MPU6050: Commodity IMU with acceptable drift for mobile base stabilization and motion gating; easy to filter at 200 Hz on the Pi.
Top‑mounted LiDAR: Clear line of sight above people and furniture; the mast keeps occlusions outside of the primary scan plane.
3S LiPo + buck: Keeps high current off the Roomba’s internal 5 V; isolates compute from base brown‑outs and gives me a clean power budget.

Mechanical Integration Details

Fasteners: Almost everything is M3 or M4 to simplify spares. Steel washers are used where printed parts interface with PVC to prevent local crushing.
Tolerances: I leave 0.2–0.3 mm clearance for printed‑to‑printed fits and 0.4–0.5 mm for printed‑to‑PVC sockets; this range consistently assembles without post‑processing on a 0.4 mm nozzle.
Vibration: A single cross‑brace mid‑mast reduces LiDAR ringing and camera shake noticeably. If I add heavier sensors later, I’ll upgrade to a truss‑style brace that bolts to captured nuts in the posts.

Why PVC Instead of Aluminum Extrusion

Criterion	PVC (3/4" Sch 40)	2020/2040 Aluminum Extrusion
Cost	Very low	Moderate to high
Stiffness/weight for this height	Adequate with bracing	Higher
Tooling	Hand saw, drill, small screws	Chop saw, tapping, brackets
Modularity	Cut‑to‑length, printed adapters	Excellent with slot hardware
Iteration speed	Very fast	Moderate
Aesthetics	Utility‑grade	Professional

For a research robot that changes every week, the speed and cost advantages of PVC outweigh the stiffness and finish benefits of extrusion. When the design freezes, I can translate the printed sockets to aluminum adapters if needed.

Assembly Summary

Print base bracket, post sockets, and upper enclosure components with embedded heat‑set inserts.
Cut four PVC posts to length; drill cable entry/exit slots.
Mount the base bracket to the Create 2 using existing bosses; verify level.
Route the harnesses through posts, then seat posts into the base bracket and temporarily pin with screws.
Install the upper enclosure, capture the posts, and secure all joints.
Fit the Pi 5, regulator, IMU, camera, LiDAR, and battery; complete wiring with fuses and the master switch.
Bring up power rails, verify voltages, and perform sensor smoke tests before closing the lid.

Integration Challenges and Solutions

EMI into IMU at motor start: I added a small LC filter and routed the IMU cable away from the main battery run; problem disappeared.
LiDAR cable strain: A printed strain‑relief block under the top lid ensures the connector cannot back out with mast flex.
Tip stability during docking: Adding a small cross‑brace reduced mast oscillations that could trigger the Roomba’s safety bumpers.

Design Goals vs. Outcomes

Cost: Achieved. PVC + printed parts kept the mechanical BOM inexpensive.
Functionality: Achieved. Clear FOV for the camera and LiDAR, with clean sensor cabling.
Modularity: Achieved. I can reconfigure height and add shelves in under an hour.
Ease of assembly/maintenance: Achieved. All fasteners are accessible; boards mount to inserts; battery swaps are fast.

Bill of Materials (mechanical/electro‑mechanical, major items)

iRobot Create 2 (base)
3/4" Schedule 40 PVC pipe + self‑tapping screws
3D‑printed base bracket, post sockets, upper enclosure, strain‑reliefs
Raspberry Pi 5 + official active cooler
5 V high‑current buck regulator (≥ 5 A)
IMX219 camera module + acrylic window
MPU6050 IMU breakout
2D LiDAR (top mount)
3S LiPo battery, inline fuse, master switch
USB‑to‑TTL serial adapter (3.3 V) for Roomba OI
Cable glands, wire, ferrules, heat‑shrink

The Appeal of Autonomous Following

Leveraging Existing Object Detection Infrastructure

Smart Person Selection: More Than Just "Pick the Biggest"

Distance Estimation: Computer Vision Meets Physics

Following Behavior: The Art of Robotic Companionship

Smooth Motion Through Advanced Control

Safety First: Multiple Protection Layers

ROS 2 Architecture: Clean and Modular

Real-World Performance: Lessons from Testing

Hardware Performance: Pi 5 Delivers

Comparison with Other Control Methods

Future Directions: Beyond Basic Following

Why Body Poses Beat Hand Gestures

The Technical Foundation: MediaPipe Pose Detection

Simplicity Through Four Poses

ROS 2 Integration: From Pose to Motion

Hardware: Raspberry Pi 5 Proves Its Worth

Real-World Performance and Applications

The Code: Open Source and Ready to Deploy

System Architecture

Core Components

Data Flow Architecture

Launch File Structure

Technical Implementation

MediaPipe Integration for Hand Gesture Detection

Gesture-to-Motion Mapping System

Acceleration Limiting for Mechanical Stability

High-Frequency Velocity Smoothing

Performance Optimizations

Gesture Stability Filtering

Smart Logging System

Cross-Workspace ROS2 Integration

Results and Performance Metrics

Responsiveness Improvements

Stability Achievements

Detection Performance

The Core Architecture

The 33-Point Pose Model

Handling MediaPipe's Async Processing

Performance Reality Check: 3-7 FPS on Pi 5

Debugging MediaPipe Integration Challenges

Multi-Modal Vision Integration

Practical Robotics Applications

Human-Aware Navigation

Gesture Command Integration

Safety and Interaction

Key Performance Metrics

MediaPipe Integration with ROS 2

Core Architecture Components

Callback-Based Processing Pipeline

Hand Landmark Detection System

21-Point Hand Skeleton

Landmark Visualization Implementation

Gesture-to-Navigation Command Mapping

Supported Gestures and Commands

Gesture Stability and Confidence Thresholds

Unified Image Viewer Architecture

Multi-Topic Display Capabilities

Performance Optimizations

FPS Overlay Implementation

Performance Optimization for Raspberry Pi 5

Camera Configuration Optimization

Configuration and Usage

Launch File Configuration

Key Parameters Explained

Integration with Navigation System

Visual System Features and Consistency

Font Standardization

Annotated Image Publishing

Performance Optimization

Problem Statement: Why Move Beyond MediaPipe's Built-in Annotations

MediaPipe Annotation Limitations

Our Requirements

Technical Implementation: Manual Annotation Architecture

System Architecture

Core Implementation: Manual Annotation Method

Integration with MediaPipe Pipeline

Visual Features: Color-Coded Confidence System

Confidence Color Mapping

Typography and Positioning