Close

Simplifying MediaPipe Vision Processing

A project log for GestureBot - Computer Vision Robot

A mobile robot that responds to human gestures, facial expressions using real-time pose estimation and gesture recognition & intuitive HRI

vipin-mVipin M 08/10/2025 at 20:460 Comments

In my recent work on the GestureBot vision system, I made several architectural improvements that significantly simplified the codebase while maintaining performance. Here's what I learned about building robust MediaPipe-based vision pipelines in ROS 2.

The Problem: Over-Engineering for Simplicity

Initially, I implemented a complex architecture with ComposableNodes, thread pools, and async processing patterns. The system created a new thread for every camera frame and used intricate callback checking mechanisms. While this seemed like a performance optimization, it introduced unnecessary complexity:

# Old approach - complex threading
threading.Thread(
    target=self._process_frame_async,
    args=(cv_image, timestamp),
    daemon=True
).start()

# Complex callback checking after submission
if self.processing_lock.acquire(blocking=False):
    # Process and check callback results...

 I refactored the entire system to use a straightforward synchronous approach that separates concerns cleanly:

1. Converted from ComposableNode to Regular Node Architecture

Before:

camera_container = ComposableNodeContainer(
    name='object_detection_camera_container',
    package='rclcpp_components',
    executable='component_container',
    composable_node_descriptions=[
        ComposableNode(package='camera_ros', plugin='camera::CameraNode')
    ]
)

After:

camera_node = Node(
    package='camera_ros',
    executable='camera_node',
    name='camera_node',
    namespace='camera'
)

Why this works better: Since my object detection node runs in Python and can't be part of the same composable container anyway, using regular nodes eliminates complexity without sacrificing performance.

2. Separated Processing Contexts

I redesigned the processing flow to have two distinct, non-blocking contexts:

def image_callback(self, msg: Image) -> None:
    """Simple synchronous image processing callback."""
    cv_image = self.cv_bridge.imgmsg_to_cv2(msg, 'bgr8')
    timestamp = time.time()
    
    # Process frame synchronously - no threading complexity
    results = self.process_frame(cv_image, timestamp)
    
    if results is not None:
        self.publish_results(results, timestamp)

 Key insight: Instead of checking MediaPipe callbacks after submission, I let MediaPipe's callback system handle result publishing directly. This eliminates the need for complex synchronization between submission and result retrieval.

3. Fixed MediaPipe Message Conversion Robustness

MediaPipe sometimes returns None values for bounding box coordinates and confidence scores. I added comprehensive None-value handling:

# Handle None values explicitly
origin_x = getattr(bbox, 'origin_x', None)
msg.bbox_x = int(origin_x) if origin_x is not None else 0

# Robust confidence assignment with multiple fallback approaches
if score_val is not None:
    confidence_val = float(score_val)
else:
    confidence_val = 0.0

try:
    msg.confidence = confidence_val
except:
    object.__setattr__(msg, 'confidence', confidence_val)

This eliminated the persistent <function DetectedObject.confidence at 0x...> returned a result with an exception set errors that were blocking the system.

4. Added Shared Memory Transport for Performance

While simplifying the architecture, I maintained performance by enabling shared memory transport. This provides most of the performance benefits of ComposableNodes without the architectural complexity.

5. Cleaned Up Topic Namespace

I consolidated all camera-related topics under a clean /camera/ namespace:

remappings=[
    ('~/image_raw', '/camera/image_raw'),
    ('~/camera_info', '/camera/camera_info'),
]

 This eliminates duplicate topics like /camera_node/image_raw and /camera/image_raw that were causing confusion.

Results: Better Performance Through Simplicity

The refactored system achieves:

Discussions