Simplifying MediaPipe Vision Processing

In my recent work on the GestureBot vision system, I made several architectural improvements that significantly simplified the codebase while maintaining performance. Here's what I learned about building robust MediaPipe-based vision pipelines in ROS 2.

The Problem: Over-Engineering for Simplicity

Initially, I implemented a complex architecture with ComposableNodes, thread pools, and async processing patterns. The system created a new thread for every camera frame and used intricate callback checking mechanisms. While this seemed like a performance optimization, it introduced unnecessary complexity:

# Old approach - complex threading
threading.Thread(
    target=self._process_frame_async,
    args=(cv_image, timestamp),
    daemon=True
).start()

# Complex callback checking after submission
if self.processing_lock.acquire(blocking=False):
    # Process and check callback results...

I refactored the entire system to use a straightforward synchronous approach that separates concerns cleanly:

1. Converted from ComposableNode to Regular Node Architecture

Before:

camera_container = ComposableNodeContainer(
    name='object_detection_camera_container',
    package='rclcpp_components',
    executable='component_container',
    composable_node_descriptions=[
        ComposableNode(package='camera_ros', plugin='camera::CameraNode')
    ]
)

After:

camera_node = Node(
    package='camera_ros',
    executable='camera_node',
    name='camera_node',
    namespace='camera'
)

Why this works better: Since my object detection node runs in Python and can't be part of the same composable container anyway, using regular nodes eliminates complexity without sacrificing performance.

2. Separated Processing Contexts

I redesigned the processing flow to have two distinct, non-blocking contexts:

def image_callback(self, msg: Image) -> None:
    """Simple synchronous image processing callback."""
    cv_image = self.cv_bridge.imgmsg_to_cv2(msg, 'bgr8')
    timestamp = time.time()
    
    # Process frame synchronously - no threading complexity
    results = self.process_frame(cv_image, timestamp)
    
    if results is not None:
        self.publish_results(results, timestamp)

Key insight: Instead of checking MediaPipe callbacks after submission, I let MediaPipe's callback system handle result publishing directly. This eliminates the need for complex synchronization between submission and result retrieval.

3. Fixed MediaPipe Message Conversion Robustness

MediaPipe sometimes returns None values for bounding box coordinates and confidence scores. I added comprehensive None-value handling:

# Handle None values explicitly
origin_x = getattr(bbox, 'origin_x', None)
msg.bbox_x = int(origin_x) if origin_x is not None else 0

# Robust confidence assignment with multiple fallback approaches
if score_val is not None:
    confidence_val = float(score_val)
else:
    confidence_val = 0.0

try:
    msg.confidence = confidence_val
except:
    object.__setattr__(msg, 'confidence', confidence_val)

This eliminated the persistent <function DetectedObject.confidence at 0x...> returned a result with an exception set errors that were blocking the system.

4. Added Shared Memory Transport for Performance

While simplifying the architecture, I maintained performance by enabling shared memory transport. This provides most of the performance benefits of ComposableNodes without the architectural complexity.

5. Cleaned Up Topic Namespace

I consolidated all camera-related topics under a clean /camera/ namespace:

remappings=[
    ('~/image_raw', '/camera/image_raw'),
    ('~/camera_info', '/camera/camera_info'),
]

This eliminates duplicate topics like /camera_node/image_raw and /camera/image_raw that were causing confusion.

Results: Better Performance Through Simplicity

The refactored system achieves:

Eliminated threading overhead: No more thread creation per frame
Cleaner error handling: Robust None-value processing prevents crashes
Simplified debugging: Linear execution flow is easy to trace
Maintained performance: Shared memory transport provides efficient image transfer
Clean topic structure: Single source of truth for camera data

The Problem: Over-Engineering for Simplicity

1. Converted from ComposableNode to Regular Node Architecture

2. Separated Processing Contexts

3. Fixed MediaPipe Message Conversion Robustness

4. Added Shared Memory Transport for Performance

5. Cleaned Up Topic Namespace

Results: Better Performance Through Simplicity

Mechanical Design and Hardware Integration Notes

Refactoring Buffered Logging in ROS 2 Vision Pipelines

Discussions

Become a Hackaday.io Member