Most robotics vision systems treat humans as simple bounding boxes – "person detected, avoid obstacle." But humans are dynamic, expressive, and predictable if you know how to read body language. A person leaning forward might be about to walk into the robot's path. Someone pointing could be giving directional commands. Arms raised might signal "stop."
I needed a system that could:
- Track 33 distinct body landmarks in real-time
- Handle multiple people simultaneously (up to 2 poses)
- Run headless on embedded hardware without X11 dependencies
- Integrate cleanly with my existing ROS 2 navigation stack
- Provide visual feedback for development and debugging
The Core Architecture
The heart of my implementation is a modular ROS 2 node that wraps MediaPipe's PoseLandmarker model. I chose a composition pattern to keep the ROS infrastructure separate from the MediaPipe processing logic:
class PoseDetectionNode(MediaPipeBaseNode, MediaPipeCallbackMixin):
def __init__(self, **kwargs):
MediaPipeCallbackMixin.__init__(self)
super().__init__(
node_name='pose_detection_node',
**kwargs
)
# Initialize the pose detection controller
self.controller = PoseDetectionController(
model_path=self.model_path,
confidence_threshold=self.confidence_threshold,
max_poses=self.max_poses,
logger=self.get_logger()
)The PoseDetectionController handles all MediaPipe-specific operations:
class PoseDetectionController:
def __init__(self, model_path: str, confidence_threshold: float, max_poses: int, logger):
self.logger = logger
# Configure MediaPipe options
base_options = python.BaseOptions(model_asset_path=model_path)
options = vision.PoseLandmarkerOptions(
base_options=base_options,
running_mode=vision.RunningMode.LIVE_STREAM,
num_poses=max_poses,
min_pose_detection_confidence=confidence_threshold,
min_pose_presence_confidence=confidence_threshold,
min_tracking_confidence=confidence_threshold,
result_callback=self._pose_callback
)
self._landmarker = vision.PoseLandmarker.create_from_options(options)The 33-Point Pose Model
MediaPipe's pose model detects 33 landmarks covering the entire human body:
- Face: Nose, eyes, ears (5 points)
- Torso: Shoulders, hips, center points (6 points)
- Arms: Shoulders, elbows, wrists (6 points)
- Hands: Wrist, thumb, fingers (10 points)
- Legs: Hips, knees, ankles, feet (6 points)
Each landmark provides normalized (x, y) coordinates plus a visibility score, giving rich information about human pose and orientation.
Handling MediaPipe's Async Processing
One of the trickier aspects was properly handling MediaPipe's asynchronous LIVE_STREAM mode. The pose detection happens in a separate thread, with results delivered via callback:
def _pose_callback(self, result: vision.PoseLandmarkerResult,
output_image: mp.Image, timestamp_ms: int):
"""Handle pose detection results from MediaPipe."""
try:
# Convert MediaPipe timestamp to ROS time
ros_timestamp = self._convert_timestamp(timestamp_ms)
# Process pose landmarks
pose_msg = PoseLandmarks()
pose_msg.header.stamp = ros_timestamp
pose_msg.header.frame_id = 'camera_frame'
if result.pose_landmarks:
pose_msg.num_poses = len(result.pose_landmarks)
# Handle MediaPipe's pose landmark structure variations
for pose_landmarks in result.pose_landmarks:
try:
# MediaPipe structure can vary between versions
if hasattr(pose_landmarks, '__iter__') and not hasattr(pose_landmarks, 'landmark'):
landmarks = pose_landmarks # Direct list access
else:
landmarks = pose_landmarks.landmark # Attribute access
for landmark in landmarks:
point = Point()
point.x = landmark.x
point.y = landmark.y
point.z = landmark.z
pose_msg.landmarks.append(point)
except Exception as e:
self.logger.warn(f'Pose landmark processing error: {e}')
continue
# Publish results
self.pose_publisher.publish(pose_msg)
except Exception as e:
self.logger.error(f'Pose callback error: {e}')\Performance Reality Check: 3-7 FPS on Pi 5
Let me be honest about performance – this isn't going to run at 30 FPS on a Raspberry Pi 5. Through extensive testing, I measured:
- Actual Performance: 3-7 FPS @ 640x480 resolution
- Processing Time: ~150-300ms per frame
- Memory Usage: ~200MB additional RAM
- CPU Load: ~40-60% of one core during active processing
But here's the thing – for robotics applications, this is actually sufficient. Human movement is relatively slow compared to computer vision processing. A robot doesn't need to track every micro-movement; it needs to understand general pose, direction, and intent.
The key insight was optimizing for stability over speed. I'd rather have consistent 5 FPS processing than erratic 15 FPS with dropped frames and errors.
Debugging MediaPipe Integration Challenges
The most frustrating part of this project was dealing with MediaPipe's pose landmark data structures. Different versions of MediaPipe return pose landmarks in slightly different formats, and the documentation doesn't clearly explain the variations.
I spent hours debugging errors like:
AttributeError: 'list' object has no attribute 'landmark'The solution was implementing robust structure detection:
# Handle both possible MediaPipe pose landmark structures
if hasattr(pose_landmarks, '__iter__') and not hasattr(pose_landmarks, 'landmark'):
# pose_landmarks is already a list of landmarks (newer format)
landmarks = pose_landmarks
else:
# pose_landmarks has .landmark attribute (older format)
landmarks = pose_landmarks.landmark# Handle both possible MediaPipe pose landmark structures
if hasattr(pose_landmarks, '__iter__') and not hasattr(pose_landmarks, 'landmark'):
# pose_landmarks is already a list of landmarks (newer format)
landmarks = pose_landmarks
else:
# pose_landmarks has .landmark attribute (older format)
landmarks = pose_landmarks.landmark
if self.annotated_image_publisher and result.pose_landmarks:
annotated_frame = self._create_annotated_image(output_image.numpy_view(), result)
annotated_msg = self.cv_bridge.cv2_to_imgmsg(annotated_frame, encoding='bgr8')
annotated_msg.header.stamp = self.get_clock().now().to_msg()
annotated_msg.header.frame_id = 'camera_frame'
self.annotated_image_publisher.publish(annotated_msg)I can develop and debug with full visualization, then deploy the same code headless on the robot.
Multi-Modal Vision Integration
The real power comes from combining pose detection with other vision modalities. My unified image viewer can display multiple vision streams simultaneously:
# View all vision systems together
ros2 launch gesturebot image_viewer.launch.py \
image_topics:='["/vision/objects/annotated", "/vision/gestures/annotated", "/vision/pose/annotated"]' \
topic_window_names:='{"\/vision\/objects\/annotated": "Objects", "\/vision\/gestures\/annotated": "Gestures", "\/vision\/pose\/annotated": "Poses"}'This gives me object detection (what's there), gesture recognition (what they're doing), and pose detection (how they're positioned) all in one integrated system.
Practical Robotics Applications
Human-Aware Navigation
Instead of treating humans as static obstacles, my robot can now:
- Predict movement direction from body orientation
- Maintain appropriate social distances based on pose
- Recognize when someone is trying to interact vs. just passing by
Gesture Command Integration
Pose data enhances gesture recognition by providing context:
- A pointing gesture combined with body orientation gives directional commands
- Raised arms with forward-leaning pose might indicate "stop" vs. "hello"
- Multiple people can give conflicting gestures – pose helps determine who to follow
Safety and Interaction
The 33-point skeleton provides rich safety information:
- Detect if someone has fallen (unusual pose angles)
- Recognize aggressive vs. friendly body language
- Identify when someone is carrying objects that might affect navigation
Vipin M
Discussions
Become a Hackaday.io Member
Create an account to leave a comment. Already have an account? Log In.