Close

Building a 33-Point Human Skeleton Tracker

A project log for GestureBot - Computer Vision Robot

A mobile robot that responds to human gestures, facial expressions using real-time pose estimation and gesture recognition & intuitive HRI

vipin-mVipin M 08/14/2025 at 15:200 Comments

Most robotics vision systems treat humans as simple bounding boxes – "person detected, avoid obstacle." But humans are dynamic, expressive, and predictable if you know how to read body language. A person leaning forward might be about to walk into the robot's path. Someone pointing could be giving directional commands. Arms raised might signal "stop."

I needed a system that could:

The Core Architecture

The heart of my implementation is a modular ROS 2 node that wraps MediaPipe's PoseLandmarker model. I chose a composition pattern to keep the ROS infrastructure separate from the MediaPipe processing logic:

class PoseDetectionNode(MediaPipeBaseNode, MediaPipeCallbackMixin):
    def __init__(self, **kwargs):
        MediaPipeCallbackMixin.__init__(self)
        super().__init__(
            node_name='pose_detection_node',
            **kwargs
        )
        
        # Initialize the pose detection controller
        self.controller = PoseDetectionController(
            model_path=self.model_path,
            confidence_threshold=self.confidence_threshold,
            max_poses=self.max_poses,
            logger=self.get_logger()
        )

The PoseDetectionController handles all MediaPipe-specific operations: 

class PoseDetectionController:
    def __init__(self, model_path: str, confidence_threshold: float, max_poses: int, logger):
        self.logger = logger
        
        # Configure MediaPipe options
        base_options = python.BaseOptions(model_asset_path=model_path)
        options = vision.PoseLandmarkerOptions(
            base_options=base_options,
            running_mode=vision.RunningMode.LIVE_STREAM,
            num_poses=max_poses,
            min_pose_detection_confidence=confidence_threshold,
            min_pose_presence_confidence=confidence_threshold,
            min_tracking_confidence=confidence_threshold,
            result_callback=self._pose_callback
        )
        
        self._landmarker = vision.PoseLandmarker.create_from_options(options)

The 33-Point Pose Model

MediaPipe's pose model detects 33 landmarks covering the entire human body:

Each landmark provides normalized (x, y) coordinates plus a visibility score, giving rich information about human pose and orientation.

Handling MediaPipe's Async Processing

One of the trickier aspects was properly handling MediaPipe's asynchronous LIVE_STREAM mode. The pose detection happens in a separate thread, with results delivered via callback:

def _pose_callback(self, result: vision.PoseLandmarkerResult, 
                   output_image: mp.Image, timestamp_ms: int):
    """Handle pose detection results from MediaPipe."""
    try:
        # Convert MediaPipe timestamp to ROS time
        ros_timestamp = self._convert_timestamp(timestamp_ms)
        
        # Process pose landmarks
        pose_msg = PoseLandmarks()
        pose_msg.header.stamp = ros_timestamp
        pose_msg.header.frame_id = 'camera_frame'
        
        if result.pose_landmarks:
            pose_msg.num_poses = len(result.pose_landmarks)
            
            # Handle MediaPipe's pose landmark structure variations
            for pose_landmarks in result.pose_landmarks:
                try:
                    # MediaPipe structure can vary between versions
                    if hasattr(pose_landmarks, '__iter__') and not hasattr(pose_landmarks, 'landmark'):
                        landmarks = pose_landmarks  # Direct list access
                    else:
                        landmarks = pose_landmarks.landmark  # Attribute access
                        
                    for landmark in landmarks:
                        point = Point()
                        point.x = landmark.x
                        point.y = landmark.y  
                        point.z = landmark.z
                        pose_msg.landmarks.append(point)
                        
                except Exception as e:
                    self.logger.warn(f'Pose landmark processing error: {e}')
                    continue
        
        # Publish results
        self.pose_publisher.publish(pose_msg)
        
    except Exception as e:
        self.logger.error(f'Pose callback error: {e}')\

Performance Reality Check: 3-7 FPS on Pi 5

Let me be honest about performance – this isn't going to run at 30 FPS on a Raspberry Pi 5. Through extensive testing, I measured:

But here's the thing – for robotics applications, this is actually sufficient. Human movement is relatively slow compared to computer vision processing. A robot doesn't need to track every micro-movement; it needs to understand general pose, direction, and intent.

The key insight was optimizing for stability over speed. I'd rather have consistent 5 FPS processing than erratic 15 FPS with dropped frames and errors.

Debugging MediaPipe Integration Challenges

The most frustrating part of this project was dealing with MediaPipe's pose landmark data structures. Different versions of MediaPipe return pose landmarks in slightly different formats, and the documentation doesn't clearly explain the variations.

I spent hours debugging errors like:

AttributeError: 'list' object has no attribute 'landmark'

The solution was implementing robust structure detection:

# Handle both possible MediaPipe pose landmark structures
if hasattr(pose_landmarks, '__iter__') and not hasattr(pose_landmarks, 'landmark'):
    # pose_landmarks is already a list of landmarks (newer format)
    landmarks = pose_landmarks
else:
    # pose_landmarks has .landmark attribute (older format)  
    landmarks = pose_landmarks.landmark

# Handle both possible MediaPipe pose landmark structures
if hasattr(pose_landmarks, '__iter__') and not hasattr(pose_landmarks, 'landmark'):
    # pose_landmarks is already a list of landmarks (newer format)
    landmarks = pose_landmarks
else:
    # pose_landmarks has .landmark attribute (older format)  
    landmarks = pose_landmarks.landmark

if self.annotated_image_publisher and result.pose_landmarks:
    annotated_frame = self._create_annotated_image(output_image.numpy_view(), result)
    annotated_msg = self.cv_bridge.cv2_to_imgmsg(annotated_frame, encoding='bgr8')
    annotated_msg.header.stamp = self.get_clock().now().to_msg()
    annotated_msg.header.frame_id = 'camera_frame'
    self.annotated_image_publisher.publish(annotated_msg)

I can develop and debug with full visualization, then deploy the same code headless on the robot.

Multi-Modal Vision Integration

The real power comes from combining pose detection with other vision modalities. My unified image viewer can display multiple vision streams simultaneously:

# View all vision systems together
ros2 launch gesturebot image_viewer.launch.py \
    image_topics:='["/vision/objects/annotated", "/vision/gestures/annotated", "/vision/pose/annotated"]' \
    topic_window_names:='{"\/vision\/objects\/annotated": "Objects", "\/vision\/gestures\/annotated": "Gestures", "\/vision\/pose\/annotated": "Poses"}'

This gives me object detection (what's there), gesture recognition (what they're doing), and pose detection (how they're positioned) all in one integrated system.

Practical Robotics Applications

Human-Aware Navigation

Instead of treating humans as static obstacles, my robot can now:

Gesture Command Integration

Pose data enhances gesture recognition by providing context:

Safety and Interaction

The 33-point skeleton provides rich safety information:

Discussions