Building a 33-Point Human Skeleton Tracker

Most robotics vision systems treat humans as simple bounding boxes – "person detected, avoid obstacle." But humans are dynamic, expressive, and predictable if you know how to read body language. A person leaning forward might be about to walk into the robot's path. Someone pointing could be giving directional commands. Arms raised might signal "stop."

I needed a system that could:

Track 33 distinct body landmarks in real-time
Handle multiple people simultaneously (up to 2 poses)
Run headless on embedded hardware without X11 dependencies
Integrate cleanly with my existing ROS 2 navigation stack
Provide visual feedback for development and debugging

The Core Architecture

The heart of my implementation is a modular ROS 2 node that wraps MediaPipe's PoseLandmarker model. I chose a composition pattern to keep the ROS infrastructure separate from the MediaPipe processing logic:

class PoseDetectionNode(MediaPipeBaseNode, MediaPipeCallbackMixin):
    def __init__(self, **kwargs):
        MediaPipeCallbackMixin.__init__(self)
        super().__init__(
            node_name='pose_detection_node',
            **kwargs
        )
        
        # Initialize the pose detection controller
        self.controller = PoseDetectionController(
            model_path=self.model_path,
            confidence_threshold=self.confidence_threshold,
            max_poses=self.max_poses,
            logger=self.get_logger()
        )

The PoseDetectionController handles all MediaPipe-specific operations:

class PoseDetectionController:
    def __init__(self, model_path: str, confidence_threshold: float, max_poses: int, logger):
        self.logger = logger
        
        # Configure MediaPipe options
        base_options = python.BaseOptions(model_asset_path=model_path)
        options = vision.PoseLandmarkerOptions(
            base_options=base_options,
            running_mode=vision.RunningMode.LIVE_STREAM,
            num_poses=max_poses,
            min_pose_detection_confidence=confidence_threshold,
            min_pose_presence_confidence=confidence_threshold,
            min_tracking_confidence=confidence_threshold,
            result_callback=self._pose_callback
        )
        
        self._landmarker = vision.PoseLandmarker.create_from_options(options)

The 33-Point Pose Model

MediaPipe's pose model detects 33 landmarks covering the entire human body:

Face: Nose, eyes, ears (5 points)
Torso: Shoulders, hips, center points (6 points)
Arms: Shoulders, elbows, wrists (6 points)
Hands: Wrist, thumb, fingers (10 points)
Legs: Hips, knees, ankles, feet (6 points)

Each landmark provides normalized (x, y) coordinates plus a visibility score, giving rich information about human pose and orientation.

Handling MediaPipe's Async Processing

One of the trickier aspects was properly handling MediaPipe's asynchronous LIVE_STREAM mode. The pose detection happens in a separate thread, with results delivered via callback:

def _pose_callback(self, result: vision.PoseLandmarkerResult, 
                   output_image: mp.Image, timestamp_ms: int):
    """Handle pose detection results from MediaPipe."""
    try:
        # Convert MediaPipe timestamp to ROS time
        ros_timestamp = self._convert_timestamp(timestamp_ms)
        
        # Process pose landmarks
        pose_msg = PoseLandmarks()
        pose_msg.header.stamp = ros_timestamp
        pose_msg.header.frame_id = 'camera_frame'
        
        if result.pose_landmarks:
            pose_msg.num_poses = len(result.pose_landmarks)
            
            # Handle MediaPipe's pose landmark structure variations
            for pose_landmarks in result.pose_landmarks:
                try:
                    # MediaPipe structure can vary between versions
                    if hasattr(pose_landmarks, '__iter__') and not hasattr(pose_landmarks, 'landmark'):
                        landmarks = pose_landmarks  # Direct list access
                    else:
                        landmarks = pose_landmarks.landmark  # Attribute access
                        
                    for landmark in landmarks:
                        point = Point()
                        point.x = landmark.x
                        point.y = landmark.y  
                        point.z = landmark.z
                        pose_msg.landmarks.append(point)
                        
                except Exception as e:
                    self.logger.warn(f'Pose landmark processing error: {e}')
                    continue
        
        # Publish results
        self.pose_publisher.publish(pose_msg)
        
    except Exception as e:
        self.logger.error(f'Pose callback error: {e}')\

Performance Reality Check: 3-7 FPS on Pi 5

Let me be honest about performance – this isn't going to run at 30 FPS on a Raspberry Pi 5. Through extensive testing, I measured:

Actual Performance: 3-7 FPS @ 640x480 resolution
Processing Time: ~150-300ms per frame
Memory Usage: ~200MB additional RAM
CPU Load: ~40-60% of one core during active processing

But here's the thing – for robotics applications, this is actually sufficient. Human movement is relatively slow compared to computer vision processing. A robot doesn't need to track every micro-movement; it needs to understand general pose, direction, and intent.

The key insight was optimizing for stability over speed. I'd rather have consistent 5 FPS processing than erratic 15 FPS with dropped frames and errors.

Debugging MediaPipe Integration Challenges

The most frustrating part of this project was dealing with MediaPipe's pose landmark data structures. Different versions of MediaPipe return pose landmarks in slightly different formats, and the documentation doesn't clearly explain the variations.

I spent hours debugging errors like:

AttributeError: 'list' object has no attribute 'landmark'

The solution was implementing robust structure detection:

# Handle both possible MediaPipe pose landmark structures
if hasattr(pose_landmarks, '__iter__') and not hasattr(pose_landmarks, 'landmark'):
    # pose_landmarks is already a list of landmarks (newer format)
    landmarks = pose_landmarks
else:
    # pose_landmarks has .landmark attribute (older format)  
    landmarks = pose_landmarks.landmark

# Handle both possible MediaPipe pose landmark structures
if hasattr(pose_landmarks, '__iter__') and not hasattr(pose_landmarks, 'landmark'):
# pose_landmarks is already a list of landmarks (newer format)
landmarks = pose_landmarks
else:
# pose_landmarks has .landmark attribute (older format)
landmarks = pose_landmarks.landmark

if self.annotated_image_publisher and result.pose_landmarks:
    annotated_frame = self._create_annotated_image(output_image.numpy_view(), result)
    annotated_msg = self.cv_bridge.cv2_to_imgmsg(annotated_frame, encoding='bgr8')
    annotated_msg.header.stamp = self.get_clock().now().to_msg()
    annotated_msg.header.frame_id = 'camera_frame'
    self.annotated_image_publisher.publish(annotated_msg)

I can develop and debug with full visualization, then deploy the same code headless on the robot.

Multi-Modal Vision Integration

The real power comes from combining pose detection with other vision modalities. My unified image viewer can display multiple vision streams simultaneously:

# View all vision systems together
ros2 launch gesturebot image_viewer.launch.py \
    image_topics:='["/vision/objects/annotated", "/vision/gestures/annotated", "/vision/pose/annotated"]' \
    topic_window_names:='{"\/vision\/objects\/annotated": "Objects", "\/vision\/gestures\/annotated": "Gestures", "\/vision\/pose\/annotated": "Poses"}'

This gives me object detection (what's there), gesture recognition (what they're doing), and pose detection (how they're positioned) all in one integrated system.

Practical Robotics Applications

Human-Aware Navigation

Instead of treating humans as static obstacles, my robot can now:

Predict movement direction from body orientation
Maintain appropriate social distances based on pose
Recognize when someone is trying to interact vs. just passing by

Gesture Command Integration

Pose data enhances gesture recognition by providing context:

A pointing gesture combined with body orientation gives directional commands
Raised arms with forward-leaning pose might indicate "stop" vs. "hello"
Multiple people can give conflicting gestures – pose helps determine who to follow

Safety and Interaction

The 33-point skeleton provides rich safety information:

Detect if someone has fallen (unusual pose angles)
Recognize aggressive vs. friendly body language
Identify when someone is carrying objects that might affect navigation