Close

Building an Autonomous Person Following System with Computer Vision

A project log for GestureBot - Computer Vision Robot

A mobile robot that responds to human gestures, facial expressions using real-time pose estimation and gesture recognition & intuitive HRI

vipin-mVipin M 08/15/2025 at 14:470 Comments

When your robot becomes your shadow – implementing intelligent person following with object detection and ROS 2

Imagine a robot that follows you around like a loyal companion, maintaining the perfect distance whether you're walking through a warehouse, giving a facility tour, or need hands-free assistance. While gesture and pose control are great for direct commands, sometimes you want your robot to simply tag along autonomously. That's exactly what we've built into GestureBot – a standalone person following system that transforms any detected person into a moving target for smooth, intelligent pursuit.

The Appeal of Autonomous Following

Person following robots aren't just cool demos – they solve real problems. Consider a hospital robot carrying supplies that needs to follow a nurse through rounds, a security robot accompanying a guard on patrol, or a service robot helping someone navigate a large facility. In these scenarios, constant manual control becomes tedious and impractical.

The key insight is that following behavior should be completely autonomous once activated. No gestures, no poses, no commands – just intelligent tracking that maintains appropriate distance while handling the inevitable challenges of real-world environments: people walking behind obstacles, multiple individuals in the scene, varying lighting conditions, and the need for smooth, non-jerky motion that won't startle or annoy.

Leveraging Existing Object Detection Infrastructure

Rather than building a specialized person tracking system from scratch, we cleverly repurpose GestureBot's existing object detection capabilities. The system already runs MediaPipe's EfficientDet model at 5 FPS, detecting 80 different object classes including people with confidence scores and precise bounding boxes.

This architectural decision provides several advantages: proven stability, existing performance optimizations, and the ability to simultaneously track people and obstacles. The object detection system publishes to /vision/objects, providing a stream of detected people that our following controller can consume.

# Object detection provides person detections like this:
DetectedObject {
    class_name: "person"
    confidence: 0.76
    bbox_x: 145        # Top-left corner
    bbox_y: 89
    bbox_width: 312    # Bounding box dimensions
    bbox_height: 387
}

 The person following controller subscribes to this stream and implements sophisticated logic to select, track, and follow the most appropriate person in the scene.

Smart Person Selection: More Than Just "Pick the Biggest"

When multiple people appear in the camera view, the system needs to intelligently choose who to follow. Our selection algorithm uses a weighted scoring system that considers three key factors:

Size Score (40% weight): Larger bounding boxes typically indicate closer people or those more prominently positioned in the scene. This naturally biases toward the person most likely intended as the target.

Center Score (30% weight): People closer to the image center are preferred, following the reasonable assumption that users position themselves centrally when activating following mode.

Confidence Score (30% weight): Higher detection confidence indicates more reliable tracking, reducing the chance of following false positives or poorly detected individuals.

def select_initial_target(self, people):
    scored_people = []
    for person in people:
        # Normalize bounding box to 0-1 coordinates
        size_score = (person.bbox_width * person.bbox_height) / (640 * 480)
        center_x = (person.bbox_x + person.bbox_width/2) / 640
        center_score = 1.0 - abs(center_x - 0.5) * 2
        
        total_score = (size_score * 0.4 + 
                      center_score * 0.3 + 
                      person.confidence * 0.3)
        scored_people.append((person, total_score))
    
    return max(scored_people, key=lambda x: x[1])[0]

Once a target is selected, the system maintains tracking continuity by matching people across frames based on position prediction, preventing erratic switching between similar individuals.

Distance Estimation: Computer Vision Meets Physics

Estimating distance from a single camera image is a classic computer vision challenge. Without stereo vision or depth sensors, we rely on the relationship between object size and distance – larger bounding boxes generally indicate closer people.

After extensive calibration with real-world measurements, we developed an empirical mapping between normalized bounding box area and estimated distance:

def estimate_distance(self, person_bbox):
    # Convert pixel coordinates to normalized area
    bbox_area = (person.bbox_width * person.bbox_height) / (640 * 480)
    
    # Empirical distance mapping calibrated for average person height
    if bbox_area > 0.55:    # 55%+ of image = very close
        return 1.0          # ~1 meter
    elif bbox_area > 0.35:  # 35-55% = near target distance  
        return 1.5          # ~1.5 meters (target)
    elif bbox_area > 0.20:  # 20-35% = medium distance
        return 2.2          # ~2.2 meters
    # ... additional ranges up to 7 meters

This approach works surprisingly well for typical indoor environments, though it assumes average human height and doesn't account for unusual poses. To improve stability, we apply a 3-point moving average that smooths out frame-to-frame variations in bounding box size.

Following Behavior: The Art of Robotic Companionship

The core following logic implements two simultaneous control loops: distance maintenance and person centering. The distance controller calculates the error between current estimated distance and the target distance (1.5 meters), applying proportional control to generate forward/backward velocity commands.

The centering controller keeps the person positioned in the center of the camera view by calculating the horizontal offset and generating appropriate angular velocity commands. This dual-axis control creates natural following behavior that maintains both proper distance and orientation.

def calculate_following_command(self, person, estimated_distance):
    # Distance control (linear velocity)
    distance_error = estimated_distance - self.target_distance  # 1.5m target
    linear_velocity = distance_error * 0.8  # Control gain
    linear_velocity = max(-0.3, min(0.3, linear_velocity))  # Clamp limits
    
    # Centering control (angular velocity)  
    center_x = (person.bbox_x + person.bbox_width/2) / 640
    center_error = center_x - 0.5  # 0.5 = image center
    angular_velocity = -center_error * 1.5  # Control gain
    angular_velocity = max(-0.8, min(0.8, angular_velocity))  # Clamp limits
    
    return {'linear_x': linear_velocity, 'angular_z': angular_velocity} 

Smooth Motion Through Advanced Control

Raw velocity commands would create jerky, uncomfortable robot motion. Our solution implements a sophisticated velocity smoothing system running at 25 Hz – much faster than the 10 Hz control calculation rate. This high-frequency loop applies acceleration limiting to gradually ramp velocities up and down.

The acceleration limits (1.0 m/s² linear, 2.0 rad/s² angular) are carefully tuned for the Raspberry Pi 5's processing capabilities while ensuring smooth motion. A typical acceleration from rest to 0.3 m/s takes about 0.3 seconds over 7-8 control cycles, creating natural-feeling motion that won't startle users or cause mechanical stress.

Critical to the system's success is preventing rapid target switching. The control system holds velocity commands for a minimum of 500ms, preventing oscillations that would occur if the robot constantly changed direction based on minor detection variations.

Safety First: Multiple Protection Layers

Autonomous following requires robust safety systems. Our implementation includes several protection mechanisms:

Distance Limits: The robot won't approach closer than 0.8 meters (safety zone) and stops following if the person exceeds 5.0 meters (lost target). These limits prevent uncomfortable crowding and endless pursuit of distant figures.

Timeout Protection: If no person is detected for 3 seconds, the system automatically deactivates following mode and stops the robot. This handles cases where the target leaves the camera view or detection fails.

Emergency Override: The system monitors an /emergency_stop topic, immediately halting motion if any other system component detects a problem.

Backward Motion Safety: When a person gets too close, the robot smoothly backs away rather than stopping abruptly, maintaining comfortable interpersonal distance.

ROS 2 Architecture: Clean and Modular

The person following system integrates seamlessly with ROS 2's publish-subscribe architecture. The person_following_controller node subscribes to /vision/objects from the existing object detection system and publishes velocity commands to /cmd_vel for the robot base.

Activation and deactivation happen through a simple service interface:

# Activate person following mode
ros2 service call /follow_mode/activate std_srvs/srv/SetBool "data: true"

# Deactivate following mode  
ros2 service call /follow_mode/activate std_srvs/srv/SetBool "data: false"

# Monitor following behavior
ros2 topic echo /cmd_vel

This modular design means the following system can work with any robot platform that accepts standard ROS 2 velocity commands, making it broadly applicable beyond the original GestureBot hardware.

Real-World Performance: Lessons from Testing

In practice, the system performs remarkably well in typical indoor environments. Response time from person detection to robot motion is under 200ms, creating immediate feedback that feels natural. The robot maintains the target 1.5-meter distance with ±0.3-meter accuracy under normal conditions.

The biggest challenges come from environmental factors: people walking behind furniture create temporary occlusions that the system handles by maintaining last-known velocity until the person reappears. Multiple people in the scene occasionally cause target switching, though our scoring algorithm minimizes this issue.

Lighting variations affect object detection confidence but rarely cause complete tracking failure. The system gracefully degrades by reducing following speed when confidence drops, rather than stopping entirely.

Hardware Performance: Pi 5 Delivers

Running on a Raspberry Pi 5 with 8GB RAM, the complete system (object detection + person following + ROS 2 navigation) consumes approximately 70-75% CPU at steady state. This leaves headroom for additional processing while maintaining stable performance.

The Pi Camera Module 3 provides sufficient image quality for reliable person detection at 640x480 resolution. Higher resolutions improve detection accuracy but reduce frame rate – the current configuration strikes an optimal balance for real-time following behavior.

Power consumption remains reasonable at ~8W total system power, making battery operation feasible for mobile applications.

Comparison with Other Control Methods

Person following complements rather than replaces gesture and pose control. Each method has its ideal use cases:

Gesture Control: Best for precise, intentional commands when the user wants direct robot control Pose Control: Ideal for hands-free operation when gestures aren't practical Person Following: Perfect for autonomous companionship when continuous manual control would be tedious

The beauty of the modular architecture is that users can switch between modes as needed, or even combine them – imagine a robot that follows you autonomously but responds to gesture overrides for specific actions.

Future Directions: Beyond Basic Following

The current implementation provides a solid foundation for more sophisticated behaviors. Future enhancements could include:

Outdoor Operation: Adapting distance estimation for varying lighting and longer ranges Multi-Person Scenarios: Following groups or switching between designated individuals Predictive Tracking: Using motion prediction to handle temporary occlusions more gracefully Sensor Fusion: Integrating lidar or depth cameras for more accurate distance measurement Social Awareness: Adjusting following distance based on environmental context and social norms

Discussions