Close

Manual Object Detection Annotations in GestureBot

A project log for GestureBot - Computer Vision Robot

A mobile robot that responds to human gestures, facial expressions using real-time pose estimation and gesture recognition & intuitive HRI

vipin-mVipin M 08/11/2025 at 05:040 Comments

When developing the GestureBot vision system, I encountered a common challenge in robotics computer vision: balancing performance with visualization flexibility. While MediaPipe provides excellent object detection capabilities, its built-in annotation system proved limiting for our specific visualization requirements. This post details how I implemented a custom manual annotation system using OpenCV primitives while maintaining MediaPipe's high-performance LIVE_STREAM processing mode.

Problem Statement: Why Move Beyond MediaPipe's Built-in Annotations

MediaPipe's object detection framework excels at inference performance, but its visualization capabilities presented several limitations for our robotics application:

MediaPipe Annotation Limitations

Our Requirements

For GestureBot's vision system, I needed:

Technical Implementation: Manual Annotation Architecture

The solution involved decoupling MediaPipe's inference engine from the visualization layer, creating a custom annotation system that operates on the original RGB frames.

System Architecture

# High-level flow
RGB Frame → MediaPipe Detection (LIVE_STREAM) → Manual Annotation → ROS Publishing

 The key insight was to preserve MediaPipe's asynchronous detect_async() processing while applying custom annotations to the original RGB frames, rather than relying on MediaPipe's output_image.

Core Implementation: Manual Annotation Method

def draw_manual_annotations(self, image: np.ndarray, detections) -> np.ndarray:
    """
    Manually draw bounding boxes, labels, and confidence scores using OpenCV.
    
    Args:
        image: RGB image array (H, W, 3)
        detections: MediaPipe detection results
        
    Returns:
        Annotated RGB image array
    """
    if not detections:
        return image.copy()
        
    annotated_image = image.copy()
    height, width = image.shape[:2]
    
    for detection in detections:
        # Get bounding box coordinates
        bbox = detection.bounding_box
        x_min = int(bbox.origin_x)
        y_min = int(bbox.origin_y)
        x_max = int(bbox.origin_x + bbox.width)
        y_max = int(bbox.origin_y + bbox.height)
        
        # Ensure coordinates are within image bounds
        x_min = max(0, min(x_min, width - 1))
        y_min = max(0, min(y_min, height - 1))
        x_max = max(0, min(x_max, width - 1))
        y_max = max(0, min(y_max, height - 1))
        
        # Get the best category (highest confidence)
        if detection.categories:
            best_category = max(detection.categories, key=lambda c: c.score if c.score else 0)
            class_name = best_category.category_name or 'unknown'
            confidence = best_category.score or 0.0
            
            # Color-coded boxes based on confidence levels
            if confidence >= 0.7:
                color = (0, 255, 0)  # Green for high confidence (RGB)
            elif confidence >= 0.5:
                color = (255, 255, 0)  # Yellow for medium confidence (RGB)
            else:
                color = (255, 0, 0)  # Red for low confidence (RGB)
            
            # Draw bounding box rectangle
            cv2.rectangle(annotated_image, (x_min, y_min), (x_max, y_max), color, 2)
            
            # Prepare label text with confidence percentage
            confidence_percent = int(confidence * 100)
            label = f"{class_name}: {confidence_percent}%"
            
            # Calculate text size for background rectangle
            font = cv2.FONT_HERSHEY_SIMPLEX
            font_scale = 0.6
            thickness = 2
            (text_width, text_height), baseline = cv2.getTextSize(label, font, font_scale, thickness)
            
            # Position text above bounding box, or below if not enough space
            text_x = x_min
            text_y = y_min - 10 if y_min - 10 > text_height else y_max + text_height + 10
            
            # Draw background rectangle for text (filled)
            cv2.rectangle(
                annotated_image,
                (text_x, text_y - text_height - baseline),
                (text_x + text_width, text_y + baseline),
                color,
                -1  # Filled rectangle
            )
            
            # Draw text label in black for good contrast
            cv2.putText(
                annotated_image,
                label,
                (text_x, text_y),
                font,
                font_scale,
                (0, 0, 0),  # Black text (RGB)
                thickness,
                cv2.LINE_AA
            )
    
    return annotated_image 

Integration with MediaPipe Pipeline

The manual annotation system integrates seamlessly with MediaPipe's asynchronous processing:

def process_frame(self, frame: np.ndarray, timestamp: float) -> Optional[Dict]:
    """Process frame for object detection."""
    # ... MediaPipe processing ...
    
    if detections:
        result_dict = {
            'detections': detections,
            'timestamp': timestamp,
            'processing_time': (time.time() - timestamp) * 1000,
            'rgb_frame': rgb_frame  # Include original RGB frame for manual annotation
        }
        return result_dict

def publish_results(self, results: Dict, timestamp: float) -> None:
    """Publish object detection results and optionally annotated images."""
    # ... detection publishing ...
    
    # Extract the original RGB frame and detections for manual annotation
    rgb_frame = results['rgb_frame']
    detections = results['detections']
    
    # Apply manual annotations using OpenCV drawing primitives
    annotated_rgb = self.draw_manual_annotations(rgb_frame, detections)
    
    # Convert RGB to BGR for ROS publishing
    cv_image = cv2.cvtColor(annotated_rgb, cv2.COLOR_RGB2BGR)

Visual Features: Color-Coded Confidence System

The manual annotation system implements a three-tier confidence visualization scheme designed for rapid assessment in robotics applications:

Confidence Color Mapping

Typography and Positioning

Percentage Display Format

Confidence scores are displayed as integers (e.g., "person: 76%") rather than decimals, providing immediate visual feedback without cognitive overhead during real-time operation.

Discussions