When developing the GestureBot vision system, I encountered a common challenge in robotics computer vision: balancing performance with visualization flexibility. While MediaPipe provides excellent object detection capabilities, its built-in annotation system proved limiting for our specific visualization requirements. This post details how I implemented a custom manual annotation system using OpenCV primitives while maintaining MediaPipe's high-performance LIVE_STREAM processing mode.
Problem Statement: Why Move Beyond MediaPipe's Built-in Annotations
MediaPipe's object detection framework excels at inference performance, but its visualization capabilities presented several limitations for our robotics application:
MediaPipe Annotation Limitations
- Limited customization: Fixed annotation styles with minimal configuration options
- Inconsistent output:
LIVE_STREAMmode doesn't always provide reliableoutput_imageresults - Performance overhead: Built-in annotations add processing latency in the inference pipeline
- Inflexible styling: No control over color schemes, font sizes, or confidence display formats
Our Requirements
For GestureBot's vision system, I needed:
- Color-coded confidence levels for quick visual assessment
- Percentage-based confidence display for precise evaluation
- Consistent annotation rendering regardless of detection confidence
- Minimal performance impact on the real-time processing pipeline
- Full control over visual styling to match our robotics interface
Technical Implementation: Manual Annotation Architecture
The solution involved decoupling MediaPipe's inference engine from the visualization layer, creating a custom annotation system that operates on the original RGB frames.
System Architecture
# High-level flow
RGB Frame → MediaPipe Detection (LIVE_STREAM) → Manual Annotation → ROS Publishing The key insight was to preserve MediaPipe's asynchronous detect_async() processing while applying custom annotations to the original RGB frames, rather than relying on MediaPipe's output_image.
Core Implementation: Manual Annotation Method
def draw_manual_annotations(self, image: np.ndarray, detections) -> np.ndarray:
"""
Manually draw bounding boxes, labels, and confidence scores using OpenCV.
Args:
image: RGB image array (H, W, 3)
detections: MediaPipe detection results
Returns:
Annotated RGB image array
"""
if not detections:
return image.copy()
annotated_image = image.copy()
height, width = image.shape[:2]
for detection in detections:
# Get bounding box coordinates
bbox = detection.bounding_box
x_min = int(bbox.origin_x)
y_min = int(bbox.origin_y)
x_max = int(bbox.origin_x + bbox.width)
y_max = int(bbox.origin_y + bbox.height)
# Ensure coordinates are within image bounds
x_min = max(0, min(x_min, width - 1))
y_min = max(0, min(y_min, height - 1))
x_max = max(0, min(x_max, width - 1))
y_max = max(0, min(y_max, height - 1))
# Get the best category (highest confidence)
if detection.categories:
best_category = max(detection.categories, key=lambda c: c.score if c.score else 0)
class_name = best_category.category_name or 'unknown'
confidence = best_category.score or 0.0
# Color-coded boxes based on confidence levels
if confidence >= 0.7:
color = (0, 255, 0) # Green for high confidence (RGB)
elif confidence >= 0.5:
color = (255, 255, 0) # Yellow for medium confidence (RGB)
else:
color = (255, 0, 0) # Red for low confidence (RGB)
# Draw bounding box rectangle
cv2.rectangle(annotated_image, (x_min, y_min), (x_max, y_max), color, 2)
# Prepare label text with confidence percentage
confidence_percent = int(confidence * 100)
label = f"{class_name}: {confidence_percent}%"
# Calculate text size for background rectangle
font = cv2.FONT_HERSHEY_SIMPLEX
font_scale = 0.6
thickness = 2
(text_width, text_height), baseline = cv2.getTextSize(label, font, font_scale, thickness)
# Position text above bounding box, or below if not enough space
text_x = x_min
text_y = y_min - 10 if y_min - 10 > text_height else y_max + text_height + 10
# Draw background rectangle for text (filled)
cv2.rectangle(
annotated_image,
(text_x, text_y - text_height - baseline),
(text_x + text_width, text_y + baseline),
color,
-1 # Filled rectangle
)
# Draw text label in black for good contrast
cv2.putText(
annotated_image,
label,
(text_x, text_y),
font,
font_scale,
(0, 0, 0), # Black text (RGB)
thickness,
cv2.LINE_AA
)
return annotated_image Integration with MediaPipe Pipeline
The manual annotation system integrates seamlessly with MediaPipe's asynchronous processing:
def process_frame(self, frame: np.ndarray, timestamp: float) -> Optional[Dict]:
"""Process frame for object detection."""
# ... MediaPipe processing ...
if detections:
result_dict = {
'detections': detections,
'timestamp': timestamp,
'processing_time': (time.time() - timestamp) * 1000,
'rgb_frame': rgb_frame # Include original RGB frame for manual annotation
}
return result_dict
def publish_results(self, results: Dict, timestamp: float) -> None:
"""Publish object detection results and optionally annotated images."""
# ... detection publishing ...
# Extract the original RGB frame and detections for manual annotation
rgb_frame = results['rgb_frame']
detections = results['detections']
# Apply manual annotations using OpenCV drawing primitives
annotated_rgb = self.draw_manual_annotations(rgb_frame, detections)
# Convert RGB to BGR for ROS publishing
cv_image = cv2.cvtColor(annotated_rgb, cv2.COLOR_RGB2BGR)Visual Features: Color-Coded Confidence System
The manual annotation system implements a three-tier confidence visualization scheme designed for rapid assessment in robotics applications:
Confidence Color Mapping
- 🟢 Green (≥70%): High confidence detections suitable for autonomous decision-making
- 🟡 Yellow (≥50%): Medium confidence detections requiring validation
- 🔴 Red (<50%): Low confidence detections for debugging purposes
Typography and Positioning
- Font:
cv2.FONT_HERSHEY_SIMPLEXfor clear readability - Background: Filled rectangles matching bounding box colors for contrast
- Text Color: Black for optimal contrast against colored backgrounds
- Positioning: Adaptive placement above or below bounding boxes based on available space
Percentage Display Format
Confidence scores are displayed as integers (e.g., "person: 76%") rather than decimals, providing immediate visual feedback without cognitive overhead during real-time operation.
Vipin M
Discussions
Become a Hackaday.io Member
Create an account to leave a comment. Already have an account? Log In.