Introduction
I recently completed a comprehensive performance optimization project for a ROS 2-based object detection pipeline using MediaPipe and OpenCV. The system processes camera frames for real-time object detection in robotics applications, but initial performance analysis revealed significant bottlenecks that were limiting throughput and consuming excessive CPU resources.
The object detection pipeline consists of three main stages:
- Preprocessing: Camera frame format conversion (BGR→RGB) for MediaPipe compatibility
- MediaPipe Inference: Object detection using TensorFlow Lite models
- Postprocessing: Result conversion and ROS message publishing, including optional annotated image generation
Through systematic measurement and targeted optimization, I achieved a 68.7% reduction in total pipeline processing time, from 8.65ms to 2.71ms per frame, while maintaining full functionality and improving system stability.
Baseline Performance Analysis
Measurement Infrastructure
Before implementing any optimizations, I established a comprehensive performance measurement system to ensure accurate, statistically reliable data collection. The measurement infrastructure includes:
PipelineTimer Class: High-precision timing using time.perf_counter() for microsecond-level accuracy:
class PipelineTimer:
def __init__(self):
self.stage_times = {}
self.start_time = None
def start_stage(self, stage_name: str):
self.stage_times[stage_name] = time.perf_counter()
def end_stage(self, stage_name: str) -> float:
if stage_name in self.stage_times:
duration = time.perf_counter() - self.stage_times[stage_name]
return duration * 1000 # Convert to milliseconds
return 0.0PerformanceStats Class: Aggregates timing data over 5-second periods and publishes metrics to ROS topics:
class PerformanceStats:
def __init__(self):
self.period_start_time = time.perf_counter()
self.frames_processed = 0
self.total_preprocessing_time = 0.0
self.total_mediapipe_time = 0.0
self.total_postprocessing_time = 0.0
self.period_duration = 5.0 # secondsStatistical Methodology: I used 30-second test periods with multiple measurement intervals to ensure statistical confidence. Each test collected 5-6 data points, allowing calculation of mean performance metrics and variance analysis.
Baseline Performance Results
Using YUYV camera format with full annotated image processing enabled, the baseline performance measurements revealed:
>td ###Total Pipeline Time
| Metric | Average Time | |
|---|---|---|
| 8.65ms | 100>#/td### | |
| Preprocessing Time | 1.22ms | 14>#/td### |
| MediaPipe Inference | 2.13ms | 25>#/td### |
| Postprocessing Time | 5.30ms | 61>#/td### |
| Effective FPS | 2.28 | - |
The baseline analysis immediately identified postprocessing as the primary bottleneck, consuming 61% of total pipeline time. This stage includes MediaPipe result conversion, RGB→BGR color conversion, and ROS Image message creation for annotated output.
Optimization #1: Conditional Annotated Image Processing
Problem Analysis
The postprocessing bottleneck was caused by unconditional generation of annotated images, even when no ROS subscribers were listening to the /vision/objects/annotated topic. This resulted in expensive memory operations and color conversions being performed unnecessarily.
Implementation
I implemented a subscriber count check to conditionally skip annotated image processing when no subscribers are present:
def publish_results(self, results: Dict, timestamp: float) -> None:
"""Publish object detection results and optionally annotated images."""
try:
# Always publish detection results
msg = MessageConverter.detection_results_to_ros(results, timestamp)
self.detections_publisher.publish(msg)
# Conditional annotated image publishing
if (self.annotated_image_publisher is not None and
'output_image' in results and
results['output_image'] is not None):
# Optimization: Skip expensive postprocessing if no subscribers
subscriber_count = self.annotated_image_publisher.get_subscription_count()
if subscriber_count == 0:
self.log_buffered_event(
'ANNOTATED_PROCESSING_SKIPPED',
'Skipping annotated image processing - no subscribers',
subscriber_count=subscriber_count
)
return
# Continue with annotated image processing only when needed
# ... (expensive postprocessing operations)Performance Impact
The conditional processing optimization delivered significant improvements:
| Metric | Before | After | Improvement |
|---|---|---|---|
| Total Pipeline Time | 8.65ms | 5.19ms | 🚀 40.0% FASTER |
| Postprocessing Time | 5.30ms | 1.52ms | 🚀 71.3% FASTER |
| CPU Efficiency | High overhead | Reduced by 40>#/td### | Significant |
This optimization eliminates 3.46ms of processing time per frame when annotated images are not needed, which is the common case in production robotics applications where object detection results are consumed programmatically rather than visually.
Optimization #2: RGB888 Camera Format
Problem Analysis
After eliminating the postprocessing bottleneck, MediaPipe inference became the primary performance constraint. Analysis revealed that the BGR→RGB color conversion in preprocessing, combined with MediaPipe's internal processing of converted frames, was creating inefficiencies.
Implementation
I switched the camera configuration from YUYV to RGB888 format, allowing direct RGB input to MediaPipe:
def process_frame(self, frame: np.ndarray, timestamp: float) -> Optional[Dict]:
"""Process frame for object detection."""
try:
# Optimization: Check if frame is already in RGB format (camera_format=RGB888)
if frame.shape[2] == 3: # Ensure it's a 3-channel image
# Direct RGB input from camera eliminates BGR→RGB conversion
rgb_frame = frame
self.log_buffered_event(
'PREPROCESSING_OPTIMIZED',
'Using direct RGB input - skipping BGR→RGB conversion',
frame_shape=str(frame.shape)
)
else:
# Fallback: Convert BGR to RGB for MediaPipe
rgb_frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
# Create MediaPipe image with direct RGB data
mp_image = mp.Image(image_format=mp.ImageFormat.SRGB, data=rgb_frame) Camera Launch Configuration:
ros2 launch gesturebot object_detection.launch.py camera_format:=RGB888Technical Benefits
The RGB888 format optimization provides multiple performance improvements:
- Eliminates BGR→RGB Conversion: Removes expensive
cv2.cvtColor()operation in preprocessing - MediaPipe Efficiency: Direct RGB input is processed more efficiently by MediaPipe's internal algorithms
- Memory Bandwidth Reduction: Fewer memory copy operations and intermediate buffer allocations
- Cache Efficiency: Better memory access patterns with consistent RGB format throughout the pipeline
Performance Impact
The RGB888 optimization delivered substantial additional improvements:
| Metric | After Opt #1 | After Opt #2 | Additional Improvement |
|---|---|---|---|
| Total Pipeline Time | 5.19ms | 2.71ms | 🚀 47.8% FASTER |
| MediaPipe Time | 2.47ms | 0.81ms | 🚀 67.2% FASTER |
| Preprocessing Time | 1.30ms | 0.83ms | 🚀 36.2% FASTER |
The MediaPipe inference time reduction of 67.2% was particularly significant, transforming it from the primary bottleneck to a well-optimized component.
Performance Measurement Infrastructure
Buffered Logging System
To collect detailed performance data without impacting measurements, I implemented a configurable buffered logging system:
class BufferedLogger:
def __init__(self, enabled: bool = True, max_size: int = 200):
self.enabled = enabled
self.buffer = []
self.max_size = max_size
self.mode = 'circular' if enabled else 'disabled'
def log_event(self, event_type: str, message: str, **kwargs):
if not self.enabled:
return
event = {
'timestamp': time.perf_counter(),
'event_type': event_type,
'message': message,
**kwargs
}
if len(self.buffer) >= self.max_size:
self.buffer.pop(0) # Remove oldest entry
self.buffer.append(event)\ Real-Time Performance Publishing
The system publishes aggregated performance metrics every 5 seconds to ROS topics for real-time monitoring:
def publish_performance_stats(self):
"""Publish performance statistics to ROS topic."""
if not self.enable_performance_tracking:
return
current_time = time.perf_counter()
period_duration = current_time - self.stats.period_start_time
if period_duration >= self.stats.period_duration:
# Calculate averages
avg_preprocessing = self.stats.total_preprocessing_time / self.stats.frames_processed
avg_mediapipe = self.stats.total_mediapipe_time / self.stats.frames_processed
avg_postprocessing = self.stats.total_postprocessing_time / self.stats.frames_processed
avg_total = avg_preprocessing + avg_mediapipe + avg_postprocessing
# Create and publish performance message
perf_msg = PerformanceMetrics()
perf_msg.header.stamp = self.get_clock().now().to_msg()
perf_msg.current_fps = self.stats.frames_processed / period_duration
perf_msg.avg_preprocessing_time = avg_preprocessing
perf_msg.avg_mediapipe_time = avg_mediapipe
perf_msg.avg_postprocessing_time = avg_postprocessing
perf_msg.avg_total_pipeline_time = avg_total
self.performance_publisher.publish(perf_msg)Statistical Validation
I used consistent testing methodology to ensure reliable measurements:
- Test Duration: 30-second measurement periods
- Data Points: 5-6 measurement intervals per test
- Consistency Validation: Multiple test runs to verify reproducibility
- Variance Analysis: Standard deviation calculations to assess measurement stability
Results Summary
Cumulative Performance Improvement
The systematic optimization approach achieved exceptional results:
| Configuration | Total Pipeline Time | Improvement from Baseline | Cumulative Improvement |
|---|---|---|---|
| Baseline (YUYV + Full Processing) | 8.65ms | - | - |
| Optimization #1 (Conditional Processing) | 5.19ms | 40.0% faster | 40.0>#/td### |
| Final Optimized (RGB888 + Conditional) | 2.71ms | 68.7% faster | 68.7>#/td### |
Stage-by-Stage Transformation
| Pipeline Stage | Baseline | After Opt #1 | Final Optimized | Total Improvement |
|---|---|---|---|---|
| Preprocessing | 1.22ms (14%) | 1.30ms (25%) | 0.83ms (31%) | 32.0% faster |
| MediaPipe | 2.13ms (25%) | 2.47ms (48%) | 0.81ms (30%) | 62.0% faster |
| Postprocessing | 5.30ms (61%) | 1.52ms (29%) | 1.06ms (39%) | 80.0% faster |
Mathematical Validation
The cumulative improvement follows the expected multiplicative formula:
Total Improvement = 1 - (1 - Opt1%) × (1 - Opt2%)
68.7% = 1 - (1 - 0.40) × (1 - 0.478)
68.7% = 1 - (0.60 × 0.522) = 68.7% ✓Technical Insights
Bottleneck Migration Pattern
The optimization process revealed an important pattern of bottleneck migration:
- Initial State: Postprocessing dominated (61% of pipeline time)
- After Optimization #1: MediaPipe became the primary bottleneck (48% of pipeline time)
- Final State: Balanced pipeline with no dominant bottleneck (30-39% distribution)
This demonstrates the importance of iterative optimization and continuous measurement, as eliminating one bottleneck often reveals the next performance constraint.
Optimization Sequencing Strategy
The success of this optimization effort validates several key principles:
- Measure First: Comprehensive performance measurement infrastructure enabled data-driven decisions
- Target the Largest Bottleneck: Addressing postprocessing first provided the highest initial impact
- Iterative Approach: Sequential optimization revealed secondary bottlenecks that weren't initially apparent
- Validate Each Step: Statistical validation ensured that improvements were real and reproducible
Camera Format Selection Impact
The RGB888 camera format optimization provided insights into the importance of data format consistency throughout processing pipelines:
- Format Conversion Overhead: BGR→RGB conversion consumed significant CPU cycles
- MediaPipe Efficiency: Direct RGB input dramatically improved inference performance
- Memory Access Patterns: Consistent format reduced cache misses and memory bandwidth requirements
Performance Measurement Best Practices
The measurement infrastructure development highlighted several critical practices:
- Non-Intrusive Logging: Buffered logging prevents measurement artifacts
- Aggregated Metrics: 5-second averaging provides stable, meaningful performance data
- Statistical Rigor: Multiple measurement periods enable confidence interval calculation
- Real-Time Monitoring: ROS topic publishing allows live performance observation
Conclusion
This optimization project demonstrates the effectiveness of systematic, measurement-driven performance improvement. By implementing comprehensive performance tracking, identifying bottlenecks through data analysis, and applying targeted optimizations, I achieved a 68.7% reduction in object detection pipeline processing time.
The key success factors were:
- Robust Measurement Infrastructure: Accurate, non-intrusive performance tracking enabled data-driven optimization decisions
- Bottleneck-Focused Approach: Targeting the largest performance constraints first maximized improvement impact
- Format Optimization: Aligning data formats throughout the pipeline eliminated unnecessary conversions
- Conditional Processing: Smart resource management reduced computational overhead when full processing isn't needed
The optimized pipeline now processes frames in 2.71ms compared to the original 8.65ms, providing substantial headroom for additional robotics processing tasks while maintaining full object detection functionality. This work demonstrates that significant performance improvements are achievable through systematic analysis and targeted optimization, even in complex multi-stage processing pipelines.
Vipin M
Discussions
Become a Hackaday.io Member
Create an account to leave a comment. Already have an account? Log In.