Close

Optimizing Object Detection Pipeline Performance: A 68.7% Improvement Through Systematic Bottleneck Analysis

A project log for GestureBot - Computer Vision Robot

A mobile robot that responds to human gestures, facial expressions using real-time pose estimation and gesture recognition & intuitive HRI

vipin-mVipin M 08/11/2025 at 01:440 Comments

Introduction

I recently completed a comprehensive performance optimization project for a ROS 2-based object detection pipeline using MediaPipe and OpenCV. The system processes camera frames for real-time object detection in robotics applications, but initial performance analysis revealed significant bottlenecks that were limiting throughput and consuming excessive CPU resources.

The object detection pipeline consists of three main stages:

Through systematic measurement and targeted optimization, I achieved a 68.7% reduction in total pipeline processing time, from 8.65ms to 2.71ms per frame, while maintaining full functionality and improving system stability.

Baseline Performance Analysis

Measurement Infrastructure

Before implementing any optimizations, I established a comprehensive performance measurement system to ensure accurate, statistically reliable data collection. The measurement infrastructure includes:

PipelineTimer Class: High-precision timing using time.perf_counter() for microsecond-level accuracy:

class PipelineTimer:
    def __init__(self):
        self.stage_times = {}
        self.start_time = None
    
    def start_stage(self, stage_name: str):
        self.stage_times[stage_name] = time.perf_counter()
    
    def end_stage(self, stage_name: str) -> float:
        if stage_name in self.stage_times:
            duration = time.perf_counter() - self.stage_times[stage_name]
            return duration * 1000  # Convert to milliseconds
        return 0.0

 PerformanceStats Class: Aggregates timing data over 5-second periods and publishes metrics to ROS topics:

class PerformanceStats:
    def __init__(self):
        self.period_start_time = time.perf_counter()
        self.frames_processed = 0
        self.total_preprocessing_time = 0.0
        self.total_mediapipe_time = 0.0
        self.total_postprocessing_time = 0.0
        self.period_duration = 5.0  # seconds

Statistical Methodology: I used 30-second test periods with multiple measurement intervals to ensure statistical confidence. Each test collected 5-6 data points, allowing calculation of mean performance metrics and variance analysis.

Baseline Performance Results

Using YUYV camera format with full annotated image processing enabled, the baseline performance measurements revealed:

>td ###Total Pipeline Time

MetricAverage Time
8.65ms100>#/td###
Preprocessing Time1.22ms14>#/td###
MediaPipe Inference2.13ms25>#/td###
Postprocessing Time5.30ms61>#/td###
Effective FPS2.28-

The baseline analysis immediately identified postprocessing as the primary bottleneck, consuming 61% of total pipeline time. This stage includes MediaPipe result conversion, RGB→BGR color conversion, and ROS Image message creation for annotated output.

Optimization #1: Conditional Annotated Image Processing

Problem Analysis

The postprocessing bottleneck was caused by unconditional generation of annotated images, even when no ROS subscribers were listening to the /vision/objects/annotated topic. This resulted in expensive memory operations and color conversions being performed unnecessarily.

Implementation

I implemented a subscriber count check to conditionally skip annotated image processing when no subscribers are present:

def publish_results(self, results: Dict, timestamp: float) -> None:
    """Publish object detection results and optionally annotated images."""
    try:
        # Always publish detection results
        msg = MessageConverter.detection_results_to_ros(results, timestamp)
        self.detections_publisher.publish(msg)

        # Conditional annotated image publishing
        if (self.annotated_image_publisher is not None and
            'output_image' in results and
            results['output_image'] is not None):
            
            # Optimization: Skip expensive postprocessing if no subscribers
            subscriber_count = self.annotated_image_publisher.get_subscription_count()
            
            if subscriber_count == 0:
                self.log_buffered_event(
                    'ANNOTATED_PROCESSING_SKIPPED',
                    'Skipping annotated image processing - no subscribers',
                    subscriber_count=subscriber_count
                )
                return
            
            # Continue with annotated image processing only when needed
            # ... (expensive postprocessing operations)

Performance Impact

The conditional processing optimization delivered significant improvements:

MetricBeforeAfterImprovement
Total Pipeline Time8.65ms5.19ms🚀 40.0% FASTER
Postprocessing Time5.30ms1.52ms🚀 71.3% FASTER
CPU EfficiencyHigh overheadReduced by 40>#/td###Significant

This optimization eliminates 3.46ms of processing time per frame when annotated images are not needed, which is the common case in production robotics applications where object detection results are consumed programmatically rather than visually.

Optimization #2: RGB888 Camera Format

Problem Analysis

After eliminating the postprocessing bottleneck, MediaPipe inference became the primary performance constraint. Analysis revealed that the BGR→RGB color conversion in preprocessing, combined with MediaPipe's internal processing of converted frames, was creating inefficiencies.

Implementation

I switched the camera configuration from YUYV to RGB888 format, allowing direct RGB input to MediaPipe:

def process_frame(self, frame: np.ndarray, timestamp: float) -> Optional[Dict]:
    """Process frame for object detection."""
    try:
        # Optimization: Check if frame is already in RGB format (camera_format=RGB888)
        if frame.shape[2] == 3:  # Ensure it's a 3-channel image
            # Direct RGB input from camera eliminates BGR→RGB conversion
            rgb_frame = frame
            self.log_buffered_event(
                'PREPROCESSING_OPTIMIZED',
                'Using direct RGB input - skipping BGR→RGB conversion',
                frame_shape=str(frame.shape)
            )
        else:
            # Fallback: Convert BGR to RGB for MediaPipe
            rgb_frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
        
        # Create MediaPipe image with direct RGB data
        mp_image = mp.Image(image_format=mp.ImageFormat.SRGB, data=rgb_frame) 

Camera Launch Configuration:

ros2 launch gesturebot object_detection.launch.py camera_format:=RGB888

Technical Benefits

The RGB888 format optimization provides multiple performance improvements:

  1. Eliminates BGR→RGB Conversion: Removes expensive cv2.cvtColor() operation in preprocessing
  2. MediaPipe Efficiency: Direct RGB input is processed more efficiently by MediaPipe's internal algorithms
  3. Memory Bandwidth Reduction: Fewer memory copy operations and intermediate buffer allocations
  4. Cache Efficiency: Better memory access patterns with consistent RGB format throughout the pipeline

Performance Impact

The RGB888 optimization delivered substantial additional improvements:

MetricAfter Opt #1After Opt #2Additional Improvement
Total Pipeline Time5.19ms2.71ms🚀 47.8% FASTER
MediaPipe Time2.47ms0.81ms🚀 67.2% FASTER
Preprocessing Time1.30ms0.83ms🚀 36.2% FASTER

The MediaPipe inference time reduction of 67.2% was particularly significant, transforming it from the primary bottleneck to a well-optimized component.

Performance Measurement Infrastructure

Buffered Logging System

To collect detailed performance data without impacting measurements, I implemented a configurable buffered logging system:

class BufferedLogger:
    def __init__(self, enabled: bool = True, max_size: int = 200):
        self.enabled = enabled
        self.buffer = []
        self.max_size = max_size
        self.mode = 'circular' if enabled else 'disabled'
    
    def log_event(self, event_type: str, message: str, **kwargs):
        if not self.enabled:
            return
        
        event = {
            'timestamp': time.perf_counter(),
            'event_type': event_type,
            'message': message,
            **kwargs
        }
        
        if len(self.buffer) >= self.max_size:
            self.buffer.pop(0)  # Remove oldest entry
        self.buffer.append(event)\ 

Real-Time Performance Publishing

The system publishes aggregated performance metrics every 5 seconds to ROS topics for real-time monitoring:

def publish_performance_stats(self):
    """Publish performance statistics to ROS topic."""
    if not self.enable_performance_tracking:
        return
    
    current_time = time.perf_counter()
    period_duration = current_time - self.stats.period_start_time
    
    if period_duration >= self.stats.period_duration:
        # Calculate averages
        avg_preprocessing = self.stats.total_preprocessing_time / self.stats.frames_processed
        avg_mediapipe = self.stats.total_mediapipe_time / self.stats.frames_processed
        avg_postprocessing = self.stats.total_postprocessing_time / self.stats.frames_processed
        avg_total = avg_preprocessing + avg_mediapipe + avg_postprocessing
        
        # Create and publish performance message
        perf_msg = PerformanceMetrics()
        perf_msg.header.stamp = self.get_clock().now().to_msg()
        perf_msg.current_fps = self.stats.frames_processed / period_duration
        perf_msg.avg_preprocessing_time = avg_preprocessing
        perf_msg.avg_mediapipe_time = avg_mediapipe
        perf_msg.avg_postprocessing_time = avg_postprocessing
        perf_msg.avg_total_pipeline_time = avg_total
        
        self.performance_publisher.publish(perf_msg)

Statistical Validation

I used consistent testing methodology to ensure reliable measurements:

Results Summary

Cumulative Performance Improvement

The systematic optimization approach achieved exceptional results:

ConfigurationTotal Pipeline TimeImprovement from BaselineCumulative Improvement
Baseline (YUYV + Full Processing)8.65ms--
Optimization #1 (Conditional Processing)5.19ms40.0% faster40.0>#/td###
Final Optimized (RGB888 + Conditional)2.71ms68.7% faster68.7>#/td###

Stage-by-Stage Transformation

Pipeline StageBaselineAfter Opt #1Final OptimizedTotal Improvement
Preprocessing1.22ms (14%)1.30ms (25%)0.83ms (31%)32.0% faster
MediaPipe2.13ms (25%)2.47ms (48%)0.81ms (30%)62.0% faster
Postprocessing5.30ms (61%)1.52ms (29%)1.06ms (39%)80.0% faster

Mathematical Validation

The cumulative improvement follows the expected multiplicative formula:

Total Improvement = 1 - (1 - Opt1%) × (1 - Opt2%)
68.7% = 1 - (1 - 0.40) × (1 - 0.478)
68.7% = 1 - (0.60 × 0.522) = 68.7% ✓

Technical Insights

Bottleneck Migration Pattern

The optimization process revealed an important pattern of bottleneck migration:

  1. Initial State: Postprocessing dominated (61% of pipeline time)
  2. After Optimization #1: MediaPipe became the primary bottleneck (48% of pipeline time)
  3. Final State: Balanced pipeline with no dominant bottleneck (30-39% distribution)

This demonstrates the importance of iterative optimization and continuous measurement, as eliminating one bottleneck often reveals the next performance constraint.

Optimization Sequencing Strategy

The success of this optimization effort validates several key principles:

Camera Format Selection Impact

The RGB888 camera format optimization provided insights into the importance of data format consistency throughout processing pipelines:

Performance Measurement Best Practices

The measurement infrastructure development highlighted several critical practices:

Conclusion

This optimization project demonstrates the effectiveness of systematic, measurement-driven performance improvement. By implementing comprehensive performance tracking, identifying bottlenecks through data analysis, and applying targeted optimizations, I achieved a 68.7% reduction in object detection pipeline processing time.

The key success factors were:

  1. Robust Measurement Infrastructure: Accurate, non-intrusive performance tracking enabled data-driven optimization decisions
  2. Bottleneck-Focused Approach: Targeting the largest performance constraints first maximized improvement impact
  3. Format Optimization: Aligning data formats throughout the pipeline eliminated unnecessary conversions
  4. Conditional Processing: Smart resource management reduced computational overhead when full processing isn't needed

The optimized pipeline now processes frames in 2.71ms compared to the original 8.65ms, providing substantial headroom for additional robotics processing tasks while maintaining full object detection functionality. This work demonstrates that significant performance improvements are achievable through systematic analysis and targeted optimization, even in complex multi-stage processing pipelines.

Discussions