Project Overview
I worked on debugging inference issues with a fine-tuned NVIDIA GR00T N1.5 model for controlling an SO-100 robotic arm. The model was trained successfully and uploaded to HuggingFace Hub, but the PhosphoBot browser interface was failing during AI control activation.
This project log documents the debugging process, root cause analysis, and the implementation of an alternative local inference solution.
Hardware Setup
- Robot: SO-100 robotic arm (6 DOF: 5 arm joints + 1 gripper)
- Cameras: Dual camera system (IDs 0 and 2) at 640x480, 30fps
- GPU: RTX 4080 Super with 16GB VRAM
- Model:
phospho-app/gr00t-example_dataset-h9g75u7gak(fine-tuned GR00T N1.5)
The Problem
When clicking “AI Control” in the PhosphoBot browser interface, the system reported:
Exception: No robot connected. Exiting AI control loop.
The robot was physically connected and visible in the UI, cameras were streaming successfully, and the model had been trained and uploaded to HuggingFace Hub. The issue appeared to be in the inference pipeline.
Debugging Process
Issue 1: Joint Count Mismatch
Added debug logging to understand the failure and discovered:
Connected joints: 6, Config joints: 1
Root Cause: The code was reading the model configuration incorrectly:
# Incorrect code number_of_joints_in_config = len( config.embodiment.statistics.action.action_space.values() )
This was counting dictionary keys (max, min, mean, std, q01, q99) instead of joint dimensions.
Model Config Structure:
{ "action_space": { "action_space": 6 }
}
Solution: Handle the nested dictionary structure correctly:
# Fixed code action_space = config.embodiment.statistics.action.action_space # Case 1: action_space is a dict with 'action_space' key containing the number if isinstance(action_space, dict) and 'action_space' in action_space: number_of_joints_in_config = action_space['action_space'] # Case 2: action_space has 'max' or 'min' arrays elif hasattr(action_space, 'max') and isinstance(action_space.max, list): number_of_joints_in_config = len(action_space.max) # Additional fallback cases...
Issue 2: Device Mismatch on Modal Server
After fixing the joint count, a new error appeared:
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!
Root Cause:
- Model inference happens on Modal GPU server (remote)
- Some model components loaded on CPU, others on GPU
- Issue occurs in VLLN (Vision-Language Layer Norm) component
Attempted Fix: Added retry logic with exponential backoff to handle transient device issues:
max_retries = 3
retry_delay = 1.0 # seconds
for retry_attempt in range(max_retries): try: actions = self(inputs) break # Success except RuntimeError as e: if "Expected all tensors to be on the same device" in str(e): if retry_attempt < max_retries - 1: logger.warning(f"Device mismatch error on attempt {retry_attempt + 1}/{max_retries}. Retrying in {retry_delay}s...") await asyncio.sleep(retry_delay) retry_delay *= 2 # Exponential backoff
Status: This helped with transient issues but didn’t solve the root cause, which is on the Modal server side and not fixable from the client.
Alternative Solution: Local Inference
Since the PhosphoBot Modal server had device mismatch issues, I implemented a local inference solution using official Isaac-GR00T scripts.
Architecture: Client-Server Model
Based on NVIDIA’s official tutorial, the solution uses a two-terminal approach:
Terminal 1: Inference Server
- Loads GR00T model on local GPU
- Runs inference on observations
- Returns action predictions
- Uses ZMQ protocol for fast communication
Terminal 2: Robot Client
- Connects to SO-100 robot via USB
- Captures camera images
- Sends observations to server
- Executes returned actions
Implementation
Server Script (start_groot_server.sh):
#!/bin/bash cd /home/vipin/Isaac-GR00T conda activate gr00t python scripts/inference_service.py \ --server \ --model_path "phospho-app/gr00t-example_dataset-h9g75u7gak" \ --embodiment-tag "new_embodiment" \ --data-config "so100_dualcam" \ --denoising-steps 4 \ --port 5555
Client Script (gr00t_inference_local.py):
- Based on official
examples/SO-100/eval_lerobot.py - Connects to SO-100 robot using LeRobot
- Initializes cameras (IDs 0 and 2)
- Connects to GR00T inference server
- Runs inference loop at 30 FPS
- Handles action horizon queuing
Configuration:
# Robot
ROBOT_TYPE = "so100_follower"
ROBOT_PORT = "/dev/ttyACM0"
ROBOT_ID = "so-100"
# Cameras
CAMERA_CONFIGS = { "front": {"type": "opencv", "index_or_path": 0, "width": 640, "height": 480, "fps": 30}, "wrist": {"type": "opencv", "index_or_path": 2, "width": 640, "height": 480, "fps": 30},
}
# Task
LANG_INSTRUCTION = "pick up the striped box and put it into the white plate"
Advantages of Local Inference
- No Modal server dependency: Runs entirely locally
- Full control over device placement: All tensors on GPU
- Easier debugging: Direct access to model and logs
- Lower latency: No network round-trip to Modal
- Official NVIDIA approach: Based on their tutorial
- More reliable: Better for production deployment
Disk Space Management
During the debugging process, encountered disk space issues (96GB disk at 100% capacity). Performed cleanup:
Actions taken:
- Cleaned conda package cache:
conda clean --all --yes(freed 2.3GB) - Removed HuggingFace XET cache:
rm -rf ~/.cache/huggingface/xet(freed 6.0GB) - Removed old LeRobot dataset versions (freed 1.3GB)
Result: Freed ~10GB total, bringing usage down to 89>#/p###
Technical Notes
Model Configuration Debugging
- Always add debug logging before making assumptions
- Check data structures - don’t assume dictionary structure
- Handle multiple cases - model configs can vary
- Verify on both sides - client and server must agree on config
PhosphoBot + Modal Limitations
- Modal server is a black box - can’t fix server-side issues
- Device placement errors on remote server are hard to debug
- Network latency adds overhead
- Dependency on external service
Direct Inference Requirements
- Local GPU with sufficient VRAM (16GB for GR00T N1.5)
- Isaac-GR00T repository installed
- LeRobot for robot control
- Proper conda environment setup
Current Status
- Joint count mismatch fixed - correctly reads 6 joints from model config
- Debug logging added for comprehensive troubleshooting
- Device mismatch on Modal has retry logic but root cause remains on server
- Alternative local inference solution implemented and ready for testing
- Disk space cleaned - freed 10GB
Usage Commands
Start inference server:
cd /home/vipin/phosphobot ./start_groot_server.sh
Run robot client (separate terminal):
cd /home/vipin/phosphobot conda activate gr00t python gr00t_inference_local.py
Summary
This debugging session involved systematic troubleshooting of AI model inference issues, from configuration parsing problems to device placement errors on remote servers. The solution involved implementing a local inference architecture that provides better control and reliability for robotic manipulation tasks.
The local approach eliminates dependencies on external services and provides the foundation for more robust robotic AI applications.
Next: Testing the local inference system with actual robot manipulation tasks.
Model: NVIDIA GR00T N1.5 (3B parameters)
Hardware: SO-100 Robot Arm, RTX 4080 Super
Software: Isaac-GR00T, LeRobot, PhosphoBot
Vipin M
Discussions
Become a Hackaday.io Member
Create an account to leave a comment. Already have an account? Log In.