Debugging GR00T N1.5 Inference in Phosphobot

Project Overview

I worked on debugging inference issues with a fine-tuned NVIDIA GR00T N1.5 model for controlling an SO-100 robotic arm. The model was trained successfully and uploaded to HuggingFace Hub, but the PhosphoBot browser interface was failing during AI control activation.

This project log documents the debugging process, root cause analysis, and the implementation of an alternative local inference solution.

Hardware Setup

Robot: SO-100 robotic arm (6 DOF: 5 arm joints + 1 gripper)
Cameras: Dual camera system (IDs 0 and 2) at 640x480, 30fps
GPU: RTX 4080 Super with 16GB VRAM
Model: phospho-app/gr00t-example_dataset-h9g75u7gak (fine-tuned GR00T N1.5)

The Problem

When clicking “AI Control” in the PhosphoBot browser interface, the system reported:

Exception: No robot connected. Exiting AI control loop.

The robot was physically connected and visible in the UI, cameras were streaming successfully, and the model had been trained and uploaded to HuggingFace Hub. The issue appeared to be in the inference pipeline.

Debugging Process

Issue 1: Joint Count Mismatch

Added debug logging to understand the failure and discovered:

Connected joints: 6, Config joints: 1

Root Cause: The code was reading the model configuration incorrectly:

# Incorrect code
number_of_joints_in_config = len(    config.embodiment.statistics.action.action_space.values()
)

This was counting dictionary keys (max, min, mean, std, q01, q99) instead of joint dimensions.

Model Config Structure:

{  "action_space": {    "action_space": 6  }
}

Solution: Handle the nested dictionary structure correctly:

# Fixed code
action_space = config.embodiment.statistics.action.action_space

# Case 1: action_space is a dict with 'action_space' key containing the number
if isinstance(action_space, dict) and 'action_space' in action_space:    number_of_joints_in_config = action_space['action_space']
# Case 2: action_space has 'max' or 'min' arrays
elif hasattr(action_space, 'max') and isinstance(action_space.max, list):    number_of_joints_in_config = len(action_space.max)
# Additional fallback cases...

Issue 2: Device Mismatch on Modal Server

After fixing the joint count, a new error appeared:

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

Root Cause:

Model inference happens on Modal GPU server (remote)
Some model components loaded on CPU, others on GPU
Issue occurs in VLLN (Vision-Language Layer Norm) component

Attempted Fix: Added retry logic with exponential backoff to handle transient device issues:

max_retries = 3
retry_delay = 1.0  # seconds

for retry_attempt in range(max_retries):    try:        actions = self(inputs)        break  # Success    except RuntimeError as e:        if "Expected all tensors to be on the same device" in str(e):            if retry_attempt < max_retries - 1:                logger.warning(f"Device mismatch error on attempt {retry_attempt + 1}/{max_retries}. Retrying in {retry_delay}s...")                await asyncio.sleep(retry_delay)                retry_delay *= 2  # Exponential backoff

Status: This helped with transient issues but didn’t solve the root cause, which is on the Modal server side and not fixable from the client.

Alternative Solution: Local Inference

Since the PhosphoBot Modal server had device mismatch issues, I implemented a local inference solution using official Isaac-GR00T scripts.

Architecture: Client-Server Model

Based on NVIDIA’s official tutorial, the solution uses a two-terminal approach:

Terminal 1: Inference Server

Loads GR00T model on local GPU
Runs inference on observations
Returns action predictions
Uses ZMQ protocol for fast communication

Terminal 2: Robot Client

Connects to SO-100 robot via USB
Captures camera images
Sends observations to server
Executes returned actions

Implementation

Server Script (start_groot_server.sh):

#!/bin/bash
cd /home/vipin/Isaac-GR00T
conda activate gr00t

python scripts/inference_service.py \    --server \    --model_path "phospho-app/gr00t-example_dataset-h9g75u7gak" \    --embodiment-tag "new_embodiment" \    --data-config "so100_dualcam" \    --denoising-steps 4 \    --port 5555

Client Script (gr00t_inference_local.py):

Based on official examples/SO-100/eval_lerobot.py
Connects to SO-100 robot using LeRobot
Initializes cameras (IDs 0 and 2)
Connects to GR00T inference server
Runs inference loop at 30 FPS
Handles action horizon queuing

Configuration:

# Robot
ROBOT_TYPE = "so100_follower"
ROBOT_PORT = "/dev/ttyACM0"
ROBOT_ID = "so-100"

# Cameras
CAMERA_CONFIGS = {    "front": {"type": "opencv", "index_or_path": 0, "width": 640, "height": 480, "fps": 30},    "wrist": {"type": "opencv", "index_or_path": 2, "width": 640, "height": 480, "fps": 30},
}

# Task
LANG_INSTRUCTION = "pick up the striped box and put it into the white plate"

Advantages of Local Inference

No Modal server dependency: Runs entirely locally
Full control over device placement: All tensors on GPU
Easier debugging: Direct access to model and logs
Lower latency: No network round-trip to Modal
Official NVIDIA approach: Based on their tutorial
More reliable: Better for production deployment

Disk Space Management

During the debugging process, encountered disk space issues (96GB disk at 100% capacity). Performed cleanup:

Actions taken:

Cleaned conda package cache: conda clean --all --yes (freed 2.3GB)
Removed HuggingFace XET cache: rm -rf ~/.cache/huggingface/xet (freed 6.0GB)
Removed old LeRobot dataset versions (freed 1.3GB)

Result: Freed ~10GB total, bringing usage down to 89>#/p###

Technical Notes

Model Configuration Debugging

Always add debug logging before making assumptions
Check data structures - don’t assume dictionary structure
Handle multiple cases - model configs can vary
Verify on both sides - client and server must agree on config

PhosphoBot + Modal Limitations

Modal server is a black box - can’t fix server-side issues
Device placement errors on remote server are hard to debug
Network latency adds overhead
Dependency on external service

Direct Inference Requirements

Local GPU with sufficient VRAM (16GB for GR00T N1.5)
Isaac-GR00T repository installed
LeRobot for robot control
Proper conda environment setup

Current Status

Joint count mismatch fixed - correctly reads 6 joints from model config
Debug logging added for comprehensive troubleshooting
Device mismatch on Modal has retry logic but root cause remains on server
Alternative local inference solution implemented and ready for testing
Disk space cleaned - freed 10GB

Usage Commands

Start inference server:

cd /home/vipin/phosphobot
./start_groot_server.sh

Run robot client (separate terminal):

cd /home/vipin/phosphobot
conda activate gr00t
python gr00t_inference_local.py

Summary

This debugging session involved systematic troubleshooting of AI model inference issues, from configuration parsing problems to device placement errors on remote servers. The solution involved implementing a local inference architecture that provides better control and reliability for robotic manipulation tasks.

The local approach eliminates dependencies on external services and provides the foundation for more robust robotic AI applications.

Next: Testing the local inference system with actual robot manipulation tasks.

Model: NVIDIA GR00T N1.5 (3B parameters)
Hardware: SO-100 Robot Arm, RTX 4080 Super
Software: Isaac-GR00T, LeRobot, PhosphoBot