Close

Debugging GR00T N1.5 Inference in Phosphobot

A project log for ChefMate - AI LeRobot Arm

Robotic arm workflow with nVIDIA GR00T N1.5 model. Dataset recording, fine-tuning, debugging, and deployment for pick-and-place tasks

vipin-mVipin M 10/05/2025 at 15:550 Comments

Project Overview

I worked on debugging inference issues with a fine-tuned NVIDIA GR00T N1.5 model for controlling an SO-100 robotic arm. The model was trained successfully and uploaded to HuggingFace Hub, but the PhosphoBot browser interface was failing during AI control activation.

This project log documents the debugging process, root cause analysis, and the implementation of an alternative local inference solution.

Hardware Setup

The Problem

When clicking “AI Control” in the PhosphoBot browser interface, the system reported:

Exception: No robot connected. Exiting AI control loop.

The robot was physically connected and visible in the UI, cameras were streaming successfully, and the model had been trained and uploaded to HuggingFace Hub. The issue appeared to be in the inference pipeline.

Debugging Process

Issue 1: Joint Count Mismatch

Added debug logging to understand the failure and discovered:

Connected joints: 6, Config joints: 1

Root Cause: The code was reading the model configuration incorrectly:

# Incorrect code
number_of_joints_in_config = len(    config.embodiment.statistics.action.action_space.values()
)

This was counting dictionary keys (maxminmeanstdq01q99) instead of joint dimensions.

Model Config Structure:

{  "action_space": {    "action_space": 6  }
}

Solution: Handle the nested dictionary structure correctly:

# Fixed code
action_space = config.embodiment.statistics.action.action_space

# Case 1: action_space is a dict with 'action_space' key containing the number
if isinstance(action_space, dict) and 'action_space' in action_space:    number_of_joints_in_config = action_space['action_space']
# Case 2: action_space has 'max' or 'min' arrays
elif hasattr(action_space, 'max') and isinstance(action_space.max, list):    number_of_joints_in_config = len(action_space.max)
# Additional fallback cases...

Issue 2: Device Mismatch on Modal Server

After fixing the joint count, a new error appeared:

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

Root Cause:

Attempted Fix: Added retry logic with exponential backoff to handle transient device issues:

max_retries = 3
retry_delay = 1.0  # seconds

for retry_attempt in range(max_retries):    try:        actions = self(inputs)        break  # Success    except RuntimeError as e:        if "Expected all tensors to be on the same device" in str(e):            if retry_attempt < max_retries - 1:                logger.warning(f"Device mismatch error on attempt {retry_attempt + 1}/{max_retries}. Retrying in {retry_delay}s...")                await asyncio.sleep(retry_delay)                retry_delay *= 2  # Exponential backoff

Status: This helped with transient issues but didn’t solve the root cause, which is on the Modal server side and not fixable from the client.

Alternative Solution: Local Inference

Since the PhosphoBot Modal server had device mismatch issues, I implemented a local inference solution using official Isaac-GR00T scripts.

Architecture: Client-Server Model

Based on NVIDIA’s official tutorial, the solution uses a two-terminal approach:

Terminal 1: Inference Server

Terminal 2: Robot Client

Implementation

Server Script (start_groot_server.sh):

#!/bin/bash
cd /home/vipin/Isaac-GR00T
conda activate gr00t

python scripts/inference_service.py \    --server \    --model_path "phospho-app/gr00t-example_dataset-h9g75u7gak" \    --embodiment-tag "new_embodiment" \    --data-config "so100_dualcam" \    --denoising-steps 4 \    --port 5555

Client Script (gr00t_inference_local.py):

Configuration:

# Robot
ROBOT_TYPE = "so100_follower"
ROBOT_PORT = "/dev/ttyACM0"
ROBOT_ID = "so-100"

# Cameras
CAMERA_CONFIGS = {    "front": {"type": "opencv", "index_or_path": 0, "width": 640, "height": 480, "fps": 30},    "wrist": {"type": "opencv", "index_or_path": 2, "width": 640, "height": 480, "fps": 30},
}

# Task
LANG_INSTRUCTION = "pick up the striped box and put it into the white plate"

Advantages of Local Inference

Disk Space Management

During the debugging process, encountered disk space issues (96GB disk at 100% capacity). Performed cleanup:

Actions taken:

Result: Freed ~10GB total, bringing usage down to 89>#/p###

Technical Notes

Model Configuration Debugging

  1. Always add debug logging before making assumptions
  2. Check data structures - don’t assume dictionary structure
  3. Handle multiple cases - model configs can vary
  4. Verify on both sides - client and server must agree on config

PhosphoBot + Modal Limitations

Direct Inference Requirements

Current Status

Usage Commands

Start inference server:

cd /home/vipin/phosphobot
./start_groot_server.sh

Run robot client (separate terminal):

cd /home/vipin/phosphobot
conda activate gr00t
python gr00t_inference_local.py

Summary

This debugging session involved systematic troubleshooting of AI model inference issues, from configuration parsing problems to device placement errors on remote servers. The solution involved implementing a local inference architecture that provides better control and reliability for robotic manipulation tasks.

The local approach eliminates dependencies on external services and provides the foundation for more robust robotic AI applications.

Next: Testing the local inference system with actual robot manipulation tasks.

Model: NVIDIA GR00T N1.5 (3B parameters)
Hardware: SO-100 Robot Arm, RTX 4080 Super
Software: Isaac-GR00T, LeRobot, PhosphoBot

Discussions