Close

Debugging Robot “Twitching” in GR00T N1.5 Deployment

A project log for ChefMate - AI LeRobot Arm

Robotic arm workflow with nVIDIA GR00T N1.5 model. Dataset recording, fine-tuning, debugging, and deployment for pick-and-place tasks

vipin-mVipin M 10/05/2025 at 16:130 Comments

Project Overview

I worked on debugging a puzzling issue where a fine-tuned NVIDIA GR00T N1.5 model was causing an SO-100 robotic arm to “twitch” instead of performing pick-and-place tasks. The robot would make tiny oscillating movements around the same position, with the gripper staying completely unresponsive.

This project log documents the systematic debugging process that revealed the root cause: an undertrained model that needed significantly more training steps to learn the complete task sequence.

Hardware Setup

The Problem: Robot Twitching

Initial Symptoms

When deploying the trained GR00T model:

The model had been trained for 2000 steps and showed good loss convergence, but the physical deployment was completely unsuccessful.

Debugging Approach

Step 1: Enhanced Logging Implementation

Added comprehensive logging to both the inference server and robot client to understand what data was being exchanged.

Server-Side Logging (service.py):

Client-Side Logging (eval_lerobot.py):

Example Output:

[Request #1] Endpoint: get_action  Inference time: 75.23ms  Response keys: ['action.single_arm', 'action.gripper']    action.single_arm: shape=(16, 5), min=-45.23, max=67.89, mean=12.34    action.gripper: shape=(16, 1), min=-0.30, max=0.50, mean=0.15

[CLIENT] First action to send to robot:    shoulder_pan.pos: -12.34

Step 2: Diagnostic Tools Development

Created several diagnostic scripts to isolate the issue:

Joint Testing Tool (test_joint.py):

Robot State Monitor (monitor_robot_state.py):

Step 3: Dataset Visualization

Uploaded the dataset to Hugging Face Hub and used Rerun visualization to inspect the recorded episodes:

# Upload dataset for analysis
python scripts/so100_groot/upload_to_huggingface.py \    --local-dir ~/.cache/huggingface/lerobot/rubbotix/striped-block \    --repo-id sparkmt/so100-striped-block

# Visualize episodes
./scripts/so100_groot/visualize_episodes.sh 0

This revealed the difference between State (robot’s actual position) and Action (commanded target position), which was crucial for diagnosis.

Critical Discovery: The Root Cause

Key Finding from Logs

The robot was making very small, uncertain movements instead of decisive actions. The logging revealed that the model was outputting actions with very small magnitudes, indicating high uncertainty.

The Root Cause: Undertrained Model

Analysis revealed that the model was severely undertrained at 2000 steps.

Evidence:

  1. Tiny action magnitudes: Model outputting very small actions due to high uncertainty
  2. Lack of task structure understanding: Model hadn’t learned the full sequence (approach → grasp → lift → move → release)
  3. Closed-loop instability: Small errors accumulating, causing the robot to end up in states the model never saw during training

The Solution: Extended Training

Training Requirements Analysis

Task ComplexityMinimum StepsRecommended Steps
Simple reaching1,000-2,0005,000
Pick and place5,000-10,00010,000-20,000
Complex manipulation10,000-20,00020,000-50,000

The pick-and-place task required 10,000-20,000 steps, not the 2000 steps initially used.

Training Configuration Update

Updated the training script to resume from checkpoint-2000 and continue to 10,000 steps:

# Resume training configuration
RESUME_TRAINING="true"
MAX_STEPS=10000  # Increased from 2000
BATCH_SIZE=16
GRADIENT_ACCUMULATION_STEPS=8
LORA_RANK=32
LORA_ALPHA=64

Automatic Checkpoint Detection:

if [ "$RESUME_TRAINING" = "true" ]; then    if ls "$OUTPUT_DIR"/checkpoint-* 1> /dev/null 2>&1; then        LATEST_CHECKPOINT=$(ls -td "$OUTPUT_DIR"/checkpoint-* | head -1)        echo "Resuming from latest checkpoint: ${LATEST_CHECKPOINT}"        TRAIN_CMD="$TRAIN_CMD --resume"    fi
fi

Why 2000 Steps Was Insufficient

1. Model Hadn’t Learned Task Structure

2. Action Magnitude Learning

From deployment logs at 2000 steps, the model was outputting very small actions because:

3. Closed-Loop Instability

Technical Implementation Details

Enhanced Logging Code

Server-side logging addition:

logger.info(f"[Request #{request_counter}] Endpoint: {endpoint}")
logger.info(f"  Data keys: {list(data.keys())}")
logger.info(f"  Inference time: {inference_time:.2f}ms")
for key, value in response.items():    if isinstance(value, np.ndarray):        logger.info(f"    {key}: shape={value.shape}, min={value.min():.2f}, max={value.max():.2f}, mean={value.mean():.2f}")

Client-side logging addition:

logger.info(f"[STEP {step_count}] Getting observation...")
logger.info(f"  Current robot state:")
for key, value in current_state.items():    logger.info(f"    {key}: {value:.2f}")
logger.info(f"[CLIENT] First action to send to robot:")
for key, value in first_action.items():    logger.info(f"    {key}: {value:.2f}")

Dataset Upload and Visualization

Created tools for dataset management and analysis:

# Upload script for Hugging Face Hub
def upload_dataset(local_dir, repo_id):    # Validate dataset structure    required_files = ['meta/info.json', 'meta/stats.json', 'meta/tasks.parquet']    for file in required_files:        if not os.path.exists(os.path.join(local_dir, file)):            raise FileNotFoundError(f"Required file {file} not found")        # Create repository and upload    api = HfApi()    api.create_repo(repo_id, repo_type="dataset", exist_ok=True)    api.upload_folder(folder_path=local_dir, repo_id=repo_id, repo_type="dataset")

Results and Validation

Training Progress

After resuming training from 2000 to 10,000 steps:

Deployment Results

With the extended training at 10,000 steps:

Performance Assessment

The model demonstrates partial success:

Technical Insights

1. Training Duration is Critical

2. Logging is Essential for Debugging

3. Visualization Tools are Invaluable

Current Status

Summary

This debugging session demonstrated that what appeared to be a complex hardware or software integration issue was actually a fundamental training problem. The “twitching” behavior was caused by an undertrained model that hadn’t learned the complete task structure.

The systematic debugging approach using enhanced logging, diagnostic tools, and dataset visualization was crucial for identifying the root cause. The solution required extending training from 2000 to 10,000 steps, resulting in a 74% improvement in MSE (from ~24 to ~6.3) and enabling the robot to execute the complete pick-and-place sequence.

While the model now performs the full task (approach → open → grasp → lift → move → release), execution remains slow and imprecise, with uneven performance across different joints. This suggests that further data collection and training iterations will be needed to achieve reliable, smooth manipulation.

The project demonstrates the iterative nature of robotic AI development and the importance of adequate training duration for manipulation tasks. The debugging infrastructure and systematic approach provide a foundation for continued improvement.

Next: Collecting additional training episodes and exploring Isaac Sim integration for synthetic data generation.

Model: NVIDIA GR00T N1.5 (3B parameters)
Training Method: LoRA fine-tuning (extended to 10,000 steps)
Hardware: SO-100 Robot Arm, RTX 4080 Super
Framework: Isaac-GR00T + LeRobot

Discussions