Close

Fine-Tuning GR00T N1.5 for SO-100 Robot Arm Manipulation

A project log for ChefMate - AI LeRobot Arm

Robotic arm workflow with nVIDIA GR00T N1.5 model. Dataset recording, fine-tuning, debugging, and deployment for pick-and-place tasks

vipin-mVipin M 10/05/2025 at 15:590 Comments

Project Overview

I worked on fine-tuning NVIDIA’s GR00T N1.5 model for controlling an SO-100 robotic arm. The project involved dataset preparation, memory optimization for 16GB VRAM constraints, model training with LoRA techniques, and deployment setup for real-world robot control.

The goal was to train the model to perform pick-and-place manipulation tasks using the instruction “pick up the striped box and put it into the white plate” with dual-camera visual input.

Hardware Setup

Dataset Preparation and Debugging

Issue 1: Blank Visualization Plots

The dataset visualization script displayed blank canvases for state/action plots.

Root Cause: The script had hardcoded humanoid robot keys (left_armright_armleft_handright_hand) while the SO-100 dataset uses different keys (single_armgripper).

Solution: Modified the visualization function to auto-detect keys from the dataset:

# Before: hardcoded humanoid keys
shared_keys = ["left_arm", "right_arm", "left_hand", "right_hand"]

# After: auto-detect from dataset
if shared_keys is None:    shared_keys = [key.replace("state.", "") for key in state_dict.keys()]    print(f"Auto-detected keys to plot: {shared_keys}")

Issue 2: Camera Mapping Discrepancy

The visualization showed the wrist camera perspective when it should have shown the scene camera.

Investigation: Checked the dataset’s modality.json mappings and discovered that during data collection, the camera naming was swapped:

Solution: Corrected the mappings in modality.json:

"video": {    "front": {"original_key": "observation.images.secondary_0"},  // Scene camera    "wrist": {"original_key": "observation.images.main"}          // Wrist camera
}

Verification: Created a diagnostic script that confirmed the mapping correction by comparing raw video frames with dataset loader output.

Issue 3: Missing Video Metadata

Dataset loading failed due to missing video metadata fields.

Solution: Added the required fields to info.json:

info['features'][key]['info']['video.channels'] = 3
info['features'][key]['info']['video.height'] = 720
info['features'][key]['info']['video.width'] = 1280

Memory Optimization Challenge

The Problem: CUDA Out of Memory

Initial training attempts all failed with out-of-memory errors, even with very small batch sizes:

AttemptBatch SizeGradient AccumResult
1642OOM at step 0
2324OOM at step 0
3168OOM at step 0
4816OOM at step 0
5432OOM at step 0
6264OOM during optimizer step

Analysis: The base model has 3B parameters, plus a 550M parameter diffusion model. The Adam optimizer requires 2x memory for momentum and variance states, exceeding the 16GB VRAM limit.

Solution: LoRA Fine-Tuning

Implemented Low-Rank Adaptation (LoRA) to reduce trainable parameters:

LoRA Configuration:

--lora-rank 32          # Size of low-rank adaptation matrices
--lora-alpha 64         # Scaling factor (typically 2x rank)
--lora-dropout 0.1      # Regularization
--no-tune_diffusion_model  # Freeze 550M parameter diffusion model

Memory Savings:

Training Configuration and Results

Final Training Setup

PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
python scripts/gr00t_finetune.py \   --dataset-path ./demo_data/example_dataset/ \   --num-gpus 1 \   --output-dir ./so100-checkpoints \   --max-steps 5000 \   --data-config so100_dualcam \   --batch-size 16 \   --gradient-accumulation-steps 8 \   --learning-rate 0.0001 \   --no-tune_diffusion_model \   --lora-rank 32 \   --lora-alpha 64 \   --lora-dropout 0.1

Key Parameters:

Training Progress

Loss Trajectory:

Step    Loss    Change from Previous
----    ----    --------------------  500   0.080   (baseline)
1,000   0.050   -37.5% (strong improvement)
1,500   0.040   -20.0% (good improvement)
2,000   0.040    0.0%  (plateau started)
2,340   0.038   -5.0%  (minimal improvement)

Convergence Analysis: The loss plateaued around step 1500-2000, with minimal improvement in the last 840 steps. Training was stopped early to avoid overtraining on the small dataset.

Training Metrics:

Model Evaluation

Open-Loop Evaluation Results

Evaluated the trained model using the official evaluation script:

python scripts/eval_policy.py --plot \   --embodiment-tag new_embodiment \   --model-path ./so100-checkpoints/checkpoint-2000 \   --data-config so100_dualcam \   --dataset-path ./demo_data/example_dataset/

Result: Unnormalized MSE of 0.017463

Performance Assessment:

The model predictions closely match ground truth actions, indicating readiness for real robot deployment.

Deployment Setup

Device Configuration

Verified robot and camera device mappings:

/dev/follower -> ttyACM4  # Robot motor bus
/dev/wrist -> video0      # Wrist/gripper camera
/dev/scene -> video2      # Scene/front camera

Client-Server Architecture

Terminal 1 - Inference Server:

python scripts/inference_service.py --server \    --model-path ./so100-checkpoints/checkpoint-2000 \    --embodiment-tag new_embodiment \    --data-config so100_dualcam \    --denoising-steps 4 \    --port 5555

Terminal 2 - Robot Client:

python ./examples/SO-100/eval_lerobot.py \    --robot.type=so100_follower \    --robot.port=/dev/follower \    --robot.id=my_so100_arm \    --robot.cameras="{ wrist: {type: opencv, index_or_path: 0, width: 640, height: 480, fps: 30}, front: {type: opencv, index_or_path: 2, width: 640, height: 480, fps: 30}}" \    --policy_host=localhost \    --lang_instruction="pick up the striped box and put it into the white plate"

Issues Fixed

  1. Import Error: Fixed incorrect import path in the robot client script
  2. Missing Dependency: Installed feetech-servo-sdk for robot communication

Technical Notes

Memory Optimization Strategies

Successful approaches:

Unsuccessful approaches:

Training Efficiency Insights

Dataset Quality Factors

Current Status

Summary

This project successfully fine-tuned a 3B parameter vision-language-action model for robotic manipulation within 16GB VRAM constraints. The key breakthrough was using LoRA fine-tuning to reduce memory requirements while maintaining training effectiveness.

The trained model achieved excellent evaluation metrics (MSE: 0.017) and is ready for real-world deployment. The systematic approach to dataset debugging, memory optimization, and deployment setup provides a foundation for future robotic AI projects.

Next: Testing the trained model on physical robot manipulation tasks.

Model: NVIDIA GR00T N1.5 (3B parameters)
Training Method: LoRA fine-tuning (rank 32)
Hardware: SO-100 Robot Arm, RTX 4080 Super
Framework: Isaac-GR00T + LeRobot

Discussions