Project Overview
I worked on fine-tuning NVIDIA’s GR00T N1.5 model for controlling an SO-100 robotic arm. The project involved dataset preparation, memory optimization for 16GB VRAM constraints, model training with LoRA techniques, and deployment setup for real-world robot control.
The goal was to train the model to perform pick-and-place manipulation tasks using the instruction “pick up the striped box and put it into the white plate” with dual-camera visual input.
Hardware Setup
- Robot: SO-100 robotic arm (6 DOF: 5 arm joints + 1 gripper)
- Cameras: Dual camera system at 640x480, 30fps
- GPU: RTX 4080 Super with 16GB VRAM
- Dataset: 20 episodes, 5,197 frames of manipulation demonstrations
- Model: NVIDIA GR00T N1.5 (3B parameters)
Dataset Preparation and Debugging
Issue 1: Blank Visualization Plots
The dataset visualization script displayed blank canvases for state/action plots.
Root Cause: The script had hardcoded humanoid robot keys (left_arm, right_arm, left_hand, right_hand) while the SO-100 dataset uses different keys (single_arm, gripper).
Solution: Modified the visualization function to auto-detect keys from the dataset:
# Before: hardcoded humanoid keys
shared_keys = ["left_arm", "right_arm", "left_hand", "right_hand"]
# After: auto-detect from dataset
if shared_keys is None: shared_keys = [key.replace("state.", "") for key in state_dict.keys()] print(f"Auto-detected keys to plot: {shared_keys}")
Issue 2: Camera Mapping Discrepancy
The visualization showed the wrist camera perspective when it should have shown the scene camera.
Investigation: Checked the dataset’s modality.json mappings and discovered that during data collection, the camera naming was swapped:
observation.images.mainwas actually the wrist/gripper cameraobservation.images.secondary_0was actually the scene camera
Solution: Corrected the mappings in modality.json:
"video": { "front": {"original_key": "observation.images.secondary_0"}, // Scene camera "wrist": {"original_key": "observation.images.main"} // Wrist camera
}
Verification: Created a diagnostic script that confirmed the mapping correction by comparing raw video frames with dataset loader output.
Issue 3: Missing Video Metadata
Dataset loading failed due to missing video metadata fields.
Solution: Added the required fields to info.json:
info['features'][key]['info']['video.channels'] = 3 info['features'][key]['info']['video.height'] = 720 info['features'][key]['info']['video.width'] = 1280
Memory Optimization Challenge
The Problem: CUDA Out of Memory
Initial training attempts all failed with out-of-memory errors, even with very small batch sizes:
| Attempt | Batch Size | Gradient Accum | Result |
|---|---|---|---|
| 1 | 64 | 2 | OOM at step 0 |
| 2 | 32 | 4 | OOM at step 0 |
| 3 | 16 | 8 | OOM at step 0 |
| 4 | 8 | 16 | OOM at step 0 |
| 5 | 4 | 32 | OOM at step 0 |
| 6 | 2 | 64 | OOM during optimizer step |
Analysis: The base model has 3B parameters, plus a 550M parameter diffusion model. The Adam optimizer requires 2x memory for momentum and variance states, exceeding the 16GB VRAM limit.
Solution: LoRA Fine-Tuning
Implemented Low-Rank Adaptation (LoRA) to reduce trainable parameters:
LoRA Configuration:
--lora-rank 32 # Size of low-rank adaptation matrices --lora-alpha 64 # Scaling factor (typically 2x rank) --lora-dropout 0.1 # Regularization --no-tune_diffusion_model # Freeze 550M parameter diffusion model
Memory Savings:
- Full fine-tuning: ~200M trainable parameters
- LoRA fine-tuning: ~10M trainable parameters (20x reduction)
- Result: Fits in 16GB VRAM with batch_size=16
Training Configuration and Results
Final Training Setup
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \ python scripts/gr00t_finetune.py \ --dataset-path ./demo_data/example_dataset/ \ --num-gpus 1 \ --output-dir ./so100-checkpoints \ --max-steps 5000 \ --data-config so100_dualcam \ --batch-size 16 \ --gradient-accumulation-steps 8 \ --learning-rate 0.0001 \ --no-tune_diffusion_model \ --lora-rank 32 \ --lora-alpha 64 \ --lora-dropout 0.1
Key Parameters:
- Effective batch size: 128 (16 × 8 gradient accumulation)
- Training steps: 5000 planned, stopped early at 2340
- Checkpoints saved every 1000 steps
Training Progress
Loss Trajectory:
Step Loss Change from Previous ---- ---- -------------------- 500 0.080 (baseline) 1,000 0.050 -37.5% (strong improvement) 1,500 0.040 -20.0% (good improvement) 2,000 0.040 0.0% (plateau started) 2,340 0.038 -5.0% (minimal improvement)
Convergence Analysis: The loss plateaued around step 1500-2000, with minimal improvement in the last 840 steps. Training was stopped early to avoid overtraining on the small dataset.
Training Metrics:
- Training speed: ~12.76 seconds/step
- GPU memory usage: 7.07 GB before training
- Best checkpoint:
checkpoint-2000with stable performance
Model Evaluation
Open-Loop Evaluation Results
Evaluated the trained model using the official evaluation script:
python scripts/eval_policy.py --plot \ --embodiment-tag new_embodiment \ --model-path ./so100-checkpoints/checkpoint-2000 \ --data-config so100_dualcam \ --dataset-path ./demo_data/example_dataset/
Result: Unnormalized MSE of 0.017463
Performance Assessment:
- Excellent: MSE < 10
- Good: MSE 10-30
- Moderate: MSE 30-60
- Poor: MSE > 60
- Our Result: 0.017 (Outstanding performance)
The model predictions closely match ground truth actions, indicating readiness for real robot deployment.
Deployment Setup
Device Configuration
Verified robot and camera device mappings:
/dev/follower -> ttyACM4 # Robot motor bus /dev/wrist -> video0 # Wrist/gripper camera /dev/scene -> video2 # Scene/front camera
Client-Server Architecture
Terminal 1 - Inference Server:
python scripts/inference_service.py --server \ --model-path ./so100-checkpoints/checkpoint-2000 \ --embodiment-tag new_embodiment \ --data-config so100_dualcam \ --denoising-steps 4 \ --port 5555
Terminal 2 - Robot Client:
python ./examples/SO-100/eval_lerobot.py \ --robot.type=so100_follower \ --robot.port=/dev/follower \ --robot.id=my_so100_arm \ --robot.cameras="{ wrist: {type: opencv, index_or_path: 0, width: 640, height: 480, fps: 30}, front: {type: opencv, index_or_path: 2, width: 640, height: 480, fps: 30}}" \ --policy_host=localhost \ --lang_instruction="pick up the striped box and put it into the white plate"
Issues Fixed
- Import Error: Fixed incorrect import path in the robot client script
- Missing Dependency: Installed
feetech-servo-sdkfor robot communication
Technical Notes
Memory Optimization Strategies
Successful approaches:
- LoRA fine-tuning: Reduced trainable parameters by 20x
- Freezing diffusion model: Saved 550M parameters from training
- Gradient accumulation: Maintained effective batch size without memory overhead
- Memory fragmentation fix:
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
Unsuccessful approaches:
- Simply reducing batch size (still OOM with batch_size=2)
- Training full model even with frozen diffusion model
Training Efficiency Insights
- Small datasets (20 episodes) converge quickly (~2000 steps)
- Monitor loss curves and stop when plateau is reached
- Training loss of 0.04 with MSE 0.017 indicates excellent learning
- Avoid overtraining on small datasets
Dataset Quality Factors
- Camera mapping must match between training and deployment
- Robot calibration affects action space consistency
- 20 episodes sufficient for single task, more needed for multi-task
- Always visualize dataset before training
Current Status
- Dataset prepared and validated
- Model trained and converged (checkpoint-2000, loss: 0.04)
- Open-loop evaluation passed (MSE: 0.017)
- Inference server configured
- Robot client script ready
- Pending: Complete robot calibration and test real-world deployment
Summary
This project successfully fine-tuned a 3B parameter vision-language-action model for robotic manipulation within 16GB VRAM constraints. The key breakthrough was using LoRA fine-tuning to reduce memory requirements while maintaining training effectiveness.
The trained model achieved excellent evaluation metrics (MSE: 0.017) and is ready for real-world deployment. The systematic approach to dataset debugging, memory optimization, and deployment setup provides a foundation for future robotic AI projects.
Next: Testing the trained model on physical robot manipulation tasks.
Model: NVIDIA GR00T N1.5 (3B parameters)
Training Method: LoRA fine-tuning (rank 32)
Hardware: SO-100 Robot Arm, RTX 4080 Super
Framework: Isaac-GR00T + LeRobot
Vipin M
Discussions
Become a Hackaday.io Member
Create an account to leave a comment. Already have an account? Log In.