-
Debugging Language Conditioning in GR00T Multitask Training
10/20/2025 at 16:15 • 0 commentsWhen your robot ignores “do not pick up the cheese” and picks it up anyway: A journey through frozen VLM backbones and the limits of action-only fine-tuning
Project Overview
This project log documents the discovery and resolution of a critical language conditioning failure in NVIDIA’s GR00T N1.5 vision-language-action (VLA) model during multitask training. The investigation revealed fundamental limitations in training configurations that freeze the vision-language backbone, and provides solutions for enabling proper language-conditioned robotic manipulation.
The debugging process involved systematic testing with dual-ingredient scenarios (cheese and bread), analysis of model architecture and data flow, and identification of the root cause: frozen Eagle VLM backbone preventing task-specific language-vision association learning.
This work is part of the LeIsaac project - building a multi-ingredient sandwich assembly robot using Isaac Sim, Isaac Lab, and VLA models with an SO-ARM 101 robotic arm.
Hardware and Software Stack
- Robot: SO-101 robotic arm (6 DOF: 5 arm joints + 1 gripper)
- Cameras: Dual camera system (wrist + scene) at 640x480, 30fps
- GPU: RTX 4080 Super with 16GB VRAM
- Model: NVIDIA GR00T N1.5-3B (Vision-Language-Action model)
- Framework: Isaac-GR00T + LeRobot v3.0
- Training: LoRA fine-tuning on custom datasets
- Task: Multitask pick-and-place (cheese vs bread)
The Challenge: Multitask Language Conditioning
Why Multitask Learning?
The sandwich assembly task requires the robot to manipulate multiple ingredients based on language instructions:
- “Pick up the cheese and place it in the white plate”
- “Pick up the bread and place it in the white plate”
- “Stack the cheese on the bread”
This requires the model to:
- Understand language instructions - differentiate “cheese” vs “bread”
- Ground language to vision - recognize which object is cheese vs bread
- Execute task-specific actions - different manipulation strategies per ingredient
Training Setup
Datasets:
- Cheese dataset: 50 episodes, 14,212 frames, task: “Pick slice of yellow cheese and place it in the white plate”
- Bread dataset: 50 episodes, 13,483 frames, task: “Pick slice of bread and place it in the white plate”
Training configuration:
python scripts/gr00t_finetune.py \ --dataset-path ./demo_data/cheese/ ./demo_data/bread/ \ --num-gpus 1 \ --max-steps 10000 \ --data-config so100_dualcam \ --batch-size 16 \ --lora-rank 32 \ --balance-dataset-weights \ --balance-trajectory-weights
The
LeRobotMixtureDatasetautomatically balances sampling across both datasets during training.Phase 1: Problem Discovery
Initial Testing
After training a multitask model for 3,000 steps, I deployed it on the physical robot and tested with different language instructions:
Test 1:
"pick up the yellow cheese and put it into the white plate"- Result: ✅ Robot picks up cheese
Test 2:
"pick up the bread and put it into the white plate"- Result: ❌ Robot picks up cheese (ignores instruction!)
Test 3:
"do not pick up the cheese"- Result: ❌ Robot picks up cheese (completely ignores negation!)
Critical observation: The model’s behavior was 100% determined by visual state, with 0% influence from language instruction.
Hypothesis: Visual State Machine
The robot appeared to be using a simple position-based heuristic:
IF (object detected in plate): STOP (task complete) ELSE IF (object detected in holder): GRASP object → MOVE to plate → RELEASE ELSE: SEARCH randomly
This suggested the model learned visual patterns rather than language-conditioned behavior.
Phase 2: First Fix Attempt - The Diffusion Model Flag
Discovery of
--no-tune_diffusion_modelInvestigating the training script revealed a suspicious flag:
TRAIN_CMD="python scripts/gr00t_finetune.py \ --dataset-path ${DATASET_PATHS} \ --no-tune_diffusion_model \ # ← SUSPICIOUS! --lora-rank 32 \ ..."Analysis of what this flag does:
GR00T N1.5 architecture:
├── Vision Tower (SigLIP) ..................... ❌ FROZEN (tune_visual=False) ├── Language Model (Qwen2.5-3B) ............... ❌ FROZEN (tune_llm=False) └── Action Head ├── Projector (Linear layers) ............ ✅ TRAINABLE (LoRA rank 32) └── Diffusion Model (DiT) ................ ❌ FROZEN (--no-tune_diffusion_model)
With
--no-tune_diffusion_model, only the tiny projector layer was trainable!Training logs confirmed:
tune_diffusion_model: False trainable params: 6,553,600 || all params: 2,730,717,120 || trainable%: 0.2400 Tune action head projector: True Tune action head diffusion model: False ← PROBLEM!
The Fix
Removed the flag from
03_train_model.sh:# REMOVED: --no-tune_diffusion_model TRAIN_CMD="python scripts/gr00t_finetune.py \ --dataset-path ${DATASET_PATHS} \ --lora-rank 32 \ ..."New training logs:
tune_diffusion_model: True trainable params: 6,553,600 || all params: 2,730,717,120 || trainable%: 0.2400 Tune action head projector: True Tune action head diffusion model: True ← FIXED!
Expectations: With the diffusion model now trainable, the model should learn to map language instructions to different action sequences.
Reality: Language conditioning still failed! 😱
Phase 3: Systematic Testing and Behavior Analysis
Test Matrix
I conducted systematic testing with single and dual ingredient scenarios:
Scenario Cheese Location Bread Location Robot Action Language Effect 1 Holder Holder Randomly picks one ❌ Ignores instruction 2 Holder None Picks cheese ❌ Ignores instruction 3 None Holder Picks bread ❌ Ignores instruction 4 Plate Holder Stops ❌ Ignores instruction 5 Holder Plate Stops ❌ Ignores instruction 6 Plate Plate Stops ❌ Ignores instruction 7 None None Random search ❌ Ignores instruction 8 Plate None Stops ❌ Ignores instruction Key finding: The robot’s behavior was entirely determined by object positions, regardless of language instruction.
Behavior Pattern
The model learned a position-based state machine:
State 1: Nothing in plate → Pick any object from holder
State 2: Something in plate → Stop (task complete)
State 3: Nothing anywhere → Search randomlyCritical test: Manually moved cheese to plate (without robot), then gave instruction “pick up the bread and put it into the white plate”
- Expected: Robot picks up bread
- Actual: Robot stops (considers task complete because cheese is in plate)
This confirmed the model was using visual heuristics (“if object in plate → task done”) rather than understanding language instructions.
Phase 4: Root Cause Analysis - The Frozen Backbone Problem
How GR00T Processes Language
After diving into the codebase, I discovered how language flows through the model:
Step 1: Input Preparation (
transforms.py)# Language text lang = "Pick slice of yellow cheese and place it in the white plate" # Images from cameras images = [scene_camera_frame, wrist_camera_frame] # Shape: [V, T, C, H, W]
Step 2: Eagle VLM Processing (
transforms.py:_apply_vlm_processing)# Create conversation format (Eagle processes vision + language together!) eagle_conversation = [ { "role": "user", "content": [ {"type": "image", "image": scene_img}, {"type": "image", "image": wrist_img}, {"type": "text", "text": lang} ] } ] # Tokenize and process text_list = eagle_processor.apply_chat_template(eagle_conversation) image_inputs = eagle_processor.process_vision_info(eagle_conversation)Step 3: Eagle Model Forward (
eagle_backbone.py)# Eagle model processes BOTH vision and language together eagle_output = self.eagle_model( input_ids=tokenized_text, # Language tokens pixel_values=processed_images, # Vision features attention_mask=attention_mask, output_hidden_states=True ) # Extract joint vision-language embeddings vl_embeddings = eagle_output.hidden_states[select_layer] # Shape: [B, seq_len, 2048] vl_embeddings = self.eagle_linear(vl_embeddings) # Project to 1536 dim
Step 4: Action Head Uses VL Embeddings (
flow_matching_action_head.py)# Action head receives joint vision-language embeddings vl_embs = backbone_output.backbone_features # From Eagle # Diffusion model uses these embeddings as conditioning model_output = self.model( hidden_states=sa_embs, # State + action embeddings encoder_hidden_states=vl_embs, # Vision-language conditioning ← KEY! encoder_attention_mask=vl_attn_mask, timestep=t_discretized )
The Fundamental Problem
Eagle VLM is completely frozen (
tune_llm=False,tune_visual=False):- Eagle cannot learn new language-vision associations
- Pre-trained on general VLM tasks (image captioning, VQA, etc.)
- Never seen “pick cheese” vs “pick bread” during pre-training
- Cannot learn to differentiate these similar instructions
- Eagle produces nearly identical embeddings
# Hypothesis: Eagle's frozen embeddings emb_cheese = eagle("pick cheese", [scene_img, wrist_img]) emb_bread = eagle("pick bread", [scene_img, wrist_img]) # Cosine similarity likely very high (>0.95) # Because both are "pick X and place in plate" structure - Diffusion model has no signal to differentiate
- Diffusion model learns:
embeddings → actions - If
emb_cheese ≈ emb_bread, thenactions_cheese ≈ actions_bread - Model falls back to visual heuristics: “if object in holder → pick it”
- Diffusion model learns:
Why Diffusion Model Training Alone is Insufficient
Even with the diffusion model trainable:
- ✅ Diffusion model can learn better action prediction
- ✅ Diffusion model can learn smoother trajectories
- ❌ Diffusion model CANNOT differentiate between similar language instructions
- ❌ Diffusion model CANNOT learn task-specific language conditioning
The bottleneck: Frozen Eagle provides nearly identical embeddings for different instructions, so the diffusion model has no signal to learn different behaviors.
Phase 5: Solution and Next Steps
Required Fix: Enable Backbone Training
Minimum requirement - Enable LLM fine-tuning:
python scripts/gr00t_finetune.py \ --dataset-path ./demo_data/cheese/ ./demo_data/bread/ \ --tune-llm \ # ✅ ENABLE THIS --tune-visual False \ # Keep frozen to save VRAM --tune-projector True \ --tune-diffusion-model True \ --lora-rank 32
Why this works:
- LLM can learn to differentiate “cheese” vs “bread” tokens
- LLM creates task-specific language embeddings
- Diffusion model learns to map these distinct embeddings to different actions
- VRAM: ~12-16GB (may fit on RTX 4080 Super with reduced batch size)
Best solution - Enable both LLM and vision fine-tuning:
python scripts/gr00t_finetune.py \ --dataset-path ./demo_data/cheese/ ./demo_data/bread/ \ --tune-llm \ # ✅ ENABLE THIS --tune-visual \ # ✅ ENABLE THIS --tune-projector True \ --tune-diffusion-model True \ --lora-rank 16 \ # Reduce from 32 to save VRAM --lora-alpha 32 # Reduce from 64 (2x rank)
Why this is best:
- Vision tower learns to recognize cheese vs bread visually
- LLM learns to understand task instructions
- Combined: Model can ground language to visual objects
- VRAM: ~16-20GB (may require batch size reduction)
VRAM Testing Script
Created
test_vram_requirements.shto systematically test configurations:cd ~/lerobot/scripts/so100_groot ./test_vram_requirements.sh
This script tests 5 configurations:
- Baseline (frozen backbone) - ~8GB
- LLM only (LoRA 16) - ~12GB
- LLM only (LoRA 32) - ~14GB
- LLM + Vision (LoRA 16) - ~18GB
- LLM + Vision (LoRA 32) - ~22GB
And recommends the best configuration that fits on the GPU.
Trade-offs and Considerations
Configuration VRAM Training Speed Language Conditioning Visual Grounding Frozen backbone ~8GB ~5-7 sec/step ❌ Broken ❌ No LLM only ~12-16GB ~8-12 sec/step ✅ Works ⚠️ Limited LLM + Vision ~16-20GB ~12-18 sec/step ✅ Works ✅ Yes Lessons Learned
1. Frozen VLM Backbones Cannot Learn New Tasks
Key insight: When using a vision-language model as a backbone, freezing it prevents learning task-specific language-vision associations. This is fundamentally different from freezing a vision-only backbone (like ResNet or ViT).
Why: VLMs create joint embeddings from vision and language. If the VLM is frozen, it can only use pre-trained associations, which likely don’t include my specific tasks.
2. Action-Only Fine-tuning Has Limits
What works: Fine-tuning only the action head (projector + diffusion model) can work for:
- Single-task learning
- Tasks similar to pre-training data
- Visual-only conditioning
What doesn’t work: Action-only fine-tuning cannot enable:
- Language conditioning for novel tasks
- Differentiation between similar language instructions
- Grounding of new language concepts to vision
3. Debugging Requires Understanding Data Flow
Critical: Understanding how data flows through the model is essential for debugging:
- How is language tokenized?
- Where are vision and language combined?
- What embeddings does the action head receive?
- Which components are frozen vs trainable?
Without this understanding, it’s easy to misdiagnose problems (e.g., thinking diffusion model training alone would fix language conditioning).
4. Systematic Testing Reveals Patterns
Approach: Testing with a matrix of scenarios (single ingredient, dual ingredient, different positions) revealed the position-based state machine pattern.
Value: This systematic approach provided clear evidence that language had 0% effect, which motivated deeper investigation into the architecture.
5. VRAM is the Limiting Factor
Reality: The best configuration (LLM + Vision fine-tuning) requires ~20GB VRAM, which exceeds most consumer GPUs.
Solutions:
- Reduce LoRA rank (32 → 16)
- Reduce batch size (16 → 4)
- Use gradient checkpointing
- Train on cloud GPUs
- Accept LLM-only training as compromise
Next Steps
- Test VRAM requirements on RTX 4080 Super to determine feasible configuration
- Retrain model with LLM fine-tuning enabled (minimum) or LLM + Vision (if VRAM allows)
- Validate language conditioning with systematic testing:
- Different instructions → different behaviors
- Negation has effect
- Model responds to language changes
- Document results and update training scripts with proper defaults
- Consider alternative approaches if VRAM is insufficient:
- Train separate models per task
- Use task selector at inference time
- Explore model quantization or pruning
Conclusion
This debugging journey revealed a fundamental limitation in the training configuration: frozen VLM backbones cannot learn task-specific language conditioning. While the initial fix (enabling diffusion model training) was necessary, it was insufficient because the bottleneck was in the frozen Eagle backbone producing nearly identical embeddings for different instructions.
The solution requires enabling at least LLM fine-tuning, and ideally both LLM and vision fine-tuning, to allow the model to learn task-specific language-vision associations. This comes with VRAM trade-offs that must be carefully managed on consumer GPUs.
The systematic testing approach and deep dive into model architecture were essential for identifying the root cause and developing an effective solution. This experience highlights the importance of understanding not just what to train, but how the model processes and combines different modalities.
-
Building a Sandwich Assembly Simulation for Robotic Manipulation
10/16/2025 at 20:30 • 0 commentsFrom USD scene creation to MimicGen integration: A complete implementation of multi-ingredient manipulation in Isaac Sim
Project Overview
This project implements a complete sandwich assembly simulation environment in Isaac Lab, designed to train robotic manipulation policies for multi-step food preparation tasks. The development involved creating a custom USD scene with proper physics configuration, implementing MimicGen integration for data augmentation, and solving critical challenges with rigid body hierarchies and API compatibility.
This project log documents the systematic development process, from initial scene setup through the final dynamic ingredient selection feature, including all the debugging challenges and solutions encountered along the way.
The simulation successfully supports teleoperation, demonstration recording, MimicGen annotation, and automated data generation for training vision-language-action (VLA) models on sandwich assembly tasks.
Hardware and Software Stack
- Robot: SO-101 robotic arm (6 DOF: 5 arm joints + 1 gripper)
- Cameras: Dual camera system (wrist + front) at 640x480, 30fps
- Front camera calibrated for Nexigo N60 webcam (78° FOV)
- Wrist camera for close-up manipulation view
- GPU: RTX 4080 Super with 16GB VRAM
- Simulation: Isaac Sim 5.0 + Isaac Lab framework
- Framework: LeIsaac (custom robotics framework built on Isaac Lab)
- Task: Multi-ingredient sandwich assembly with 4 ingredients (2× bread, cheese, patty)
The Challenge: Multi-Ingredient Manipulation
Why Sandwich Assembly?
Sandwich assembly represents a complex manipulation task that requires:
- Sequential manipulation: Multiple pick-and-place operations
- Spatial reasoning: Proper stacking order and alignment
- Object diversity: Different ingredient types with varying properties
- Real-world relevance: Applicable to food preparation and assembly tasks
- VLA training: Language-conditioned manipulation (“pick up the cheese”, “place the patty”)
Technical Requirements
- USD Scene: Custom kitchen environment with proper physics
- Multiple Objects: 4 ingredients + plate + holder, each with correct physics properties
- MimicGen Integration: Subtask annotation and data augmentation
- Dynamic Configuration: Support for different ingredient types
- Camera Setup: Optimal viewing angles for manipulation
Phase 1: USD Scene Creation
Initial Scene Setup (Commit 1c52342)
Development began with creating the basic task structure and scene configuration. The initial implementation involved:
Created Files:
assemble_sandwich_env_cfg.py- Main environment configurationassemble_sandwich_mimic_env_cfg.py- MimicGen variant with subtask configsREADME.md- Complete documentation (614 lines)
Key Configuration:
# Scene loading with automatic USD parsing parse_usd_and_create_subassets(KITCHEN_WITH_SANDWICH_USD_PATH, self)
This function automatically detects all rigid body objects in the USD scene and creates corresponding configuration objects, eliminating manual object registration.
Scene Simplification Challenge
Problem: The original kitchen scene was 37.9 MB with complex fixtures (cabinets, appliances, decorative elements) that slowed simulation and cluttered the workspace.
Solution: Documented a systematic simplification workflow:
- Remove unnecessary kitchen fixtures
- Create simple table workspace (1.2m × 0.8m × 0.85m)
- Reduce file size to ~5-10 MB (75% reduction)
- Optimize for robot simulation performance
Table Layout Design:
┌─────────────────────────────────┐ │ [Ingredients Holder] [🍽️] │ ← Left: Holder, Right: Plate │ ┌─┬─┬─┬─┐ Plate │ │ │🍞│🍞│🥩│🧀│ │ ← Slots: bread, bread, patty, cheese │ └─┴─┴─┴─┘ │ │ │ │ Assembly Area │ └─────────────────────────────────┘
Physics Configuration (Commit 6a7e4b5)
The USD scene structure was created with proper physics APIs for all objects:
Dynamic Objects (movable ingredients):
bread_slice_1,bread_slice_2,cheese_slice,patty- Physics properties:
RigidBodyAPI,CollisionAPI,MassAPI - Configuration:
physics:kinematicEnabled = false(affected by gravity)
Static Objects (fixed environment):
plate,ingredients_holder,table- Physics properties:
RigidBodyAPI,CollisionAPI - Configuration:
physics:kinematicEnabled = true(no movement)
Critical Learning: USD payload overrides can replace physics APIs from payload files. The solution required adding physics and collision APIs to both
defandoverdeclarations inscene.usdato ensure proper physics behavior.Phase 2: The Rigid Body Hierarchy Crisis
The Critical Error
Problem: When attempting teleoperation, a fatal error occurred:
[Error] Multiple rigid bodies in hierarchy detected
Root Cause: Objects were nested inside the table hierarchy, creating parent-child rigid body relationships that Isaac Sim’s physics engine cannot handle.
Investigation Process:
- Examined working reference scene (kitchen_with_orange)
- Investigation revealed correct USD hierarchy pattern
- Analysis showed that manipulable objects must be direct children of
/Root, NOT nested inside Scene or table
Incorrect Hierarchy:
/Root └── Scene └── table ├── bread_slice_1 ❌ Nested rigid body ├── cheese_slice ❌ Nested rigid body └── patty ❌ Nested rigid body
Correct Hierarchy:
/Root ├── Scene │ └── table ├── bread_slice_1 ✅ Direct child of /Root ├── cheese_slice ✅ Direct child of /Root └── patty ✅ Direct child of /Root
The Fix (Commit 6b64f15)
The USD hierarchy was flattened by moving all manipulable objects out of the table to be direct children of
/Root. This critical architectural fix enabled proper physics simulation.Validation: After flattening, teleoperation worked correctly with no rigid body hierarchy errors.
Phase 3: Camera Configuration (Commits d5328b3, 5df9d64)
Camera Positioning Strategy
The implementation includes a three-camera system optimized for sandwich assembly:
1. Wrist Camera (Close-up manipulation):
offset=TiledCameraCfg.OffsetCfg( pos=(0.02, 0.08, -0.03), # Slightly forward and up rot=(-0.35, -0.93, -0.05, 0.08), # Angled down toward table c )
2. Front Camera (Workspace overview, calibrated for Nexigo N60 webcam):
offset=TiledCameraCfg.OffsetCfg( pos=(-0.2, -0.8, 0.7), # Higher and angled for table overview rot=(0.2, -0.98, 0.0, 0.0), # Looking down at workspace c ), spawn=sim_utils.PinholeCameraCfg( focal_length=24.0, # Nexigo N60 equivalent horizontal_aperture=36.0, # ~78° FOV to match webcam focus_distance=400.0, # Optimal for table distance )
3. Viewer Camera (Development/debug):
self.viewer.eye = (1.5, -1.5, 1.8) # Elevated diagonal view self.viewer.lookat = (0.0, 0.0, 0.9) # Looking at table center
Key Insight: The front camera was specifically calibrated to match the Nexigo N60 webcam specifications (78° FOV, 640x480 resolution, 30 FPS) to ensure sim-to-real consistency for VLA training.
Phase 4: MimicGen Integration
Automatic Subtask Annotation (Commit 3e3ce0b)
Automatic subtask detection was implemented using observation functions:
Created
mdp/observations.py:def ingredient_grasped( env: ManagerBasedRLEnv, robot_cfg: SceneEntityCfg, bread_slice_1_cfg: SceneEntityCfg, cheese_slice_cfg: SceneEntityCfg, patty_cfg: SceneEntityCfg, bread_slice_2_cfg: SceneEntityCfg, gripper_open_threshold: float = 0.03, ) -> torch.Tensor: """Detect when any ingredient is grasped by checking gripper closure and proximity.""" robot: Articulation = env.scene[robot_cfg.name] gripper_pos = robot.data.joint_pos[:, -1] gripper_closed = gripper_pos < gripper_open_threshold # Check proximity to each ingredient for ingredient_name, ingredient_cfg in [ ("bread_slice_1", bread_slice_1_cfg), ("cheese_slice", cheese_slice_cfg), ("patty", patty_cfg), ("bread_slice_2", bread_slice_2_cfg), ]: if ingredient_name in env.scene: ingredient: RigidObject = env.scene[ingredient_name] distance = torch.norm( robot.data.body_pos_w[:, -1, :3] - ingredient.data.root_pos_w[:, :3], dim=-1 ) is_grasped = gripper_closed & (distance < 0.05) if is_grasped.any(): return is_grasped return torch.zeros(env.num_envs, dtype=torch.bool, device=env.device)This function automatically detects when the robot grasps an ingredient by checking:
- Gripper closure (joint position < threshold)
- Proximity to ingredient (distance < 5cm)
Success Termination Criteria (Commit faf0a22)
Problem: MimicGen annotation script requires a success termination term, but the environment only had a
time_outtermination.Solution: Created
mdp/terminations.pywith comprehensive success criteria:def task_done( env: ManagerBasedRLEnv, plate_cfg: SceneEntityCfg, bread_slice_1_cfg: SceneEntityCfg, cheese_slice_cfg: SceneEntityCfg, patty_cfg: SceneEntityCfg, bread_slice_2_cfg: SceneEntityCfg, height_threshold: float = 0.02, xy_threshold: float = 0.05, test_mode: bool = False, ) -> torch.Tensor: """Determine if the sandwich assembly task is complete.""" if test_mode: return torch.ones(num_envs, dtype=torch.bool, device=env.device) # Check XY alignment (all ingredients within 5cm of plate center) # Check vertical stacking order (bread → cheese → patty → bread) # Check stability (low velocity for all ingredients) # ...
Success Criteria:
- XY Alignment: All ingredients within 5cm of plate center
- Vertical Stacking: Correct order (bread → cheese → patty → bread)
- Stability: Low velocity for all ingredients
Test Mode for Incomplete Demonstrations (Commit 6366699)
Problem: During development, testing the annotation pipeline required support for incomplete demonstrations (just moving the arm without completing the task).
Solution: Added
--force_completioncommand-line argument:parser.add_argument( "--force_completion", action="store_true", default=False, help="Force task completion for incomplete demonstrations." ) # Modify success_term parameters based on flag if args_cli.force_completion: success_term.params["test_mode"] = True print("[INFO] Force completion enabled: success termination will accept incomplete demonstrations.")This allows switching between:
- Testing mode: Accept any demonstration (useful for pipeline testing)
- Production mode: Enforce strict success criteria (for real training data)
Phase 5: Dynamic Ingredient Selection (Commit 24bb349)
The KeyError Challenge
Problem: MimicGen data generation failed with:
KeyError: 'ingredient'
Root Cause: The subtask configuration used
object_ref="ingredient"(a generic placeholder), but the USD scene contains specific objects:bread_slice_1,bread_slice_2,cheese_slice,patty.Investigation: The
get_object_poses()method returns a dictionary with keys from actual scene object names, not generic placeholders. MimicGen tried to accessobject_poses["ingredient"]which didn’t exist.The Solution: Dynamic Ingredient Selection
A command-line argument was implemented to dynamically specify which ingredient to track:
parser.add_argument( "--ingredient_type", type=str, default=None, choices=["bread_slice_1", "bread_slice_2", "cheese_slice", "patty"], help="Specify the ingredient type for sandwich assembly task." ) # Dynamically set ingredient type if args_cli.ingredient_type is not None and "AssembleSandwich" in env_name: ingredient_display_names = { "bread_slice_1": "bread slice", "bread_slice_2": "bread slice", "cheese_slice": "cheese slice", "patty": "patty", } ingredient_name = ingredient_display_names.get(args_cli.ingredient_type) # Update object_ref for the first subtask (grasp ingredient) env_cfg.subtask_configs["so101_follower"][0].object_ref = args_cli.ingredient_type # Update descriptions env_cfg.subtask_configs["so101_follower"][0].description = f"Grasp {ingredient_name} from cartridge" env_cfg.subtask_configs["so101_follower"][1].description = f"Place {ingredient_name} on plate" print(f"[INFO] Ingredient type set to: {args_cli.ingredient_type} ({ingredient_name})")Benefits:
- No Manual Editing: Single command-line flag instead of code changes
- Type Safety: Choices parameter prevents typos
- Clear Intent: Command shows exactly which ingredient is being processed
- Workflow Efficiency: Easy switching between ingredient types
Usage Example:
# Generate bread demonstrations python.sh scripts/mimic/generate_dataset.py \ --task=LeIsaac-SO101-AssembleSandwich-Mimic-v0 \ --input_file=./datasets/annotated_bread_ingredient.hdf5 \ --output_file=./datasets/generated_bread_ingredient.hdf5 \ --ingredient_type=bread_slice_1 \ --generation_num_trials=20 # Generate patty demonstrations python.sh scripts/mimic/generate_dataset.py \ --ingredient_type=patty \ ...
Phase 6: API Compatibility Fix (Commit 6b64f15)
The Se3Keyboard Error
Problem: Annotation script failed with:
TypeError: Se3Keyboard.__init__() got an unexpected keyword argument 'pos_sensitivity'
Root Cause: The leisaac repository’s
annotate_demos.pydiverged from the official IsaacLab version. Isaac Lab changed the Se3Keyboard API in May 2025 to use a configuration object pattern, but the script was created in August 2025 with the old API.Solution: Updated to use
Se3KeyboardCfg:# Old (incorrect): device = Se3Keyboard(pos_sensitivity=0.05, rot_sensitivity=0.05) # New (correct): from omni.isaac.lab.devices import Se3KeyboardCfg device = Se3Keyboard(Se3KeyboardCfg(pos_sensitivity=0.05, rot_sensitivity=0.05))
This follows Isaac Lab’s configuration pattern where device classes accept a single
cfgparameter of a corresponding configuration dataclass type.Technical Insights and Lessons Learned
1. USD Hierarchy is Critical
The rigid body hierarchy issue was the most critical architectural challenge. Key learning: Manipulable objects must be direct children of the root prim, not nested inside other rigid bodies.
2. Physics API Overrides
USD payload overrides can replace physics APIs from payload files. Always add physics and collision APIs to both
defandoverdeclarations to ensure proper behavior.3. MimicGen Configuration Requirements
- Intermediate subtasks: Can have termination signals and offset ranges
- Final subtask: Must have
subtask_term_signal=Noneandsubtask_term_offset_range=(0, 0) - Object references: Must match actual USD scene object names
4. Camera Calibration for Sim-to-Real
Calibrating simulation cameras to match real hardware (Nexigo N60 webcam) is essential for VLA training. Key parameters:
- Field of view (78° diagonal)
- Resolution (640x480)
- Frame rate (30 FPS)
- Focus distance (400mm for table workspace)
5. Flexible Testing Workflows
The
--force_completionflag enables rapid iteration during development by accepting incomplete demonstrations, while maintaining strict criteria for production data collection.Results and Portfolio Value
Completed Features
✅ Custom USD scene with proper physics configuration
✅ Multi-ingredient manipulation support (4 ingredients)
✅ Dual camera system calibrated for real hardware
✅ MimicGen integration with automatic subtask detection
✅ Success termination criteria for task completion
✅ Dynamic ingredient selection for data augmentation
✅ Comprehensive documentation (614-line README)Demonstration Workflow
- Record demonstrations for each ingredient type (bread, cheese, patty)
- Annotate subtasks manually or automatically
- Generate augmented data using MimicGen (1 demo → 20+ variations)
- Train VLA policies with language conditioning
- Deploy to real robot with sim-to-real transfer
Portfolio Highlights
This project demonstrates:
- USD scene design with complex physics requirements
- Debugging systematic issues (rigid body hierarchy, API compatibility)
- MimicGen integration for data augmentation
- Flexible configuration for different use cases
- Real-world applicability to food preparation and assembly tasks
Future Enhancements
- Language Conditioning: Integrate with VLA models for language-conditioned manipulation
- Sim-to-Real Transfer: Deploy trained policies to real SO-101 robot
- Advanced Physics: Add deformable objects (lettuce, tomato)
- Multi-Robot: Extend to bi-manual manipulation
- Benchmarking: Compare MimicGen vs manual demonstrations
Conclusion
Building the sandwich assembly simulation required solving challenges across multiple domains: USD scene design, physics configuration, MimicGen integration, and API compatibility. The systematic debugging approach and comprehensive documentation make this a strong portfolio piece demonstrating end-to-end robotics simulation development.
The dynamic ingredient selection feature and flexible testing workflows show attention to developer experience and practical usability. The project successfully bridges simulation and real-world robotics, providing a foundation for training manipulation policies on complex multi-step tasks.
-
MimicGen Data Augmentation Pipeline for Robotic Manipulation
10/08/2025 at 22:35 • 0 commentsProject Overview
I implemented a complete MimicGen data augmentation pipeline to generate multiple training demonstrations from a single recorded episode. The goal was to overcome the data scarcity problem in robotic manipulation by automatically creating diverse variations of expert demonstrations.
This project log documents the systematic implementation of the 4-step MimicGen workflow, from converting demonstrations to IK actions through generating 10x augmented data, and the debugging challenges encountered along the way.
The pipeline successfully transformed 1 original demonstration into 10 augmented demonstrations with a 71.4% generation success rate, providing rich training data for imitation learning policies.
Hardware Setup
- Robot: SO-101 robotic arm (6 DOF: 5 arm joints + 1 gripper)
- Cameras: Dual camera system (scene + wrist) at 640x480, 30fps
- GPU: RTX 4080 Super with 16GB VRAM
- Simulation: Isaac Sim 5.0 + Isaac Lab framework
- Dataset: Single “lift_cube” demonstration → 10 augmented demonstrations
- Task: “Pick up 1.5cm cube and lift it 5cm above robot base”
The Problem: Data Scarcity in Robotic Learning
Initial Challenge
Robotic manipulation policies require large amounts of diverse training data, but collecting demonstrations is:
- Time-consuming: Each episode requires manual teleoperation
- Limited diversity: Human demonstrations tend to be similar
- Expensive: Requires expert operators and robot time
- Insufficient for generalization: Single demonstrations don’t capture task variations
Traditional approach: Record 50-100 demonstrations manually.
MimicGen approach: Record 1 demonstration → Generate 10+ variations automatically.MimicGen Pipeline Overview
The 4-Step Workflow
- Convert to IK Actions: Transform joint-space actions (6D) to end-effector actions (8D)
- Annotate Subtasks: Automatically detect subtask boundaries using termination signals
- Generate Augmented Data: Create variations by recombining subtask segments
- Convert to Joint Actions: Transform back to joint-space for training
Task Structure: Lift Cube
Subtask 1:
pick_cube- Approach and grasp the cube
Subtask 2:lift_cube- Lift cube above threshold heightKey Requirements:
- Cube dimensions: 1.5cm × 1.5cm × 1.5cm
- Lift threshold: 5cm above robot base
- Success condition: Cube height > base height + 0.05m
Debugging Approach
Step 1: Environment Configuration Issues
Problem: MimicGen annotation failed with “The final task was not completed” error.
Root Cause Analysis:
- Missing
lift_cubeobservation function in environment - Incorrect subtask termination signal configuration
- Height threshold too strict for actual cube size
Solution: Added lift_cube observation function:
def lift_cube( env: ManagerBasedRLEnv, cube_cfg: SceneEntityCfg = SceneEntityCfg("cube"), robot_cfg: SceneEntityCfg = SceneEntityCfg("robot"), robot_base_name: str = "base", height_threshold: float = 0.05) -> torch.Tensor: """Check if the cube is lifted above the robot base.""" cube: RigidObject = env.scene[cube_cfg.name] robot: Articulation = env.scene[robot_cfg.name] cube_height = cube.data.root_pos_w[:, 2] base_index = robot.data.body_names.index(robot_base_name) robot_base_height = robot.data.body_pos_w[:, base_index, 2] above_base = cube_height - robot_base_height > height_threshold return above_baseStep 2: Height Threshold Calibration
Critical Discovery: The default height threshold (0.20m) was too strict for the actual cube size.
Investigation Process:
- Examined cube model file:
/assets/scenes/table_with_cube/cube/model.xml - Found actual dimensions: 0.015077m × 0.015077m × 0.015077m (1.5cm cube)
- Calculated appropriate threshold: 0.05m (3.3× cube height)
Configuration Update:
# Updated threshold in both environments height_threshold: float = 0.05 # Changed from 0.20m
Step 3: MimicGen Configuration Requirements
Problem: Assertion error during generation: “assert subtask_configs[-1].subtask_term_offset_range[0] == 0”
Root Cause: Final subtask had incorrect offset range configuration.
MimicGen Requirements:
- Intermediate subtasks: Can have termination signals and offset ranges
- Final subtask: Must have
subtask_term_signal=Noneandsubtask_term_offset_range=(0, 0)
Solution:
# Final subtask configuration subtask_configs.append( SubTaskConfig( object_ref="cube", subtask_term_signal=None, # No termination signal for final subtask subtask_term_offset_range=(0, 0), # Required by MimicGen selecti, description="Lift cube", next_subtask_description=None, ) )
Critical Discovery: Environment Compatibility
The AttributeError Challenge
Problem:
AttributeError: 'ManagerBasedRLLeIsaacMimicEnv' object has no attribute 'scene'Root Cause: MimicGen environment has different internal structure than regular environment.
Solution: Added compatibility handling in termination functions:
# Handle both regular env and mimic env if hasattr(env, 'scene'): num_envs = env.num_envs device = env.device cube: RigidObject = env.scene[cube_cfg.name] robot: Articulation = env.scene[robot_cfg.name] else: # For mimic environments, try alternative access patterns num_envs = getattr(env, '_num_envs', 1) device = getattr(env, '_device', torch.device('cuda')) scene = getattr(env, '_scene', None) or getattr(env, 'env', None) if scene is None: return torch.tensor([True], dtype=torch.bool, device=device) cube: RigidObject = scene[cube_cfg.name] robot: Articulation = scene[robot_cfg.name]The Solution: Complete Pipeline Implementation
Step 1: Convert to IK Actions
/home/vipin/IsaacSim/_build/linux-x86_64/release/python.sh scripts/mimic/eef_action_process.py \ --input_file=./datasets/so101_lift_cube.hdf5 \ --output_file=./datasets/processed_so101_lift_cube.hdf5 \ --to_ik \ --device=cuda \ --headless
Result: 6D joint actions → 8D end-effector actions (7 pose + 1 gripper)
Step 2: Annotate Subtasks
/home/vipin/IsaacSim/_build/linux-x86_64/release/python.sh scripts/mimic/annotate_dataset.py \ --input_file=./datasets/processed_so101_lift_cube.hdf5 \ --output_file=./datasets/annotated_so101_lift_cube.hdf5 \ --task=LeIsaac-SO101-LiftCube-Mimic-v0 \ --device=cuda \ --headless
Result: Added subtask termination signals for automatic segmentation.
Step 3: Generate Augmented Data
/home/vipin/IsaacSim/_build/linux-x86_64/release/python.sh scripts/mimic/generate_dataset.py \ --input_file=./datasets/annotated_so101_lift_cube.hdf5 \ --output_file=./datasets/generated_so101_lift_cube.hdf5 \ --task=LeIsaac-SO101-LiftCube-Mimic-v0 \ --num_demos=10 \ --device=cuda \ --headless
Result: 1 demonstration → 10 augmented demonstrations (71.4% success rate)
Step 4: Convert Back to Joint Actions
/home/vipin/IsaacSim/_build/linux-x86_64/release/python.sh scripts/mimic/eef_action_process.py \ --input_file=./datasets/generated_so101_lift_cube.hdf5 \ --output_file=./datasets/final_so101_lift_cube.hdf5 \ --to_joint \ --device=cuda \ --headless
Result: 8D end-effector actions → 6D joint actions ready for training.
Technical Implementation Details
Dataset Pipeline Summary
Stage File Size Episodes Action Dim Description Original so101_lift_cube.hdf5239.5 MB 1 6D Recorded demonstration IK Converted processed_so101_lift_cube.hdf574.2 MB 1 8D End-effector actions Annotated annotated_so101_lift_cube.hdf574.4 MB 1 8D With subtask signals Generated generated_so101_lift_cube.hdf5732.5 MB 10 8D Augmented data Final final_so101_lift_cube.hdf5732.5 MB 10 6D Ready for training LeRobot Conversion
conda activate lerobot python scripts/convert/isaaclab2lerobot.py
Configuration:
repo_id = 'sparkmt/so101_lift_cube_mimicgen' robot_type = 'so101_follower' fps = 30 task = 'Lift cube from table using MimicGen augmented data'
Result: 21 MB LeRobot v3.0 dataset with AV1-compressed videos.
Results and Validation
Data Augmentation Success
- Input: 1 original demonstration
- Output: 10 augmented demonstrations
- Success Rate: 71.4% (10 successful / 14 total attempts)
- Data Multiplication: 10× increase in training data
- Total Dataset: 11 demonstrations (1 original + 10 generated)
Generation Statistics
MimicGen Generation Results: - Total attempts: 14 - Successful generations: 10 - Failed generations: 4 - Success rate: 71.4% - Average generation time: ~2 minutes per demo
Dataset Quality Metrics
- Action Diversity: Generated demonstrations show variations in:
- Object positions and orientations
- Robot approach trajectories
- Timing and velocity profiles
- Grasp configurations
- Task Consistency: All generated demos maintain task structure
- Physical Validity: All actions respect robot constraints
LeRobot Dataset Structure
sparkmt/so101_lift_cube_mimicgen/ ├── data/ (372K) - Parquet files with actions/states ├── videos/ (20M) - AV1-compressed dual camera videos ├── meta/ (84K) - Dataset metadata and statistics └── images/ (12K) - Sample images
Compression Efficiency: 732.5 MB HDF5 → 21 MB LeRobot (99% size reduction)
Technical Insights
1. Subtask Configuration is Critical
- Intermediate subtasks: Require termination signals for segmentation
- Final subtask: Must have
subtask_term_signal=Noneandsubtask_term_offset_range=(0, 0) - Height thresholds: Must match actual object dimensions, not arbitrary values
2. Environment Compatibility Matters
- MimicGen environments have different internal structure
- Always check for attribute existence before accessing
- Provide fallback mechanisms for different environment types
3. Action Space Consistency
- MimicGen requires IK actions (8D) for generation
- Training requires joint actions (6D)
- Conversion steps are essential for pipeline success
4. Success Rate Expectations
- 71.4% success rate is considered good for MimicGen
- Failed generations often due to:
- Collision detection
- Unreachable configurations
- Timing constraints
Current Status
- ✅ Complete MimicGen pipeline implemented
- ✅ 10× data augmentation achieved (1 → 10 demonstrations)
- ✅ LeRobot v3.0 dataset created and optimized
- ✅ All debugging challenges resolved
- ✅ Comprehensive documentation and reproducible workflow
- 🔄 Ready for imitation learning policy training
Summary
This project successfully implemented a complete MimicGen data augmentation pipeline, transforming a single demonstration into 10 diverse training examples. The systematic debugging approach revealed critical requirements for MimicGen configuration, including proper subtask termination signals, accurate height thresholds, and environment compatibility handling.
The pipeline achieved a 71.4% generation success rate and produced a high-quality dataset with 10× data augmentation. The final LeRobot dataset provides rich training data with dual camera observations and diverse manipulation trajectories, all compressed efficiently using modern video codecs.
Key technical contributions include environment compatibility fixes, proper subtask configuration, and automated conversion between action spaces. The debugging infrastructure and systematic approach provide a foundation for scaling to more complex manipulation tasks.
The project demonstrates the power of automated data augmentation for robotic learning, reducing the manual data collection burden while increasing training data diversity and quality.
Next: Training imitation learning policies on the augmented dataset and comparing performance against single-demonstration baselines.
Framework: Isaac Sim 5.0 + Isaac Lab + MimicGen
Data Pipeline: HDF5 → LeRobot v3.0 (21 MB, AV1-compressed)
Hardware: SO-101 Robot Arm, RTX 4080 Super
Success Metrics: 10× data augmentation, 71.4% generation success rate -
Building a Real-to-Sim Digital Twin for SO-101 Robot Arm in Isaac Sim
10/06/2025 at 06:50 • 0 commentsProject Overview
I worked on implementing a real-to-sim digital twin system for an SO-101 robotic arm using NVIDIA Isaac Sim 4.5.0. The goal was to create a virtual replica that mirrors the physical robot’s movements in real-time, enabling simultaneous control of both real and virtual arms through a leader-follower teleoperation setup.
This project log documents the complete setup process, debugging challenges, and the implementation of a robust ROS2-based communication pipeline between physical hardware and Isaac Sim.
Hardware Setup
- Robot: SO-100/SO-101 robotic arm (6 DOF: 5 arm joints + 1 gripper)
- Control System: Leader-follower teleoperation setup
- Device Mappings:
/dev/leader- Leader arm for human teleoperation/dev/follower- Follower arm (physical robot being controlled)/dev/wrist- Wrist-mounted camera (video0)/dev/scene- Scene overview camera (video2)
- GPU: RTX 4080 Super with 16GB VRAM
- Software: Isaac Sim 4.5.0, ROS2 Humble, CycloneDDS
The Challenge
The objective was to create a digital twin where:
- Leader arm movements control both physical follower arm AND virtual follower arm
- Real-time synchronization with minimal latency
- Proper joint state feedback from physical to virtual robot
- Seamless integration with existing teleoperation workflow
Architecture Overview
The system uses a three-component architecture:
Component 1: Teleoperation System
- Reads leader arm positions from
/dev/leader - Controls physical follower arm via existing teleoperation
Component 2: Joint State Bridge
- Reads actual follower arm positions from
/dev/follower - Publishes joint states to ROS2 topics
- Bridges physical robot data to Isaac Sim
Component 3: Isaac Sim Digital Twin
- Subscribes to joint state commands
- Renders virtual robot matching physical movements
- Provides visual feedback and simulation capabilities
Debugging Process
Issue 1: GLIBCXX Library Version Conflicts
When attempting to run ROS2 nodes from the Isaac Sim conda environment, I encountered:
ImportError: /home/vipin/miniconda3/envs/isaacsim/bin/../lib/libstdc++.so.6: version `GLIBCXX_3.4.30' not found (required by /opt/ros/humble/local/lib/python3.10/dist-packages/rclpy/_rclpy_pybind11.cpython-310-x86_64-linux-gnu.so)
Root Cause: The conda environment’s
libstdc++(version 3.4.26) was older than what system ROS2 required (3.4.30).Investigation:
# Conda environment library strings /home/vipin/miniconda3/envs/isaacsim/lib/libstdc++.so.6 | grep GLIBCXX | tail -1 # Output: GLIBCXX_3.4.26 # System library strings /usr/lib/x86_64-linux-gnu/libstdc++.so.6 | grep GLIBCXX | tail -1 # Output: GLIBCXX_3.4.30
Solution: Use Isaac Sim’s internal ROS2 libraries instead of system installation:
# Configure Isaac Sim with internal ROS2 libraries export isaac_sim_package_path=$(dirname $(which isaacsim))/../lib/python3.10/site-packages/isaacsim export RMW_IMPLEMENTATION=rmw_cyclonedx_cpp export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$isaac_sim_package_path/exts/isaacsim.ros2.bridge/humble/lib export PYTHONPATH=$isaac_sim_package_path/exts/isaacsim.ros2.bridge/humble:$PYTHONPATH
This approach avoided library conflicts while maintaining Isaac Sim stability.
Issue 2: Network Topic Interference
During initial testing, I discovered unexpected joint states:
ros2 topic echo /joint_states --once # Output showed wheel joints, gripper extension, head swivel - not arm joints!
Root Cause: Another machine on the network was publishing robot topics to the same ROS_DOMAIN_ID.
Solution: Network isolation using unique domain ID:
export ROS_DOMAIN_ID=42 # Isolated domain ros2 topic list # Clean output: only local topics
Issue 3: Joint Name Mismatch
The most critical debugging challenge was joint name inconsistency. Isaac Sim’s ArticulationController was throwing warnings:
[Warning] [omni.graph.core.plugin] /so101_new_calib/ROS_JointStates/ArticulationController: [/so101_new_calib/ROS_JointStates] OmniGraph Warning: 'joint_1'
Investigation: Compared published vs expected joint names:
# Physical arm publishing ros2 topic echo /isaac_joint_command --once name: [joint_1, joint_2, joint_3, joint_4, joint_5, joint_6] # Isaac Sim expecting ros2 topic echo /isaac_joint_states --once name: [Rotation, Pitch, Elbow, Wrist_Pitch, Wrist_Roll, Jaw]
Root Cause: Joint state reader was using generic names instead of Isaac Sim’s actual joint names.
Solution: Updated joint state reader configuration:
# Fixed joint names to match Isaac Sim robot self.joint_names = [ 'Rotation', # Base rotation 'Pitch', # Shoulder pitch 'Elbow', # Elbow 'Wrist_Pitch', # Wrist pitch 'Wrist_Roll', # Wrist roll 'Jaw' # Gripper ]
Issue 4: Device Port Conflicts
Initially attempted to read from
/dev/leader, but this caused conflicts:# Problematic - conflicts with teleoperation self.serial_port = serial.Serial('/dev/leader', 1000000, timeout=0.1)Root Cause: Teleoperation system was already reading from
/dev/leader. Multiple processes accessing the same serial port caused communication failures.Solution: Read from follower arm instead:
# Correct approach - read actual follower positions self.serial_port = serial.Serial('/dev/follower', 1000000, timeout=0.1)This provides the actual controlled arm positions for digital twin synchronization.
Implementation Details
Joint State Reader Architecture
The core component reads physical robot joint positions and publishes them for Isaac Sim:
class JointStateReader(Node): def __init__(self): super().__init__('joint_state_reader') # Publisher for Isaac Sim joint commands self.joint_pub = self.create_publisher(JointState, '/isaac_joint_command', 10) # Connect to follower arm for position feedback self.serial_port = serial.Serial('/dev/follower', 1000000, timeout=0.1) # Timer for 20Hz publishing rate self.timer = self.create_timer(0.05, self.read_and_publish)ROS2 Topic Architecture
The system uses a clean topic structure:
/joint_states- Standard ROS2 joint states (from physical robot)/isaac_joint_command- Commands sent to Isaac Sim virtual robot/isaac_joint_states- Feedback from Isaac Sim virtual robot/robot/cmd_pose- IK teleoperation commands (future expansion)
Isaac Sim Configuration
Isaac Sim loads with internal ROS2 libraries:
[23.352s] [ext: isaacsim.ros2.bridge-4.1.15] startup [23.370s] Using backup internal ROS2 humble distro [23.396s] Attempting to load system rclpy [23.396s] Could not import system rclpy: No module named 'rclpy' [23.396s] Attempting to load internal rclpy [23.405s] rclpy loaded
This confirms successful loading of Isaac Sim’s bundled ROS2 libraries, avoiding system conflicts.
Current Status
✅ Real-to-sim synchronization working: Virtual arm mirrors physical follower arm movements in real-time
✅ Joint state publishing: 20Hz update rate from physical robot to Isaac Sim
✅ Network isolation: Clean ROS2 topic namespace without interference
✅ Library conflicts resolved: Isaac Sim uses internal ROS2 libraries
✅ Proper device mapping: Reading from
/dev/followerwithout conflictsDemonstration
[VIDEO PLACEHOLDER: Leader arm controlling real + virtual follower for pick-and-place task]
The video demonstrates:
- Operating the leader arm to control movements
- Real follower arm and virtual Isaac Sim arm moving in perfect synchronization
- Pick and place task: picking up a striped object and placing it into a white tray
- Real-time response with minimal latency between physical and virtual robots
Advantages of This Approach
- True Digital Twin: Virtual robot reflects actual physical robot state, not just commands
- Debugging Capability: Visual feedback in Isaac Sim helps debug physical robot issues
- Simulation Validation: Test control algorithms in simulation with real robot data
- Training Data Generation: Record synchronized real/virtual data for ML training
- Safety Monitoring: Virtual representation provides additional safety oversight
- Scalability: Can extend to multiple robots or complex multi-robot scenarios
Technical Notes
ROS2 Bridge Verification
Always verify topic subscription counts to confirm Isaac Sim connectivity:
ros2 topic info /isaac_joint_command # Publisher count: 1 (joint state reader) # Subscription count: 1 (Isaac Sim when playing)
Joint Name Debugging Strategy
- Check what Isaac Sim publishes:
ros2 topic echo /isaac_joint_states --once - Match physical robot publisher to those exact names
- Verify ArticulationController warnings disappear
Library Conflict Prevention
- Never upgrade conda environment’s
libstdc++- it will break Isaac Sim - Use Isaac Sim’s internal ROS2 libraries for compatibility
- Test with simple
ros2 topic listbefore complex operations
Usage Commands
Start Isaac Sim (with ROS2 bridge):
conda activate isaacsim isaacsim # ROS2 bridge loads automatically
Run joint state reader:
source /opt/ros/humble/setup.bash cd /home/vipin/so-arm101-ros2-bridge /usr/bin/python3.10 src/jointstatereader/jointstatereader/joint_state_reader.py
Monitor synchronization:
source /opt/ros/humble/setup.bash ros2 topic hz /isaac_joint_command # Should show ~20Hz
Summary
This project successfully implemented a real-to-sim digital twin system for robotic manipulation, overcoming significant technical challenges in library compatibility, network isolation, and joint name mapping. The solution provides a robust foundation for advanced robotics applications including simulation-based training, safety monitoring, and algorithm validation.
The key insight was recognizing that true digital twin functionality requires reading actual robot state (from
/dev/follower) rather than just command signals, providing accurate virtual representation of physical robot behavior.Next: Implementing bidirectional control for sim-to-real policy deployment and testing advanced manipulation tasks with synchronized real/virtual feedback.
Hardware: SO-100/SO-101 Robot Arm, RTX 4080 Super
Software: Isaac Sim 4.5.0, ROS2 Humble, CycloneDDS
Repository: so-arm101-ros2-bridge -
Debugging Robot “Twitching” in GR00T N1.5 Deployment
10/05/2025 at 16:13 • 0 commentsProject Overview
I worked on debugging a puzzling issue where a fine-tuned NVIDIA GR00T N1.5 model was causing an SO-100 robotic arm to “twitch” instead of performing pick-and-place tasks. The robot would make tiny oscillating movements around the same position, with the gripper staying completely unresponsive.
This project log documents the systematic debugging process that revealed the root cause: an undertrained model that needed significantly more training steps to learn the complete task sequence.
Hardware Setup
- Robot: SO-100 robotic arm (6 DOF: 5 arm joints + 1 gripper)
- Cameras: Dual camera system (scene + wrist) at 640x480, 30fps
- GPU: RTX 4080 Super with 16GB VRAM
- Dataset: 20 episodes of pick-and-place demonstrations
- Task: “Pick up the striped block and put it into the white plate”
The Problem: Robot Twitching
Initial Symptoms
When deploying the trained GR00T model:
- Robot connected successfully
- Model inference server running correctly
- Robot made tiny oscillating movements around the same position
- Robot was not executing the intended pick-and-place task
The model had been trained for 2000 steps and showed good loss convergence, but the physical deployment was completely unsuccessful.
Debugging Approach
Step 1: Enhanced Logging Implementation
Added comprehensive logging to both the inference server and robot client to understand what data was being exchanged.
Server-Side Logging (
service.py):- Request counter for each inference call
- Input data keys and shapes
- Inference time in milliseconds
- Output action statistics (min/max/mean values)
Client-Side Logging (
eval_lerobot.py):- Step counter and observation keys
- Current robot state (all 6 joints)
- Received action chunks from server
- First action being sent to robot
Example Output:
[Request #1] Endpoint: get_action Inference time: 75.23ms Response keys: ['action.single_arm', 'action.gripper'] action.single_arm: shape=(16, 5), min=-45.23, max=67.89, mean=12.34 action.gripper: shape=(16, 1), min=-0.30, max=0.50, mean=0.15 [CLIENT] First action to send to robot: shoulder_pan.pos: -12.34
Step 2: Diagnostic Tools Development
Created several diagnostic scripts to isolate the issue:
Joint Testing Tool (
test_joint.py):- Tests individual joint control to verify hardware functionality
- Takes joint number (1-6) and value (-100 to 100) as input
- Helps isolate hardware vs. software issues
Robot State Monitor (
monitor_robot_state.py):- Real-time monitoring of robot joint positions
- Verifies encoder readings match values sent to server
Step 3: Dataset Visualization
Uploaded the dataset to Hugging Face Hub and used Rerun visualization to inspect the recorded episodes:
# Upload dataset for analysis python scripts/so100_groot/upload_to_huggingface.py \ --local-dir ~/.cache/huggingface/lerobot/rubbotix/striped-block \ --repo-id sparkmt/so100-striped-block # Visualize episodes ./scripts/so100_groot/visualize_episodes.sh 0
This revealed the difference between State (robot’s actual position) and Action (commanded target position), which was crucial for diagnosis.
Critical Discovery: The Root Cause
Key Finding from Logs
The robot was making very small, uncertain movements instead of decisive actions. The logging revealed that the model was outputting actions with very small magnitudes, indicating high uncertainty.
The Root Cause: Undertrained Model
Analysis revealed that the model was severely undertrained at 2000 steps.
Evidence:
- Tiny action magnitudes: Model outputting very small actions due to high uncertainty
- Lack of task structure understanding: Model hadn’t learned the full sequence (approach → grasp → lift → move → release)
- Closed-loop instability: Small errors accumulating, causing the robot to end up in states the model never saw during training
The Solution: Extended Training
Training Requirements Analysis
Task Complexity Minimum Steps Recommended Steps Simple reaching 1,000-2,000 5,000 Pick and place 5,000-10,000 10,000-20,000 Complex manipulation 10,000-20,000 20,000-50,000 The pick-and-place task required 10,000-20,000 steps, not the 2000 steps initially used.
Training Configuration Update
Updated the training script to resume from checkpoint-2000 and continue to 10,000 steps:
# Resume training configuration RESUME_TRAINING="true" MAX_STEPS=10000 # Increased from 2000 BATCH_SIZE=16 GRADIENT_ACCUMULATION_STEPS=8 LORA_RANK=32 LORA_ALPHA=64
Automatic Checkpoint Detection:
if [ "$RESUME_TRAINING" = "true" ]; then if ls "$OUTPUT_DIR"/checkpoint-* 1> /dev/null 2>&1; then LATEST_CHECKPOINT=$(ls -td "$OUTPUT_DIR"/checkpoint-* | head -1) echo "Resuming from latest checkpoint: ${LATEST_CHECKPOINT}" TRAIN_CMD="$TRAIN_CMD --resume" fi fiWhy 2000 Steps Was Insufficient
1. Model Hadn’t Learned Task Structure
- At 2000 steps: Learning basic correlations between observations and actions
- Missing: Full sequence understanding of the manipulation task
2. Action Magnitude Learning
From deployment logs at 2000 steps, the model was outputting very small actions because:
- Hadn’t learned correct action scale
- Being overly cautious due to high uncertainty
- Loss function hadn’t fully converged
3. Closed-Loop Instability
- Small errors accumulate: Undertrained model makes uncertain movements
- Compounding problem: Robot ends up in states model never saw during training
- Result: Model gets “confused” and twitches in place
Technical Implementation Details
Enhanced Logging Code
Server-side logging addition:
logger.info(f"[Request #{request_counter}] Endpoint: {endpoint}") logger.info(f" Data keys: {list(data.keys())}") logger.info(f" Inference time: {inference_time:.2f}ms") for key, value in response.items(): if isinstance(value, np.ndarray): logger.info(f" {key}: shape={value.shape}, min={value.min():.2f}, max={value.max():.2f}, mean={value.mean():.2f}")Client-side logging addition:
logger.info(f"[STEP {step_count}] Getting observation...") logger.info(f" Current robot state:") for key, value in current_state.items(): logger.info(f" {key}: {value:.2f}") logger.info(f"[CLIENT] First action to send to robot:") for key, value in first_action.items(): logger.info(f" {key}: {value:.2f}")Dataset Upload and Visualization
Created tools for dataset management and analysis:
# Upload script for Hugging Face Hub def upload_dataset(local_dir, repo_id): # Validate dataset structure required_files = ['meta/info.json', 'meta/stats.json', 'meta/tasks.parquet'] for file in required_files: if not os.path.exists(os.path.join(local_dir, file)): raise FileNotFoundError(f"Required file {file} not found") # Create repository and upload api = HfApi() api.create_repo(repo_id, repo_type="dataset", exist_ok=True) api.upload_folder(folder_path=local_dir, repo_id=repo_id, repo_type="dataset")Results and Validation
Training Progress
After resuming training from 2000 to 10,000 steps:
- Significant MSE improvement: From ~24 at 2,000 steps to ~6.3 at 10,000 steps
- Loss continued to decrease: Model learned more complex patterns
- Action magnitudes increased: Actions became more decisive
- Task structure emerged: Model learned the complete manipulation sequence
Deployment Results
With the extended training at 10,000 steps:
- Task execution achieved: Robot now performs the complete sequence (approach → open → grasp → lift → move → release)
- Mixed joint performance: Some joints (1, 2, 3, and 5) showed accurate predictions matching ground truth, while others (joints 0 and 4) had less precise control
- Execution challenges: Task completion takes 3+ minutes with multiple retries due to shaky movements
- No more twitching: Robot executes purposeful movements instead of oscillating in place
Performance Assessment
The model demonstrates partial success:
- ✅ Complete task sequence understanding
- ✅ Elimination of twitching behavior
- ⚠️ Uneven accuracy across different joints
- ⚠️ Execution speed and precision need improvement
- ⏳ Further iteration required for reliable performance
Technical Insights
1. Training Duration is Critical
- 2000 steps = memorized patterns (MSE ~24, twitching behavior)
- 10,000 steps = learned task structure (MSE ~6.3, complete sequence execution)
- Manipulation tasks require significantly more training than simple reaching
- Even at 10,000 steps, performance varies across joints, suggesting more training may be beneficial
2. Logging is Essential for Debugging
- Without detailed logs, impossible to diagnose model-robot mismatch
- Action statistics (min/max/mean) reveal model confidence levels
- State vs. action comparison shows tracking performance
3. Visualization Tools are Invaluable
- Dataset visualization revealed data quality and action ranges
- State vs. Action plots diagnosed tracking issues
- Essential for understanding model behavior
Current Status
- Extended training completed (2000 → 10,000 steps)
- MSE improved from ~24 to ~6.3 (74% improvement)
- Robot deployment shows partial success with complete task sequence execution
- Performance varies across joints with some showing accurate control while others need improvement
- Comprehensive debugging infrastructure in place
- Dataset published to Hugging Face Hub: sparkmt/so100-striped-block
Summary
This debugging session demonstrated that what appeared to be a complex hardware or software integration issue was actually a fundamental training problem. The “twitching” behavior was caused by an undertrained model that hadn’t learned the complete task structure.
The systematic debugging approach using enhanced logging, diagnostic tools, and dataset visualization was crucial for identifying the root cause. The solution required extending training from 2000 to 10,000 steps, resulting in a 74% improvement in MSE (from ~24 to ~6.3) and enabling the robot to execute the complete pick-and-place sequence.
While the model now performs the full task (approach → open → grasp → lift → move → release), execution remains slow and imprecise, with uneven performance across different joints. This suggests that further data collection and training iterations will be needed to achieve reliable, smooth manipulation.
The project demonstrates the iterative nature of robotic AI development and the importance of adequate training duration for manipulation tasks. The debugging infrastructure and systematic approach provide a foundation for continued improvement.
Next: Collecting additional training episodes and exploring Isaac Sim integration for synthetic data generation.
Model: NVIDIA GR00T N1.5 (3B parameters)
Training Method: LoRA fine-tuning (extended to 10,000 steps)
Hardware: SO-100 Robot Arm, RTX 4080 Super
Framework: Isaac-GR00T + LeRobot -
Fine-Tuning GR00T N1.5 for SO-100 Robot Arm Manipulation
10/05/2025 at 15:59 • 0 commentsProject Overview
I worked on fine-tuning NVIDIA’s GR00T N1.5 model for controlling an SO-100 robotic arm. The project involved dataset preparation, memory optimization for 16GB VRAM constraints, model training with LoRA techniques, and deployment setup for real-world robot control.
The goal was to train the model to perform pick-and-place manipulation tasks using the instruction “pick up the striped box and put it into the white plate” with dual-camera visual input.
Hardware Setup
- Robot: SO-100 robotic arm (6 DOF: 5 arm joints + 1 gripper)
- Cameras: Dual camera system at 640x480, 30fps
- GPU: RTX 4080 Super with 16GB VRAM
- Dataset: 20 episodes, 5,197 frames of manipulation demonstrations
- Model: NVIDIA GR00T N1.5 (3B parameters)
Dataset Preparation and Debugging
Issue 1: Blank Visualization Plots
The dataset visualization script displayed blank canvases for state/action plots.
Root Cause: The script had hardcoded humanoid robot keys (
left_arm,right_arm,left_hand,right_hand) while the SO-100 dataset uses different keys (single_arm,gripper).Solution: Modified the visualization function to auto-detect keys from the dataset:
# Before: hardcoded humanoid keys shared_keys = ["left_arm", "right_arm", "left_hand", "right_hand"] # After: auto-detect from dataset if shared_keys is None: shared_keys = [key.replace("state.", "") for key in state_dict.keys()] print(f"Auto-detected keys to plot: {shared_keys}")Issue 2: Camera Mapping Discrepancy
The visualization showed the wrist camera perspective when it should have shown the scene camera.
Investigation: Checked the dataset’s
modality.jsonmappings and discovered that during data collection, the camera naming was swapped:observation.images.mainwas actually the wrist/gripper cameraobservation.images.secondary_0was actually the scene camera
Solution: Corrected the mappings in
modality.json:"video": { "front": {"original_key": "observation.images.secondary_0"}, // Scene camera "wrist": {"original_key": "observation.images.main"} // Wrist camera }Verification: Created a diagnostic script that confirmed the mapping correction by comparing raw video frames with dataset loader output.
Issue 3: Missing Video Metadata
Dataset loading failed due to missing video metadata fields.
Solution: Added the required fields to
info.json:info['features'][key]['info']['video.channels'] = 3 info['features'][key]['info']['video.height'] = 720 info['features'][key]['info']['video.width'] = 1280
Memory Optimization Challenge
The Problem: CUDA Out of Memory
Initial training attempts all failed with out-of-memory errors, even with very small batch sizes:
Attempt Batch Size Gradient Accum Result 1 64 2 OOM at step 0 2 32 4 OOM at step 0 3 16 8 OOM at step 0 4 8 16 OOM at step 0 5 4 32 OOM at step 0 6 2 64 OOM during optimizer step Analysis: The base model has 3B parameters, plus a 550M parameter diffusion model. The Adam optimizer requires 2x memory for momentum and variance states, exceeding the 16GB VRAM limit.
Solution: LoRA Fine-Tuning
Implemented Low-Rank Adaptation (LoRA) to reduce trainable parameters:
LoRA Configuration:
--lora-rank 32 # Size of low-rank adaptation matrices --lora-alpha 64 # Scaling factor (typically 2x rank) --lora-dropout 0.1 # Regularization --no-tune_diffusion_model # Freeze 550M parameter diffusion model
Memory Savings:
- Full fine-tuning: ~200M trainable parameters
- LoRA fine-tuning: ~10M trainable parameters (20x reduction)
- Result: Fits in 16GB VRAM with batch_size=16
Training Configuration and Results
Final Training Setup
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \ python scripts/gr00t_finetune.py \ --dataset-path ./demo_data/example_dataset/ \ --num-gpus 1 \ --output-dir ./so100-checkpoints \ --max-steps 5000 \ --data-config so100_dualcam \ --batch-size 16 \ --gradient-accumulation-steps 8 \ --learning-rate 0.0001 \ --no-tune_diffusion_model \ --lora-rank 32 \ --lora-alpha 64 \ --lora-dropout 0.1
Key Parameters:
- Effective batch size: 128 (16 × 8 gradient accumulation)
- Training steps: 5000 planned, stopped early at 2340
- Checkpoints saved every 1000 steps
Training Progress
Loss Trajectory:
Step Loss Change from Previous ---- ---- -------------------- 500 0.080 (baseline) 1,000 0.050 -37.5% (strong improvement) 1,500 0.040 -20.0% (good improvement) 2,000 0.040 0.0% (plateau started) 2,340 0.038 -5.0% (minimal improvement)
Convergence Analysis: The loss plateaued around step 1500-2000, with minimal improvement in the last 840 steps. Training was stopped early to avoid overtraining on the small dataset.
Training Metrics:
- Training speed: ~12.76 seconds/step
- GPU memory usage: 7.07 GB before training
- Best checkpoint:
checkpoint-2000with stable performance
Model Evaluation
Open-Loop Evaluation Results
Evaluated the trained model using the official evaluation script:
python scripts/eval_policy.py --plot \ --embodiment-tag new_embodiment \ --model-path ./so100-checkpoints/checkpoint-2000 \ --data-config so100_dualcam \ --dataset-path ./demo_data/example_dataset/
Result: Unnormalized MSE of 0.017463
Performance Assessment:
- Excellent: MSE < 10
- Good: MSE 10-30
- Moderate: MSE 30-60
- Poor: MSE > 60
- Our Result: 0.017 (Outstanding performance)
The model predictions closely match ground truth actions, indicating readiness for real robot deployment.
Deployment Setup
Device Configuration
Verified robot and camera device mappings:
/dev/follower -> ttyACM4 # Robot motor bus /dev/wrist -> video0 # Wrist/gripper camera /dev/scene -> video2 # Scene/front camera
Client-Server Architecture
Terminal 1 - Inference Server:
python scripts/inference_service.py --server \ --model-path ./so100-checkpoints/checkpoint-2000 \ --embodiment-tag new_embodiment \ --data-config so100_dualcam \ --denoising-steps 4 \ --port 5555
Terminal 2 - Robot Client:
python ./examples/SO-100/eval_lerobot.py \ --robot.type=so100_follower \ --robot.port=/dev/follower \ --robot.id=my_so100_arm \ --robot.cameras="{ wrist: {type: opencv, index_or_path: 0, width: 640, height: 480, fps: 30}, front: {type: opencv, index_or_path: 2, width: 640, height: 480, fps: 30}}" \ --policy_host=localhost \ --lang_instruction="pick up the striped box and put it into the white plate"Issues Fixed
- Import Error: Fixed incorrect import path in the robot client script
- Missing Dependency: Installed
feetech-servo-sdkfor robot communication
Technical Notes
Memory Optimization Strategies
Successful approaches:
- LoRA fine-tuning: Reduced trainable parameters by 20x
- Freezing diffusion model: Saved 550M parameters from training
- Gradient accumulation: Maintained effective batch size without memory overhead
- Memory fragmentation fix:
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
Unsuccessful approaches:
- Simply reducing batch size (still OOM with batch_size=2)
- Training full model even with frozen diffusion model
Training Efficiency Insights
- Small datasets (20 episodes) converge quickly (~2000 steps)
- Monitor loss curves and stop when plateau is reached
- Training loss of 0.04 with MSE 0.017 indicates excellent learning
- Avoid overtraining on small datasets
Dataset Quality Factors
- Camera mapping must match between training and deployment
- Robot calibration affects action space consistency
- 20 episodes sufficient for single task, more needed for multi-task
- Always visualize dataset before training
Current Status
- Dataset prepared and validated
- Model trained and converged (checkpoint-2000, loss: 0.04)
- Open-loop evaluation passed (MSE: 0.017)
- Inference server configured
- Robot client script ready
- Pending: Complete robot calibration and test real-world deployment
Summary
This project successfully fine-tuned a 3B parameter vision-language-action model for robotic manipulation within 16GB VRAM constraints. The key breakthrough was using LoRA fine-tuning to reduce memory requirements while maintaining training effectiveness.
The trained model achieved excellent evaluation metrics (MSE: 0.017) and is ready for real-world deployment. The systematic approach to dataset debugging, memory optimization, and deployment setup provides a foundation for future robotic AI projects.
Next: Testing the trained model on physical robot manipulation tasks.
Model: NVIDIA GR00T N1.5 (3B parameters)
Training Method: LoRA fine-tuning (rank 32)
Hardware: SO-100 Robot Arm, RTX 4080 Super
Framework: Isaac-GR00T + LeRobot -
Debugging GR00T N1.5 Inference in Phosphobot
10/05/2025 at 15:55 • 0 commentsProject Overview
I worked on debugging inference issues with a fine-tuned NVIDIA GR00T N1.5 model for controlling an SO-100 robotic arm. The model was trained successfully and uploaded to HuggingFace Hub, but the PhosphoBot browser interface was failing during AI control activation.
This project log documents the debugging process, root cause analysis, and the implementation of an alternative local inference solution.
Hardware Setup
- Robot: SO-100 robotic arm (6 DOF: 5 arm joints + 1 gripper)
- Cameras: Dual camera system (IDs 0 and 2) at 640x480, 30fps
- GPU: RTX 4080 Super with 16GB VRAM
- Model:
phospho-app/gr00t-example_dataset-h9g75u7gak(fine-tuned GR00T N1.5)
The Problem
When clicking “AI Control” in the PhosphoBot browser interface, the system reported:
Exception: No robot connected. Exiting AI control loop.
The robot was physically connected and visible in the UI, cameras were streaming successfully, and the model had been trained and uploaded to HuggingFace Hub. The issue appeared to be in the inference pipeline.
Debugging Process
Issue 1: Joint Count Mismatch
Added debug logging to understand the failure and discovered:
Connected joints: 6, Config joints: 1
Root Cause: The code was reading the model configuration incorrectly:
# Incorrect code number_of_joints_in_config = len( config.embodiment.statistics.action.action_space.values() )
This was counting dictionary keys (
max,min,mean,std,q01,q99) instead of joint dimensions.Model Config Structure:
{ "action_space": { "action_space": 6 } }Solution: Handle the nested dictionary structure correctly:
# Fixed code action_space = config.embodiment.statistics.action.action_space # Case 1: action_space is a dict with 'action_space' key containing the number if isinstance(action_space, dict) and 'action_space' in action_space: number_of_joints_in_config = action_space['action_space'] # Case 2: action_space has 'max' or 'min' arrays elif hasattr(action_space, 'max') and isinstance(action_space.max, list): number_of_joints_in_config = len(action_space.max) # Additional fallback cases...
Issue 2: Device Mismatch on Modal Server
After fixing the joint count, a new error appeared:
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!
Root Cause:
- Model inference happens on Modal GPU server (remote)
- Some model components loaded on CPU, others on GPU
- Issue occurs in VLLN (Vision-Language Layer Norm) component
Attempted Fix: Added retry logic with exponential backoff to handle transient device issues:
max_retries = 3 retry_delay = 1.0 # seconds for retry_attempt in range(max_retries): try: actions = self(inputs) break # Success except RuntimeError as e: if "Expected all tensors to be on the same device" in str(e): if retry_attempt < max_retries - 1: logger.warning(f"Device mismatch error on attempt {retry_attempt + 1}/{max_retries}. Retrying in {retry_delay}s...") await asyncio.sleep(retry_delay) retry_delay *= 2 # Exponential backoffStatus: This helped with transient issues but didn’t solve the root cause, which is on the Modal server side and not fixable from the client.
Alternative Solution: Local Inference
Since the PhosphoBot Modal server had device mismatch issues, I implemented a local inference solution using official Isaac-GR00T scripts.
Architecture: Client-Server Model
Based on NVIDIA’s official tutorial, the solution uses a two-terminal approach:
Terminal 1: Inference Server
- Loads GR00T model on local GPU
- Runs inference on observations
- Returns action predictions
- Uses ZMQ protocol for fast communication
Terminal 2: Robot Client
- Connects to SO-100 robot via USB
- Captures camera images
- Sends observations to server
- Executes returned actions
Implementation
Server Script (
start_groot_server.sh):#!/bin/bash cd /home/vipin/Isaac-GR00T conda activate gr00t python scripts/inference_service.py \ --server \ --model_path "phospho-app/gr00t-example_dataset-h9g75u7gak" \ --embodiment-tag "new_embodiment" \ --data-config "so100_dualcam" \ --denoising-steps 4 \ --port 5555
Client Script (
gr00t_inference_local.py):- Based on official
examples/SO-100/eval_lerobot.py - Connects to SO-100 robot using LeRobot
- Initializes cameras (IDs 0 and 2)
- Connects to GR00T inference server
- Runs inference loop at 30 FPS
- Handles action horizon queuing
Configuration:
# Robot ROBOT_TYPE = "so100_follower" ROBOT_PORT = "/dev/ttyACM0" ROBOT_ID = "so-100" # Cameras CAMERA_CONFIGS = { "front": {"type": "opencv", "index_or_path": 0, "width": 640, "height": 480, "fps": 30}, "wrist": {"type": "opencv", "index_or_path": 2, "width": 640, "height": 480, "fps": 30}, } # Task LANG_INSTRUCTION = "pick up the striped box and put it into the white plate"Advantages of Local Inference
- No Modal server dependency: Runs entirely locally
- Full control over device placement: All tensors on GPU
- Easier debugging: Direct access to model and logs
- Lower latency: No network round-trip to Modal
- Official NVIDIA approach: Based on their tutorial
- More reliable: Better for production deployment
Disk Space Management
During the debugging process, encountered disk space issues (96GB disk at 100% capacity). Performed cleanup:
Actions taken:
- Cleaned conda package cache:
conda clean --all --yes(freed 2.3GB) - Removed HuggingFace XET cache:
rm -rf ~/.cache/huggingface/xet(freed 6.0GB) - Removed old LeRobot dataset versions (freed 1.3GB)
Result: Freed ~10GB total, bringing usage down to 89>#/p###
Technical Notes
Model Configuration Debugging
- Always add debug logging before making assumptions
- Check data structures - don’t assume dictionary structure
- Handle multiple cases - model configs can vary
- Verify on both sides - client and server must agree on config
PhosphoBot + Modal Limitations
- Modal server is a black box - can’t fix server-side issues
- Device placement errors on remote server are hard to debug
- Network latency adds overhead
- Dependency on external service
Direct Inference Requirements
- Local GPU with sufficient VRAM (16GB for GR00T N1.5)
- Isaac-GR00T repository installed
- LeRobot for robot control
- Proper conda environment setup
Current Status
- Joint count mismatch fixed - correctly reads 6 joints from model config
- Debug logging added for comprehensive troubleshooting
- Device mismatch on Modal has retry logic but root cause remains on server
- Alternative local inference solution implemented and ready for testing
- Disk space cleaned - freed 10GB
Usage Commands
Start inference server:
cd /home/vipin/phosphobot ./start_groot_server.sh
Run robot client (separate terminal):
cd /home/vipin/phosphobot conda activate gr00t python gr00t_inference_local.py
Summary
This debugging session involved systematic troubleshooting of AI model inference issues, from configuration parsing problems to device placement errors on remote servers. The solution involved implementing a local inference architecture that provides better control and reliability for robotic manipulation tasks.
The local approach eliminates dependencies on external services and provides the foundation for more robust robotic AI applications.
Next: Testing the local inference system with actual robot manipulation tasks.
Model: NVIDIA GR00T N1.5 (3B parameters)
Hardware: SO-100 Robot Arm, RTX 4080 Super
Software: Isaac-GR00T, LeRobot, PhosphoBot -
Debugging Dual-Camera Vision System for SO-101 Robotic Manipulation Platform
10/05/2025 at 15:51 • 0 commentsProject Overview
I worked on debugging a dual-camera vision system for my SO-101 robotic manipulation platform. The cameras were experiencing intermittent streaming failures that initially appeared to be software compatibility issues, but turned out to be caused by a faulty USB extension cable.
This project log documents the troubleshooting process, technical solutions implemented, and lessons learned while setting up the vision system for robotic data collection.
Hardware Setup
The SO-101 platform consists of:
- Dual robotic arms: Leader and follower configuration with USB serial communication
- Dual camera system:
- Wrist-mounted camera (640x480 @ 30fps) for end-effector view
- Scene camera (NexiGo N60 FHD, 1920x1080 capable) for workspace overview
- Target performance: Stable 30 FPS streaming for robotics data collection
The Problem: Intermittent Camera Failures
The scene camera was experiencing frustrating intermittent failures:
- Random “No such device” errors during streaming
- Inconsistent connection behavior
- Performance degradation over time
- Apparent timing-related issues
Initial symptoms pointed to software compatibility problems, leading me down a complex debugging path.
Solution 1: Persistent Device Management with udev Rules
First, I tackled device management by implementing comprehensive udev rules for consistent device naming:
# /etc/udev/rules.d/99-lerobot-so101.rules SUBSYSTEM=="tty", ATTRS{idVendor}=="1a86", ATTRS{idProduct}=="7523", ATTRS{serial}=="54A90068534", SYMLINK+="leader", MODE="0666" SUBSYSTEM=="tty", ATTRS{idVendor}=="1a86", ATTRS{idProduct}=="7523", ATTRS{serial}=="54A900685B4", SYMLINK+="follower", MODE="0666" SUBSYSTEM=="video4linux", ATTRS{idVendor}=="1bcf", ATTRS{idProduct}=="2c99", ATTR{index}=="0", SYMLINK+="wrist", MODE="0666" SUBSYSTEM=="video4linux", ATTRS{idVendor}=="1bcf", ATTRS{idProduct}=="2b95", ATTR{index}=="0", SYMLINK+="scene", MODE="0666"Key insight: Using
ATTR{index}=="0"prevents conflicts with video device metadata nodes, ensuring symlinks point to actual video devices.Solution 2: Camera Implementation Optimization
I developed an improved OpenCV camera implementation with better resource management:
class CorrectedOpenCVCamera: def __init__(self, camera_index, fps=30, width=640, height=480): self.camera_index = camera_index self.fps = fps self.width = width self.height = height self.cap = None def connect(self): self.cap = cv2.VideoCapture(self.camera_index) if not self.cap.isOpened(): raise RuntimeError(f"Failed to open camera {self.camera_index}") # Set properties for optimal performance self.cap.set(cv2.CAP_PROP_FRAME_WIDTH, self.width) self.cap.set(cv2.CAP_PROP_FRAME_HEIGHT, self.height) self.cap.set(cv2.CAP_PROP_FPS, self.fps)Root Cause: Hardware Issue
After implementing various software approaches including:
- MJPG format forcing (reduced performance from 30fps to 15fps)
- Artificial timing delays (caused more failures)
- Complex configuration workarounds
The actual root cause was identified through systematic testing: a poor-quality USB extension cable.
Debugging Approach
- Systematic isolation: Tested each camera individually
- Performance measurement: FPS monitoring under realistic conditions
- Hardware verification: Checked physical connections
- Root cause analysis: Eliminated software assumptions
Results
After removing the faulty USB extension cable:
- Wrist camera: 33.5 FPS sustained
- Scene camera: 30+ FPS sustained
- Both cameras: Work with default OpenCV settings
- System stability: No configuration workarounds needed
Technical Notes
1. Hardware vs Software Issues
Physical connection issues can create symptoms that mimic software problems. Checking hardware connections early in the debugging process can save time.
2. USB Cable Quality
Poor quality USB extension cables can cause:
- Signal degradation
- Power delivery issues
- Bandwidth limitations
- Intermittent connection failures
3. Camera Configuration
- Forcing MJPG format: Not necessary, can reduce performance
- Artificial delays: Can cause more failures
- Default OpenCV settings: Often work well without modification
4. Persistent Device Management
udev rules with device-specific attributes provide consistent operation across reboots and reconnections.
Current System Status
The system is now operational:
- All 4 USB devices with persistent symlinks (
/dev/leader,/dev/follower,/dev/wrist,/dev/scene) - Stable 30 FPS streaming on both cameras
- LeRobot integration working
- Ready for robotic manipulation data collection
Tools and Commands
# Test camera detection lerobot-find-cameras opencv # Capture test photos python capture_camera.py --device /dev/wrist python capture_camera.py --device /dev/scene # Reload udev rules after changes sudo udevadm control --reload-rules && sudo udevadm trigger
Future Improvements
- Cable management: Implement proper routing to prevent disconnections
- Performance monitoring: Add logging for early detection of degradation
- Integration testing: Validate complete workflow with arms and cameras together
Summary
This debugging session highlighted the importance of checking simple things first when facing technical issues. A faulty USB cable was causing what appeared to be complex software compatibility problems.
The systematic approach resulted in a stable vision system for robotic manipulation data collection.
Next: Implementing NVIDIA GR00T 1.5 model integration for manipulation policies using this camera system.
Project Repository: LeRobot SO-101 Platform
Hardware: SO-101 Robotic Arms, NexiGo N60 FHD Camera
Software: LeRobot, OpenCV, Python, udev
Vipin M