Project | ChefMate: Kitchen Robot using VLA, ACT, Diffusion

« Back to project details Sort by:

Debugging Language Conditioning in GR00T Multitask Training

10/20/2025 at 16:15 • 0 comments

When your robot ignores “do not pick up the cheese” and picks it up anyway: A journey through frozen VLM backbones and the limits of action-only fine-tuning

Project Overview

This project log documents the discovery and resolution of a critical language conditioning failure in NVIDIA’s GR00T N1.5 vision-language-action (VLA) model during multitask training. The investigation revealed fundamental limitations in training configurations that freeze the vision-language backbone, and provides solutions for enabling proper language-conditioned robotic manipulation.

The debugging process involved systematic testing with dual-ingredient scenarios (cheese and bread), analysis of model architecture and data flow, and identification of the root cause: frozen Eagle VLM backbone preventing task-specific language-vision association learning.

This work is part of the LeIsaac project - building a multi-ingredient sandwich assembly robot using Isaac Sim, Isaac Lab, and VLA models with an SO-ARM 101 robotic arm.

Hardware and Software Stack

Robot: SO-101 robotic arm (6 DOF: 5 arm joints + 1 gripper)
Cameras: Dual camera system (wrist + scene) at 640x480, 30fps
GPU: RTX 4080 Super with 16GB VRAM
Model: NVIDIA GR00T N1.5-3B (Vision-Language-Action model)
Framework: Isaac-GR00T + LeRobot v3.0
Training: LoRA fine-tuning on custom datasets
Task: Multitask pick-and-place (cheese vs bread)

The Challenge: Multitask Language Conditioning

Why Multitask Learning?

The sandwich assembly task requires the robot to manipulate multiple ingredients based on language instructions:

“Pick up the cheese and place it in the white plate”
“Pick up the bread and place it in the white plate”
“Stack the cheese on the bread”

This requires the model to:

Understand language instructions - differentiate “cheese” vs “bread”
Ground language to vision - recognize which object is cheese vs bread
Execute task-specific actions - different manipulation strategies per ingredient

Training Setup

Datasets:

Cheese dataset: 50 episodes, 14,212 frames, task: “Pick slice of yellow cheese and place it in the white plate”
Bread dataset: 50 episodes, 13,483 frames, task: “Pick slice of bread and place it in the white plate”

Training configuration:

python scripts/gr00t_finetune.py \    --dataset-path ./demo_data/cheese/ ./demo_data/bread/ \    --num-gpus 1 \    --max-steps 10000 \    --data-config so100_dualcam \    --batch-size 16 \    --lora-rank 32 \    --balance-dataset-weights \    --balance-trajectory-weights

The LeRobotMixtureDataset automatically balances sampling across both datasets during training.

Phase 1: Problem Discovery

Initial Testing

After training a multitask model for 3,000 steps, I deployed it on the physical robot and tested with different language instructions:

Test 1: "pick up the yellow cheese and put it into the white plate"

Result: ✅ Robot picks up cheese

Test 2: "pick up the bread and put it into the white plate"

Result: ❌ Robot picks up cheese (ignores instruction!)

Test 3: "do not pick up the cheese"

Result: ❌ Robot picks up cheese (completely ignores negation!)

Critical observation: The model’s behavior was 100% determined by visual state, with 0% influence from language instruction.

Hypothesis: Visual State Machine

The robot appeared to be using a simple position-based heuristic:

IF (object detected in plate):    STOP (task complete)
ELSE IF (object detected in holder):    GRASP object → MOVE to plate → RELEASE
ELSE:    SEARCH randomly

This suggested the model learned visual patterns rather than language-conditioned behavior.

Phase 2: First Fix Attempt - The Diffusion Model Flag

Discovery of `--no-tune_diffusion_model`

Investigating the training script revealed a suspicious flag:

TRAIN_CMD="python scripts/gr00t_finetune.py \    --dataset-path ${DATASET_PATHS} \    --no-tune_diffusion_model \  # ← SUSPICIOUS!    --lora-rank 32 \    ..."

Analysis of what this flag does:

GR00T N1.5 architecture:

├── Vision Tower (SigLIP) ..................... ❌ FROZEN (tune_visual=False)
├── Language Model (Qwen2.5-3B) ............... ❌ FROZEN (tune_llm=False)
└── Action Head    ├── Projector (Linear layers) ............ ✅ TRAINABLE (LoRA rank 32)    └── Diffusion Model (DiT) ................ ❌ FROZEN (--no-tune_diffusion_model)

With --no-tune_diffusion_model, only the tiny projector layer was trainable!

Training logs confirmed:

tune_diffusion_model: False
trainable params: 6,553,600 || all params: 2,730,717,120 || trainable%: 0.2400
Tune action head projector: True
Tune action head diffusion model: False  ← PROBLEM!

The Fix

Removed the flag from 03_train_model.sh:

# REMOVED: --no-tune_diffusion_model
TRAIN_CMD="python scripts/gr00t_finetune.py \    --dataset-path ${DATASET_PATHS} \    --lora-rank 32 \    ..."

New training logs:

tune_diffusion_model: True
trainable params: 6,553,600 || all params: 2,730,717,120 || trainable%: 0.2400
Tune action head projector: True
Tune action head diffusion model: True  ← FIXED!

Expectations: With the diffusion model now trainable, the model should learn to map language instructions to different action sequences.

Reality: Language conditioning still failed! 😱

Phase 3: Systematic Testing and Behavior Analysis

Test Matrix

I conducted systematic testing with single and dual ingredient scenarios:

Scenario	Cheese Location	Bread Location	Robot Action	Language Effect
1	Holder	Holder	Randomly picks one	❌ Ignores instruction
2	Holder	None	Picks cheese	❌ Ignores instruction
3	None	Holder	Picks bread	❌ Ignores instruction
4	Plate	Holder	Stops	❌ Ignores instruction
5	Holder	Plate	Stops	❌ Ignores instruction
6	Plate	Plate	Stops	❌ Ignores instruction
7	None	None	Random search	❌ Ignores instruction
8	Plate	None	Stops	❌ Ignores instruction

Key finding: The robot’s behavior was entirely determined by object positions, regardless of language instruction.

Behavior Pattern

The model learned a position-based state machine:

State 1: Nothing in plate → Pick any object from holder
State 2: Something in plate → Stop (task complete)
State 3: Nothing anywhere → Search randomly

Critical test: Manually moved cheese to plate (without robot), then gave instruction “pick up the bread and put it into the white plate”

Expected: Robot picks up bread
Actual: Robot stops (considers task complete because cheese is in plate)

This confirmed the model was using visual heuristics (“if object in plate → task done”) rather than understanding language instructions.

Phase 4: Root Cause Analysis - The Frozen Backbone Problem

How GR00T Processes Language

After diving into the codebase, I discovered how language flows through the model:

Step 1: Input Preparation (transforms.py)

# Language text
lang = "Pick slice of yellow cheese and place it in the white plate"

# Images from cameras
images = [scene_camera_frame, wrist_camera_frame]  # Shape: [V, T, C, H, W]

Step 2: Eagle VLM Processing (transforms.py:_apply_vlm_processing)

# Create conversation format (Eagle processes vision + language together!)
eagle_conversation = [    {        "role": "user",        "content": [            {"type": "image", "image": scene_img},            {"type": "image", "image": wrist_img},            {"type": "text", "text": lang}        ]    }
]

# Tokenize and process
text_list = eagle_processor.apply_chat_template(eagle_conversation)
image_inputs = eagle_processor.process_vision_info(eagle_conversation)

Step 3: Eagle Model Forward (eagle_backbone.py)

# Eagle model processes BOTH vision and language together
eagle_output = self.eagle_model(    input_ids=tokenized_text,      # Language tokens    pixel_values=processed_images,  # Vision features    attention_mask=attention_mask,    output_hidden_states=True
)

# Extract joint vision-language embeddings
vl_embeddings = eagle_output.hidden_states[select_layer]  # Shape: [B, seq_len, 2048]
vl_embeddings = self.eagle_linear(vl_embeddings)          # Project to 1536 dim

Step 4: Action Head Uses VL Embeddings (flow_matching_action_head.py)

# Action head receives joint vision-language embeddings
vl_embs = backbone_output.backbone_features  # From Eagle

# Diffusion model uses these embeddings as conditioning
model_output = self.model(    hidden_states=sa_embs,           # State + action embeddings    encoder_hidden_states=vl_embs,   # Vision-language conditioning ← KEY!    encoder_attention_mask=vl_attn_mask,    timestep=t_discretized
)

The Fundamental Problem

Eagle VLM is completely frozen (tune_llm=False, tune_visual=False):

Eagle cannot learn new language-vision associations
- Pre-trained on general VLM tasks (image captioning, VQA, etc.)
- Never seen “pick cheese” vs “pick bread” during pre-training
- Cannot learn to differentiate these similar instructions

Eagle produces nearly identical embeddings

# Hypothesis: Eagle's frozen embeddings
emb_cheese = eagle("pick cheese", [scene_img, wrist_img])
emb_bread = eagle("pick bread", [scene_img, wrist_img])

# Cosine similarity likely very high (>0.95)
# Because both are "pick X and place in plate" structure

Diffusion model has no signal to differentiate
- Diffusion model learns: embeddings → actions
- If emb_cheese ≈ emb_bread, then actions_cheese ≈ actions_bread
- Model falls back to visual heuristics: “if object in holder → pick it”

Why Diffusion Model Training Alone is Insufficient

Even with the diffusion model trainable:

✅ Diffusion model can learn better action prediction
✅ Diffusion model can learn smoother trajectories
❌ Diffusion model CANNOT differentiate between similar language instructions
❌ Diffusion model CANNOT learn task-specific language conditioning

The bottleneck: Frozen Eagle provides nearly identical embeddings for different instructions, so the diffusion model has no signal to learn different behaviors.

Phase 5: Solution and Next Steps

Required Fix: Enable Backbone Training

Minimum requirement - Enable LLM fine-tuning:

python scripts/gr00t_finetune.py \    --dataset-path ./demo_data/cheese/ ./demo_data/bread/ \    --tune-llm \              # ✅ ENABLE THIS    --tune-visual False \     # Keep frozen to save VRAM    --tune-projector True \    --tune-diffusion-model True \    --lora-rank 32

Why this works:

LLM can learn to differentiate “cheese” vs “bread” tokens
LLM creates task-specific language embeddings
Diffusion model learns to map these distinct embeddings to different actions
VRAM: ~12-16GB (may fit on RTX 4080 Super with reduced batch size)

Best solution - Enable both LLM and vision fine-tuning:

python scripts/gr00t_finetune.py \    --dataset-path ./demo_data/cheese/ ./demo_data/bread/ \    --tune-llm \              # ✅ ENABLE THIS    --tune-visual \           # ✅ ENABLE THIS    --tune-projector True \    --tune-diffusion-model True \    --lora-rank 16 \          # Reduce from 32 to save VRAM    --lora-alpha 32           # Reduce from 64 (2x rank)

Why this is best:

Vision tower learns to recognize cheese vs bread visually
LLM learns to understand task instructions
Combined: Model can ground language to visual objects
VRAM: ~16-20GB (may require batch size reduction)

VRAM Testing Script

Created test_vram_requirements.sh to systematically test configurations:

cd ~/lerobot/scripts/so100_groot
./test_vram_requirements.sh

This script tests 5 configurations:

Baseline (frozen backbone) - ~8GB
LLM only (LoRA 16) - ~12GB
LLM only (LoRA 32) - ~14GB
LLM + Vision (LoRA 16) - ~18GB
LLM + Vision (LoRA 32) - ~22GB

And recommends the best configuration that fits on the GPU.

Trade-offs and Considerations

Configuration	VRAM	Training Speed	Language Conditioning	Visual Grounding
Frozen backbone	~8GB	~5-7 sec/step	❌ Broken	❌ No
LLM only	~12-16GB	~8-12 sec/step	✅ Works	⚠️ Limited
LLM + Vision	~16-20GB	~12-18 sec/step	✅ Works	✅ Yes

Lessons Learned

1. Frozen VLM Backbones Cannot Learn New Tasks

Key insight: When using a vision-language model as a backbone, freezing it prevents learning task-specific language-vision associations. This is fundamentally different from freezing a vision-only backbone (like ResNet or ViT).

Why: VLMs create joint embeddings from vision and language. If the VLM is frozen, it can only use pre-trained associations, which likely don’t include my specific tasks.

2. Action-Only Fine-tuning Has Limits

What works: Fine-tuning only the action head (projector + diffusion model) can work for:

Single-task learning
Tasks similar to pre-training data
Visual-only conditioning

What doesn’t work: Action-only fine-tuning cannot enable:

Language conditioning for novel tasks
Differentiation between similar language instructions
Grounding of new language concepts to vision

3. Debugging Requires Understanding Data Flow

Critical: Understanding how data flows through the model is essential for debugging:

How is language tokenized?
Where are vision and language combined?
What embeddings does the action head receive?
Which components are frozen vs trainable?

Without this understanding, it’s easy to misdiagnose problems (e.g., thinking diffusion model training alone would fix language conditioning).

4. Systematic Testing Reveals Patterns

Approach: Testing with a matrix of scenarios (single ingredient, dual ingredient, different positions) revealed the position-based state machine pattern.

Value: This systematic approach provided clear evidence that language had 0% effect, which motivated deeper investigation into the architecture.

5. VRAM is the Limiting Factor

Reality: The best configuration (LLM + Vision fine-tuning) requires ~20GB VRAM, which exceeds most consumer GPUs.

Solutions:

Reduce LoRA rank (32 → 16)
Reduce batch size (16 → 4)
Use gradient checkpointing
Train on cloud GPUs
Accept LLM-only training as compromise

Next Steps

Test VRAM requirements on RTX 4080 Super to determine feasible configuration
Retrain model with LLM fine-tuning enabled (minimum) or LLM + Vision (if VRAM allows)
Validate language conditioning with systematic testing:
- Different instructions → different behaviors
- Negation has effect
- Model responds to language changes
Document results and update training scripts with proper defaults
Consider alternative approaches if VRAM is insufficient:
- Train separate models per task
- Use task selector at inference time
- Explore model quantization or pruning

Conclusion

This debugging journey revealed a fundamental limitation in the training configuration: frozen VLM backbones cannot learn task-specific language conditioning. While the initial fix (enabling diffusion model training) was necessary, it was insufficient because the bottleneck was in the frozen Eagle backbone producing nearly identical embeddings for different instructions.

The solution requires enabling at least LLM fine-tuning, and ideally both LLM and vision fine-tuning, to allow the model to learn task-specific language-vision associations. This comes with VRAM trade-offs that must be carefully managed on consumer GPUs.

The systematic testing approach and deep dive into model architecture were essential for identifying the root cause and developing an effective solution. This experience highlights the importance of understanding not just what to train, but how the model processes and combines different modalities.

Building a Sandwich Assembly Simulation for Robotic Manipulation
10/16/2025 at 20:30 • 0 comments
From USD scene creation to MimicGen integration: A complete implementation of multi-ingredient manipulation in Isaac Sim

Project Overview

This project implements a complete sandwich assembly simulation environment in Isaac Lab, designed to train robotic manipulation policies for multi-step food preparation tasks. The development involved creating a custom USD scene with proper physics configuration, implementing MimicGen integration for data augmentation, and solving critical challenges with rigid body hierarchies and API compatibility.

This project log documents the systematic development process, from initial scene setup through the final dynamic ingredient selection feature, including all the debugging challenges and solutions encountered along the way.

The simulation successfully supports teleoperation, demonstration recording, MimicGen annotation, and automated data generation for training vision-language-action (VLA) models on sandwich assembly tasks.

Hardware and Software Stack
- Robot: SO-101 robotic arm (6 DOF: 5 arm joints + 1 gripper)
- Cameras: Dual camera system (wrist + front) at 640x480, 30fps
  - Front camera calibrated for Nexigo N60 webcam (78° FOV)
  - Wrist camera for close-up manipulation view
- GPU: RTX 4080 Super with 16GB VRAM
- Simulation: Isaac Sim 5.0 + Isaac Lab framework
- Framework: LeIsaac (custom robotics framework built on Isaac Lab)
- Task: Multi-ingredient sandwich assembly with 4 ingredients (2× bread, cheese, patty)
The Challenge: Multi-Ingredient Manipulation

Why Sandwich Assembly?

Sandwich assembly represents a complex manipulation task that requires:
- Sequential manipulation: Multiple pick-and-place operations
- Spatial reasoning: Proper stacking order and alignment
- Object diversity: Different ingredient types with varying properties
- Real-world relevance: Applicable to food preparation and assembly tasks
- VLA training: Language-conditioned manipulation (“pick up the cheese”, “place the patty”)
Technical Requirements
1. USD Scene: Custom kitchen environment with proper physics
2. Multiple Objects: 4 ingredients + plate + holder, each with correct physics properties
3. MimicGen Integration: Subtask annotation and data augmentation
4. Dynamic Configuration: Support for different ingredient types
5. Camera Setup: Optimal viewing angles for manipulation
Phase 1: USD Scene Creation

Initial Scene Setup (Commit 1c52342)

Development began with creating the basic task structure and scene configuration. The initial implementation involved:

Created Files:
- assemble_sandwich_env_cfg.py - Main environment configuration
- assemble_sandwich_mimic_env_cfg.py - MimicGen variant with subtask configs
- README.md - Complete documentation (614 lines)
Key Configuration:
```
# Scene loading with automatic USD parsing
parse_usd_and_create_subassets(KITCHEN_WITH_SANDWICH_USD_PATH, self)
```
This function automatically detects all rigid body objects in the USD scene and creates corresponding configuration objects, eliminating manual object registration.

Scene Simplification Challenge

Problem: The original kitchen scene was 37.9 MB with complex fixtures (cabinets, appliances, decorative elements) that slowed simulation and cluttered the workspace.

Solution: Documented a systematic simplification workflow:
1. Remove unnecessary kitchen fixtures
2. Create simple table workspace (1.2m × 0.8m × 0.85m)
3. Reduce file size to ~5-10 MB (75% reduction)
4. Optimize for robot simulation performance
Table Layout Design:
```
┌─────────────────────────────────┐
│  [Ingredients Holder]     [🍽️] │  ← Left: Holder, Right: Plate
│  ┌─┬─┬─┬─┐              Plate  │
│  │🍞│🍞│🥩│🧀│                   │  ← Slots: bread, bread, patty, cheese
│  └─┴─┴─┴─┘                     │
│                                 │
│        Assembly Area            │
└─────────────────────────────────┘
```
Physics Configuration (Commit 6a7e4b5)

The USD scene structure was created with proper physics APIs for all objects:

Dynamic Objects (movable ingredients):
- bread_slice_1, bread_slice_2, cheese_slice, patty
- Physics properties: RigidBodyAPI, CollisionAPI, MassAPI
- Configuration: physics:kinematicEnabled = false (affected by gravity)
Static Objects (fixed environment):
- plate, ingredients_holder, table
- Physics properties: RigidBodyAPI, CollisionAPI
- Configuration: physics:kinematicEnabled = true (no movement)
Critical Learning: USD payload overrides can replace physics APIs from payload files. The solution required adding physics and collision APIs to both def and over declarations in scene.usda to ensure proper physics behavior.

Phase 2: The Rigid Body Hierarchy Crisis

The Critical Error

Problem: When attempting teleoperation, a fatal error occurred:
```
[Error] Multiple rigid bodies in hierarchy detected
```
Root Cause: Objects were nested inside the table hierarchy, creating parent-child rigid body relationships that Isaac Sim’s physics engine cannot handle.

Investigation Process:
1. Examined working reference scene (kitchen_with_orange)
2. Investigation revealed correct USD hierarchy pattern
3. Analysis showed that manipulable objects must be direct children of /Root, NOT nested inside Scene or table
Incorrect Hierarchy:
```
/Root  └── Scene      └── table          ├── bread_slice_1  ❌ Nested rigid body          ├── cheese_slice   ❌ Nested rigid body          └── patty          ❌ Nested rigid body
```
Correct Hierarchy:
```
/Root  ├── Scene  │   └── table  ├── bread_slice_1  ✅ Direct child of /Root  ├── cheese_slice   ✅ Direct child of /Root  └── patty          ✅ Direct child of /Root
```
The Fix (Commit 6b64f15)

The USD hierarchy was flattened by moving all manipulable objects out of the table to be direct children of /Root. This critical architectural fix enabled proper physics simulation.

Validation: After flattening, teleoperation worked correctly with no rigid body hierarchy errors.

Phase 3: Camera Configuration (Commits d5328b3, 5df9d64)

Camera Positioning Strategy

The implementation includes a three-camera system optimized for sandwich assembly:

1. Wrist Camera (Close-up manipulation):
```
offset=TiledCameraCfg.OffsetCfg(    pos=(0.02, 0.08, -0.03),  # Slightly forward and up    rot=(-0.35, -0.93, -0.05, 0.08),  # Angled down toward table    c
)
```
2. Front Camera (Workspace overview, calibrated for Nexigo N60 webcam):
```
offset=TiledCameraCfg.OffsetCfg(    pos=(-0.2, -0.8, 0.7),  # Higher and angled for table overview    rot=(0.2, -0.98, 0.0, 0.0),  # Looking down at workspace    c
),
spawn=sim_utils.PinholeCameraCfg(    focal_length=24.0,  # Nexigo N60 equivalent    horizontal_aperture=36.0,  # ~78° FOV to match webcam    focus_distance=400.0,  # Optimal for table distance
)
```
3. Viewer Camera (Development/debug):
```
self.viewer.eye = (1.5, -1.5, 1.8)     # Elevated diagonal view
self.viewer.lookat = (0.0, 0.0, 0.9)   # Looking at table center
```
Key Insight: The front camera was specifically calibrated to match the Nexigo N60 webcam specifications (78° FOV, 640x480 resolution, 30 FPS) to ensure sim-to-real consistency for VLA training.

Phase 4: MimicGen Integration

Automatic Subtask Annotation (Commit 3e3ce0b)

Automatic subtask detection was implemented using observation functions:

Created mdp/observations.py:
```
def ingredient_grasped(    env: ManagerBasedRLEnv,    robot_cfg: SceneEntityCfg,    bread_slice_1_cfg: SceneEntityCfg,    cheese_slice_cfg: SceneEntityCfg,    patty_cfg: SceneEntityCfg,    bread_slice_2_cfg: SceneEntityCfg,    gripper_open_threshold: float = 0.03,
) -> torch.Tensor:    """Detect when any ingredient is grasped by checking gripper closure and proximity."""    robot: Articulation = env.scene[robot_cfg.name]    gripper_pos = robot.data.joint_pos[:, -1]    gripper_closed = gripper_pos < gripper_open_threshold        # Check proximity to each ingredient    for ingredient_name, ingredient_cfg in [        ("bread_slice_1", bread_slice_1_cfg),        ("cheese_slice", cheese_slice_cfg),        ("patty", patty_cfg),        ("bread_slice_2", bread_slice_2_cfg),    ]:        if ingredient_name in env.scene:            ingredient: RigidObject = env.scene[ingredient_name]            distance = torch.norm(                robot.data.body_pos_w[:, -1, :3] - ingredient.data.root_pos_w[:, :3],                dim=-1            )            is_grasped = gripper_closed & (distance < 0.05)            if is_grasped.any():                return is_grasped        return torch.zeros(env.num_envs, dtype=torch.bool, device=env.device)
```
This function automatically detects when the robot grasps an ingredient by checking:
1. Gripper closure (joint position < threshold)
2. Proximity to ingredient (distance < 5cm)
Success Termination Criteria (Commit faf0a22)

Problem: MimicGen annotation script requires a success termination term, but the environment only had a time_out termination.

Solution: Created mdp/terminations.py with comprehensive success criteria:
```
def task_done(    env: ManagerBasedRLEnv,    plate_cfg: SceneEntityCfg,    bread_slice_1_cfg: SceneEntityCfg,    cheese_slice_cfg: SceneEntityCfg,    patty_cfg: SceneEntityCfg,    bread_slice_2_cfg: SceneEntityCfg,    height_threshold: float = 0.02,    xy_threshold: float = 0.05,    test_mode: bool = False,
) -> torch.Tensor:    """Determine if the sandwich assembly task is complete."""    if test_mode:        return torch.ones(num_envs, dtype=torch.bool, device=env.device)        # Check XY alignment (all ingredients within 5cm of plate center)    # Check vertical stacking order (bread → cheese → patty → bread)    # Check stability (low velocity for all ingredients)    # ...
```
Success Criteria:
1. XY Alignment: All ingredients within 5cm of plate center
2. Vertical Stacking: Correct order (bread → cheese → patty → bread)
3. Stability: Low velocity for all ingredients
Test Mode for Incomplete Demonstrations (Commit 6366699)

Problem: During development, testing the annotation pipeline required support for incomplete demonstrations (just moving the arm without completing the task).

Solution: Added --force_completion command-line argument:
```
parser.add_argument(    "--force_completion",    action="store_true",    default=False,    help="Force task completion for incomplete demonstrations."
)

# Modify success_term parameters based on flag
if args_cli.force_completion:    success_term.params["test_mode"] = True    print("[INFO] Force completion enabled: success termination will accept incomplete demonstrations.")
```
This allows switching between:
- Testing mode: Accept any demonstration (useful for pipeline testing)
- Production mode: Enforce strict success criteria (for real training data)
Phase 5: Dynamic Ingredient Selection (Commit 24bb349)

The KeyError Challenge

Problem: MimicGen data generation failed with:
```
KeyError: 'ingredient'
```
Root Cause: The subtask configuration used object_ref="ingredient" (a generic placeholder), but the USD scene contains specific objects: bread_slice_1, bread_slice_2, cheese_slice, patty.

Investigation: The get_object_poses() method returns a dictionary with keys from actual scene object names, not generic placeholders. MimicGen tried to access object_poses["ingredient"] which didn’t exist.

The Solution: Dynamic Ingredient Selection

A command-line argument was implemented to dynamically specify which ingredient to track:
```
parser.add_argument(    "--ingredient_type",    type=str,    default=None,    choices=["bread_slice_1", "bread_slice_2", "cheese_slice", "patty"],    help="Specify the ingredient type for sandwich assembly task."
)

# Dynamically set ingredient type
if args_cli.ingredient_type is not None and "AssembleSandwich" in env_name:    ingredient_display_names = {        "bread_slice_1": "bread slice",        "bread_slice_2": "bread slice",        "cheese_slice": "cheese slice",        "patty": "patty",    }        ingredient_name = ingredient_display_names.get(args_cli.ingredient_type)        # Update object_ref for the first subtask (grasp ingredient)    env_cfg.subtask_configs["so101_follower"][0].object_ref = args_cli.ingredient_type    # Update descriptions    env_cfg.subtask_configs["so101_follower"][0].description = f"Grasp {ingredient_name} from cartridge"    env_cfg.subtask_configs["so101_follower"][1].description = f"Place {ingredient_name} on plate"        print(f"[INFO] Ingredient type set to: {args_cli.ingredient_type} ({ingredient_name})")
```
Benefits:
1. No Manual Editing: Single command-line flag instead of code changes
2. Type Safety: Choices parameter prevents typos
3. Clear Intent: Command shows exactly which ingredient is being processed
4. Workflow Efficiency: Easy switching between ingredient types
Usage Example:
```
# Generate bread demonstrations
python.sh scripts/mimic/generate_dataset.py \    --task=LeIsaac-SO101-AssembleSandwich-Mimic-v0 \    --input_file=./datasets/annotated_bread_ingredient.hdf5 \    --output_file=./datasets/generated_bread_ingredient.hdf5 \    --ingredient_type=bread_slice_1 \    --generation_num_trials=20

# Generate patty demonstrations
python.sh scripts/mimic/generate_dataset.py \    --ingredient_type=patty \    ...
```
Phase 6: API Compatibility Fix (Commit 6b64f15)

The Se3Keyboard Error

Problem: Annotation script failed with:
```
TypeError: Se3Keyboard.__init__() got an unexpected keyword argument 'pos_sensitivity'
```
Root Cause: The leisaac repository’s annotate_demos.py diverged from the official IsaacLab version. Isaac Lab changed the Se3Keyboard API in May 2025 to use a configuration object pattern, but the script was created in August 2025 with the old API.

Solution: Updated to use Se3KeyboardCfg:
```
# Old (incorrect):
device = Se3Keyboard(pos_sensitivity=0.05, rot_sensitivity=0.05)

# New (correct):
from omni.isaac.lab.devices import Se3KeyboardCfg
device = Se3Keyboard(Se3KeyboardCfg(pos_sensitivity=0.05, rot_sensitivity=0.05))
```
This follows Isaac Lab’s configuration pattern where device classes accept a single cfg parameter of a corresponding configuration dataclass type.

Technical Insights and Lessons Learned

1. USD Hierarchy is Critical

The rigid body hierarchy issue was the most critical architectural challenge. Key learning: Manipulable objects must be direct children of the root prim, not nested inside other rigid bodies.

2. Physics API Overrides

USD payload overrides can replace physics APIs from payload files. Always add physics and collision APIs to both def and over declarations to ensure proper behavior.

3. MimicGen Configuration Requirements
- Intermediate subtasks: Can have termination signals and offset ranges
- Final subtask: Must have subtask_term_signal=None and subtask_term_offset_range=(0, 0)
- Object references: Must match actual USD scene object names
4. Camera Calibration for Sim-to-Real

Calibrating simulation cameras to match real hardware (Nexigo N60 webcam) is essential for VLA training. Key parameters:
- Field of view (78° diagonal)
- Resolution (640x480)
- Frame rate (30 FPS)
- Focus distance (400mm for table workspace)
5. Flexible Testing Workflows

The --force_completion flag enables rapid iteration during development by accepting incomplete demonstrations, while maintaining strict criteria for production data collection.

Results and Portfolio Value

Completed Features

✅ Custom USD scene with proper physics configuration
✅ Multi-ingredient manipulation support (4 ingredients)
✅ Dual camera system calibrated for real hardware
✅ MimicGen integration with automatic subtask detection
✅ Success termination criteria for task completion
✅ Dynamic ingredient selection for data augmentation
✅ Comprehensive documentation (614-line README)

Demonstration Workflow
1. Record demonstrations for each ingredient type (bread, cheese, patty)
2. Annotate subtasks manually or automatically
3. Generate augmented data using MimicGen (1 demo → 20+ variations)
4. Train VLA policies with language conditioning
5. Deploy to real robot with sim-to-real transfer
Portfolio Highlights

This project demonstrates:
- USD scene design with complex physics requirements
- Debugging systematic issues (rigid body hierarchy, API compatibility)
- MimicGen integration for data augmentation
- Flexible configuration for different use cases
- Real-world applicability to food preparation and assembly tasks
Future Enhancements
1. Language Conditioning: Integrate with VLA models for language-conditioned manipulation
2. Sim-to-Real Transfer: Deploy trained policies to real SO-101 robot
3. Advanced Physics: Add deformable objects (lettuce, tomato)
4. Multi-Robot: Extend to bi-manual manipulation
5. Benchmarking: Compare MimicGen vs manual demonstrations
Conclusion

Building the sandwich assembly simulation required solving challenges across multiple domains: USD scene design, physics configuration, MimicGen integration, and API compatibility. The systematic debugging approach and comprehensive documentation make this a strong portfolio piece demonstrating end-to-end robotics simulation development.

The dynamic ingredient selection feature and flexible testing workflows show attention to developer experience and practical usability. The project successfully bridges simulation and real-world robotics, providing a foundation for training manipulation policies on complex multi-step tasks.

MimicGen Data Augmentation Pipeline for Robotic Manipulation

10/08/2025 at 22:35 • 0 comments

Project Overview

I implemented a complete MimicGen data augmentation pipeline to generate multiple training demonstrations from a single recorded episode. The goal was to overcome the data scarcity problem in robotic manipulation by automatically creating diverse variations of expert demonstrations.

This project log documents the systematic implementation of the 4-step MimicGen workflow, from converting demonstrations to IK actions through generating 10x augmented data, and the debugging challenges encountered along the way.

The pipeline successfully transformed 1 original demonstration into 10 augmented demonstrations with a 71.4% generation success rate, providing rich training data for imitation learning policies.

Hardware Setup

Robot: SO-101 robotic arm (6 DOF: 5 arm joints + 1 gripper)
Cameras: Dual camera system (scene + wrist) at 640x480, 30fps
GPU: RTX 4080 Super with 16GB VRAM
Simulation: Isaac Sim 5.0 + Isaac Lab framework
Dataset: Single “lift_cube” demonstration → 10 augmented demonstrations
Task: “Pick up 1.5cm cube and lift it 5cm above robot base”

The Problem: Data Scarcity in Robotic Learning

Initial Challenge

Robotic manipulation policies require large amounts of diverse training data, but collecting demonstrations is:

Time-consuming: Each episode requires manual teleoperation
Limited diversity: Human demonstrations tend to be similar
Expensive: Requires expert operators and robot time
Insufficient for generalization: Single demonstrations don’t capture task variations

Traditional approach: Record 50-100 demonstrations manually.
MimicGen approach: Record 1 demonstration → Generate 10+ variations automatically.

MimicGen Pipeline Overview

The 4-Step Workflow

Convert to IK Actions: Transform joint-space actions (6D) to end-effector actions (8D)
Annotate Subtasks: Automatically detect subtask boundaries using termination signals
Generate Augmented Data: Create variations by recombining subtask segments
Convert to Joint Actions: Transform back to joint-space for training

Task Structure: Lift Cube

Subtask 1: pick_cube - Approach and grasp the cube
Subtask 2: lift_cube - Lift cube above threshold height

Key Requirements:

Cube dimensions: 1.5cm × 1.5cm × 1.5cm
Lift threshold: 5cm above robot base
Success condition: Cube height > base height + 0.05m

Debugging Approach

Step 1: Environment Configuration Issues

Problem: MimicGen annotation failed with “The final task was not completed” error.

Root Cause Analysis:

Missing lift_cube observation function in environment
Incorrect subtask termination signal configuration
Height threshold too strict for actual cube size

Solution: Added lift_cube observation function:

def lift_cube(        env: ManagerBasedRLEnv,        cube_cfg: SceneEntityCfg = SceneEntityCfg("cube"),        robot_cfg: SceneEntityCfg = SceneEntityCfg("robot"),        robot_base_name: str = "base",        height_threshold: float = 0.05) -> torch.Tensor:    """Check if the cube is lifted above the robot base."""    cube: RigidObject = env.scene[cube_cfg.name]    robot: Articulation = env.scene[robot_cfg.name]    cube_height = cube.data.root_pos_w[:, 2]    base_index = robot.data.body_names.index(robot_base_name)    robot_base_height = robot.data.body_pos_w[:, base_index, 2]    above_base = cube_height - robot_base_height > height_threshold    return above_base

Step 2: Height Threshold Calibration

Critical Discovery: The default height threshold (0.20m) was too strict for the actual cube size.

Investigation Process:

Examined cube model file: /assets/scenes/table_with_cube/cube/model.xml
Found actual dimensions: 0.015077m × 0.015077m × 0.015077m (1.5cm cube)
Calculated appropriate threshold: 0.05m (3.3× cube height)

Configuration Update:

# Updated threshold in both environments
height_threshold: float = 0.05  # Changed from 0.20m

Step 3: MimicGen Configuration Requirements

Problem: Assertion error during generation: “assert subtask_configs[-1].subtask_term_offset_range[0] == 0”

Root Cause: Final subtask had incorrect offset range configuration.

MimicGen Requirements:

Intermediate subtasks: Can have termination signals and offset ranges
Final subtask: Must have subtask_term_signal=None and subtask_term_offset_range=(0, 0)

Solution:

# Final subtask configuration
subtask_configs.append(    SubTaskConfig(        object_ref="cube",        subtask_term_signal=None,  # No termination signal for final subtask        subtask_term_offset_range=(0, 0),  # Required by MimicGen        selecti,        description="Lift cube",        next_subtask_description=None,    )
)

Critical Discovery: Environment Compatibility

The AttributeError Challenge

Problem: AttributeError: 'ManagerBasedRLLeIsaacMimicEnv' object has no attribute 'scene'

Root Cause: MimicGen environment has different internal structure than regular environment.

Solution: Added compatibility handling in termination functions:

# Handle both regular env and mimic env
if hasattr(env, 'scene'):    num_envs = env.num_envs    device = env.device    cube: RigidObject = env.scene[cube_cfg.name]    robot: Articulation = env.scene[robot_cfg.name]
else:    # For mimic environments, try alternative access patterns    num_envs = getattr(env, '_num_envs', 1)    device = getattr(env, '_device', torch.device('cuda'))    scene = getattr(env, '_scene', None) or getattr(env, 'env', None)    if scene is None:        return torch.tensor([True], dtype=torch.bool, device=device)    cube: RigidObject = scene[cube_cfg.name]    robot: Articulation = scene[robot_cfg.name]

The Solution: Complete Pipeline Implementation

Step 1: Convert to IK Actions

/home/vipin/IsaacSim/_build/linux-x86_64/release/python.sh scripts/mimic/eef_action_process.py \    --input_file=./datasets/so101_lift_cube.hdf5 \    --output_file=./datasets/processed_so101_lift_cube.hdf5 \    --to_ik \    --device=cuda \    --headless

Result: 6D joint actions → 8D end-effector actions (7 pose + 1 gripper)

Step 2: Annotate Subtasks

/home/vipin/IsaacSim/_build/linux-x86_64/release/python.sh scripts/mimic/annotate_dataset.py \    --input_file=./datasets/processed_so101_lift_cube.hdf5 \    --output_file=./datasets/annotated_so101_lift_cube.hdf5 \    --task=LeIsaac-SO101-LiftCube-Mimic-v0 \    --device=cuda \    --headless

Result: Added subtask termination signals for automatic segmentation.

Step 3: Generate Augmented Data

/home/vipin/IsaacSim/_build/linux-x86_64/release/python.sh scripts/mimic/generate_dataset.py \    --input_file=./datasets/annotated_so101_lift_cube.hdf5 \    --output_file=./datasets/generated_so101_lift_cube.hdf5 \    --task=LeIsaac-SO101-LiftCube-Mimic-v0 \    --num_demos=10 \    --device=cuda \    --headless

Result: 1 demonstration → 10 augmented demonstrations (71.4% success rate)

Step 4: Convert Back to Joint Actions

/home/vipin/IsaacSim/_build/linux-x86_64/release/python.sh scripts/mimic/eef_action_process.py \    --input_file=./datasets/generated_so101_lift_cube.hdf5 \    --output_file=./datasets/final_so101_lift_cube.hdf5 \    --to_joint \    --device=cuda \    --headless

Result: 8D end-effector actions → 6D joint actions ready for training.

Technical Implementation Details

Dataset Pipeline Summary

Stage	File	Size	Episodes	Action Dim	Description
Original	`so101_lift_cube.hdf5`	239.5 MB	1	6D	Recorded demonstration
IK Converted	`processed_so101_lift_cube.hdf5`	74.2 MB	1	8D	End-effector actions
Annotated	`annotated_so101_lift_cube.hdf5`	74.4 MB	1	8D	With subtask signals
Generated	`generated_so101_lift_cube.hdf5`	732.5 MB	10	8D	Augmented data
Final	`final_so101_lift_cube.hdf5`	732.5 MB	10	6D	Ready for training

LeRobot Conversion

conda activate lerobot
python scripts/convert/isaaclab2lerobot.py

Configuration:

repo_id = 'sparkmt/so101_lift_cube_mimicgen'
robot_type = 'so101_follower'
fps = 30
task = 'Lift cube from table using MimicGen augmented data'

Result: 21 MB LeRobot v3.0 dataset with AV1-compressed videos.

Results and Validation

Data Augmentation Success

Input: 1 original demonstration
Output: 10 augmented demonstrations
Success Rate: 71.4% (10 successful / 14 total attempts)
Data Multiplication: 10× increase in training data
Total Dataset: 11 demonstrations (1 original + 10 generated)

Generation Statistics

MimicGen Generation Results:
- Total attempts: 14
- Successful generations: 10
- Failed generations: 4
- Success rate: 71.4%
- Average generation time: ~2 minutes per demo

Dataset Quality Metrics

Action Diversity: Generated demonstrations show variations in:
- Object positions and orientations
- Robot approach trajectories
- Timing and velocity profiles
- Grasp configurations
Task Consistency: All generated demos maintain task structure
Physical Validity: All actions respect robot constraints

LeRobot Dataset Structure

sparkmt/so101_lift_cube_mimicgen/
├── data/           (372K) - Parquet files with actions/states
├── videos/         (20M)  - AV1-compressed dual camera videos
├── meta/           (84K)  - Dataset metadata and statistics
└── images/         (12K)  - Sample images

Compression Efficiency: 732.5 MB HDF5 → 21 MB LeRobot (99% size reduction)

Technical Insights

1. Subtask Configuration is Critical

Intermediate subtasks: Require termination signals for segmentation
Final subtask: Must have subtask_term_signal=None and subtask_term_offset_range=(0, 0)
Height thresholds: Must match actual object dimensions, not arbitrary values

2. Environment Compatibility Matters

MimicGen environments have different internal structure
Always check for attribute existence before accessing
Provide fallback mechanisms for different environment types

3. Action Space Consistency

MimicGen requires IK actions (8D) for generation
Training requires joint actions (6D)
Conversion steps are essential for pipeline success

4. Success Rate Expectations

71.4% success rate is considered good for MimicGen
Failed generations often due to:
- Collision detection
- Unreachable configurations
- Timing constraints

Current Status

✅ Complete MimicGen pipeline implemented
✅ 10× data augmentation achieved (1 → 10 demonstrations)
✅ LeRobot v3.0 dataset created and optimized
✅ All debugging challenges resolved
✅ Comprehensive documentation and reproducible workflow
🔄 Ready for imitation learning policy training

Summary

This project successfully implemented a complete MimicGen data augmentation pipeline, transforming a single demonstration into 10 diverse training examples. The systematic debugging approach revealed critical requirements for MimicGen configuration, including proper subtask termination signals, accurate height thresholds, and environment compatibility handling.

The pipeline achieved a 71.4% generation success rate and produced a high-quality dataset with 10× data augmentation. The final LeRobot dataset provides rich training data with dual camera observations and diverse manipulation trajectories, all compressed efficiently using modern video codecs.

Key technical contributions include environment compatibility fixes, proper subtask configuration, and automated conversion between action spaces. The debugging infrastructure and systematic approach provide a foundation for scaling to more complex manipulation tasks.

The project demonstrates the power of automated data augmentation for robotic learning, reducing the manual data collection burden while increasing training data diversity and quality.

Next: Training imitation learning policies on the augmented dataset and comparing performance against single-demonstration baselines.

Framework: Isaac Sim 5.0 + Isaac Lab + MimicGen
Data Pipeline: HDF5 → LeRobot v3.0 (21 MB, AV1-compressed)
Hardware: SO-101 Robot Arm, RTX 4080 Super
Success Metrics: 10× data augmentation, 71.4% generation success rate

Building a Real-to-Sim Digital Twin for SO-101 Robot Arm in Isaac Sim
10/06/2025 at 06:50 • 0 comments
Project Overview

I worked on implementing a real-to-sim digital twin system for an SO-101 robotic arm using NVIDIA Isaac Sim 4.5.0. The goal was to create a virtual replica that mirrors the physical robot’s movements in real-time, enabling simultaneous control of both real and virtual arms through a leader-follower teleoperation setup.

This project log documents the complete setup process, debugging challenges, and the implementation of a robust ROS2-based communication pipeline between physical hardware and Isaac Sim.

Hardware Setup
- Robot: SO-100/SO-101 robotic arm (6 DOF: 5 arm joints + 1 gripper)
- Control System: Leader-follower teleoperation setup
- Device Mappings:
  - /dev/leader - Leader arm for human teleoperation
  - /dev/follower - Follower arm (physical robot being controlled)
  - /dev/wrist - Wrist-mounted camera (video0)
  - /dev/scene - Scene overview camera (video2)
- GPU: RTX 4080 Super with 16GB VRAM
- Software: Isaac Sim 4.5.0, ROS2 Humble, CycloneDDS
The Challenge

The objective was to create a digital twin where:
1. Leader arm movements control both physical follower arm AND virtual follower arm
2. Real-time synchronization with minimal latency
3. Proper joint state feedback from physical to virtual robot
4. Seamless integration with existing teleoperation workflow
Architecture Overview

The system uses a three-component architecture:

Component 1: Teleoperation System
- Reads leader arm positions from /dev/leader
- Controls physical follower arm via existing teleoperation
Component 2: Joint State Bridge
- Reads actual follower arm positions from /dev/follower
- Publishes joint states to ROS2 topics
- Bridges physical robot data to Isaac Sim
Component 3: Isaac Sim Digital Twin
- Subscribes to joint state commands
- Renders virtual robot matching physical movements
- Provides visual feedback and simulation capabilities
Debugging Process

Issue 1: GLIBCXX Library Version Conflicts

When attempting to run ROS2 nodes from the Isaac Sim conda environment, I encountered:
```
ImportError: /home/vipin/miniconda3/envs/isaacsim/bin/../lib/libstdc++.so.6: version `GLIBCXX_3.4.30' not found (required by /opt/ros/humble/local/lib/python3.10/dist-packages/rclpy/_rclpy_pybind11.cpython-310-x86_64-linux-gnu.so)
```
Root Cause: The conda environment’s libstdc++ (version 3.4.26) was older than what system ROS2 required (3.4.30).

Investigation:
```
# Conda environment library
strings /home/vipin/miniconda3/envs/isaacsim/lib/libstdc++.so.6 | grep GLIBCXX | tail -1
# Output: GLIBCXX_3.4.26

# System library  strings /usr/lib/x86_64-linux-gnu/libstdc++.so.6 | grep GLIBCXX | tail -1
# Output: GLIBCXX_3.4.30
```
Solution: Use Isaac Sim’s internal ROS2 libraries instead of system installation:
```
# Configure Isaac Sim with internal ROS2 libraries
export isaac_sim_package_path=$(dirname $(which isaacsim))/../lib/python3.10/site-packages/isaacsim
export RMW_IMPLEMENTATION=rmw_cyclonedx_cpp
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$isaac_sim_package_path/exts/isaacsim.ros2.bridge/humble/lib
export PYTHONPATH=$isaac_sim_package_path/exts/isaacsim.ros2.bridge/humble:$PYTHONPATH
```
This approach avoided library conflicts while maintaining Isaac Sim stability.

Issue 2: Network Topic Interference

During initial testing, I discovered unexpected joint states:
```
ros2 topic echo /joint_states --once
# Output showed wheel joints, gripper extension, head swivel - not arm joints!
```
Root Cause: Another machine on the network was publishing robot topics to the same ROS_DOMAIN_ID.

Solution: Network isolation using unique domain ID:
```
export ROS_DOMAIN_ID=42  # Isolated domain
ros2 topic list
# Clean output: only local topics
```
Issue 3: Joint Name Mismatch

The most critical debugging challenge was joint name inconsistency. Isaac Sim’s ArticulationController was throwing warnings:
```
[Warning] [omni.graph.core.plugin] /so101_new_calib/ROS_JointStates/ArticulationController: [/so101_new_calib/ROS_JointStates] OmniGraph Warning: 'joint_1'
```
Investigation: Compared published vs expected joint names:
```
# Physical arm publishing
ros2 topic echo /isaac_joint_command --once
name: [joint_1, joint_2, joint_3, joint_4, joint_5, joint_6]

# Isaac Sim expecting  ros2 topic echo /isaac_joint_states --once
name: [Rotation, Pitch, Elbow, Wrist_Pitch, Wrist_Roll, Jaw]
```
Root Cause: Joint state reader was using generic names instead of Isaac Sim’s actual joint names.

Solution: Updated joint state reader configuration:
```
# Fixed joint names to match Isaac Sim robot
self.joint_names = [    'Rotation',      # Base rotation    'Pitch',         # Shoulder pitch      'Elbow',         # Elbow    'Wrist_Pitch',   # Wrist pitch    'Wrist_Roll',    # Wrist roll    'Jaw'            # Gripper
]
```
Issue 4: Device Port Conflicts

Initially attempted to read from /dev/leader, but this caused conflicts:
```
# Problematic - conflicts with teleoperation
self.serial_port = serial.Serial('/dev/leader', 1000000, timeout=0.1)
```
Root Cause: Teleoperation system was already reading from /dev/leader. Multiple processes accessing the same serial port caused communication failures.

Solution: Read from follower arm instead:
```
# Correct approach - read actual follower positions
self.serial_port = serial.Serial('/dev/follower', 1000000, timeout=0.1)
```
This provides the actual controlled arm positions for digital twin synchronization.

Implementation Details

Joint State Reader Architecture

The core component reads physical robot joint positions and publishes them for Isaac Sim:
```
class JointStateReader(Node):    def __init__(self):        super().__init__('joint_state_reader')                # Publisher for Isaac Sim joint commands        self.joint_pub = self.create_publisher(JointState, '/isaac_joint_command', 10)                # Connect to follower arm for position feedback        self.serial_port = serial.Serial('/dev/follower', 1000000, timeout=0.1)                # Timer for 20Hz publishing rate        self.timer = self.create_timer(0.05, self.read_and_publish)
```
ROS2 Topic Architecture

The system uses a clean topic structure:
- /joint_states - Standard ROS2 joint states (from physical robot)
- /isaac_joint_command - Commands sent to Isaac Sim virtual robot
- /isaac_joint_states - Feedback from Isaac Sim virtual robot
- /robot/cmd_pose - IK teleoperation commands (future expansion)
Isaac Sim Configuration

Isaac Sim loads with internal ROS2 libraries:
```
[23.352s] [ext: isaacsim.ros2.bridge-4.1.15] startup
[23.370s] Using backup internal ROS2 humble distro
[23.396s] Attempting to load system rclpy
[23.396s] Could not import system rclpy: No module named 'rclpy'
[23.396s] Attempting to load internal rclpy
[23.405s] rclpy loaded
```
This confirms successful loading of Isaac Sim’s bundled ROS2 libraries, avoiding system conflicts.

Current Status

✅ Real-to-sim synchronization working: Virtual arm mirrors physical follower arm movements in real-time

✅ Joint state publishing: 20Hz update rate from physical robot to Isaac Sim

✅ Network isolation: Clean ROS2 topic namespace without interference

✅ Library conflicts resolved: Isaac Sim uses internal ROS2 libraries

✅ Proper device mapping: Reading from /dev/follower without conflicts

Demonstration

[VIDEO PLACEHOLDER: Leader arm controlling real + virtual follower for pick-and-place task]

The video demonstrates:
- Operating the leader arm to control movements
- Real follower arm and virtual Isaac Sim arm moving in perfect synchronization
- Pick and place task: picking up a striped object and placing it into a white tray
- Real-time response with minimal latency between physical and virtual robots
Advantages of This Approach
- True Digital Twin: Virtual robot reflects actual physical robot state, not just commands
- Debugging Capability: Visual feedback in Isaac Sim helps debug physical robot issues
- Simulation Validation: Test control algorithms in simulation with real robot data
- Training Data Generation: Record synchronized real/virtual data for ML training
- Safety Monitoring: Virtual representation provides additional safety oversight
- Scalability: Can extend to multiple robots or complex multi-robot scenarios
Technical Notes

ROS2 Bridge Verification

Always verify topic subscription counts to confirm Isaac Sim connectivity:
```
ros2 topic info /isaac_joint_command
# Publisher count: 1 (joint state reader)
# Subscription count: 1 (Isaac Sim when playing)
```
Joint Name Debugging Strategy
1. Check what Isaac Sim publishes: ros2 topic echo /isaac_joint_states --once
2. Match physical robot publisher to those exact names
3. Verify ArticulationController warnings disappear
Library Conflict Prevention
- Never upgrade conda environment’s libstdc++ - it will break Isaac Sim
- Use Isaac Sim’s internal ROS2 libraries for compatibility
- Test with simple ros2 topic list before complex operations
Usage Commands

Start Isaac Sim (with ROS2 bridge):
```
conda activate isaacsim
isaacsim  # ROS2 bridge loads automatically
```
Run joint state reader:
```
source /opt/ros/humble/setup.bash
cd /home/vipin/so-arm101-ros2-bridge
/usr/bin/python3.10 src/jointstatereader/jointstatereader/joint_state_reader.py
```
Monitor synchronization:
```
source /opt/ros/humble/setup.bash
ros2 topic hz /isaac_joint_command  # Should show ~20Hz
```
Summary

This project successfully implemented a real-to-sim digital twin system for robotic manipulation, overcoming significant technical challenges in library compatibility, network isolation, and joint name mapping. The solution provides a robust foundation for advanced robotics applications including simulation-based training, safety monitoring, and algorithm validation.

The key insight was recognizing that true digital twin functionality requires reading actual robot state (from /dev/follower) rather than just command signals, providing accurate virtual representation of physical robot behavior.

Next: Implementing bidirectional control for sim-to-real policy deployment and testing advanced manipulation tasks with synchronized real/virtual feedback.

Hardware: SO-100/SO-101 Robot Arm, RTX 4080 Super
Software: Isaac Sim 4.5.0, ROS2 Humble, CycloneDDS
Repository: so-arm101-ros2-bridge
Debugging Robot “Twitching” in GR00T N1.5 Deployment
10/05/2025 at 16:13 • 0 comments
Project Overview

I worked on debugging a puzzling issue where a fine-tuned NVIDIA GR00T N1.5 model was causing an SO-100 robotic arm to “twitch” instead of performing pick-and-place tasks. The robot would make tiny oscillating movements around the same position, with the gripper staying completely unresponsive.

This project log documents the systematic debugging process that revealed the root cause: an undertrained model that needed significantly more training steps to learn the complete task sequence.

Hardware Setup
- Robot: SO-100 robotic arm (6 DOF: 5 arm joints + 1 gripper)
- Cameras: Dual camera system (scene + wrist) at 640x480, 30fps
- GPU: RTX 4080 Super with 16GB VRAM
- Dataset: 20 episodes of pick-and-place demonstrations
- Task: “Pick up the striped block and put it into the white plate”
The Problem: Robot Twitching

Initial Symptoms

When deploying the trained GR00T model:
- Robot connected successfully
- Model inference server running correctly
- Robot made tiny oscillating movements around the same position
- Robot was not executing the intended pick-and-place task
The model had been trained for 2000 steps and showed good loss convergence, but the physical deployment was completely unsuccessful.

Debugging Approach

Step 1: Enhanced Logging Implementation

Added comprehensive logging to both the inference server and robot client to understand what data was being exchanged.

Server-Side Logging (service.py):
- Request counter for each inference call
- Input data keys and shapes
- Inference time in milliseconds
- Output action statistics (min/max/mean values)
Client-Side Logging (eval_lerobot.py):
- Step counter and observation keys
- Current robot state (all 6 joints)
- Received action chunks from server
- First action being sent to robot
Example Output:
```
[Request #1] Endpoint: get_action  Inference time: 75.23ms  Response keys: ['action.single_arm', 'action.gripper']    action.single_arm: shape=(16, 5), min=-45.23, max=67.89, mean=12.34    action.gripper: shape=(16, 1), min=-0.30, max=0.50, mean=0.15

[CLIENT] First action to send to robot:    shoulder_pan.pos: -12.34
```
Step 2: Diagnostic Tools Development

Created several diagnostic scripts to isolate the issue:

Joint Testing Tool (test_joint.py):
- Tests individual joint control to verify hardware functionality
- Takes joint number (1-6) and value (-100 to 100) as input
- Helps isolate hardware vs. software issues
Robot State Monitor (monitor_robot_state.py):
- Real-time monitoring of robot joint positions
- Verifies encoder readings match values sent to server
Step 3: Dataset Visualization

Uploaded the dataset to Hugging Face Hub and used Rerun visualization to inspect the recorded episodes:
```
# Upload dataset for analysis
python scripts/so100_groot/upload_to_huggingface.py \    --local-dir ~/.cache/huggingface/lerobot/rubbotix/striped-block \    --repo-id sparkmt/so100-striped-block

# Visualize episodes
./scripts/so100_groot/visualize_episodes.sh 0
```
This revealed the difference between State (robot’s actual position) and Action (commanded target position), which was crucial for diagnosis.

Critical Discovery: The Root Cause

Key Finding from Logs

The robot was making very small, uncertain movements instead of decisive actions. The logging revealed that the model was outputting actions with very small magnitudes, indicating high uncertainty.

The Root Cause: Undertrained Model

Analysis revealed that the model was severely undertrained at 2000 steps.

Evidence:
1. Tiny action magnitudes: Model outputting very small actions due to high uncertainty
2. Lack of task structure understanding: Model hadn’t learned the full sequence (approach → grasp → lift → move → release)
3. Closed-loop instability: Small errors accumulating, causing the robot to end up in states the model never saw during training
The Solution: Extended Training

Training Requirements Analysis

Task Complexity Minimum Steps Recommended Steps
Simple reaching 1,000-2,000 5,000
Pick and place 5,000-10,000 10,000-20,000
Complex manipulation 10,000-20,000 20,000-50,000

The pick-and-place task required 10,000-20,000 steps, not the 2000 steps initially used.

Training Configuration Update

Updated the training script to resume from checkpoint-2000 and continue to 10,000 steps:
```
# Resume training configuration
RESUME_TRAINING="true"
MAX_STEPS=10000  # Increased from 2000
BATCH_SIZE=16
GRADIENT_ACCUMULATION_STEPS=8
LORA_RANK=32
LORA_ALPHA=64
```
Automatic Checkpoint Detection:
```
if [ "$RESUME_TRAINING" = "true" ]; then    if ls "$OUTPUT_DIR"/checkpoint-* 1> /dev/null 2>&1; then        LATEST_CHECKPOINT=$(ls -td "$OUTPUT_DIR"/checkpoint-* | head -1)        echo "Resuming from latest checkpoint: ${LATEST_CHECKPOINT}"        TRAIN_CMD="$TRAIN_CMD --resume"    fi
fi
```
Why 2000 Steps Was Insufficient

1. Model Hadn’t Learned Task Structure
- At 2000 steps: Learning basic correlations between observations and actions
- Missing: Full sequence understanding of the manipulation task
2. Action Magnitude Learning

From deployment logs at 2000 steps, the model was outputting very small actions because:
- Hadn’t learned correct action scale
- Being overly cautious due to high uncertainty
- Loss function hadn’t fully converged
3. Closed-Loop Instability
- Small errors accumulate: Undertrained model makes uncertain movements
- Compounding problem: Robot ends up in states model never saw during training
- Result: Model gets “confused” and twitches in place
Technical Implementation Details

Enhanced Logging Code

Server-side logging addition:
```
logger.info(f"[Request #{request_counter}] Endpoint: {endpoint}")
logger.info(f"  Data keys: {list(data.keys())}")
logger.info(f"  Inference time: {inference_time:.2f}ms")
for key, value in response.items():    if isinstance(value, np.ndarray):        logger.info(f"    {key}: shape={value.shape}, min={value.min():.2f}, max={value.max():.2f}, mean={value.mean():.2f}")
```
Client-side logging addition:
```
logger.info(f"[STEP {step_count}] Getting observation...")
logger.info(f"  Current robot state:")
for key, value in current_state.items():    logger.info(f"    {key}: {value:.2f}")
logger.info(f"[CLIENT] First action to send to robot:")
for key, value in first_action.items():    logger.info(f"    {key}: {value:.2f}")
```
Dataset Upload and Visualization

Created tools for dataset management and analysis:
```
# Upload script for Hugging Face Hub
def upload_dataset(local_dir, repo_id):    # Validate dataset structure    required_files = ['meta/info.json', 'meta/stats.json', 'meta/tasks.parquet']    for file in required_files:        if not os.path.exists(os.path.join(local_dir, file)):            raise FileNotFoundError(f"Required file {file} not found")        # Create repository and upload    api = HfApi()    api.create_repo(repo_id, repo_type="dataset", exist_ok=True)    api.upload_folder(folder_path=local_dir, repo_id=repo_id, repo_type="dataset")
```
Results and Validation

Training Progress

After resuming training from 2000 to 10,000 steps:
- Significant MSE improvement: From ~24 at 2,000 steps to ~6.3 at 10,000 steps
- Loss continued to decrease: Model learned more complex patterns
- Action magnitudes increased: Actions became more decisive
- Task structure emerged: Model learned the complete manipulation sequence
Deployment Results

With the extended training at 10,000 steps:
- Task execution achieved: Robot now performs the complete sequence (approach → open → grasp → lift → move → release)
- Mixed joint performance: Some joints (1, 2, 3, and 5) showed accurate predictions matching ground truth, while others (joints 0 and 4) had less precise control
- Execution challenges: Task completion takes 3+ minutes with multiple retries due to shaky movements
- No more twitching: Robot executes purposeful movements instead of oscillating in place
Performance Assessment

The model demonstrates partial success:
- ✅ Complete task sequence understanding
- ✅ Elimination of twitching behavior
- ⚠️ Uneven accuracy across different joints
- ⚠️ Execution speed and precision need improvement
- ⏳ Further iteration required for reliable performance
Technical Insights

1. Training Duration is Critical
- 2000 steps = memorized patterns (MSE ~24, twitching behavior)
- 10,000 steps = learned task structure (MSE ~6.3, complete sequence execution)
- Manipulation tasks require significantly more training than simple reaching
- Even at 10,000 steps, performance varies across joints, suggesting more training may be beneficial
2. Logging is Essential for Debugging
- Without detailed logs, impossible to diagnose model-robot mismatch
- Action statistics (min/max/mean) reveal model confidence levels
- State vs. action comparison shows tracking performance
3. Visualization Tools are Invaluable
- Dataset visualization revealed data quality and action ranges
- State vs. Action plots diagnosed tracking issues
- Essential for understanding model behavior
Current Status
- Extended training completed (2000 → 10,000 steps)
- MSE improved from ~24 to ~6.3 (74% improvement)
- Robot deployment shows partial success with complete task sequence execution
- Performance varies across joints with some showing accurate control while others need improvement
- Comprehensive debugging infrastructure in place
- Dataset published to Hugging Face Hub: sparkmt/so100-striped-block
Summary

This debugging session demonstrated that what appeared to be a complex hardware or software integration issue was actually a fundamental training problem. The “twitching” behavior was caused by an undertrained model that hadn’t learned the complete task structure.

The systematic debugging approach using enhanced logging, diagnostic tools, and dataset visualization was crucial for identifying the root cause. The solution required extending training from 2000 to 10,000 steps, resulting in a 74% improvement in MSE (from ~24 to ~6.3) and enabling the robot to execute the complete pick-and-place sequence.

While the model now performs the full task (approach → open → grasp → lift → move → release), execution remains slow and imprecise, with uneven performance across different joints. This suggests that further data collection and training iterations will be needed to achieve reliable, smooth manipulation.

The project demonstrates the iterative nature of robotic AI development and the importance of adequate training duration for manipulation tasks. The debugging infrastructure and systematic approach provide a foundation for continued improvement.

Next: Collecting additional training episodes and exploring Isaac Sim integration for synthetic data generation.

Model: NVIDIA GR00T N1.5 (3B parameters)
Training Method: LoRA fine-tuning (extended to 10,000 steps)
Hardware: SO-100 Robot Arm, RTX 4080 Super
Framework: Isaac-GR00T + LeRobot
Fine-Tuning GR00T N1.5 for SO-100 Robot Arm Manipulation
10/05/2025 at 15:59 • 0 comments
Project Overview

I worked on fine-tuning NVIDIA’s GR00T N1.5 model for controlling an SO-100 robotic arm. The project involved dataset preparation, memory optimization for 16GB VRAM constraints, model training with LoRA techniques, and deployment setup for real-world robot control.

The goal was to train the model to perform pick-and-place manipulation tasks using the instruction “pick up the striped box and put it into the white plate” with dual-camera visual input.

Hardware Setup
- Robot: SO-100 robotic arm (6 DOF: 5 arm joints + 1 gripper)
- Cameras: Dual camera system at 640x480, 30fps
- GPU: RTX 4080 Super with 16GB VRAM
- Dataset: 20 episodes, 5,197 frames of manipulation demonstrations
- Model: NVIDIA GR00T N1.5 (3B parameters)
Dataset Preparation and Debugging

Issue 1: Blank Visualization Plots

The dataset visualization script displayed blank canvases for state/action plots.

Root Cause: The script had hardcoded humanoid robot keys (left_arm, right_arm, left_hand, right_hand) while the SO-100 dataset uses different keys (single_arm, gripper).

Solution: Modified the visualization function to auto-detect keys from the dataset:
```
# Before: hardcoded humanoid keys
shared_keys = ["left_arm", "right_arm", "left_hand", "right_hand"]

# After: auto-detect from dataset
if shared_keys is None:    shared_keys = [key.replace("state.", "") for key in state_dict.keys()]    print(f"Auto-detected keys to plot: {shared_keys}")
```
Issue 2: Camera Mapping Discrepancy

The visualization showed the wrist camera perspective when it should have shown the scene camera.

Investigation: Checked the dataset’s modality.json mappings and discovered that during data collection, the camera naming was swapped:
- observation.images.main was actually the wrist/gripper camera
- observation.images.secondary_0 was actually the scene camera
Solution: Corrected the mappings in modality.json:
```
"video": {    "front": {"original_key": "observation.images.secondary_0"},  // Scene camera    "wrist": {"original_key": "observation.images.main"}          // Wrist camera
}
```
Verification: Created a diagnostic script that confirmed the mapping correction by comparing raw video frames with dataset loader output.

Issue 3: Missing Video Metadata

Dataset loading failed due to missing video metadata fields.

Solution: Added the required fields to info.json:
```
info['features'][key]['info']['video.channels'] = 3
info['features'][key]['info']['video.height'] = 720
info['features'][key]['info']['video.width'] = 1280
```
Memory Optimization Challenge

The Problem: CUDA Out of Memory

Initial training attempts all failed with out-of-memory errors, even with very small batch sizes:

Attempt Batch Size Gradient Accum Result
1 64 2 OOM at step 0
2 32 4 OOM at step 0
3 16 8 OOM at step 0
4 8 16 OOM at step 0
5 4 32 OOM at step 0
6 2 64 OOM during optimizer step

Analysis: The base model has 3B parameters, plus a 550M parameter diffusion model. The Adam optimizer requires 2x memory for momentum and variance states, exceeding the 16GB VRAM limit.

Solution: LoRA Fine-Tuning

Implemented Low-Rank Adaptation (LoRA) to reduce trainable parameters:

LoRA Configuration:
```
--lora-rank 32          # Size of low-rank adaptation matrices
--lora-alpha 64         # Scaling factor (typically 2x rank)
--lora-dropout 0.1      # Regularization
--no-tune_diffusion_model  # Freeze 550M parameter diffusion model
```
Memory Savings:
- Full fine-tuning: ~200M trainable parameters
- LoRA fine-tuning: ~10M trainable parameters (20x reduction)
- Result: Fits in 16GB VRAM with batch_size=16
Training Configuration and Results

Final Training Setup
```
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
python scripts/gr00t_finetune.py \   --dataset-path ./demo_data/example_dataset/ \   --num-gpus 1 \   --output-dir ./so100-checkpoints \   --max-steps 5000 \   --data-config so100_dualcam \   --batch-size 16 \   --gradient-accumulation-steps 8 \   --learning-rate 0.0001 \   --no-tune_diffusion_model \   --lora-rank 32 \   --lora-alpha 64 \   --lora-dropout 0.1
```
Key Parameters:
- Effective batch size: 128 (16 × 8 gradient accumulation)
- Training steps: 5000 planned, stopped early at 2340
- Checkpoints saved every 1000 steps
Training Progress

Loss Trajectory:
```
Step    Loss    Change from Previous
----    ----    --------------------  500   0.080   (baseline)
1,000   0.050   -37.5% (strong improvement)
1,500   0.040   -20.0% (good improvement)
2,000   0.040    0.0%  (plateau started)
2,340   0.038   -5.0%  (minimal improvement)
```
Convergence Analysis: The loss plateaued around step 1500-2000, with minimal improvement in the last 840 steps. Training was stopped early to avoid overtraining on the small dataset.

Training Metrics:
- Training speed: ~12.76 seconds/step
- GPU memory usage: 7.07 GB before training
- Best checkpoint: checkpoint-2000 with stable performance
Model Evaluation

Open-Loop Evaluation Results

Evaluated the trained model using the official evaluation script:
```
python scripts/eval_policy.py --plot \   --embodiment-tag new_embodiment \   --model-path ./so100-checkpoints/checkpoint-2000 \   --data-config so100_dualcam \   --dataset-path ./demo_data/example_dataset/
```
Result: Unnormalized MSE of 0.017463

Performance Assessment:
- Excellent: MSE < 10
- Good: MSE 10-30
- Moderate: MSE 30-60
- Poor: MSE > 60
- Our Result: 0.017 (Outstanding performance)
The model predictions closely match ground truth actions, indicating readiness for real robot deployment.

Deployment Setup

Device Configuration

Verified robot and camera device mappings:
```
/dev/follower -> ttyACM4  # Robot motor bus
/dev/wrist -> video0      # Wrist/gripper camera
/dev/scene -> video2      # Scene/front camera
```
Client-Server Architecture

Terminal 1 - Inference Server:
```
python scripts/inference_service.py --server \    --model-path ./so100-checkpoints/checkpoint-2000 \    --embodiment-tag new_embodiment \    --data-config so100_dualcam \    --denoising-steps 4 \    --port 5555
```
Terminal 2 - Robot Client:
```
python ./examples/SO-100/eval_lerobot.py \    --robot.type=so100_follower \    --robot.port=/dev/follower \    --robot.id=my_so100_arm \    --robot.cameras="{ wrist: {type: opencv, index_or_path: 0, width: 640, height: 480, fps: 30}, front: {type: opencv, index_or_path: 2, width: 640, height: 480, fps: 30}}" \    --policy_host=localhost \    --lang_instruction="pick up the striped box and put it into the white plate"
```
Issues Fixed
1. Import Error: Fixed incorrect import path in the robot client script
2. Missing Dependency: Installed feetech-servo-sdk for robot communication
Technical Notes

Memory Optimization Strategies

Successful approaches:
- LoRA fine-tuning: Reduced trainable parameters by 20x
- Freezing diffusion model: Saved 550M parameters from training
- Gradient accumulation: Maintained effective batch size without memory overhead
- Memory fragmentation fix: PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
Unsuccessful approaches:
- Simply reducing batch size (still OOM with batch_size=2)
- Training full model even with frozen diffusion model
Training Efficiency Insights
- Small datasets (20 episodes) converge quickly (~2000 steps)
- Monitor loss curves and stop when plateau is reached
- Training loss of 0.04 with MSE 0.017 indicates excellent learning
- Avoid overtraining on small datasets
Dataset Quality Factors
- Camera mapping must match between training and deployment
- Robot calibration affects action space consistency
- 20 episodes sufficient for single task, more needed for multi-task
- Always visualize dataset before training
Current Status
- Dataset prepared and validated
- Model trained and converged (checkpoint-2000, loss: 0.04)
- Open-loop evaluation passed (MSE: 0.017)
- Inference server configured
- Robot client script ready
- Pending: Complete robot calibration and test real-world deployment
Summary

This project successfully fine-tuned a 3B parameter vision-language-action model for robotic manipulation within 16GB VRAM constraints. The key breakthrough was using LoRA fine-tuning to reduce memory requirements while maintaining training effectiveness.

The trained model achieved excellent evaluation metrics (MSE: 0.017) and is ready for real-world deployment. The systematic approach to dataset debugging, memory optimization, and deployment setup provides a foundation for future robotic AI projects.

Next: Testing the trained model on physical robot manipulation tasks.

Model: NVIDIA GR00T N1.5 (3B parameters)
Training Method: LoRA fine-tuning (rank 32)
Hardware: SO-100 Robot Arm, RTX 4080 Super
Framework: Isaac-GR00T + LeRobot
Debugging GR00T N1.5 Inference in Phosphobot
10/05/2025 at 15:55 • 0 comments
Project Overview

I worked on debugging inference issues with a fine-tuned NVIDIA GR00T N1.5 model for controlling an SO-100 robotic arm. The model was trained successfully and uploaded to HuggingFace Hub, but the PhosphoBot browser interface was failing during AI control activation.

This project log documents the debugging process, root cause analysis, and the implementation of an alternative local inference solution.

Hardware Setup
- Robot: SO-100 robotic arm (6 DOF: 5 arm joints + 1 gripper)
- Cameras: Dual camera system (IDs 0 and 2) at 640x480, 30fps
- GPU: RTX 4080 Super with 16GB VRAM
- Model: phospho-app/gr00t-example_dataset-h9g75u7gak (fine-tuned GR00T N1.5)
The Problem

When clicking “AI Control” in the PhosphoBot browser interface, the system reported:
```
Exception: No robot connected. Exiting AI control loop.
```
The robot was physically connected and visible in the UI, cameras were streaming successfully, and the model had been trained and uploaded to HuggingFace Hub. The issue appeared to be in the inference pipeline.

Debugging Process

Issue 1: Joint Count Mismatch

Added debug logging to understand the failure and discovered:
```
Connected joints: 6, Config joints: 1
```
Root Cause: The code was reading the model configuration incorrectly:
```
# Incorrect code
number_of_joints_in_config = len(    config.embodiment.statistics.action.action_space.values()
)
```
This was counting dictionary keys (max, min, mean, std, q01, q99) instead of joint dimensions.

Model Config Structure:
```
{  "action_space": {    "action_space": 6  }
}
```
Solution: Handle the nested dictionary structure correctly:
```
# Fixed code
action_space = config.embodiment.statistics.action.action_space

# Case 1: action_space is a dict with 'action_space' key containing the number
if isinstance(action_space, dict) and 'action_space' in action_space:    number_of_joints_in_config = action_space['action_space']
# Case 2: action_space has 'max' or 'min' arrays
elif hasattr(action_space, 'max') and isinstance(action_space.max, list):    number_of_joints_in_config = len(action_space.max)
# Additional fallback cases...
```
Issue 2: Device Mismatch on Modal Server

After fixing the joint count, a new error appeared:
```
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!
```
Root Cause:
- Model inference happens on Modal GPU server (remote)
- Some model components loaded on CPU, others on GPU
- Issue occurs in VLLN (Vision-Language Layer Norm) component
Attempted Fix: Added retry logic with exponential backoff to handle transient device issues:
```
max_retries = 3
retry_delay = 1.0  # seconds

for retry_attempt in range(max_retries):    try:        actions = self(inputs)        break  # Success    except RuntimeError as e:        if "Expected all tensors to be on the same device" in str(e):            if retry_attempt < max_retries - 1:                logger.warning(f"Device mismatch error on attempt {retry_attempt + 1}/{max_retries}. Retrying in {retry_delay}s...")                await asyncio.sleep(retry_delay)                retry_delay *= 2  # Exponential backoff
```
Status: This helped with transient issues but didn’t solve the root cause, which is on the Modal server side and not fixable from the client.

Alternative Solution: Local Inference

Since the PhosphoBot Modal server had device mismatch issues, I implemented a local inference solution using official Isaac-GR00T scripts.

Architecture: Client-Server Model

Based on NVIDIA’s official tutorial, the solution uses a two-terminal approach:

Terminal 1: Inference Server
- Loads GR00T model on local GPU
- Runs inference on observations
- Returns action predictions
- Uses ZMQ protocol for fast communication
Terminal 2: Robot Client
- Connects to SO-100 robot via USB
- Captures camera images
- Sends observations to server
- Executes returned actions
Implementation

Server Script (start_groot_server.sh):
```
#!/bin/bash
cd /home/vipin/Isaac-GR00T
conda activate gr00t

python scripts/inference_service.py \    --server \    --model_path "phospho-app/gr00t-example_dataset-h9g75u7gak" \    --embodiment-tag "new_embodiment" \    --data-config "so100_dualcam" \    --denoising-steps 4 \    --port 5555
```
Client Script (gr00t_inference_local.py):
- Based on official examples/SO-100/eval_lerobot.py
- Connects to SO-100 robot using LeRobot
- Initializes cameras (IDs 0 and 2)
- Connects to GR00T inference server
- Runs inference loop at 30 FPS
- Handles action horizon queuing
Configuration:
```
# Robot
ROBOT_TYPE = "so100_follower"
ROBOT_PORT = "/dev/ttyACM0"
ROBOT_ID = "so-100"

# Cameras
CAMERA_CONFIGS = {    "front": {"type": "opencv", "index_or_path": 0, "width": 640, "height": 480, "fps": 30},    "wrist": {"type": "opencv", "index_or_path": 2, "width": 640, "height": 480, "fps": 30},
}

# Task
LANG_INSTRUCTION = "pick up the striped box and put it into the white plate"
```
Advantages of Local Inference
- No Modal server dependency: Runs entirely locally
- Full control over device placement: All tensors on GPU
- Easier debugging: Direct access to model and logs
- Lower latency: No network round-trip to Modal
- Official NVIDIA approach: Based on their tutorial
- More reliable: Better for production deployment
Disk Space Management

During the debugging process, encountered disk space issues (96GB disk at 100% capacity). Performed cleanup:

Actions taken:
- Cleaned conda package cache: conda clean --all --yes (freed 2.3GB)
- Removed HuggingFace XET cache: rm -rf ~/.cache/huggingface/xet (freed 6.0GB)
- Removed old LeRobot dataset versions (freed 1.3GB)
Result: Freed ~10GB total, bringing usage down to 89>#/p###

Technical Notes

Model Configuration Debugging
1. Always add debug logging before making assumptions
2. Check data structures - don’t assume dictionary structure
3. Handle multiple cases - model configs can vary
4. Verify on both sides - client and server must agree on config
PhosphoBot + Modal Limitations
- Modal server is a black box - can’t fix server-side issues
- Device placement errors on remote server are hard to debug
- Network latency adds overhead
- Dependency on external service
Direct Inference Requirements
- Local GPU with sufficient VRAM (16GB for GR00T N1.5)
- Isaac-GR00T repository installed
- LeRobot for robot control
- Proper conda environment setup
Current Status
- Joint count mismatch fixed - correctly reads 6 joints from model config
- Debug logging added for comprehensive troubleshooting
- Device mismatch on Modal has retry logic but root cause remains on server
- Alternative local inference solution implemented and ready for testing
- Disk space cleaned - freed 10GB
Usage Commands

Start inference server:
```
cd /home/vipin/phosphobot
./start_groot_server.sh
```
Run robot client (separate terminal):
```
cd /home/vipin/phosphobot
conda activate gr00t
python gr00t_inference_local.py
```
Summary

This debugging session involved systematic troubleshooting of AI model inference issues, from configuration parsing problems to device placement errors on remote servers. The solution involved implementing a local inference architecture that provides better control and reliability for robotic manipulation tasks.

The local approach eliminates dependencies on external services and provides the foundation for more robust robotic AI applications.

Next: Testing the local inference system with actual robot manipulation tasks.

Model: NVIDIA GR00T N1.5 (3B parameters)
Hardware: SO-100 Robot Arm, RTX 4080 Super
Software: Isaac-GR00T, LeRobot, PhosphoBot
Debugging Dual-Camera Vision System for SO-101 Robotic Manipulation Platform
10/05/2025 at 15:51 • 0 comments
Project Overview

I worked on debugging a dual-camera vision system for my SO-101 robotic manipulation platform. The cameras were experiencing intermittent streaming failures that initially appeared to be software compatibility issues, but turned out to be caused by a faulty USB extension cable.

This project log documents the troubleshooting process, technical solutions implemented, and lessons learned while setting up the vision system for robotic data collection.

Hardware Setup

The SO-101 platform consists of:
- Dual robotic arms: Leader and follower configuration with USB serial communication
- Dual camera system:
  - Wrist-mounted camera (640x480 @ 30fps) for end-effector view
  - Scene camera (NexiGo N60 FHD, 1920x1080 capable) for workspace overview
- Target performance: Stable 30 FPS streaming for robotics data collection
The Problem: Intermittent Camera Failures

The scene camera was experiencing frustrating intermittent failures:
- Random “No such device” errors during streaming
- Inconsistent connection behavior
- Performance degradation over time
- Apparent timing-related issues
Initial symptoms pointed to software compatibility problems, leading me down a complex debugging path.

Solution 1: Persistent Device Management with udev Rules

First, I tackled device management by implementing comprehensive udev rules for consistent device naming:
```
# /etc/udev/rules.d/99-lerobot-so101.rules
SUBSYSTEM=="tty", ATTRS{idVendor}=="1a86", ATTRS{idProduct}=="7523", ATTRS{serial}=="54A90068534", SYMLINK+="leader", MODE="0666"
SUBSYSTEM=="tty", ATTRS{idVendor}=="1a86", ATTRS{idProduct}=="7523", ATTRS{serial}=="54A900685B4", SYMLINK+="follower", MODE="0666"
SUBSYSTEM=="video4linux", ATTRS{idVendor}=="1bcf", ATTRS{idProduct}=="2c99", ATTR{index}=="0", SYMLINK+="wrist", MODE="0666"
SUBSYSTEM=="video4linux", ATTRS{idVendor}=="1bcf", ATTRS{idProduct}=="2b95", ATTR{index}=="0", SYMLINK+="scene", MODE="0666"
```
Key insight: Using ATTR{index}=="0" prevents conflicts with video device metadata nodes, ensuring symlinks point to actual video devices.

Solution 2: Camera Implementation Optimization

I developed an improved OpenCV camera implementation with better resource management:
```
class CorrectedOpenCVCamera:    def __init__(self, camera_index, fps=30, width=640, height=480):        self.camera_index = camera_index        self.fps = fps        self.width = width        self.height = height        self.cap = None            def connect(self):        self.cap = cv2.VideoCapture(self.camera_index)        if not self.cap.isOpened():            raise RuntimeError(f"Failed to open camera {self.camera_index}")                # Set properties for optimal performance        self.cap.set(cv2.CAP_PROP_FRAME_WIDTH, self.width)        self.cap.set(cv2.CAP_PROP_FRAME_HEIGHT, self.height)        self.cap.set(cv2.CAP_PROP_FPS, self.fps)
```
Root Cause: Hardware Issue

After implementing various software approaches including:
- MJPG format forcing (reduced performance from 30fps to 15fps)
- Artificial timing delays (caused more failures)
- Complex configuration workarounds
The actual root cause was identified through systematic testing: a poor-quality USB extension cable.

Debugging Approach
1. Systematic isolation: Tested each camera individually
2. Performance measurement: FPS monitoring under realistic conditions
3. Hardware verification: Checked physical connections
4. Root cause analysis: Eliminated software assumptions
Results

After removing the faulty USB extension cable:
- Wrist camera: 33.5 FPS sustained
- Scene camera: 30+ FPS sustained
- Both cameras: Work with default OpenCV settings
- System stability: No configuration workarounds needed
Technical Notes

1. Hardware vs Software Issues

Physical connection issues can create symptoms that mimic software problems. Checking hardware connections early in the debugging process can save time.

2. USB Cable Quality

Poor quality USB extension cables can cause:
- Signal degradation
- Power delivery issues
- Bandwidth limitations
- Intermittent connection failures
3. Camera Configuration
- Forcing MJPG format: Not necessary, can reduce performance
- Artificial delays: Can cause more failures
- Default OpenCV settings: Often work well without modification
4. Persistent Device Management

udev rules with device-specific attributes provide consistent operation across reboots and reconnections.

Current System Status

The system is now operational:
- All 4 USB devices with persistent symlinks (/dev/leader, /dev/follower, /dev/wrist, /dev/scene)
- Stable 30 FPS streaming on both cameras
- LeRobot integration working
- Ready for robotic manipulation data collection
Tools and Commands
```
# Test camera detection
lerobot-find-cameras opencv

# Capture test photos
python capture_camera.py --device /dev/wrist
python capture_camera.py --device /dev/scene

# Reload udev rules after changes
sudo udevadm control --reload-rules && sudo udevadm trigger
```
Future Improvements
1. Cable management: Implement proper routing to prevent disconnections
2. Performance monitoring: Add logging for early detection of degradation
3. Integration testing: Validate complete workflow with arms and cameras together
Summary

This debugging session highlighted the importance of checking simple things first when facing technical issues. A faulty USB cable was causing what appeared to be complex software compatibility problems.

The systematic approach resulted in a stable vision system for robotic manipulation data collection.

Next: Implementing NVIDIA GR00T 1.5 model integration for manipulation policies using this camera system.

Project Repository: LeRobot SO-101 Platform
Hardware: SO-101 Robotic Arms, NexiGo N60 FHD Camera
Software: LeRobot, OpenCV, Python, udev

Task Complexity	Minimum Steps	Recommended Steps
Simple reaching	1,000-2,000	5,000
Pick and place	5,000-10,000	10,000-20,000
Complex manipulation	10,000-20,000	20,000-50,000

Attempt	Batch Size	Gradient Accum	Result
1	64	2	OOM at step 0
2	32	4	OOM at step 0
3	16	8	OOM at step 0
4	8	16	OOM at step 0
5	4	32	OOM at step 0
6	2	64	OOM during optimizer step

ChefMate: Kitchen Robot using VLA, ACT, Diffusion

Debugging Language Conditioning in GR00T Multitask Training

Project Overview

Hardware and Software Stack

The Challenge: Multitask Language Conditioning

Why Multitask Learning?

Training Setup

Phase 1: Problem Discovery

Initial Testing

Hypothesis: Visual State Machine

Phase 2: First Fix Attempt - The Diffusion Model Flag

Discovery of --no-tune_diffusion_model

The Fix

Phase 3: Systematic Testing and Behavior Analysis

Test Matrix

Behavior Pattern

Phase 4: Root Cause Analysis - The Frozen Backbone Problem

How GR00T Processes Language

The Fundamental Problem

Why Diffusion Model Training Alone is Insufficient

Phase 5: Solution and Next Steps

Required Fix: Enable Backbone Training

VRAM Testing Script

Trade-offs and Considerations

Lessons Learned

1. Frozen VLM Backbones Cannot Learn New Tasks

2. Action-Only Fine-tuning Has Limits

3. Debugging Requires Understanding Data Flow

4. Systematic Testing Reveals Patterns

5. VRAM is the Limiting Factor

Next Steps

Conclusion

Building a Sandwich Assembly Simulation for Robotic Manipulation

Project Overview

Hardware and Software Stack

The Challenge: Multi-Ingredient Manipulation

Why Sandwich Assembly?

Technical Requirements

Phase 1: USD Scene Creation

Initial Scene Setup (Commit 1c52342)

Scene Simplification Challenge

Physics Configuration (Commit 6a7e4b5)

Phase 2: The Rigid Body Hierarchy Crisis

The Critical Error

The Fix (Commit 6b64f15)

Phase 3: Camera Configuration (Commits d5328b3, 5df9d64)

Camera Positioning Strategy

Phase 4: MimicGen Integration

Automatic Subtask Annotation (Commit 3e3ce0b)

Success Termination Criteria (Commit faf0a22)

Test Mode for Incomplete Demonstrations (Commit 6366699)

Phase 5: Dynamic Ingredient Selection (Commit 24bb349)

The KeyError Challenge

The Solution: Dynamic Ingredient Selection

Phase 6: API Compatibility Fix (Commit 6b64f15)

The Se3Keyboard Error

Technical Insights and Lessons Learned

1. USD Hierarchy is Critical

2. Physics API Overrides

3. MimicGen Configuration Requirements

4. Camera Calibration for Sim-to-Real

5. Flexible Testing Workflows

Results and Portfolio Value

Completed Features

Demonstration Workflow

Portfolio Highlights

Future Enhancements

Conclusion

MimicGen Data Augmentation Pipeline for Robotic Manipulation

Project Overview

Hardware Setup

The Problem: Data Scarcity in Robotic Learning

Initial Challenge

MimicGen Pipeline Overview

The 4-Step Workflow

Task Structure: Lift Cube

Debugging Approach

Step 1: Environment Configuration Issues

Step 2: Height Threshold Calibration

Step 3: MimicGen Configuration Requirements

Discovery of `--no-tune_diffusion_model`