Building a Sandwich Assembly Simulation for Robotic Manipulation

From USD scene creation to MimicGen integration: A complete implementation of multi-ingredient manipulation in Isaac Sim

Project Overview

This project implements a complete sandwich assembly simulation environment in Isaac Lab, designed to train robotic manipulation policies for multi-step food preparation tasks. The development involved creating a custom USD scene with proper physics configuration, implementing MimicGen integration for data augmentation, and solving critical challenges with rigid body hierarchies and API compatibility.

This project log documents the systematic development process, from initial scene setup through the final dynamic ingredient selection feature, including all the debugging challenges and solutions encountered along the way.

The simulation successfully supports teleoperation, demonstration recording, MimicGen annotation, and automated data generation for training vision-language-action (VLA) models on sandwich assembly tasks.

Hardware and Software Stack

Robot: SO-101 robotic arm (6 DOF: 5 arm joints + 1 gripper)
Cameras: Dual camera system (wrist + front) at 640x480, 30fps
- Front camera calibrated for Nexigo N60 webcam (78° FOV)
- Wrist camera for close-up manipulation view
GPU: RTX 4080 Super with 16GB VRAM
Simulation: Isaac Sim 5.0 + Isaac Lab framework
Framework: LeIsaac (custom robotics framework built on Isaac Lab)
Task: Multi-ingredient sandwich assembly with 4 ingredients (2× bread, cheese, patty)

The Challenge: Multi-Ingredient Manipulation

Why Sandwich Assembly?

Sandwich assembly represents a complex manipulation task that requires:

Sequential manipulation: Multiple pick-and-place operations
Spatial reasoning: Proper stacking order and alignment
Object diversity: Different ingredient types with varying properties
Real-world relevance: Applicable to food preparation and assembly tasks
VLA training: Language-conditioned manipulation (“pick up the cheese”, “place the patty”)

Technical Requirements

USD Scene: Custom kitchen environment with proper physics
Multiple Objects: 4 ingredients + plate + holder, each with correct physics properties
MimicGen Integration: Subtask annotation and data augmentation
Dynamic Configuration: Support for different ingredient types
Camera Setup: Optimal viewing angles for manipulation

Phase 1: USD Scene Creation

Initial Scene Setup (Commit 1c52342)

Development began with creating the basic task structure and scene configuration. The initial implementation involved:

Created Files:

assemble_sandwich_env_cfg.py - Main environment configuration
assemble_sandwich_mimic_env_cfg.py - MimicGen variant with subtask configs
README.md - Complete documentation (614 lines)

Key Configuration:

# Scene loading with automatic USD parsing
parse_usd_and_create_subassets(KITCHEN_WITH_SANDWICH_USD_PATH, self)

This function automatically detects all rigid body objects in the USD scene and creates corresponding configuration objects, eliminating manual object registration.

Scene Simplification Challenge

Problem: The original kitchen scene was 37.9 MB with complex fixtures (cabinets, appliances, decorative elements) that slowed simulation and cluttered the workspace.

Solution: Documented a systematic simplification workflow:

Remove unnecessary kitchen fixtures
Create simple table workspace (1.2m × 0.8m × 0.85m)
Reduce file size to ~5-10 MB (75% reduction)
Optimize for robot simulation performance

Table Layout Design:

┌─────────────────────────────────┐
│  [Ingredients Holder]     [🍽️] │  ← Left: Holder, Right: Plate
│  ┌─┬─┬─┬─┐              Plate  │
│  │🍞│🍞│🥩│🧀│                   │  ← Slots: bread, bread, patty, cheese
│  └─┴─┴─┴─┘                     │
│                                 │
│        Assembly Area            │
└─────────────────────────────────┘

Physics Configuration (Commit 6a7e4b5)

The USD scene structure was created with proper physics APIs for all objects:

Dynamic Objects (movable ingredients):

bread_slice_1, bread_slice_2, cheese_slice, patty
Physics properties: RigidBodyAPI, CollisionAPI, MassAPI
Configuration: physics:kinematicEnabled = false (affected by gravity)

Static Objects (fixed environment):

plate, ingredients_holder, table
Physics properties: RigidBodyAPI, CollisionAPI
Configuration: physics:kinematicEnabled = true (no movement)

Critical Learning: USD payload overrides can replace physics APIs from payload files. The solution required adding physics and collision APIs to both def and over declarations in scene.usda to ensure proper physics behavior.

Phase 2: The Rigid Body Hierarchy Crisis

The Critical Error

Problem: When attempting teleoperation, a fatal error occurred:

[Error] Multiple rigid bodies in hierarchy detected

Root Cause: Objects were nested inside the table hierarchy, creating parent-child rigid body relationships that Isaac Sim’s physics engine cannot handle.

Investigation Process:

Examined working reference scene (kitchen_with_orange)
Investigation revealed correct USD hierarchy pattern
Analysis showed that manipulable objects must be direct children of /Root, NOT nested inside Scene or table

Incorrect Hierarchy:

/Root  └── Scene      └── table          ├── bread_slice_1  ❌ Nested rigid body          ├── cheese_slice   ❌ Nested rigid body          └── patty          ❌ Nested rigid body

Correct Hierarchy:

/Root  ├── Scene  │   └── table  ├── bread_slice_1  ✅ Direct child of /Root  ├── cheese_slice   ✅ Direct child of /Root  └── patty          ✅ Direct child of /Root

The Fix (Commit 6b64f15)

The USD hierarchy was flattened by moving all manipulable objects out of the table to be direct children of /Root. This critical architectural fix enabled proper physics simulation.

Validation: After flattening, teleoperation worked correctly with no rigid body hierarchy errors.

Phase 3: Camera Configuration (Commits d5328b3, 5df9d64)

Camera Positioning Strategy

The implementation includes a three-camera system optimized for sandwich assembly:

1. Wrist Camera (Close-up manipulation):

offset=TiledCameraCfg.OffsetCfg(    pos=(0.02, 0.08, -0.03),  # Slightly forward and up    rot=(-0.35, -0.93, -0.05, 0.08),  # Angled down toward table    c
)

2. Front Camera (Workspace overview, calibrated for Nexigo N60 webcam):

offset=TiledCameraCfg.OffsetCfg(    pos=(-0.2, -0.8, 0.7),  # Higher and angled for table overview    rot=(0.2, -0.98, 0.0, 0.0),  # Looking down at workspace    c
),
spawn=sim_utils.PinholeCameraCfg(    focal_length=24.0,  # Nexigo N60 equivalent    horizontal_aperture=36.0,  # ~78° FOV to match webcam    focus_distance=400.0,  # Optimal for table distance
)

3. Viewer Camera (Development/debug):

self.viewer.eye = (1.5, -1.5, 1.8)     # Elevated diagonal view
self.viewer.lookat = (0.0, 0.0, 0.9)   # Looking at table center

Key Insight: The front camera was specifically calibrated to match the Nexigo N60 webcam specifications (78° FOV, 640x480 resolution, 30 FPS) to ensure sim-to-real consistency for VLA training.

Phase 4: MimicGen Integration

Automatic Subtask Annotation (Commit 3e3ce0b)

Automatic subtask detection was implemented using observation functions:

Created mdp/observations.py:

def ingredient_grasped(    env: ManagerBasedRLEnv,    robot_cfg: SceneEntityCfg,    bread_slice_1_cfg: SceneEntityCfg,    cheese_slice_cfg: SceneEntityCfg,    patty_cfg: SceneEntityCfg,    bread_slice_2_cfg: SceneEntityCfg,    gripper_open_threshold: float = 0.03,
) -> torch.Tensor:    """Detect when any ingredient is grasped by checking gripper closure and proximity."""    robot: Articulation = env.scene[robot_cfg.name]    gripper_pos = robot.data.joint_pos[:, -1]    gripper_closed = gripper_pos < gripper_open_threshold        # Check proximity to each ingredient    for ingredient_name, ingredient_cfg in [        ("bread_slice_1", bread_slice_1_cfg),        ("cheese_slice", cheese_slice_cfg),        ("patty", patty_cfg),        ("bread_slice_2", bread_slice_2_cfg),    ]:        if ingredient_name in env.scene:            ingredient: RigidObject = env.scene[ingredient_name]            distance = torch.norm(                robot.data.body_pos_w[:, -1, :3] - ingredient.data.root_pos_w[:, :3],                dim=-1            )            is_grasped = gripper_closed & (distance < 0.05)            if is_grasped.any():                return is_grasped        return torch.zeros(env.num_envs, dtype=torch.bool, device=env.device)

This function automatically detects when the robot grasps an ingredient by checking:

Gripper closure (joint position < threshold)
Proximity to ingredient (distance < 5cm)

Success Termination Criteria (Commit faf0a22)

Problem: MimicGen annotation script requires a success termination term, but the environment only had a time_out termination.

Solution: Created mdp/terminations.py with comprehensive success criteria:

def task_done(    env: ManagerBasedRLEnv,    plate_cfg: SceneEntityCfg,    bread_slice_1_cfg: SceneEntityCfg,    cheese_slice_cfg: SceneEntityCfg,    patty_cfg: SceneEntityCfg,    bread_slice_2_cfg: SceneEntityCfg,    height_threshold: float = 0.02,    xy_threshold: float = 0.05,    test_mode: bool = False,
) -> torch.Tensor:    """Determine if the sandwich assembly task is complete."""    if test_mode:        return torch.ones(num_envs, dtype=torch.bool, device=env.device)        # Check XY alignment (all ingredients within 5cm of plate center)    # Check vertical stacking order (bread → cheese → patty → bread)    # Check stability (low velocity for all ingredients)    # ...

Success Criteria:

XY Alignment: All ingredients within 5cm of plate center
Vertical Stacking: Correct order (bread → cheese → patty → bread)
Stability: Low velocity for all ingredients

Test Mode for Incomplete Demonstrations (Commit 6366699)

Problem: During development, testing the annotation pipeline required support for incomplete demonstrations (just moving the arm without completing the task).

Solution: Added --force_completion command-line argument:

parser.add_argument(    "--force_completion",    action="store_true",    default=False,    help="Force task completion for incomplete demonstrations."
)

# Modify success_term parameters based on flag
if args_cli.force_completion:    success_term.params["test_mode"] = True    print("[INFO] Force completion enabled: success termination will accept incomplete demonstrations.")

This allows switching between:

Testing mode: Accept any demonstration (useful for pipeline testing)
Production mode: Enforce strict success criteria (for real training data)

Phase 5: Dynamic Ingredient Selection (Commit 24bb349)

The KeyError Challenge

Problem: MimicGen data generation failed with:

KeyError: 'ingredient'

Root Cause: The subtask configuration used object_ref="ingredient" (a generic placeholder), but the USD scene contains specific objects: bread_slice_1, bread_slice_2, cheese_slice, patty.

Investigation: The get_object_poses() method returns a dictionary with keys from actual scene object names, not generic placeholders. MimicGen tried to access object_poses["ingredient"] which didn’t exist.

The Solution: Dynamic Ingredient Selection

A command-line argument was implemented to dynamically specify which ingredient to track:

parser.add_argument(    "--ingredient_type",    type=str,    default=None,    choices=["bread_slice_1", "bread_slice_2", "cheese_slice", "patty"],    help="Specify the ingredient type for sandwich assembly task."
)

# Dynamically set ingredient type
if args_cli.ingredient_type is not None and "AssembleSandwich" in env_name:    ingredient_display_names = {        "bread_slice_1": "bread slice",        "bread_slice_2": "bread slice",        "cheese_slice": "cheese slice",        "patty": "patty",    }        ingredient_name = ingredient_display_names.get(args_cli.ingredient_type)        # Update object_ref for the first subtask (grasp ingredient)    env_cfg.subtask_configs["so101_follower"][0].object_ref = args_cli.ingredient_type    # Update descriptions    env_cfg.subtask_configs["so101_follower"][0].description = f"Grasp {ingredient_name} from cartridge"    env_cfg.subtask_configs["so101_follower"][1].description = f"Place {ingredient_name} on plate"        print(f"[INFO] Ingredient type set to: {args_cli.ingredient_type} ({ingredient_name})")

Benefits:

No Manual Editing: Single command-line flag instead of code changes
Type Safety: Choices parameter prevents typos
Clear Intent: Command shows exactly which ingredient is being processed
Workflow Efficiency: Easy switching between ingredient types

Usage Example:

# Generate bread demonstrations
python.sh scripts/mimic/generate_dataset.py \    --task=LeIsaac-SO101-AssembleSandwich-Mimic-v0 \    --input_file=./datasets/annotated_bread_ingredient.hdf5 \    --output_file=./datasets/generated_bread_ingredient.hdf5 \    --ingredient_type=bread_slice_1 \    --generation_num_trials=20

# Generate patty demonstrations
python.sh scripts/mimic/generate_dataset.py \    --ingredient_type=patty \    ...

Phase 6: API Compatibility Fix (Commit 6b64f15)

The Se3Keyboard Error

Problem: Annotation script failed with:

TypeError: Se3Keyboard.__init__() got an unexpected keyword argument 'pos_sensitivity'

Root Cause: The leisaac repository’s annotate_demos.py diverged from the official IsaacLab version. Isaac Lab changed the Se3Keyboard API in May 2025 to use a configuration object pattern, but the script was created in August 2025 with the old API.

Solution: Updated to use Se3KeyboardCfg:

# Old (incorrect):
device = Se3Keyboard(pos_sensitivity=0.05, rot_sensitivity=0.05)

# New (correct):
from omni.isaac.lab.devices import Se3KeyboardCfg
device = Se3Keyboard(Se3KeyboardCfg(pos_sensitivity=0.05, rot_sensitivity=0.05))

This follows Isaac Lab’s configuration pattern where device classes accept a single cfg parameter of a corresponding configuration dataclass type.

Technical Insights and Lessons Learned

1. USD Hierarchy is Critical

The rigid body hierarchy issue was the most critical architectural challenge. Key learning: Manipulable objects must be direct children of the root prim, not nested inside other rigid bodies.

2. Physics API Overrides

USD payload overrides can replace physics APIs from payload files. Always add physics and collision APIs to both def and over declarations to ensure proper behavior.

3. MimicGen Configuration Requirements

Intermediate subtasks: Can have termination signals and offset ranges
Final subtask: Must have subtask_term_signal=None and subtask_term_offset_range=(0, 0)
Object references: Must match actual USD scene object names

4. Camera Calibration for Sim-to-Real

Calibrating simulation cameras to match real hardware (Nexigo N60 webcam) is essential for VLA training. Key parameters:

Field of view (78° diagonal)
Resolution (640x480)
Frame rate (30 FPS)
Focus distance (400mm for table workspace)

5. Flexible Testing Workflows

The --force_completion flag enables rapid iteration during development by accepting incomplete demonstrations, while maintaining strict criteria for production data collection.

Results and Portfolio Value

Completed Features

✅ Custom USD scene with proper physics configuration
✅ Multi-ingredient manipulation support (4 ingredients)
✅ Dual camera system calibrated for real hardware
✅ MimicGen integration with automatic subtask detection
✅ Success termination criteria for task completion
✅ Dynamic ingredient selection for data augmentation
✅ Comprehensive documentation (614-line README)

Demonstration Workflow

Record demonstrations for each ingredient type (bread, cheese, patty)
Annotate subtasks manually or automatically
Generate augmented data using MimicGen (1 demo → 20+ variations)
Train VLA policies with language conditioning
Deploy to real robot with sim-to-real transfer

Portfolio Highlights

This project demonstrates:

USD scene design with complex physics requirements
Debugging systematic issues (rigid body hierarchy, API compatibility)
MimicGen integration for data augmentation
Flexible configuration for different use cases
Real-world applicability to food preparation and assembly tasks

Future Enhancements

Language Conditioning: Integrate with VLA models for language-conditioned manipulation
Sim-to-Real Transfer: Deploy trained policies to real SO-101 robot
Advanced Physics: Add deformable objects (lettuce, tomato)
Multi-Robot: Extend to bi-manual manipulation
Benchmarking: Compare MimicGen vs manual demonstrations

Conclusion

Building the sandwich assembly simulation required solving challenges across multiple domains: USD scene design, physics configuration, MimicGen integration, and API compatibility. The systematic debugging approach and comprehensive documentation make this a strong portfolio piece demonstrating end-to-end robotics simulation development.

The dynamic ingredient selection feature and flexible testing workflows show attention to developer experience and practical usability. The project successfully bridges simulation and real-world robotics, providing a foundation for training manipulation policies on complex multi-step tasks.