From USD scene creation to MimicGen integration: A complete implementation of multi-ingredient manipulation in Isaac Sim
Project Overview
This project implements a complete sandwich assembly simulation environment in Isaac Lab, designed to train robotic manipulation policies for multi-step food preparation tasks. The development involved creating a custom USD scene with proper physics configuration, implementing MimicGen integration for data augmentation, and solving critical challenges with rigid body hierarchies and API compatibility.
This project log documents the systematic development process, from initial scene setup through the final dynamic ingredient selection feature, including all the debugging challenges and solutions encountered along the way.
The simulation successfully supports teleoperation, demonstration recording, MimicGen annotation, and automated data generation for training vision-language-action (VLA) models on sandwich assembly tasks.
Hardware and Software Stack
- Robot: SO-101 robotic arm (6 DOF: 5 arm joints + 1 gripper)
- Cameras: Dual camera system (wrist + front) at 640x480, 30fps
- Front camera calibrated for Nexigo N60 webcam (78° FOV)
- Wrist camera for close-up manipulation view
- GPU: RTX 4080 Super with 16GB VRAM
- Simulation: Isaac Sim 5.0 + Isaac Lab framework
- Framework: LeIsaac (custom robotics framework built on Isaac Lab)
- Task: Multi-ingredient sandwich assembly with 4 ingredients (2× bread, cheese, patty)
The Challenge: Multi-Ingredient Manipulation
Why Sandwich Assembly?
Sandwich assembly represents a complex manipulation task that requires:
- Sequential manipulation: Multiple pick-and-place operations
- Spatial reasoning: Proper stacking order and alignment
- Object diversity: Different ingredient types with varying properties
- Real-world relevance: Applicable to food preparation and assembly tasks
- VLA training: Language-conditioned manipulation (“pick up the cheese”, “place the patty”)
Technical Requirements
- USD Scene: Custom kitchen environment with proper physics
- Multiple Objects: 4 ingredients + plate + holder, each with correct physics properties
- MimicGen Integration: Subtask annotation and data augmentation
- Dynamic Configuration: Support for different ingredient types
- Camera Setup: Optimal viewing angles for manipulation
Phase 1: USD Scene Creation
Initial Scene Setup (Commit 1c52342)
Development began with creating the basic task structure and scene configuration. The initial implementation involved:
Created Files:
assemble_sandwich_env_cfg.py- Main environment configurationassemble_sandwich_mimic_env_cfg.py- MimicGen variant with subtask configsREADME.md- Complete documentation (614 lines)
Key Configuration:
# Scene loading with automatic USD parsing parse_usd_and_create_subassets(KITCHEN_WITH_SANDWICH_USD_PATH, self)
This function automatically detects all rigid body objects in the USD scene and creates corresponding configuration objects, eliminating manual object registration.
Scene Simplification Challenge
Problem: The original kitchen scene was 37.9 MB with complex fixtures (cabinets, appliances, decorative elements) that slowed simulation and cluttered the workspace.
Solution: Documented a systematic simplification workflow:
- Remove unnecessary kitchen fixtures
- Create simple table workspace (1.2m × 0.8m × 0.85m)
- Reduce file size to ~5-10 MB (75% reduction)
- Optimize for robot simulation performance
Table Layout Design:
┌─────────────────────────────────┐ │ [Ingredients Holder] [🍽️] │ ← Left: Holder, Right: Plate │ ┌─┬─┬─┬─┐ Plate │ │ │🍞│🍞│🥩│🧀│ │ ← Slots: bread, bread, patty, cheese │ └─┴─┴─┴─┘ │ │ │ │ Assembly Area │ └─────────────────────────────────┘
Physics Configuration (Commit 6a7e4b5)
The USD scene structure was created with proper physics APIs for all objects:
Dynamic Objects (movable ingredients):
bread_slice_1,bread_slice_2,cheese_slice,patty- Physics properties:
RigidBodyAPI,CollisionAPI,MassAPI - Configuration:
physics:kinematicEnabled = false(affected by gravity)
Static Objects (fixed environment):
plate,ingredients_holder,table- Physics properties:
RigidBodyAPI,CollisionAPI - Configuration:
physics:kinematicEnabled = true(no movement)
Critical Learning: USD payload overrides can replace physics APIs from payload files. The solution required adding physics and collision APIs to both def and over declarations in scene.usda to ensure proper physics behavior.
Phase 2: The Rigid Body Hierarchy Crisis
The Critical Error
Problem: When attempting teleoperation, a fatal error occurred:
[Error] Multiple rigid bodies in hierarchy detected
Root Cause: Objects were nested inside the table hierarchy, creating parent-child rigid body relationships that Isaac Sim’s physics engine cannot handle.
Investigation Process:
- Examined working reference scene (kitchen_with_orange)
- Investigation revealed correct USD hierarchy pattern
- Analysis showed that manipulable objects must be direct children of
/Root, NOT nested inside Scene or table
Incorrect Hierarchy:
/Root └── Scene └── table ├── bread_slice_1 ❌ Nested rigid body ├── cheese_slice ❌ Nested rigid body └── patty ❌ Nested rigid body
Correct Hierarchy:
/Root ├── Scene │ └── table ├── bread_slice_1 ✅ Direct child of /Root ├── cheese_slice ✅ Direct child of /Root └── patty ✅ Direct child of /Root
The Fix (Commit 6b64f15)
The USD hierarchy was flattened by moving all manipulable objects out of the table to be direct children of /Root. This critical architectural fix enabled proper physics simulation.
Validation: After flattening, teleoperation worked correctly with no rigid body hierarchy errors.
Phase 3: Camera Configuration (Commits d5328b3, 5df9d64)
Camera Positioning Strategy
The implementation includes a three-camera system optimized for sandwich assembly:
1. Wrist Camera (Close-up manipulation):
offset=TiledCameraCfg.OffsetCfg( pos=(0.02, 0.08, -0.03), # Slightly forward and up rot=(-0.35, -0.93, -0.05, 0.08), # Angled down toward table c )
2. Front Camera (Workspace overview, calibrated for Nexigo N60 webcam):
offset=TiledCameraCfg.OffsetCfg( pos=(-0.2, -0.8, 0.7), # Higher and angled for table overview rot=(0.2, -0.98, 0.0, 0.0), # Looking down at workspace c ), spawn=sim_utils.PinholeCameraCfg( focal_length=24.0, # Nexigo N60 equivalent horizontal_aperture=36.0, # ~78° FOV to match webcam focus_distance=400.0, # Optimal for table distance )
3. Viewer Camera (Development/debug):
self.viewer.eye = (1.5, -1.5, 1.8) # Elevated diagonal view self.viewer.lookat = (0.0, 0.0, 0.9) # Looking at table center
Key Insight: The front camera was specifically calibrated to match the Nexigo N60 webcam specifications (78° FOV, 640x480 resolution, 30 FPS) to ensure sim-to-real consistency for VLA training.
Phase 4: MimicGen Integration
Automatic Subtask Annotation (Commit 3e3ce0b)
Automatic subtask detection was implemented using observation functions:
Created mdp/observations.py:
def ingredient_grasped( env: ManagerBasedRLEnv, robot_cfg: SceneEntityCfg, bread_slice_1_cfg: SceneEntityCfg, cheese_slice_cfg: SceneEntityCfg, patty_cfg: SceneEntityCfg, bread_slice_2_cfg: SceneEntityCfg, gripper_open_threshold: float = 0.03,
) -> torch.Tensor: """Detect when any ingredient is grasped by checking gripper closure and proximity.""" robot: Articulation = env.scene[robot_cfg.name] gripper_pos = robot.data.joint_pos[:, -1] gripper_closed = gripper_pos < gripper_open_threshold # Check proximity to each ingredient for ingredient_name, ingredient_cfg in [ ("bread_slice_1", bread_slice_1_cfg), ("cheese_slice", cheese_slice_cfg), ("patty", patty_cfg), ("bread_slice_2", bread_slice_2_cfg), ]: if ingredient_name in env.scene: ingredient: RigidObject = env.scene[ingredient_name] distance = torch.norm( robot.data.body_pos_w[:, -1, :3] - ingredient.data.root_pos_w[:, :3], dim=-1 ) is_grasped = gripper_closed & (distance < 0.05) if is_grasped.any(): return is_grasped return torch.zeros(env.num_envs, dtype=torch.bool, device=env.device)
This function automatically detects when the robot grasps an ingredient by checking:
- Gripper closure (joint position < threshold)
- Proximity to ingredient (distance < 5cm)
Success Termination Criteria (Commit faf0a22)
Problem: MimicGen annotation script requires a success termination term, but the environment only had a time_out termination.
Solution: Created mdp/terminations.py with comprehensive success criteria:
def task_done( env: ManagerBasedRLEnv, plate_cfg: SceneEntityCfg, bread_slice_1_cfg: SceneEntityCfg, cheese_slice_cfg: SceneEntityCfg, patty_cfg: SceneEntityCfg, bread_slice_2_cfg: SceneEntityCfg, height_threshold: float = 0.02, xy_threshold: float = 0.05, test_mode: bool = False, ) -> torch.Tensor: """Determine if the sandwich assembly task is complete.""" if test_mode: return torch.ones(num_envs, dtype=torch.bool, device=env.device) # Check XY alignment (all ingredients within 5cm of plate center) # Check vertical stacking order (bread → cheese → patty → bread) # Check stability (low velocity for all ingredients) # ...
Success Criteria:
- XY Alignment: All ingredients within 5cm of plate center
- Vertical Stacking: Correct order (bread → cheese → patty → bread)
- Stability: Low velocity for all ingredients
Test Mode for Incomplete Demonstrations (Commit 6366699)
Problem: During development, testing the annotation pipeline required support for incomplete demonstrations (just moving the arm without completing the task).
Solution: Added --force_completion command-line argument:
parser.add_argument( "--force_completion", action="store_true", default=False, help="Force task completion for incomplete demonstrations."
)
# Modify success_term parameters based on flag
if args_cli.force_completion: success_term.params["test_mode"] = True print("[INFO] Force completion enabled: success termination will accept incomplete demonstrations.")
This allows switching between:
- Testing mode: Accept any demonstration (useful for pipeline testing)
- Production mode: Enforce strict success criteria (for real training data)
Phase 5: Dynamic Ingredient Selection (Commit 24bb349)
The KeyError Challenge
Problem: MimicGen data generation failed with:
KeyError: 'ingredient'
Root Cause: The subtask configuration used object_ref="ingredient" (a generic placeholder), but the USD scene contains specific objects: bread_slice_1, bread_slice_2, cheese_slice, patty.
Investigation: The get_object_poses() method returns a dictionary with keys from actual scene object names, not generic placeholders. MimicGen tried to access object_poses["ingredient"] which didn’t exist.
The Solution: Dynamic Ingredient Selection
A command-line argument was implemented to dynamically specify which ingredient to track:
parser.add_argument( "--ingredient_type", type=str, default=None, choices=["bread_slice_1", "bread_slice_2", "cheese_slice", "patty"], help="Specify the ingredient type for sandwich assembly task."
)
# Dynamically set ingredient type
if args_cli.ingredient_type is not None and "AssembleSandwich" in env_name: ingredient_display_names = { "bread_slice_1": "bread slice", "bread_slice_2": "bread slice", "cheese_slice": "cheese slice", "patty": "patty", } ingredient_name = ingredient_display_names.get(args_cli.ingredient_type) # Update object_ref for the first subtask (grasp ingredient) env_cfg.subtask_configs["so101_follower"][0].object_ref = args_cli.ingredient_type # Update descriptions env_cfg.subtask_configs["so101_follower"][0].description = f"Grasp {ingredient_name} from cartridge" env_cfg.subtask_configs["so101_follower"][1].description = f"Place {ingredient_name} on plate" print(f"[INFO] Ingredient type set to: {args_cli.ingredient_type} ({ingredient_name})")
Benefits:
- No Manual Editing: Single command-line flag instead of code changes
- Type Safety: Choices parameter prevents typos
- Clear Intent: Command shows exactly which ingredient is being processed
- Workflow Efficiency: Easy switching between ingredient types
Usage Example:
# Generate bread demonstrations python.sh scripts/mimic/generate_dataset.py \ --task=LeIsaac-SO101-AssembleSandwich-Mimic-v0 \ --input_file=./datasets/annotated_bread_ingredient.hdf5 \ --output_file=./datasets/generated_bread_ingredient.hdf5 \ --ingredient_type=bread_slice_1 \ --generation_num_trials=20 # Generate patty demonstrations python.sh scripts/mimic/generate_dataset.py \ --ingredient_type=patty \ ...
Phase 6: API Compatibility Fix (Commit 6b64f15)
The Se3Keyboard Error
Problem: Annotation script failed with:
TypeError: Se3Keyboard.__init__() got an unexpected keyword argument 'pos_sensitivity'
Root Cause: The leisaac repository’s annotate_demos.py diverged from the official IsaacLab version. Isaac Lab changed the Se3Keyboard API in May 2025 to use a configuration object pattern, but the script was created in August 2025 with the old API.
Solution: Updated to use Se3KeyboardCfg:
# Old (incorrect): device = Se3Keyboard(pos_sensitivity=0.05, rot_sensitivity=0.05) # New (correct): from omni.isaac.lab.devices import Se3KeyboardCfg device = Se3Keyboard(Se3KeyboardCfg(pos_sensitivity=0.05, rot_sensitivity=0.05))
This follows Isaac Lab’s configuration pattern where device classes accept a single cfg parameter of a corresponding configuration dataclass type.
Technical Insights and Lessons Learned
1. USD Hierarchy is Critical
The rigid body hierarchy issue was the most critical architectural challenge. Key learning: Manipulable objects must be direct children of the root prim, not nested inside other rigid bodies.
2. Physics API Overrides
USD payload overrides can replace physics APIs from payload files. Always add physics and collision APIs to both def and over declarations to ensure proper behavior.
3. MimicGen Configuration Requirements
- Intermediate subtasks: Can have termination signals and offset ranges
- Final subtask: Must have
subtask_term_signal=Noneandsubtask_term_offset_range=(0, 0) - Object references: Must match actual USD scene object names
4. Camera Calibration for Sim-to-Real
Calibrating simulation cameras to match real hardware (Nexigo N60 webcam) is essential for VLA training. Key parameters:
- Field of view (78° diagonal)
- Resolution (640x480)
- Frame rate (30 FPS)
- Focus distance (400mm for table workspace)
5. Flexible Testing Workflows
The --force_completion flag enables rapid iteration during development by accepting incomplete demonstrations, while maintaining strict criteria for production data collection.
Results and Portfolio Value
Completed Features
✅ Custom USD scene with proper physics configuration
✅ Multi-ingredient manipulation support (4 ingredients)
✅ Dual camera system calibrated for real hardware
✅ MimicGen integration with automatic subtask detection
✅ Success termination criteria for task completion
✅ Dynamic ingredient selection for data augmentation
✅ Comprehensive documentation (614-line README)
Demonstration Workflow
- Record demonstrations for each ingredient type (bread, cheese, patty)
- Annotate subtasks manually or automatically
- Generate augmented data using MimicGen (1 demo → 20+ variations)
- Train VLA policies with language conditioning
- Deploy to real robot with sim-to-real transfer
Portfolio Highlights
This project demonstrates:
- USD scene design with complex physics requirements
- Debugging systematic issues (rigid body hierarchy, API compatibility)
- MimicGen integration for data augmentation
- Flexible configuration for different use cases
- Real-world applicability to food preparation and assembly tasks
Future Enhancements
- Language Conditioning: Integrate with VLA models for language-conditioned manipulation
- Sim-to-Real Transfer: Deploy trained policies to real SO-101 robot
- Advanced Physics: Add deformable objects (lettuce, tomato)
- Multi-Robot: Extend to bi-manual manipulation
- Benchmarking: Compare MimicGen vs manual demonstrations
Conclusion
Building the sandwich assembly simulation required solving challenges across multiple domains: USD scene design, physics configuration, MimicGen integration, and API compatibility. The systematic debugging approach and comprehensive documentation make this a strong portfolio piece demonstrating end-to-end robotics simulation development.
The dynamic ingredient selection feature and flexible testing workflows show attention to developer experience and practical usability. The project successfully bridges simulation and real-world robotics, providing a foundation for training manipulation policies on complex multi-step tasks.
Vipin M
Discussions
Become a Hackaday.io Member
Create an account to leave a comment. Already have an account? Log In.