When your robot ignores “do not pick up the cheese” and picks it up anyway: A journey through frozen VLM backbones and the limits of action-only fine-tuning
Project Overview
This project log documents the discovery and resolution of a critical language conditioning failure in NVIDIA’s GR00T N1.5 vision-language-action (VLA) model during multitask training. The investigation revealed fundamental limitations in training configurations that freeze the vision-language backbone, and provides solutions for enabling proper language-conditioned robotic manipulation.
The debugging process involved systematic testing with dual-ingredient scenarios (cheese and bread), analysis of model architecture and data flow, and identification of the root cause: frozen Eagle VLM backbone preventing task-specific language-vision association learning.
This work is part of the LeIsaac project - building a multi-ingredient sandwich assembly robot using Isaac Sim, Isaac Lab, and VLA models with an SO-ARM 101 robotic arm.
Hardware and Software Stack
- Robot: SO-101 robotic arm (6 DOF: 5 arm joints + 1 gripper)
- Cameras: Dual camera system (wrist + scene) at 640x480, 30fps
- GPU: RTX 4080 Super with 16GB VRAM
- Model: NVIDIA GR00T N1.5-3B (Vision-Language-Action model)
- Framework: Isaac-GR00T + LeRobot v3.0
- Training: LoRA fine-tuning on custom datasets
- Task: Multitask pick-and-place (cheese vs bread)
The Challenge: Multitask Language Conditioning
Why Multitask Learning?
The sandwich assembly task requires the robot to manipulate multiple ingredients based on language instructions:
- “Pick up the cheese and place it in the white plate”
- “Pick up the bread and place it in the white plate”
- “Stack the cheese on the bread”
This requires the model to:
- Understand language instructions - differentiate “cheese” vs “bread”
- Ground language to vision - recognize which object is cheese vs bread
- Execute task-specific actions - different manipulation strategies per ingredient
Training Setup
Datasets:
- Cheese dataset: 50 episodes, 14,212 frames, task: “Pick slice of yellow cheese and place it in the white plate”
- Bread dataset: 50 episodes, 13,483 frames, task: “Pick slice of bread and place it in the white plate”
Training configuration:
python scripts/gr00t_finetune.py \ --dataset-path ./demo_data/cheese/ ./demo_data/bread/ \ --num-gpus 1 \ --max-steps 10000 \ --data-config so100_dualcam \ --batch-size 16 \ --lora-rank 32 \ --balance-dataset-weights \ --balance-trajectory-weights
The LeRobotMixtureDataset automatically balances sampling across both datasets during training.
Phase 1: Problem Discovery
Initial Testing
After training a multitask model for 3,000 steps, I deployed it on the physical robot and tested with different language instructions:
Test 1: "pick up the yellow cheese and put it into the white plate"
- Result: ✅ Robot picks up cheese
Test 2: "pick up the bread and put it into the white plate"
- Result: ❌ Robot picks up cheese (ignores instruction!)
Test 3: "do not pick up the cheese"
- Result: ❌ Robot picks up cheese (completely ignores negation!)
Critical observation: The model’s behavior was 100% determined by visual state, with 0% influence from language instruction.
Hypothesis: Visual State Machine
The robot appeared to be using a simple position-based heuristic:
IF (object detected in plate): STOP (task complete) ELSE IF (object detected in holder): GRASP object → MOVE to plate → RELEASE ELSE: SEARCH randomly
This suggested the model learned visual patterns rather than language-conditioned behavior.
Phase 2: First Fix Attempt - The Diffusion Model Flag
Discovery of --no-tune_diffusion_model
Investigating the training script revealed a suspicious flag:
TRAIN_CMD="python scripts/gr00t_finetune.py \ --dataset-path ${DATASET_PATHS} \ --no-tune_diffusion_model \ # ← SUSPICIOUS! --lora-rank 32 \ ..."
Analysis of what this flag does:
GR00T N1.5 architecture:
├── Vision Tower (SigLIP) ..................... ❌ FROZEN (tune_visual=False) ├── Language Model (Qwen2.5-3B) ............... ❌ FROZEN (tune_llm=False) └── Action Head ├── Projector (Linear layers) ............ ✅ TRAINABLE (LoRA rank 32) └── Diffusion Model (DiT) ................ ❌ FROZEN (--no-tune_diffusion_model)
With --no-tune_diffusion_model, only the tiny projector layer was trainable!
Training logs confirmed:
tune_diffusion_model: False trainable params: 6,553,600 || all params: 2,730,717,120 || trainable%: 0.2400 Tune action head projector: True Tune action head diffusion model: False ← PROBLEM!
The Fix
Removed the flag from 03_train_model.sh:
# REMOVED: --no-tune_diffusion_model
TRAIN_CMD="python scripts/gr00t_finetune.py \ --dataset-path ${DATASET_PATHS} \ --lora-rank 32 \ ..."
New training logs:
tune_diffusion_model: True trainable params: 6,553,600 || all params: 2,730,717,120 || trainable%: 0.2400 Tune action head projector: True Tune action head diffusion model: True ← FIXED!
Expectations: With the diffusion model now trainable, the model should learn to map language instructions to different action sequences.
Reality: Language conditioning still failed! 😱
Phase 3: Systematic Testing and Behavior Analysis
Test Matrix
I conducted systematic testing with single and dual ingredient scenarios:
| Scenario | Cheese Location | Bread Location | Robot Action | Language Effect |
|---|---|---|---|---|
| 1 | Holder | Holder | Randomly picks one | ❌ Ignores instruction |
| 2 | Holder | None | Picks cheese | ❌ Ignores instruction |
| 3 | None | Holder | Picks bread | ❌ Ignores instruction |
| 4 | Plate | Holder | Stops | ❌ Ignores instruction |
| 5 | Holder | Plate | Stops | ❌ Ignores instruction |
| 6 | Plate | Plate | Stops | ❌ Ignores instruction |
| 7 | None | None | Random search | ❌ Ignores instruction |
| 8 | Plate | None | Stops | ❌ Ignores instruction |
Key finding: The robot’s behavior was entirely determined by object positions, regardless of language instruction.
Behavior Pattern
The model learned a position-based state machine:
State 1: Nothing in plate → Pick any object from holder
State 2: Something in plate → Stop (task complete)
State 3: Nothing anywhere → Search randomly
Critical test: Manually moved cheese to plate (without robot), then gave instruction “pick up the bread and put it into the white plate”
- Expected: Robot picks up bread
- Actual: Robot stops (considers task complete because cheese is in plate)
This confirmed the model was using visual heuristics (“if object in plate → task done”) rather than understanding language instructions.
Phase 4: Root Cause Analysis - The Frozen Backbone Problem
How GR00T Processes Language
After diving into the codebase, I discovered how language flows through the model:
Step 1: Input Preparation (transforms.py)
# Language text lang = "Pick slice of yellow cheese and place it in the white plate" # Images from cameras images = [scene_camera_frame, wrist_camera_frame] # Shape: [V, T, C, H, W]
Step 2: Eagle VLM Processing (transforms.py:_apply_vlm_processing)
# Create conversation format (Eagle processes vision + language together!)
eagle_conversation = [ { "role": "user", "content": [ {"type": "image", "image": scene_img}, {"type": "image", "image": wrist_img}, {"type": "text", "text": lang} ] }
]
# Tokenize and process
text_list = eagle_processor.apply_chat_template(eagle_conversation)
image_inputs = eagle_processor.process_vision_info(eagle_conversation)
Step 3: Eagle Model Forward (eagle_backbone.py)
# Eagle model processes BOTH vision and language together eagle_output = self.eagle_model( input_ids=tokenized_text, # Language tokens pixel_values=processed_images, # Vision features attention_mask=attention_mask, output_hidden_states=True ) # Extract joint vision-language embeddings vl_embeddings = eagle_output.hidden_states[select_layer] # Shape: [B, seq_len, 2048] vl_embeddings = self.eagle_linear(vl_embeddings) # Project to 1536 dim
Step 4: Action Head Uses VL Embeddings (flow_matching_action_head.py)
# Action head receives joint vision-language embeddings vl_embs = backbone_output.backbone_features # From Eagle # Diffusion model uses these embeddings as conditioning model_output = self.model( hidden_states=sa_embs, # State + action embeddings encoder_hidden_states=vl_embs, # Vision-language conditioning ← KEY! encoder_attention_mask=vl_attn_mask, timestep=t_discretized )
The Fundamental Problem
Eagle VLM is completely frozen (tune_llm=False, tune_visual=False):
- Eagle cannot learn new language-vision associations
- Pre-trained on general VLM tasks (image captioning, VQA, etc.)
- Never seen “pick cheese” vs “pick bread” during pre-training
- Cannot learn to differentiate these similar instructions
- Eagle produces nearly identical embeddings
# Hypothesis: Eagle's frozen embeddings emb_cheese = eagle("pick cheese", [scene_img, wrist_img]) emb_bread = eagle("pick bread", [scene_img, wrist_img]) # Cosine similarity likely very high (>0.95) # Because both are "pick X and place in plate" structure - Diffusion model has no signal to differentiate
- Diffusion model learns:
embeddings → actions - If
emb_cheese ≈ emb_bread, thenactions_cheese ≈ actions_bread - Model falls back to visual heuristics: “if object in holder → pick it”
- Diffusion model learns:
Why Diffusion Model Training Alone is Insufficient
Even with the diffusion model trainable:
- ✅ Diffusion model can learn better action prediction
- ✅ Diffusion model can learn smoother trajectories
- ❌ Diffusion model CANNOT differentiate between similar language instructions
- ❌ Diffusion model CANNOT learn task-specific language conditioning
The bottleneck: Frozen Eagle provides nearly identical embeddings for different instructions, so the diffusion model has no signal to learn different behaviors.
Phase 5: Solution and Next Steps
Required Fix: Enable Backbone Training
Minimum requirement - Enable LLM fine-tuning:
python scripts/gr00t_finetune.py \ --dataset-path ./demo_data/cheese/ ./demo_data/bread/ \ --tune-llm \ # ✅ ENABLE THIS --tune-visual False \ # Keep frozen to save VRAM --tune-projector True \ --tune-diffusion-model True \ --lora-rank 32
Why this works:
- LLM can learn to differentiate “cheese” vs “bread” tokens
- LLM creates task-specific language embeddings
- Diffusion model learns to map these distinct embeddings to different actions
- VRAM: ~12-16GB (may fit on RTX 4080 Super with reduced batch size)
Best solution - Enable both LLM and vision fine-tuning:
python scripts/gr00t_finetune.py \ --dataset-path ./demo_data/cheese/ ./demo_data/bread/ \ --tune-llm \ # ✅ ENABLE THIS --tune-visual \ # ✅ ENABLE THIS --tune-projector True \ --tune-diffusion-model True \ --lora-rank 16 \ # Reduce from 32 to save VRAM --lora-alpha 32 # Reduce from 64 (2x rank)
Why this is best:
- Vision tower learns to recognize cheese vs bread visually
- LLM learns to understand task instructions
- Combined: Model can ground language to visual objects
- VRAM: ~16-20GB (may require batch size reduction)
VRAM Testing Script
Created test_vram_requirements.sh to systematically test configurations:
cd ~/lerobot/scripts/so100_groot ./test_vram_requirements.sh
This script tests 5 configurations:
- Baseline (frozen backbone) - ~8GB
- LLM only (LoRA 16) - ~12GB
- LLM only (LoRA 32) - ~14GB
- LLM + Vision (LoRA 16) - ~18GB
- LLM + Vision (LoRA 32) - ~22GB
And recommends the best configuration that fits on the GPU.
Trade-offs and Considerations
| Configuration | VRAM | Training Speed | Language Conditioning | Visual Grounding |
|---|---|---|---|---|
| Frozen backbone | ~8GB | ~5-7 sec/step | ❌ Broken | ❌ No |
| LLM only | ~12-16GB | ~8-12 sec/step | ✅ Works | ⚠️ Limited |
| LLM + Vision | ~16-20GB | ~12-18 sec/step | ✅ Works | ✅ Yes |
Lessons Learned
1. Frozen VLM Backbones Cannot Learn New Tasks
Key insight: When using a vision-language model as a backbone, freezing it prevents learning task-specific language-vision associations. This is fundamentally different from freezing a vision-only backbone (like ResNet or ViT).
Why: VLMs create joint embeddings from vision and language. If the VLM is frozen, it can only use pre-trained associations, which likely don’t include my specific tasks.
2. Action-Only Fine-tuning Has Limits
What works: Fine-tuning only the action head (projector + diffusion model) can work for:
- Single-task learning
- Tasks similar to pre-training data
- Visual-only conditioning
What doesn’t work: Action-only fine-tuning cannot enable:
- Language conditioning for novel tasks
- Differentiation between similar language instructions
- Grounding of new language concepts to vision
3. Debugging Requires Understanding Data Flow
Critical: Understanding how data flows through the model is essential for debugging:
- How is language tokenized?
- Where are vision and language combined?
- What embeddings does the action head receive?
- Which components are frozen vs trainable?
Without this understanding, it’s easy to misdiagnose problems (e.g., thinking diffusion model training alone would fix language conditioning).
4. Systematic Testing Reveals Patterns
Approach: Testing with a matrix of scenarios (single ingredient, dual ingredient, different positions) revealed the position-based state machine pattern.
Value: This systematic approach provided clear evidence that language had 0% effect, which motivated deeper investigation into the architecture.
5. VRAM is the Limiting Factor
Reality: The best configuration (LLM + Vision fine-tuning) requires ~20GB VRAM, which exceeds most consumer GPUs.
Solutions:
- Reduce LoRA rank (32 → 16)
- Reduce batch size (16 → 4)
- Use gradient checkpointing
- Train on cloud GPUs
- Accept LLM-only training as compromise
Next Steps
- Test VRAM requirements on RTX 4080 Super to determine feasible configuration
- Retrain model with LLM fine-tuning enabled (minimum) or LLM + Vision (if VRAM allows)
- Validate language conditioning with systematic testing:
- Different instructions → different behaviors
- Negation has effect
- Model responds to language changes
- Document results and update training scripts with proper defaults
- Consider alternative approaches if VRAM is insufficient:
- Train separate models per task
- Use task selector at inference time
- Explore model quantization or pruning
Conclusion
This debugging journey revealed a fundamental limitation in the training configuration: frozen VLM backbones cannot learn task-specific language conditioning. While the initial fix (enabling diffusion model training) was necessary, it was insufficient because the bottleneck was in the frozen Eagle backbone producing nearly identical embeddings for different instructions.
The solution requires enabling at least LLM fine-tuning, and ideally both LLM and vision fine-tuning, to allow the model to learn task-specific language-vision associations. This comes with VRAM trade-offs that must be carefully managed on consumer GPUs.
The systematic testing approach and deep dive into model architecture were essential for identifying the root cause and developing an effective solution. This experience highlights the importance of understanding not just what to train, but how the model processes and combines different modalities.
Vipin M
Discussions
Become a Hackaday.io Member
Create an account to leave a comment. Already have an account? Log In.