Debugging Language Conditioning in GR00T Multitask Training

When your robot ignores “do not pick up the cheese” and picks it up anyway: A journey through frozen VLM backbones and the limits of action-only fine-tuning

Project Overview

This project log documents the discovery and resolution of a critical language conditioning failure in NVIDIA’s GR00T N1.5 vision-language-action (VLA) model during multitask training. The investigation revealed fundamental limitations in training configurations that freeze the vision-language backbone, and provides solutions for enabling proper language-conditioned robotic manipulation.

The debugging process involved systematic testing with dual-ingredient scenarios (cheese and bread), analysis of model architecture and data flow, and identification of the root cause: frozen Eagle VLM backbone preventing task-specific language-vision association learning.

This work is part of the LeIsaac project - building a multi-ingredient sandwich assembly robot using Isaac Sim, Isaac Lab, and VLA models with an SO-ARM 101 robotic arm.

Hardware and Software Stack

Robot: SO-101 robotic arm (6 DOF: 5 arm joints + 1 gripper)
Cameras: Dual camera system (wrist + scene) at 640x480, 30fps
GPU: RTX 4080 Super with 16GB VRAM
Model: NVIDIA GR00T N1.5-3B (Vision-Language-Action model)
Framework: Isaac-GR00T + LeRobot v3.0
Training: LoRA fine-tuning on custom datasets
Task: Multitask pick-and-place (cheese vs bread)

The Challenge: Multitask Language Conditioning

Why Multitask Learning?

The sandwich assembly task requires the robot to manipulate multiple ingredients based on language instructions:

“Pick up the cheese and place it in the white plate”
“Pick up the bread and place it in the white plate”
“Stack the cheese on the bread”

This requires the model to:

Understand language instructions - differentiate “cheese” vs “bread”
Ground language to vision - recognize which object is cheese vs bread
Execute task-specific actions - different manipulation strategies per ingredient

Training Setup

Datasets:

Cheese dataset: 50 episodes, 14,212 frames, task: “Pick slice of yellow cheese and place it in the white plate”
Bread dataset: 50 episodes, 13,483 frames, task: “Pick slice of bread and place it in the white plate”

Training configuration:

python scripts/gr00t_finetune.py \    --dataset-path ./demo_data/cheese/ ./demo_data/bread/ \    --num-gpus 1 \    --max-steps 10000 \    --data-config so100_dualcam \    --batch-size 16 \    --lora-rank 32 \    --balance-dataset-weights \    --balance-trajectory-weights

The LeRobotMixtureDataset automatically balances sampling across both datasets during training.

Phase 1: Problem Discovery

Initial Testing

After training a multitask model for 3,000 steps, I deployed it on the physical robot and tested with different language instructions:

Test 1: "pick up the yellow cheese and put it into the white plate"

Result: ✅ Robot picks up cheese

Test 2: "pick up the bread and put it into the white plate"

Result: ❌ Robot picks up cheese (ignores instruction!)

Test 3: "do not pick up the cheese"

Result: ❌ Robot picks up cheese (completely ignores negation!)

Critical observation: The model’s behavior was 100% determined by visual state, with 0% influence from language instruction.

Hypothesis: Visual State Machine

The robot appeared to be using a simple position-based heuristic:

IF (object detected in plate):    STOP (task complete)
ELSE IF (object detected in holder):    GRASP object → MOVE to plate → RELEASE
ELSE:    SEARCH randomly

This suggested the model learned visual patterns rather than language-conditioned behavior.

Phase 2: First Fix Attempt - The Diffusion Model Flag

Discovery of `--no-tune_diffusion_model`

Investigating the training script revealed a suspicious flag:

TRAIN_CMD="python scripts/gr00t_finetune.py \    --dataset-path ${DATASET_PATHS} \    --no-tune_diffusion_model \  # ← SUSPICIOUS!    --lora-rank 32 \    ..."

Analysis of what this flag does:

GR00T N1.5 architecture:

├── Vision Tower (SigLIP) ..................... ❌ FROZEN (tune_visual=False)
├── Language Model (Qwen2.5-3B) ............... ❌ FROZEN (tune_llm=False)
└── Action Head    ├── Projector (Linear layers) ............ ✅ TRAINABLE (LoRA rank 32)    └── Diffusion Model (DiT) ................ ❌ FROZEN (--no-tune_diffusion_model)

With --no-tune_diffusion_model, only the tiny projector layer was trainable!

Training logs confirmed:

tune_diffusion_model: False
trainable params: 6,553,600 || all params: 2,730,717,120 || trainable%: 0.2400
Tune action head projector: True
Tune action head diffusion model: False  ← PROBLEM!

The Fix

Removed the flag from 03_train_model.sh:

# REMOVED: --no-tune_diffusion_model
TRAIN_CMD="python scripts/gr00t_finetune.py \    --dataset-path ${DATASET_PATHS} \    --lora-rank 32 \    ..."

New training logs:

tune_diffusion_model: True
trainable params: 6,553,600 || all params: 2,730,717,120 || trainable%: 0.2400
Tune action head projector: True
Tune action head diffusion model: True  ← FIXED!

Expectations: With the diffusion model now trainable, the model should learn to map language instructions to different action sequences.

Reality: Language conditioning still failed! 😱

Phase 3: Systematic Testing and Behavior Analysis

Test Matrix

I conducted systematic testing with single and dual ingredient scenarios:

Scenario	Cheese Location	Bread Location	Robot Action	Language Effect
1	Holder	Holder	Randomly picks one	❌ Ignores instruction
2	Holder	None	Picks cheese	❌ Ignores instruction
3	None	Holder	Picks bread	❌ Ignores instruction
4	Plate	Holder	Stops	❌ Ignores instruction
5	Holder	Plate	Stops	❌ Ignores instruction
6	Plate	Plate	Stops	❌ Ignores instruction
7	None	None	Random search	❌ Ignores instruction
8	Plate	None	Stops	❌ Ignores instruction

Key finding: The robot’s behavior was entirely determined by object positions, regardless of language instruction.

Behavior Pattern

The model learned a position-based state machine:

State 1: Nothing in plate → Pick any object from holder
State 2: Something in plate → Stop (task complete)
State 3: Nothing anywhere → Search randomly

Critical test: Manually moved cheese to plate (without robot), then gave instruction “pick up the bread and put it into the white plate”

Expected: Robot picks up bread
Actual: Robot stops (considers task complete because cheese is in plate)

This confirmed the model was using visual heuristics (“if object in plate → task done”) rather than understanding language instructions.

Phase 4: Root Cause Analysis - The Frozen Backbone Problem

How GR00T Processes Language

After diving into the codebase, I discovered how language flows through the model:

Step 1: Input Preparation (transforms.py)

# Language text
lang = "Pick slice of yellow cheese and place it in the white plate"

# Images from cameras
images = [scene_camera_frame, wrist_camera_frame]  # Shape: [V, T, C, H, W]

Step 2: Eagle VLM Processing (transforms.py:_apply_vlm_processing)

# Create conversation format (Eagle processes vision + language together!)
eagle_conversation = [    {        "role": "user",        "content": [            {"type": "image", "image": scene_img},            {"type": "image", "image": wrist_img},            {"type": "text", "text": lang}        ]    }
]

# Tokenize and process
text_list = eagle_processor.apply_chat_template(eagle_conversation)
image_inputs = eagle_processor.process_vision_info(eagle_conversation)

Step 3: Eagle Model Forward (eagle_backbone.py)

# Eagle model processes BOTH vision and language together
eagle_output = self.eagle_model(    input_ids=tokenized_text,      # Language tokens    pixel_values=processed_images,  # Vision features    attention_mask=attention_mask,    output_hidden_states=True
)

# Extract joint vision-language embeddings
vl_embeddings = eagle_output.hidden_states[select_layer]  # Shape: [B, seq_len, 2048]
vl_embeddings = self.eagle_linear(vl_embeddings)          # Project to 1536 dim

Step 4: Action Head Uses VL Embeddings (flow_matching_action_head.py)

# Action head receives joint vision-language embeddings
vl_embs = backbone_output.backbone_features  # From Eagle

# Diffusion model uses these embeddings as conditioning
model_output = self.model(    hidden_states=sa_embs,           # State + action embeddings    encoder_hidden_states=vl_embs,   # Vision-language conditioning ← KEY!    encoder_attention_mask=vl_attn_mask,    timestep=t_discretized
)

The Fundamental Problem

Eagle VLM is completely frozen (tune_llm=False, tune_visual=False):

Eagle cannot learn new language-vision associations
- Pre-trained on general VLM tasks (image captioning, VQA, etc.)
- Never seen “pick cheese” vs “pick bread” during pre-training
- Cannot learn to differentiate these similar instructions

Eagle produces nearly identical embeddings

# Hypothesis: Eagle's frozen embeddings
emb_cheese = eagle("pick cheese", [scene_img, wrist_img])
emb_bread = eagle("pick bread", [scene_img, wrist_img])

# Cosine similarity likely very high (>0.95)
# Because both are "pick X and place in plate" structure

Diffusion model has no signal to differentiate
- Diffusion model learns: embeddings → actions
- If emb_cheese ≈ emb_bread, then actions_cheese ≈ actions_bread
- Model falls back to visual heuristics: “if object in holder → pick it”

Why Diffusion Model Training Alone is Insufficient

Even with the diffusion model trainable:

✅ Diffusion model can learn better action prediction
✅ Diffusion model can learn smoother trajectories
❌ Diffusion model CANNOT differentiate between similar language instructions
❌ Diffusion model CANNOT learn task-specific language conditioning

The bottleneck: Frozen Eagle provides nearly identical embeddings for different instructions, so the diffusion model has no signal to learn different behaviors.

Phase 5: Solution and Next Steps

Required Fix: Enable Backbone Training

Minimum requirement - Enable LLM fine-tuning:

python scripts/gr00t_finetune.py \    --dataset-path ./demo_data/cheese/ ./demo_data/bread/ \    --tune-llm \              # ✅ ENABLE THIS    --tune-visual False \     # Keep frozen to save VRAM    --tune-projector True \    --tune-diffusion-model True \    --lora-rank 32

Why this works:

LLM can learn to differentiate “cheese” vs “bread” tokens
LLM creates task-specific language embeddings
Diffusion model learns to map these distinct embeddings to different actions
VRAM: ~12-16GB (may fit on RTX 4080 Super with reduced batch size)

Best solution - Enable both LLM and vision fine-tuning:

python scripts/gr00t_finetune.py \    --dataset-path ./demo_data/cheese/ ./demo_data/bread/ \    --tune-llm \              # ✅ ENABLE THIS    --tune-visual \           # ✅ ENABLE THIS    --tune-projector True \    --tune-diffusion-model True \    --lora-rank 16 \          # Reduce from 32 to save VRAM    --lora-alpha 32           # Reduce from 64 (2x rank)

Why this is best:

Vision tower learns to recognize cheese vs bread visually
LLM learns to understand task instructions
Combined: Model can ground language to visual objects
VRAM: ~16-20GB (may require batch size reduction)

VRAM Testing Script

Created test_vram_requirements.sh to systematically test configurations:

cd ~/lerobot/scripts/so100_groot
./test_vram_requirements.sh

This script tests 5 configurations:

Baseline (frozen backbone) - ~8GB
LLM only (LoRA 16) - ~12GB
LLM only (LoRA 32) - ~14GB
LLM + Vision (LoRA 16) - ~18GB
LLM + Vision (LoRA 32) - ~22GB

And recommends the best configuration that fits on the GPU.

Trade-offs and Considerations

Configuration	VRAM	Training Speed	Language Conditioning	Visual Grounding
Frozen backbone	~8GB	~5-7 sec/step	❌ Broken	❌ No
LLM only	~12-16GB	~8-12 sec/step	✅ Works	⚠️ Limited
LLM + Vision	~16-20GB	~12-18 sec/step	✅ Works	✅ Yes

Lessons Learned

1. Frozen VLM Backbones Cannot Learn New Tasks

Key insight: When using a vision-language model as a backbone, freezing it prevents learning task-specific language-vision associations. This is fundamentally different from freezing a vision-only backbone (like ResNet or ViT).

Why: VLMs create joint embeddings from vision and language. If the VLM is frozen, it can only use pre-trained associations, which likely don’t include my specific tasks.

2. Action-Only Fine-tuning Has Limits

What works: Fine-tuning only the action head (projector + diffusion model) can work for:

Single-task learning
Tasks similar to pre-training data
Visual-only conditioning

What doesn’t work: Action-only fine-tuning cannot enable:

Language conditioning for novel tasks
Differentiation between similar language instructions
Grounding of new language concepts to vision

3. Debugging Requires Understanding Data Flow

Critical: Understanding how data flows through the model is essential for debugging:

How is language tokenized?
Where are vision and language combined?
What embeddings does the action head receive?
Which components are frozen vs trainable?

Without this understanding, it’s easy to misdiagnose problems (e.g., thinking diffusion model training alone would fix language conditioning).

4. Systematic Testing Reveals Patterns

Approach: Testing with a matrix of scenarios (single ingredient, dual ingredient, different positions) revealed the position-based state machine pattern.

Value: This systematic approach provided clear evidence that language had 0% effect, which motivated deeper investigation into the architecture.

5. VRAM is the Limiting Factor

Reality: The best configuration (LLM + Vision fine-tuning) requires ~20GB VRAM, which exceeds most consumer GPUs.

Solutions:

Reduce LoRA rank (32 → 16)
Reduce batch size (16 → 4)
Use gradient checkpointing
Train on cloud GPUs
Accept LLM-only training as compromise

Next Steps

Test VRAM requirements on RTX 4080 Super to determine feasible configuration
Retrain model with LLM fine-tuning enabled (minimum) or LLM + Vision (if VRAM allows)
Validate language conditioning with systematic testing:
- Different instructions → different behaviors
- Negation has effect
- Model responds to language changes
Document results and update training scripts with proper defaults
Consider alternative approaches if VRAM is insufficient:
- Train separate models per task
- Use task selector at inference time
- Explore model quantization or pruning

Conclusion

This debugging journey revealed a fundamental limitation in the training configuration: frozen VLM backbones cannot learn task-specific language conditioning. While the initial fix (enabling diffusion model training) was necessary, it was insufficient because the bottleneck was in the frozen Eagle backbone producing nearly identical embeddings for different instructions.

The solution requires enabling at least LLM fine-tuning, and ideally both LLM and vision fine-tuning, to allow the model to learn task-specific language-vision associations. This comes with VRAM trade-offs that must be carefully managed on consumer GPUs.

The systematic testing approach and deep dive into model architecture were essential for identifying the root cause and developing an effective solution. This experience highlights the importance of understanding not just what to train, but how the model processes and combines different modalities.