Close

Debugging Language Conditioning in GR00T Multitask Training

A project log for ChefMate: Kitchen Robot using VLA, ACT, Diffusion

Robotic arm workflow with nVIDIA GR00T N1.5 model. Dataset recording, fine-tuning, debugging, and deployment for pick-and-place tasks

vipin-mVipin M 10/20/2025 at 16:150 Comments

When your robot ignores “do not pick up the cheese” and picks it up anyway: A journey through frozen VLM backbones and the limits of action-only fine-tuning

Project Overview

This project log documents the discovery and resolution of a critical language conditioning failure in NVIDIA’s GR00T N1.5 vision-language-action (VLA) model during multitask training. The investigation revealed fundamental limitations in training configurations that freeze the vision-language backbone, and provides solutions for enabling proper language-conditioned robotic manipulation.

The debugging process involved systematic testing with dual-ingredient scenarios (cheese and bread), analysis of model architecture and data flow, and identification of the root cause: frozen Eagle VLM backbone preventing task-specific language-vision association learning.

This work is part of the LeIsaac project - building a multi-ingredient sandwich assembly robot using Isaac Sim, Isaac Lab, and VLA models with an SO-ARM 101 robotic arm.

Hardware and Software Stack

The Challenge: Multitask Language Conditioning

Why Multitask Learning?

The sandwich assembly task requires the robot to manipulate multiple ingredients based on language instructions:

This requires the model to:

  1. Understand language instructions - differentiate “cheese” vs “bread”
  2. Ground language to vision - recognize which object is cheese vs bread
  3. Execute task-specific actions - different manipulation strategies per ingredient

Training Setup

Datasets:

Training configuration:

python scripts/gr00t_finetune.py \    --dataset-path ./demo_data/cheese/ ./demo_data/bread/ \    --num-gpus 1 \    --max-steps 10000 \    --data-config so100_dualcam \    --batch-size 16 \    --lora-rank 32 \    --balance-dataset-weights \    --balance-trajectory-weights

The LeRobotMixtureDataset automatically balances sampling across both datasets during training.

Phase 1: Problem Discovery

Initial Testing

After training a multitask model for 3,000 steps, I deployed it on the physical robot and tested with different language instructions:

Test 1"pick up the yellow cheese and put it into the white plate"

Test 2"pick up the bread and put it into the white plate"

Test 3"do not pick up the cheese"

Critical observation: The model’s behavior was 100% determined by visual state, with 0% influence from language instruction.

Hypothesis: Visual State Machine

The robot appeared to be using a simple position-based heuristic:

IF (object detected in plate):    STOP (task complete)
ELSE IF (object detected in holder):    GRASP object → MOVE to plate → RELEASE
ELSE:    SEARCH randomly

This suggested the model learned visual patterns rather than language-conditioned behavior.

Phase 2: First Fix Attempt - The Diffusion Model Flag

Discovery of --no-tune_diffusion_model

Investigating the training script revealed a suspicious flag:

TRAIN_CMD="python scripts/gr00t_finetune.py \    --dataset-path ${DATASET_PATHS} \    --no-tune_diffusion_model \  # ← SUSPICIOUS!    --lora-rank 32 \    ..."

Analysis of what this flag does:

GR00T N1.5 architecture:

├── Vision Tower (SigLIP) ..................... ❌ FROZEN (tune_visual=False)
├── Language Model (Qwen2.5-3B) ............... ❌ FROZEN (tune_llm=False)
└── Action Head    ├── Projector (Linear layers) ............ ✅ TRAINABLE (LoRA rank 32)    └── Diffusion Model (DiT) ................ ❌ FROZEN (--no-tune_diffusion_model)

With --no-tune_diffusion_model, only the tiny projector layer was trainable!

Training logs confirmed:

tune_diffusion_model: False
trainable params: 6,553,600 || all params: 2,730,717,120 || trainable%: 0.2400
Tune action head projector: True
Tune action head diffusion model: False  ← PROBLEM!

The Fix

Removed the flag from 03_train_model.sh:

# REMOVED: --no-tune_diffusion_model
TRAIN_CMD="python scripts/gr00t_finetune.py \    --dataset-path ${DATASET_PATHS} \    --lora-rank 32 \    ..."

New training logs:

tune_diffusion_model: True
trainable params: 6,553,600 || all params: 2,730,717,120 || trainable%: 0.2400
Tune action head projector: True
Tune action head diffusion model: True  ← FIXED!

Expectations: With the diffusion model now trainable, the model should learn to map language instructions to different action sequences.

Reality: Language conditioning still failed! 😱

Phase 3: Systematic Testing and Behavior Analysis

Test Matrix

I conducted systematic testing with single and dual ingredient scenarios:

ScenarioCheese LocationBread LocationRobot ActionLanguage Effect
1HolderHolderRandomly picks one❌ Ignores instruction
2HolderNonePicks cheese❌ Ignores instruction
3NoneHolderPicks bread❌ Ignores instruction
4PlateHolderStops❌ Ignores instruction
5HolderPlateStops❌ Ignores instruction
6PlatePlateStops❌ Ignores instruction
7NoneNoneRandom search❌ Ignores instruction
8PlateNoneStops❌ Ignores instruction

Key finding: The robot’s behavior was entirely determined by object positions, regardless of language instruction.

Behavior Pattern

The model learned a position-based state machine:

State 1: Nothing in plate → Pick any object from holder
State 2: Something in plate → Stop (task complete)
State 3: Nothing anywhere → Search randomly

Critical test: Manually moved cheese to plate (without robot), then gave instruction “pick up the bread and put it into the white plate”

This confirmed the model was using visual heuristics (“if object in plate → task done”) rather than understanding language instructions.

Phase 4: Root Cause Analysis - The Frozen Backbone Problem

How GR00T Processes Language

After diving into the codebase, I discovered how language flows through the model:

Step 1: Input Preparation (transforms.py)

# Language text
lang = "Pick slice of yellow cheese and place it in the white plate"

# Images from cameras
images = [scene_camera_frame, wrist_camera_frame]  # Shape: [V, T, C, H, W]

Step 2: Eagle VLM Processing (transforms.py:_apply_vlm_processing)

# Create conversation format (Eagle processes vision + language together!)
eagle_conversation = [    {        "role": "user",        "content": [            {"type": "image", "image": scene_img},            {"type": "image", "image": wrist_img},            {"type": "text", "text": lang}        ]    }
]

# Tokenize and process
text_list = eagle_processor.apply_chat_template(eagle_conversation)
image_inputs = eagle_processor.process_vision_info(eagle_conversation)

Step 3: Eagle Model Forward (eagle_backbone.py)

# Eagle model processes BOTH vision and language together
eagle_output = self.eagle_model(    input_ids=tokenized_text,      # Language tokens    pixel_values=processed_images,  # Vision features    attention_mask=attention_mask,    output_hidden_states=True
)

# Extract joint vision-language embeddings
vl_embeddings = eagle_output.hidden_states[select_layer]  # Shape: [B, seq_len, 2048]
vl_embeddings = self.eagle_linear(vl_embeddings)          # Project to 1536 dim

Step 4: Action Head Uses VL Embeddings (flow_matching_action_head.py)

# Action head receives joint vision-language embeddings
vl_embs = backbone_output.backbone_features  # From Eagle

# Diffusion model uses these embeddings as conditioning
model_output = self.model(    hidden_states=sa_embs,           # State + action embeddings    encoder_hidden_states=vl_embs,   # Vision-language conditioning ← KEY!    encoder_attention_mask=vl_attn_mask,    timestep=t_discretized
)

The Fundamental Problem

Eagle VLM is completely frozen (tune_llm=Falsetune_visual=False):

  1. Eagle cannot learn new language-vision associations
    • Pre-trained on general VLM tasks (image captioning, VQA, etc.)
    • Never seen “pick cheese” vs “pick bread” during pre-training
    • Cannot learn to differentiate these similar instructions
  2. Eagle produces nearly identical embeddings
    # Hypothesis: Eagle's frozen embeddings
    emb_cheese = eagle("pick cheese", [scene_img, wrist_img])
    emb_bread = eagle("pick bread", [scene_img, wrist_img])
    
    # Cosine similarity likely very high (>0.95)
    # Because both are "pick X and place in plate" structure
    
  3. Diffusion model has no signal to differentiate
    • Diffusion model learns: embeddings → actions
    • If emb_cheese ≈ emb_bread, then actions_cheese ≈ actions_bread
    • Model falls back to visual heuristics: “if object in holder → pick it”

Why Diffusion Model Training Alone is Insufficient

Even with the diffusion model trainable:

The bottleneck: Frozen Eagle provides nearly identical embeddings for different instructions, so the diffusion model has no signal to learn different behaviors.

Phase 5: Solution and Next Steps

Required Fix: Enable Backbone Training

Minimum requirement - Enable LLM fine-tuning:

python scripts/gr00t_finetune.py \    --dataset-path ./demo_data/cheese/ ./demo_data/bread/ \    --tune-llm \              # ✅ ENABLE THIS    --tune-visual False \     # Keep frozen to save VRAM    --tune-projector True \    --tune-diffusion-model True \    --lora-rank 32

Why this works:

Best solution - Enable both LLM and vision fine-tuning:

python scripts/gr00t_finetune.py \    --dataset-path ./demo_data/cheese/ ./demo_data/bread/ \    --tune-llm \              # ✅ ENABLE THIS    --tune-visual \           # ✅ ENABLE THIS    --tune-projector True \    --tune-diffusion-model True \    --lora-rank 16 \          # Reduce from 32 to save VRAM    --lora-alpha 32           # Reduce from 64 (2x rank)

Why this is best:

VRAM Testing Script

Created test_vram_requirements.sh to systematically test configurations:

cd ~/lerobot/scripts/so100_groot
./test_vram_requirements.sh

This script tests 5 configurations:

  1. Baseline (frozen backbone) - ~8GB
  2. LLM only (LoRA 16) - ~12GB
  3. LLM only (LoRA 32) - ~14GB
  4. LLM + Vision (LoRA 16) - ~18GB
  5. LLM + Vision (LoRA 32) - ~22GB

And recommends the best configuration that fits on the GPU.

Trade-offs and Considerations

ConfigurationVRAMTraining SpeedLanguage ConditioningVisual Grounding
Frozen backbone~8GB~5-7 sec/step❌ Broken❌ No
LLM only~12-16GB~8-12 sec/step✅ Works⚠️ Limited
LLM + Vision~16-20GB~12-18 sec/step✅ Works✅ Yes

Lessons Learned

1. Frozen VLM Backbones Cannot Learn New Tasks

Key insight: When using a vision-language model as a backbone, freezing it prevents learning task-specific language-vision associations. This is fundamentally different from freezing a vision-only backbone (like ResNet or ViT).

Why: VLMs create joint embeddings from vision and language. If the VLM is frozen, it can only use pre-trained associations, which likely don’t include my specific tasks.

2. Action-Only Fine-tuning Has Limits

What works: Fine-tuning only the action head (projector + diffusion model) can work for:

What doesn’t work: Action-only fine-tuning cannot enable:

3. Debugging Requires Understanding Data Flow

Critical: Understanding how data flows through the model is essential for debugging:

  1. How is language tokenized?
  2. Where are vision and language combined?
  3. What embeddings does the action head receive?
  4. Which components are frozen vs trainable?

Without this understanding, it’s easy to misdiagnose problems (e.g., thinking diffusion model training alone would fix language conditioning).

4. Systematic Testing Reveals Patterns

Approach: Testing with a matrix of scenarios (single ingredient, dual ingredient, different positions) revealed the position-based state machine pattern.

Value: This systematic approach provided clear evidence that language had 0% effect, which motivated deeper investigation into the architecture.

5. VRAM is the Limiting Factor

Reality: The best configuration (LLM + Vision fine-tuning) requires ~20GB VRAM, which exceeds most consumer GPUs.

Solutions:

Next Steps

  1. Test VRAM requirements on RTX 4080 Super to determine feasible configuration
  2. Retrain model with LLM fine-tuning enabled (minimum) or LLM + Vision (if VRAM allows)
  3. Validate language conditioning with systematic testing:
    • Different instructions → different behaviors
    • Negation has effect
    • Model responds to language changes
  4. Document results and update training scripts with proper defaults
  5. Consider alternative approaches if VRAM is insufficient:
    • Train separate models per task
    • Use task selector at inference time
    • Explore model quantization or pruning

Conclusion

This debugging journey revealed a fundamental limitation in the training configuration: frozen VLM backbones cannot learn task-specific language conditioning. While the initial fix (enabling diffusion model training) was necessary, it was insufficient because the bottleneck was in the frozen Eagle backbone producing nearly identical embeddings for different instructions.

The solution requires enabling at least LLM fine-tuning, and ideally both LLM and vision fine-tuning, to allow the model to learn task-specific language-vision associations. This comes with VRAM trade-offs that must be carefully managed on consumer GPUs.

The systematic testing approach and deep dive into model architecture were essential for identifying the root cause and developing an effective solution. This experience highlights the importance of understanding not just what to train, but how the model processes and combines different modalities.

Discussions