Robot Foundation Models

Vision-Language-Action Models Explained: How VLAs Power Modern Robots

Vision-language-action models are the robot equivalent of GPT-4 — massive, pre-trained neural networks that can be fine-tuned to perform a wide range of physical tasks. Understanding what VLAs are, how they work, and when to use them is now essential knowledge for any serious robotics practitioner.

What Is a Vision-Language-Action Model?

A vision-language-action model (VLA) is a neural network that takes visual observations (camera images) and natural language instructions as input, and outputs robot actions — joint velocities, end-effector poses, or gripper commands. The "vision-language" part refers to the pre-trained backbone: these models inherit their visual and semantic understanding from large-scale internet pre-training on image-text pairs, much like CLIP or a vision-language model (VLM). The "action" part is the fine-tuning head trained on robot demonstration data.

The core insight is that pre-training on internet data gives the robot backbone a rich representation of the physical world — what objects are, how they relate spatially, and what language means — before it has ever seen a robot demonstration. Fine-tuning then adapts this representation to the robot's embodiment and target tasks. Because the backbone already understands "pick up the blue cup" or "open the drawer on the left," the model can generalize to novel objects and task phrasings with far fewer demonstrations than a policy trained from scratch.

VLA Architecture: The Three Components

Every VLA shares the same three-stage architecture, though implementations differ significantly in how each stage is realized:

1. Vision Encoder: Converts camera images into a sequence of visual tokens. Most VLAs use a pre-trained vision transformer (ViT) — OpenVLA uses SigLIP ViT-L/14 (304M params), RT-2 uses ViT-G (1.8B params). The vision encoder is the component that transfers most directly from internet pre-training, because the visual features learned from web images (object recognition, spatial layout, texture understanding) are directly useful for robot perception. Image resolution matters: 224x224 is standard but limits fine-grained perception; some models (pi0) use 384x384 or higher for tasks requiring sub-centimeter visual precision.

2. Language-Conditioned Backbone: A large language model that processes the visual tokens alongside the language instruction tokens. This is where semantic reasoning happens — the backbone determines that "pick up the red apple" requires identifying the red apple in the visual tokens and planning an appropriate action sequence. OpenVLA uses LLaMA-2 7B; RT-2 uses PaLM-E 55B; pi0 uses a custom 3B backbone. The backbone size determines both the model's reasoning capability and its inference cost.

3. Action Head: Converts the backbone's output into robot-specific actions. This is the critical design choice that differentiates VLA architectures:

Tokenized actions (RT-2, OpenVLA): Discretize the continuous action space into bins (typically 256 bins per dimension) and predict action tokens using the language model's token prediction head. Simple to implement because it reuses the language model's existing output machinery. Limitation: discretization introduces quantization error (~0.5mm at 256 bins over a 12cm workspace) and multi-step prediction requires autoregressive generation, which is slow.
Continuous action regression (Octo): Add a small MLP head that predicts continuous action values directly. Lower latency than tokenized approaches but may lose some of the language model's generalization benefits.
Flow-matching action head (pi0): Model the action distribution as a continuous flow and sample from it using a diffusion-like denoising process. Produces smooth, continuous action trajectories. Best for dexterous tasks but adds 50-100ms to inference time.

Key Models Comparison

Model	Parameters	Action Head	Training Data	GPU for Fine-Tune	Inference Latency	Access
RT-2	55B	Tokenized	Google internal + web	TPU v4 pod (not public)	~1-2s per step	Closed (Google)
RT-2-X	55B	Tokenized	Open X-Embodiment	TPU v4 pod (not public)	~1-2s per step	Closed (Google)
OpenVLA	7B	Tokenized	Open X-Embodiment (970K eps)	1x A100 80GB	~200ms per step	Open source (HuggingFace)
Octo	93M	Continuous (diffusion)	Open X-Embodiment (800K eps)	1x RTX 4090 24GB	~50ms per step	Open source (HuggingFace)
pi0	3B	Flow-matching	Proprietary (10K+ hours)	Enterprise access only	~100ms per step	Commercial (Physical Intelligence)

RT-2 and RT-2-X: The Google DeepMind Baselines

RT-2 (Robotics Transformer 2), released by Google DeepMind in 2023, was the first demonstration that scaling a vision-language model to robot control produced qualitatively new capabilities. RT-2 co-fine-tuned a PaLI-X vision-language model on web data and robot trajectories simultaneously, producing a policy that could follow novel instructions, reason about object properties, and generalize to objects it had never seen in robot demonstrations — only on the internet.

RT-2 showed that VLAs could perform chain-of-thought reasoning: asked to pick up "something you can use to clean a spill," the model identified a sponge from the scene without ever having been explicitly told to associate sponges with cleaning. This emergent capability — semantic generalization beyond the training distribution — is what makes VLAs qualitatively different from classic imitation learning policies.

RT-2-X extended RT-2 by training on the Open X-Embodiment dataset (demonstrations from 22 robot embodiments), showing that cross-embodiment training improves generalization. A policy trained on data from multiple robots performs better on each individual robot than a policy trained only on that robot's data — analogous to how multilingual language models outperform monolingual ones.

OpenVLA: The Open-Source Starting Point

OpenVLA, released by Stanford and Berkeley researchers in 2024, democratized VLA fine-tuning by building on the open-source Prismatic VLM (itself based on LLaMA) and training on the Open X-Embodiment dataset — a 970k-episode collection of robot demonstrations from 22 different embodiments. OpenVLA is the starting point most research teams use today because it is fully open-source, well-documented, and achieves strong performance on standard manipulation benchmarks.

Fine-tuning OpenVLA on a custom task requires as few as 50-200 demonstrations, a dataset formatted with HuggingFace LeRobot conventions, and a single 80GB A100 or H100 GPU for a training run of several hours. The resulting policy is surprisingly capable of generalizing to scene variations and novel object positions not seen in training, courtesy of the pre-trained visual backbone.

# OpenVLA fine-tuning (simplified)
# Requires: 1x A100 80GB, ~4 hours for 200 demos

pip install openvla

# Format your data in LeRobot format
python scripts/convert_to_lerobot.py \
  --input_dir svrc_demos/ \
  --output_dir lerobot_dataset/

# Fine-tune OpenVLA on your task
python finetune.py \
  --model openvla/openvla-7b \
  --dataset lerobot_dataset/ \
  --task_name "pick_and_place_cups" \
  --epochs 50 \
  --batch_size 8 \
  --lr 2e-5 \
  --lora_rank 32  # LoRA for memory efficiency

# Evaluate
python evaluate.py \
  --checkpoint output/checkpoint_best.pt \
  --num_trials 50 \
  --record_video

SVRC's data collection service produces datasets in LeRobot-compatible format, ready for OpenVLA fine-tuning out of the box.

Octo: The Lightweight Alternative

Octo (Ghosh et al., 2024) takes a different approach: instead of scaling up a large language model, build a purpose-built transformer architecture that is small enough to run on commodity hardware. At 93M parameters, Octo is 75x smaller than OpenVLA and runs inference at 50ms on an RTX 4090 — fast enough for real-time control at 20 Hz.

Octo uses a diffusion action head rather than tokenized actions, which produces smoother trajectories. It supports both language conditioning and goal-image conditioning (show the robot a picture of the desired end state instead of describing it in language). For teams without A100 access or who need real-time inference on edge hardware, Octo is the practical choice. Fine-tuning Octo on 200 custom demonstrations takes approximately 2 hours on a single RTX 4090.

The tradeoff: Octo's smaller backbone means less semantic understanding. It cannot perform the chain-of-thought reasoning that RT-2 can. For tasks where the instruction is simple ("pick up the cup") and generalization to novel objects is not critical, Octo's performance matches OpenVLA at a fraction of the compute cost. For tasks requiring complex language understanding or reasoning about object properties, OpenVLA or larger models are necessary.

pi0: Physical Intelligence's Generalist Policy

pi0, from Physical Intelligence (pi.ai), represents the commercial frontier of VLA development. Unlike OpenVLA, which inherits a language model backbone, pi0 uses a flow-matching action head that produces continuous, smooth action trajectories — more suited to dexterous tasks than discrete tokenized actions. pi0 was trained on a proprietary dataset of over 10,000 hours of robot demonstrations across dozens of tasks and hardware platforms.

What distinguishes pi0 architecturally is the separation between the "slow" language-conditioned reasoning pathway and the "fast" reactive motor control pathway. This mirrors insights from cognitive science about dual-process control systems. The slow pathway processes the task instruction and current scene to produce a high-level plan; the fast pathway generates low-latency motor commands. The result is a policy that can handle both long-horizon reasoning and high-frequency reactive control — opening the door to tasks like folding laundry, where both are required simultaneously.

Access to pi0 for commercial deployment is available through Physical Intelligence's enterprise program. For teams exploring pi0-style architectures, SVRC's benchmarks include evaluations of flow-matching policies on standard manipulation suites, giving you a reference point for expected performance before committing to a training run.

Training Data Requirements

VLA performance scales with both pre-training data diversity and task-specific fine-tuning data quality. The empirical scaling relationships observed across published VLA results:

Scenario	Fine-Tune Demos	Expected Success Rate	Notes
Zero-shot (no fine-tuning)	0	10-30%	Works only if task is in pre-training distribution
Minimal fine-tuning	10-50	40-60%	Sufficient for simple tasks with pre-trained backbone
Standard fine-tuning	50-200	70-85%	Sweet spot for most tasks — best ROI on data collection
Heavy fine-tuning	200-1000	85-95%	Diminishing returns above 500; variation matters more
Specialist fine-tuning	1000+	90-98%	Industrial deployment quality; may overfit to task

Critical data quality requirements: demonstrations must include the language instruction as metadata, camera viewpoints should match the deployment setup, and demonstrations should cover the full range of object poses and scene configurations the policy will encounter. A dataset of 200 demonstrations with diverse configurations outperforms 500 demonstrations with repetitive configurations.

Inference Requirements and Deployment

VLA deployment is a compute problem. The inference requirements vary dramatically across models:

Octo (93M params): Runs on RTX 4090 (24GB VRAM) at 20 Hz. Can be deployed on Jetson AGX Orin at 5-8 Hz with TensorRT optimization. The smallest VLA that maintains cross-embodiment generalization.
OpenVLA (7B params): Requires A100 or H100 (80GB VRAM) for full-precision inference at 5 Hz. With 4-bit quantization (GPTQ or AWQ), fits on RTX 4090 at 3-4 Hz. LoRA fine-tuning reduces memory to ~40GB.
RT-2 (55B params): Requires multiple A100/H100 GPUs with tensor parallelism. Not practical for single-robot deployment; originally ran on Google's internal TPU infrastructure. This is why OpenVLA exists — to provide RT-2-class capabilities at deployable scale.
pi0 (3B params): Optimized for deployment. Runs on a single A100 or equivalent at 10 Hz. Physical Intelligence provides an inference server that handles batched requests for multi-robot deployments.

For most research teams, the practical choice is between Octo (cheap, fast, less capable) and OpenVLA (expensive, slower, more capable). Start with Octo to validate your task and data pipeline, then upgrade to OpenVLA if you need better language understanding or novel-object generalization.

Zero-Shot vs. Fine-Tuning Performance

A key question for practitioners: how much does fine-tuning matter, and when is zero-shot performance sufficient?

Zero-shot works when: the task is common (pick-place, drawer open/close), the objects are everyday items that appear in the pre-training data, the camera viewpoint is similar to standard research setups (third-person overhead or eye-in-hand), and the language instruction is simple and unambiguous.

Fine-tuning is necessary when: the task involves custom hardware or unusual kinematics, the objects are specialized (lab equipment, industrial parts), the camera configuration is non-standard, the task requires precision beyond what the pre-trained model achieves, or the deployment environment has unique visual characteristics (specific lighting, background).

In practice, almost all production deployments require fine-tuning. Zero-shot performance provides a useful sanity check — if the VLA cannot perform the task at all in zero-shot mode, either the task is very far from the training distribution (requiring a lot of fine-tuning data) or the task may be better served by a non-VLA approach.

Current Limitations

Inference latency: Even the fastest VLAs (Octo at 50ms) are slower than classic policies (ACT at 10ms, Diffusion Policy at 15ms with DDIM). For tasks requiring >30 Hz control (force-sensitive assembly, reactive grasping), VLAs are too slow without dedicated hardware acceleration.
Contact-phase performance: VLAs excel at the approach phase (identifying objects, planning trajectories) but struggle with the contact phase (force regulation, insertion, in-hand manipulation). The language backbone provides no useful prior for contact dynamics. Hybrid approaches — VLA for approach, force/tactile policy for contact — are the current best practice.
Hallucination: Like language models, VLAs can "hallucinate" — predicting confident actions in situations where the correct action is uncertain. This manifests as the robot executing a grasp motion when no graspable object is present, or following an instruction that is physically impossible. Safety layers (force limits, workspace bounds) are essential.
Training cost: Pre-training a VLA from scratch requires millions of robot demonstrations and hundreds of GPU-hours. Even fine-tuning OpenVLA costs $50-200 in cloud compute per training run. For teams iterating rapidly on task definitions, this cost adds up.

How VLAs Differ from Classic Imitation Learning Policies

Classic IL policies — ACT, Diffusion Policy, BC-Z — learn entirely from robot demonstration data. Their visual representations are learned from scratch or from a narrow pre-trained encoder (like R3M or MVP). They generalize well within their training distribution but struggle with novel objects, lighting changes, or task instructions that rephrase the goal. They also require more demonstrations to achieve a given performance level because they lack the semantic prior that pre-training provides.

VLAs trade compute for generalization. A classic ACT policy on a GPU costs pennies per inference; a VLA inference step on a 7B-parameter model costs orders of magnitude more. For tasks that need to generalize broadly across environments and instructions, VLAs win. For a narrowly defined, repetitive industrial task where you have 1,000+ demonstrations and can tune the environment, a classic policy often achieves better speed and reliability at lower cost. The practical decision framework: if your task requires generalization, start with a VLA backbone. If it is narrow and high-throughput, optimize a classic policy.

Fine-Tuning VLAs with SVRC Data

SVRC provides end-to-end support for VLA fine-tuning projects. Our teleoperation infrastructure captures demonstrations in RLDS/LeRobot format with synchronized multi-camera video, proprioceptive state, and action labels at 50Hz. Our dataset pipelines include episode quality filtering (removing failed attempts and hesitations), camera calibration metadata, and task instruction annotation — all the metadata VLAs need for effective fine-tuning.

For teams that need custom data at scale, our managed collection service at the Mountain View facility can produce hundreds of demonstrations per day with trained operators across a library of manipulation tasks. We also offer consultation on task design — defining the scope, variation axes, and success criteria for a dataset that will actually train a generalizable policy. Pilot projects start at $2,500 for 200 demonstrations; full campaigns at $8,000 for 1,000+ demonstrations. Contact our team to discuss your VLA fine-tuning project, or explore our existing dataset catalog through the SVRC platform.