Diffusion Policy

Definition

Diffusion Policy adapts denoising diffusion probabilistic models — the same generative framework behind image generators like Stable Diffusion — to the problem of robot action generation. Instead of predicting a single deterministic action, the model iteratively refines a sample of pure Gaussian noise into a coherent action trajectory through a learned denoising process. This allows the policy to represent the full distribution of valid behaviors for a given observation.

The approach was introduced by Cheng Chi, Siyuan Feng, Yilun Du, Zhenjia Xu, Eric Cousineau, Benjamin Burchfiel, and Shuran Song in 2023. It addresses a fundamental limitation of standard behavior cloning: when demonstration data contains multiple valid strategies for the same situation (e.g., grasping an object from the left or the right), a regression-based policy averages them and produces an invalid middle action. Diffusion Policy can represent all modes of the distribution and sample from any of them.

Diffusion Policy typically operates on action chunks (sequences of 8-32 future actions) rather than single steps, combining the benefits of action chunking with multimodal generative modeling. It has achieved state-of-the-art results on contact-rich tasks including cloth folding, tool use, bimanual assembly, and precision insertion.

How It Works

The core mechanism borrows from denoising diffusion probabilistic models (DDPMs). During training, the model learns to reverse a gradual noising process. Given a ground-truth action sequence a from a demonstration, Gaussian noise is added at varying levels (timesteps t = 1...T). The neural network learns to predict the noise that was added, conditioned on the current observation (camera images, proprioceptive state) and the noisy action.

At inference, the process starts from pure noise and iteratively denoises over T steps, each step moving the sample closer to a valid action trajectory. The observation acts as the conditioning signal: the same noise seed with different observations produces different action sequences appropriate to each situation. The stochastic nature of the initial noise means multiple runs can produce different valid trajectories, reflecting the multimodal distribution in the training data.

Architecturally, two variants dominate. The CNN-based variant uses a 1D temporal convolutional U-Net to process the action sequence, treating time as the spatial dimension. The Transformer-based variant uses cross-attention between noisy action tokens and observation tokens. The CNN variant is faster; the Transformer variant scales better to high-dimensional action spaces.

Key Variants

DDPM (Denoising Diffusion Probabilistic Model) — The original formulation with 50-100 denoising steps at inference. Produces high-quality trajectories but is relatively slow (~50-100ms per action chunk on a GPU).
DDIM (Denoising Diffusion Implicit Model) — A deterministic sampling schedule that reduces denoising to 10-20 steps with minimal quality loss. The most common choice in practice, cutting inference time to ~10-20ms.
Consistency Policy — Distills the multi-step diffusion process into a single-step or few-step generator. Achieves near-real-time inference (1-3 steps) while maintaining multimodal behavior, making it suitable for high-frequency control loops.
3D Diffusion Policy (DP3) — Extends diffusion policy to operate on 3D point cloud observations rather than 2D images, improving spatial reasoning for tasks involving depth, occlusion, and precise placement.

Comparison with Alternatives

Diffusion Policy vs. ACT: ACT uses a CVAE to handle multimodality but effectively commits to a single mode once the latent is sampled. Diffusion Policy can express richer, more complex distributions because the denoising process naturally handles multi-modal outputs. However, ACT is 10-50x faster at inference due to its single forward pass. For unimodal tasks (one clear strategy), ACT is often preferable; for multimodal tasks, Diffusion Policy excels.

Diffusion Policy vs. plain Behavior Cloning: Standard BC with MSE loss averages over modes, producing invalid actions when the data is multimodal. Diffusion Policy avoids this entirely. The tradeoff is greater complexity, higher compute, and more data requirements.

Diffusion Policy vs. Reinforcement Learning: RL can discover novel strategies but requires reward functions and extensive environment interaction. Diffusion Policy learns purely from demonstrations, making it practical when reward engineering is difficult or simulation is unavailable.

Practical Requirements

Data: Diffusion Policy generally requires more demonstrations than ACT — typically 100-500 demonstrations for manipulation tasks. The need for more data stems from the model's greater expressiveness: it needs enough samples to accurately capture the multimodal distribution. Data diversity (varied object positions, orientations, strategies) is more important than volume.

Compute: Training takes 4-12 hours on a single GPU (RTX 3090/4090) for typical datasets. The DDPM variant requires 50-100 denoising steps at inference, running at ~10-20 Hz; DDIM with 10 steps reaches ~50 Hz; Consistency Policy can exceed 100 Hz. For most robot control loops (10-50 Hz), DDIM is the practical choice.

Hardware: Like ACT, Diffusion Policy works with position-controlled robot arms. It has been demonstrated on Franka Emika Panda, ALOHA systems, UR5, and various low-cost arms. The higher inference latency compared to ACT means control loop frequency matters: for arms running at 50 Hz, DDIM or Consistency Policy variants are needed.

Code Example: Training Diffusion Policy with LeRobot

# Install LeRobot
pip install lerobot

# Train a Diffusion Policy on your dataset
python lerobot/scripts/train.py \
  --policy.type=diffusion \
  --dataset.repo_id=your_hf_username/your_dataset \
  --training.num_epochs=2000 \
  --policy.n_action_steps=8 \
  --policy.num_inference_steps=10 \
  --output_dir=outputs/diffusion_policy_fold

# Load and evaluate a trained checkpoint
python lerobot/scripts/eval.py \
  --policy.path=outputs/diffusion_policy_fold/checkpoints/last/pretrained_model \
  --env.type=real_world

Training Tips and Common Pitfalls

Noise schedule matters — The number of diffusion timesteps (T) controls the tradeoff between sample quality and diversity. T=100 with DDPM or T=10 with DDIM are common starting points. Too few steps produce blurry, averaged actions; too many waste compute without quality gains.
Action normalization is critical — Normalize actions to [-1, 1] before training. The diffusion process assumes Gaussian noise, and unnormalized actions with different scales across joints cause the model to allocate denoising capacity unevenly.
Observation history length — Including 2-5 frames of observation history (images + proprioception) helps the model infer velocity and task phase. A single frame often produces ambiguous actions because the model cannot distinguish approaching from retreating motions.
EMA (Exponential Moving Average) — Use an EMA of model weights for evaluation (decay 0.995-0.9999). This stabilizes inference behavior and consistently outperforms the raw training checkpoint.
Beware of action space mismatch — Diffusion Policy works best with continuous action spaces. If your robot uses discrete gripper commands (open/close), treat the gripper as a separate binary output head rather than including it in the diffusion process.