General Policy Composition

Improving Diffusion-based or Flow-based Robot Policies
via Test-time Distribution-level Composition

Compose Your Policies!
Improving Diffusion-based or Flow-based Robot Policies via Test-time Distribution-level Composition

Xiao Li4, Ping Luo1, Andrew F. Luo 1\( \ddagger \)
1The University of Hong Kong   2Beijing Innovation Center of Humanoid Robotics   3Shanghai AI Lab
4Shanghai Jiaotong University   5The Hong Kong University of Science and Technology
*Equal contribution   \(\ddagger\)Corresponding Author

Abstract

Diffusion-based models for robotic control, including vision-language-action (VLA) and vision-action (VA) policies, have demonstrated significant capabilities. Yet their advancement is constrained by the high cost of acquiring large-scale interaction datasets. This work introduces an alternative paradigm for enhancing policy performance without additional model training. Perhaps surprisingly, we demonstrate that the composed policies can exceed the performance of either parent policy. Our contribution is threefold: (I) First, we establish a theoretical foundation showing that the convex composition of distributional scores from multiple diffusion models can yield a superior one-step functional objective compared to any individual score. A Grönwall-type bound is then used to show that this single-step improvement propagates through entire generation trajectories, leading to systemic performance gains. (II) Second, motivated by these results, we propose General Policy Composition (GPC), a training-free method that enhances performance by combining the distributional scores of multiple pre-trained policies via a convex combination and test-time search. GPC is versatile, allowing for the plug-and-play composition of heterogeneous policies, including VA and VLA models, as well as those based on diffusion or flow-matching, irrespective of their input visual modalities. (III) Third, we provide extensive empirical validation. Experiments on Robomimic, PushT, and RoboTwin benchmarks, alongside real-world robotic evaluations, confirm that GPC consistently improves performance and adaptability across a diverse set of tasks. Further analysis of alternative composition operators and weighting strategies offers insights into the mechanisms underlying the success of GPC. These results establish GPC as a simple yet effective method for improving control performance by leveraging existing policies.

Motivation

Motivation Illustration: VLA and VA policies are powerful but rely on costly large-scale training. Theoretically, we show that combining distributional scores of pre-trained policies can yield a single-step error drop that propagates stably along the overall trajectory, enabling General Policy Composition (GPC) --- a training-free method to boost policy performance via convex combination. GPC is versatile, allowing for the plug-and-play composition of heterogeneous policies, including VA and VLA models, as well as those based on diffusion or flow-matching, irrespective of their input visual modalities.

Mathematical Foundation

Overview

Overview of GPC

cOverview of Mathematics Foundation: We provide a high-level description of the math foundation of General Policy Composition framework, outlining how function-level (single step improvement) and system-level (overall trajectory improvement) analyses together motivate our approach.

Proposition 1: Function-level Single-step Improvement

Function-level Proposition 1

Proposition 1 establishes that a proper convex combination of distributional scores can yield a smaller single-step error than either individual score estimator.

Proposition 2: System-level Stability

System-level Proposition 2 and Corollary

Proposition 2 leverages a Grönwall-type inequality to show that the terminal error grows at most exponentially with the Lipschitz constants, and is directly bounded by the integrated score error.

Corollary: Tighter Terminal-state Bound

Corollary Figure

Corollary 1. Once functional-level improvement is established by obtaining an optimal weight (Prop. 4.1), stability ensures this advantage propagates along the trajectory (Prop. 4.2), making convex score composition provably superior to relying on individual scores.

Method

Overview

Overview of TactAR and RDP

Overview of our proposed General Policy Composition. Combining distributional scores from pre-trained diffusion-based or flow-based policies on different conditions (e.g., visual modalities and network backbones), GPC can generate expressive and adaptable action trajectories through convex score combination without additional training.

Experiments

Main Results

Task: Robomimic and PushT

Experiemnt Settings: There are two variations and we run 200 trials for each variation: (1) Base Diffusion-based and Flow-based Policies (e.g., DP, Flow Policy, Pi-0). The policies are pre-trained based on their original codebase. (2) Composed Policies via GPC. Composed policies are combined via GPC during test-time.

Score Metric: We calculate the average success rate.

TABLE I: Experiment results on Robomimic and PushT.
The table shows the success rate ↑. Our GPC yields a noticeable average improvement compared with the base policies.

Method Generative Mode Model Type Robomimic PushT Average
Can Lift Square PushT Avg.
Base Policies
Diffusion Policy (DP)DiffusionVA 34.50 98.50 2.00 21.75 39.19
Mamba Policy (MP)DiffusionVA 5.00 98.50 3.00 12.06 29.64
Flow Policy (FP)Flow MatchingVA 95.00 13.00 77.50 54.25 59.94
Florence Policy-DDiffusionVLA 61.50 97.00 46.50 40.00 61.25
Florence Policy-FFlow MatchingVLA 89.00 98.50 88.50 39.38 78.84
π0Flow MatchingVLA 96.50 99.00 92.50 57.69 86.42
Composed Policies via Convex Score Combination
DP + MPDiffusionVA & VA 34.50 99.50 8.00 23.63 41.41 (+2.22%)
Florence-Policy-D + DPDiffusionVLA & VA 62.50 100.00 61.50 43.06 66.76 (+5.51%)
Florence-Policy-D + MPDiffusionVLA & VA 63.00 100.00 54.50 40.88 64.60 (+3.35%)
Florence-Policy-F + FPFlow MatchingVLA & VA 98.50 98.50 92.50 56.06 86.39 (+7.55%)
π0 + FPFlow MatchingVLA & VA 99.50 100.00 94.00 62.25 88.94 (+2.52%)

Task: RoboTwin 2.0

Experiemnt Settings: There are two variations and we run 100 trials for each variation: (1) Base Diffusion-based Policies (e.g., DP, DP3, RDT). (2) Composed Policies via GPC.

Score Metric: We calculate the average success rate.

Table II: Experiment results on RoboTwin with 6 diverse bimanual manipulation tasks.
GPC achieves an obvious increase with up to 7% improvement on the success rate.

Method Model Type Hanging Mug Open Laptop Place Burger Fries Put Object Cabinet Stack Bowls Three Turn Switch Average
Base Policies
DPimgVA 0.10 0.74 0.49 0.56 0.52 0.38 0.46
DPpcdVA 0.21 0.93 0.72 0.71 0.64 0.71 0.65
RDTVLA 0.13 0.69 0.46 0.32 0.47 0.30 0.40
Composed Policies via Convex Score Combination
DPimg + DPpcdVA & VA 0.23 0.93 0.78 0.82 0.71 0.71 0.70 (+5%)
RDT + DPimgVLA & VA 0.18 0.80 0.57 0.59 0.66 0.38 0.53 (+7%)
RDT + DPpcdVLA & VA 0.36 0.94 0.83 0.78 0.73 0.71 0.72 (+7%)


Visualization of Real World Experiments.

Experiemnt Settings: There are two variations and we run 20 trials for each policy for evaluation: (1) Base Diffusion-based Policies (e.g., DP, DP3). (2) Composed Policies via GPC.

Task: Clean Table

DP fails ❌

DP3 fails ❌

GPC succeeds ✅

Task: Place Bottles

DP fails ❌

DP3 succeeds ✅

GPC succeeds ✅

Task: Hang Mug

DP succeeds ✅

DP3 fails ❌

GPC succeeds ✅

Task: Punch Holes

DP fails ❌

DP3 fails ❌

GPC succeeds ✅


Influence of Weight Configurations on GPC Performance

To analyse how do different weight configurations influence the performance of GPC across various scenarios, we evaluate GPC performance across multiple tasks under different weight configurations as follows.

Table III: Experiment results of our method under different composition configurations.
These results highlight GPC's versatility and the importance of weight tuning across policies.

Scenario Task DPimg DPpcd Weight Scheduling in GPC (weight of DPimg) Δ
0.1* 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Both Policies
Perform Well
Empty Cup Place 0.42 0.62 0.70 0.86 0.84 0.86 0.84 0.84 0.76 0.68 0.61 +24%
Dual Bottles Pick (Hard) 0.49 0.64 0.69 0.63 0.71 0.66 0.64 0.65 0.63 0.56 0.58 +7%
Shoe Place 0.37 0.36 0.47 0.52 0.56 0.59 0.60 0.59 0.59 0.53 0.41 +23%
Both Policies
Perform Bad
Dual Shoes Place 0.08 0.23 0.19 0.17 0.19 0.20 0.20 0.17 0.16 0.14 0.09 +0%
Pick Apple Messy 0.05 0.26 0.25 0.17 0.21 0.15 0.13 0.08 0.08 0.06 0.08 +0%
Policy A > Policy B Dual Bottles Pick (Easy) 0.77 0.36 0.52 0.64 0.70 0.75 0.82 0.81 0.80 0.85 0.80 +8%
Policy A < Policy B Block Hammer Beat 0.00 0.76 0.61 0.30 0.18 0.15 0.12 0.07 0.00 0.00 0.00 +0%

* The number set {0.1, …, 0.9} denotes the weight of DPimg (w1), corresponding to the noise estimation of GPC as ε̂M* = w₁·εDP_img + w₂·εDP_pcd. When w₁ = 0.0 or 1.0, GPC degenerates into DPpcd or DPimg, respectively.

Several findings are summarized:

mag Finding 1: When both policies have moderate accuracy (e.g., >30%), GPC often achieves higher accuracy under appropriate weight configurations compared to base policies. This improvement reflects the composition of diffusion scores capturing a more generalized distribution that reduces reliance on specific conditions, consistent with the theoretical advantages of compositional models.

mag Finding 2: When one policy has significantly lower accuracy, GPC struggles to surpass the highest accuracy of the better-performing base policies. This suggests that low-accuracy scores from weaker modalities can significantly affect the joint distribution, diminishing the overall performance of the composed policy.

mag Finding 3: The improvement of GPC is always maximized when the better-performing base policy holds a larger weight in GPC. This highlights the necessity of assigning higher weights to the better-performing distribution to maximize the effectiveness of GPC, guiding the composed policy toward consensus.

These findings highlight GPC's versatility in leveraging the strengths of different conditions and the importance of appropriately tuning weights to each policy's performance.


Analysis on GPC's Superiority via Visualization

Overview of TactAR and RDP

Visual Analysis of GPC under Different Compositions. GPC generalizes across (a) modalities and (b) architectures, with appropriate weighting yielding accurate distributions with better SR than individual policies.

BibTeX

@article{cao2025gpc,
      title={Compose Your Policies! Improving Diffusion-based or Flow-based Robot Policies via Test-time Distribution-level Composition}, 
      author={Jiahang Cao and Yize Huang and Hanzhong Guo and Rui Zhang and Mu Nan and Weijian Mai and Jiaxu Wang and Hao Cheng and Jingkai Sun and Gang Han and Wen Zhao and Qiang Zhang and Yijie Guo and Qihao Zheng and Chunfeng Song and Xiao Li and Ping Luo and Andrew F. Luo},
      journal={arXiv preprint arXiv:2510.01068},
      year={2025},
      url={https://arxiv.org/abs/2510.01068}, 
}