Unrolling the Action Manifold

Visuomotor Policy Learning via Recursive Cascades

A 19M policy inspired by human hierarchical cognitive control that frames action generation as recursive temporal infilling directly within explicit physical space.

โŒ„
The idea in one breath

Sketch globally, refine recursively.

Robot policies usually trade speed for stability โ€” autoregressive models are slow and drift over time, diffusion models are accurate but expensive. RCP borrows from how the brain plans movement: it first sketches a coarse trajectory from a few anchor points, then recursively fills in the gaps with one small, weight-shared Transformer. Because the in-between segments are independent, RCP decodes them in parallel โ€” staying fast and stable with just 19M parameters.

Abstract

Current generative visuomotor policies often face a trade-off between inference speed and execution stability. Autoregressive architectures are constrained by sequential decoding latency and compounding errors, whereas diffusion models demand computationally expensive iterative denoising. We introduce the Recursive Cascade Policy (RCP), an architecture inspired by human hierarchical cognitive control that frames action generation as recursive temporal infilling directly within explicit physical action space. A single weight-shared Transformer first predicts sparse boundary anchors to sketch a global trajectory, then recursively populates the temporal gaps. By conditioning each step on its immediate geometric anchors and a hierarchically propagated latent, RCP renders intermediate trajectory segments conditionally independent. This structural prior restricts the attention mechanism to a constant-length context, decoupling attention complexity from the total action horizon and unlocking highly efficient parallel decoding. With only 19M parameters, less than one-quarter the size of ACT and Diffusion Policy, RCP achieves superior performance across diverse simulation and real-world tasks. By grounding generation in local geometric boundaries rather than unbounded histories, RCP maintains robust temporal coherence, demonstrating recursive cascades as an efficient and scalable approach for robotic policies.

Results at a glance

Small model, high-precision control.

  • 0parameters โ€” 4โ€“5ร— smaller than ACT / DP
  • 0ManiSkill 3 avg โ€” with 10ร— less data
  • 0RoboTwin 2.0 avg โ€” beats ฯ€โ‚€ (38.2%)
  • 0real-world avg โ€” +19% over baselines
  • 0throughput @16-chunk โ€” ~88ร— faster than DP
How RCP works

One shared generator, applied recursively.

A single 6M-parameter generator is reused at every level of a hierarchical action tree โ€” from a sparse global sketch down to dense, executable motion.

RCP architecture: observation encoder feeds a weight-shared recursive generator that instantiates a hierarchical action tree from Level 0 to Level N.
Observation encoding (left) conditions a weight-shared Recursive Shared Generator that instantiates actions across increasing temporal resolution (right): Level 0 sets global intent; deeper levels recursively infill motor detail.
  1. 1

    Observe

    A ResNet encodes the camera image; proprioception is tokenized; self-attention fuses them into unified observation tokens.

  2. 2

    Sketch ยท Level 0

    The shared generator predicts a sparse set of boundary anchors โ€” a global trajectory sketch of the whole motion.

  3. 3

    Recurse

    Each adjacent anchor pair becomes new start/end tokens; the same generator infills finer waypoints between them. Repeat to Level N.

  4. 4

    Parallel decode

    Same-level segments are conditionally independent, so they're generated in one batched pass โ€” O(L) latency, constant-length attention.

Looking inside the cascade

What the recursion actually learns.

Hierarchical action instantiation

A predicted trajectory unrolled across recursive levels. The model first places a few sparse global anchors (rose) that outline the macroscopic motion, then recursively infills dense local actions (slate) in the gaps between them โ€” making the gist-to-detail refinement explicit in physical space.

2D trajectory: sparse global anchors expand into a dense local path.

Latent action space (UMAP)

A UMAP projection of the generator's latents. Sparse global-intent nodes flow along a consistent refinement direction into the dense manifold of local actions (inset) โ€” direct evidence that RCP progressively grounds high-level intent into precise, executable motion.

UMAP of latent actions showing global intents refining into dense local manifolds.
Experiments

Better control, across sim and the real world.

RoboTwin 2.0 โ€” bimanual manipulation (success rate)

Methodbeat block hammerhanging mugplace dual shoesplace phone standplace bread basketput object cabinetplace fanplace cont. platemove can potpick div. bottlesAvg
Diffusion Policy.42.08.08.13.14.42.03.41.39.05.22
ACT.56.07.09.02.06.15.01.72.22.07.20
ฯ€โ‚€.43.11.15.35.17.68.20.88.58.27.38
RDT.77.23.04.15.10.33.12.78.25.02.28
RCP (ours).81.31.20.42.33.40.26.81.54.24.43

ManiSkill 3 (success rate)

MethodDemosPickCubePushCubeStackCubeAvg
BC100.00.00.00.00
BC1000.03.81.00.28
ACT100.28.30.33.30
ACT1000.98.89.80.89
Diffusion Policy100.76.41.61.59
Diffusion Policy10001.00.86.81.89
DP + VGGT1000.96.91.65.84
PAR1000.731.00.48.74
OpenVLA1000.08.08.08.08
Octo1000.00.00.00.00
RDT1000.771.00.74.84
RCP (ours)100.791.00.99.93

Comprehensive evaluation

Five panels: success vs multi-scale baselines, throughput scaling, horizon-scaling robustness, image vs point-cloud modality, capacity scaling.
(a) success vs multi-scale baselines ยท (b) throughput scaling with chunk size ยท (c) horizon-scaling robustness ยท (d) image vs point-cloud modality ยท (e) capacity scaling (12.8M โ†’ 19.8M โ†’ 70.3M).

Real-world โ€” TienKung 2.0 dual-arm humanoid

Real-world success-rate bar chart across four tasks with per-method numbers; RCP highest overall.
Per-task success rates on the dual-arm humanoid. RCP dominates monolithic (ACT, DP) and multi-scale (CARP) baselines โ€” 50% Stack Bowl, 65% Transfer Tape, 35% Weigh Apple, 80% Open Drawer, 58% average.
Real-world rollouts

Watch RCP vs. the baselines.

Same task, same robot. Pick a task and a baseline; toggle successes and failures. Clips are from the TienKung 2.0 dual-arm humanoid.

Compare against:
RCP (Ours)ours

Qualitative results

Simulation rollouts.

RCP executing the benchmark tasks in RoboTwin 2.0 and ManiSkill 3 โ€” the same tasks scored in the tables above.