Hierarchical Policy Learning via Causal Spectral Decomposition

Note: The CSS on this page was recently updated. If videos are appearing larger than expected, you may need to clear your cache with CTRL+F5.

Abstract

We identify a semantic decomposition in robot action sequences, separating task-level motion intent from execution-level refinements. By analyzing actions in the spectral domain using the discrete cosine transform (DCT), we observe that low-frequency components capture global motion trajectories, while high-frequency components encode precise timing, alignment, and contact behaviors. Motivated by this structure, we propose Causal Spectral Policy (CSP), which models action generation as a causal coarse-to-fine process: coarse motion is predicted from observation and language, and fine corrections are generated conditionally on the realized trajectory. Across simulation and real-world evaluations, CSP consistently outperforms strong baselines on precision-sensitive manipulation tasks. Additionally, we propose human-inspired teleoperation noise injection as a data augmentation method under which our approach demonstrates strong robustness to noisy demonstrations.

Architecture

We model action generation as a coarse-to-fine process across temporal scales. For a coarse-to-fine factorization to be effective, the action representation must make temporal structure at different resolutions explicit. Following this motivation we represent actions in the spectral domain using the discrete cosine transform (DCT). We train two separate predictors to model low and high frequency spectral components and inference them sequentially. This component factorization reflects the structure of manipulation behaviors, with global intent influenced by language task commands and execution-refinement to be trajectory directed.

Spectral Action Frequencies Contain Semantic Meaning

To evaluate the semantic structure of frequency-domain actions, we collect a single demonstration placing a dart onto a target. By progressively zeroing out high-frequency coefficients beyond a cutoff λ, we replay reconstructed trajectories with different frequency content.

λ = 40

λ = 48

λ = 50

λ = 56

As λ decreases, the robot follows the coarse trajectory towards the target region but exhibits increasing misalignment with the intended target pose. After enough ablation, as in λ = 56, the robot fails to make contact with the target at all.

Evaluation and Results

We evaluated CSP compared to baselines in simulation with different action chunk sizes. Success rate % across LIBERO and MimicGen-style tasks is averaged across 2 seeds. While longer chunk lengths are preferred for multimodal consistency, current methods only work with short chunk lengths in precision tasks. This result shows that CSP is better at preserving performance while increasing look-ahead during prediction.

Custom Baselines

Frequency Autoregressive

Action Binning

Ablations

Frequency Diffusion

No Hierarchy

Performance on Real Robot

Real robot experiments were conducted on a Franka Emika Panda equipped with a fixed and a wrist-mounted Logitech RGB camera. Across all tasks, CSP consistently improves execution accuracy, achieving the highest success on all precision-critical tasks. These results provide further evidence that coarse-to-fine spectral decomposition is effective beyond simulation, improving performance on precision sensitive and noisy manipulation tasks.

Baseline

CSP

Baseline

CSP

Baseline

CSP

Baseline

CSP

Robustness to Noise

To evaluate robustness to suboptimal demonstrations, we inject structured noise that approximates human teleoperation behavior to training demonstrations. In the video gallery below the left video in each pair is the baseline and the right has 'large' high-frequency structured noise applied to the actions.

We include noise injections of varying magnitudes, evidenced by the PCA visualization of action trajectories under different noise levels below. Small noise causes limited dispersion, while large noise leads to substantially greater variation.

PCA of Actions with Different Noise Magnitudes

Additional Experiments

To investigate the impact of CSP's structural coarse-to-fine dependency, we analyze additional ablations to understand the impact of each design decision. Removing language from the coarse trajectory, removing coarse conditioning in fine-stage generation, and reversing the dependency to fine-to-coarse all degrades performance. This demonstrates that there is a directional causal link between coarse and fine components.

Low/High Frequency Split Sweep

To understand the impact and importance of the cutoff frequency between low and high frequency components, we sweep this hyperparameter over a range from 2 to 30 for a chunk size of 32. We see That across all tasks there is a peak in performance between 6-10, task specific, in performance. If the cutoff is set lower, the coarse trajectory may not be expressive enough to represent the action chunk. If set higher, less information is retained for precise high-frequency alignment through the causal link between model trunks.

Data Collection Method KL Divergence

We collect 20 teleop demos per task using keyboard and SpaceMouse and compare action distributions via KL divergence in action and frequency space (with multiple chunk sizes). Synthetic noise consistently reduces divergence to teleop data (mean Action KL 10.57→7.93) and better matches temporal patterns (Freq KL 0.61→0.50, 0.70→0.57, 0.94→0.75), indicating improved alignment with human motion variability.

BibTeX


      BibTeX pending anonymous review