Decoupled Q-Chunking

UC Berkeley
teaser figure teaser figure

TL;DR

Decoupled Q-chunking improves upon Q-chunking by decoupling the chunk size of the policy from that of the critic. Policies with short action chunks are easier to learn and critics with long action chunks speed up value learning.

Table of Content

> Part A | Theory — Why decoupling?

> Part B | Practical Algorithm — DQC

> Part C | Experiments

Background: A primer on n-step return and Q-chunking

Our work focuses on Q-learning, a popular off-policy, value-based family of reinforcement learning (RL) algorithms that learn a state-action value function, typically via temporal difference (TD) backup. The most standard choice is the 1-step TD backup:

Q(s_t, a_t) \leftarrow r(s_t, a_t) + \gamma Q(s_{t+1}, a_{t+1} \sim \pi(\cdot \mid s_{t+1}))

where the value of a state-action pair is updated towards the immediate reward plus the discounted estimated value 1-step in to the future.

For long-horizon sparse-reward tasks, it is often beneficial to use n-step return backup instead to speedup value learning:

Q(s_t, a_t) \leftarrow \sum_{t'=t}^{t+n-1} \gamma^{t'-t}r(s_t, a_t) + \gamma^n Q(s_{t+n}, a_{t+n} \sim \pi(\cdot \mid s_{t+n}))

However, n-step return backup uses the off-policy trajectories to estimate the return which can be overly pessemistic when the data distribution is sub-optimal.

More recently, Q-chunking (QC) (Li et al., 2025) was proposed as an alternative that avoids this off-policy bias for n-step return backup. This is achieved by grouping short sequences of actions as chunks and learning a chunked critic to estimate the value of carrying out a 'chunk' of actions in the n-step horizon rather than a single action:

Q(s_t, {\color[rgb]{0.7929,0.1602,0.4804}a_{t:t+h}}) \leftarrow \sum_{t'=t}^{t+h-1} \gamma^{t'-t}r(s_t, a_t) + \gamma^h Q(s_{t+h}, {\color[rgb]{0.7929,0.1602,0.4804}a_{t+h:t+2h}} \sim \pi(\cdot \mid s_{t+n}))

The chunked critic allows QC to directly capture the value of the action chunks, which does not suffer from the off-policy bias.

Part A: Why decoupling?

In Q-chunking (QC), the chunked critic is learned with n-step return backups, but there is a problem—the learned value function may not converge to the correct value due to a subtle discrepancy:

What QC should be doing..

\begin{align}
Q(s_t, a_{t:t+h}) \leftarrow \mathbb{E}_{\color[rgb]{0.7929,0.1602,0.4804} s_{t'+1} \sim T(\cdot \mid s_{t'}, a_{t'})}\left[\sum_{t'=t}^{t+h-1} \gamma^{t'-t}r(s_{t'}, a_{t'}) + \gamma^h Q(s_{t+h}, \pi(s_{t+h}))\right]
\end{align}

What QC is actually doing..

\begin{align}
Q(s_t, a_{t:t+h}) \leftarrow \mathbb{E}_{\color[rgb]{0,0.46,0.7265} s_{t+1:t+h+1}  \sim P_\mathcal{D}(\cdot \mid s_t, a_{t:t+h}) }\left[\sum_{t'=t}^{t+h-1} \gamma^{t'-t}r(s_{t'}, a_{t'}) + \gamma^h Q(s_{t+h}, \pi(s_{t+h}))\right]
\end{align}

As long as the data collection policy reacts to the observation/state feedback in the middle of an action chunk, the resulting state and reward distribution will deviate from that of executing the action chunk open-loop. To characterize this discrepancy formally, we introduce a notion of data consistency, as formalized below:


Open-loop consistency (OLC)

\begin{align}
D_{\mathrm{TV}}(\underbrace{T(s_{t+h'} \mid s_t, a_{t:t+h'})}_{\text{take $a_{t:t+h'}$ open-loop}} \mid\mid P_{\mathcal{D}}(s_{t+h'} \mid s_t, a_{t:t+h})) \leq \varepsilon_h, \forall h' \in \{1, 2, \cdots, h\}
\end{align}

OLC measures how much the 'open-loop' state-distribution can deviate from the 'reactive' state-distribution induced by the data collection policy. With this definition, we can formally show that QC yields a near-optimal action chunking policy when the data is open-loop consistent:


Result 1: QC attains a near-optimal action chunking policy under OLC.

\begin{align}
\|V^\star - V_{\mathrm{QC}}\|_\infty \leq \Theta\left(\frac{\varepsilon_h}{(1-\gamma)(1-\gamma^h)}\right)
\end{align}

This bound is tight, meaning that the worst-case QC sub-optimality scales linearly with ε. When the data is not open-loop consistent, there exists an MDP where the action chunking policy learned from the chunked critic is arbitrarily bad. In general, the degree of open-loop consistency (ε) is orthogonal to the sub-optimality of the data (which can hurt n-step return backup). So as long as the data distribution is sub-optimal while being open-loop consistent, critic chunking is preferred over n-step return backup:

Result 2: QC is better than n-step return backup when

\begin{align}
&\underbrace{Q^\star(s_t, a_t) -  \mathbb{E}_{P_{\mathcal{D}}(s_{t+1:t+h+1}, a_{t+1:t+h} \mid s_t, a_t)}\left[\sum_{t'=t}^{t+h-1} \gamma^{t'-t}r(s_{t'}, a_{t'})+ \gamma^h V^\star(s_{t+h})\right]}_{\text{sub-optimality of the data distribution}} > O\left(\frac{\varepsilon_h}{1-\gamma^h}\right), \\
&\text{where } Q^\star, V^\star \text{ are the value functions of the optimal policy $\pi^\star$}.
\end{align}

Up to now, we have characterized the condition under which Q-chunking should be preferred over n-step return backup, but action chunking policy is still fundamentally limited in reactivity as it must execute all actions in each chunk open-loop. A common trick that robotists use for action chunking is to actually replan more frequently than the chunk size, where only a partial action chunked is executed. But can we still provide theoretical guarantees with this 'decoupling' setup? The answer is yes, at least in the special case where only the first action is taken in the action chunk (i.e., closed-loop execution of the learned action chunking policy).

Result 3: closed-loop execution of QC policy is also near-optimal under OLC

\begin{align}
\|V^\star - V_{\mathrm{DQC}}\|_\infty \leq O\left(\frac{\varepsilon_h}{(1-\gamma)^2(1-\gamma^h)}\right)
\end{align}

The intuition here is simple—if the action chunking policy is near-optimal, then (perhaps not surprisingly) the first action in each of the action chunk cannot be too sub-optimal. But this result is not as satisfying as the QC result above, because in the worst case, the closed-loop execution performance can degrade up to a factor of 1/(1-γ) (the effective horizon).


In our paper, we also characterized a new set of conditions under which such closed-loop execution is guaranteed to be close to the optimal closed-loop policy even when the data is not open-loop consistent. This is where the decoupling of the execution length of the action chunk from the critic chunking length truly shines.

Bounded Optimality Variability (BOV)

\begin{align}
&\text{Let } \mathcal{D} \text{ be a mixture of } \{\mathcal{D}_1, \mathcal{D}_2, \cdots, \mathcal{D}_M\} \text{ with } \mathcal{D}^\star \text{ being one of } \mathcal{D}_i. \text{ Both of the following conditions hold:}\\
& \\
& \text{1. Local condition:} \max_{\color[rgb]{0,0.4804,0.4609}\mathrm{supp}( P_{\mathcal{D}^i}(\cdot \mid s_t, a_t))}\left[R_{t:t+h} +\gamma^h V^\star(s_{t+h})\right] - \min_{\color[rgb]{0,0.4804,0.4609}\mathrm{supp}(P_{\mathcal{D}^i}(\cdot \mid s_t, a_t))}\left[R_{t:t+h} +\gamma^h V^\star(s_{t+h})\right] \leq \vartheta^L_h, \forall i \in \{1, 2, \cdots, M\}, \\
& \text{2. Global condition:}  \max_{\color[rgb]{0,0.4804,0.4609}\mathrm{supp}(P_{\mathcal{D}}(\cdot \mid s_t, a_{t:t+h}))}\left[R_{t:t+h} +\gamma^h V^\star(s_{t+h})\right] - \min_{\color[rgb]{0,0.4804,0.4609}\mathrm{supp}(P_{\mathcal{D}}(\cdot \mid s_t, a_{t:t+h}))}\left[R_{t:t+h} +\gamma^h V^\star(s_{t+h})\right] \leq \vartheta^G_h.
\end{align}

Intuitively, BOV assumes the data is a mixture of multiple sources (e.g., expert data, scripted policy), where each source exhibits small variability in terms of the optimality of the h-step returns (local), and the overall variability (across mixture components) conditioned on action chunk is also small. Under BOV, we can show that the closed-loop execution of the action chunking policy is guaranteed to be near-optimal, regardless of how open-loop inconsistent the data is.

Result 4: DQC attains a near-optimal closed-loop policy under BOV


\begin{align}
\|V^\star - V_{\mathrm{DQC}}\| \leq \Theta\left( \frac{\vartheta^L_h}{1-\gamma} + \frac{\vartheta^G_h + \gamma^h\min(\vartheta^L_h,\vartheta^G_h)}{(1-\gamma)(1-\gamma^h)}\right)
\end{align}

This bound is also tight, meaning that both the global and local optimality conditions are necessary for us to guarantee the near-optimality of the closed-loop execution of the action chunking policy. More importantly, the optimality gap does not depend on any measure of open-loop consistency (ε), which can make QC arbitrarily bad when the data is not OLC.


In summary, when the data is OLC and sub-optimal, QC is preferred over n-step return backup. The closed-loop execution of the action chunking policy (DQC) enjoys theoretical guarantees under both OLC and BOV condition whreas QC is only provably near-optimal under OLC, making DQC seem to be a more robust choice conceptually. This inspired us to develop a practical algorithm to effective leverage this, which we will describe next.


Part B: A Practical Algorithm - DQC

One naïve thing we could do is to simply train an QC agent and then closed-loop execute the learned action chunking policy. But this actually does not perform well in practice (as what we will show below), likely due to the challenge of learning a good action chunking policy with large chunk sizes.

So now the question is—how can we somehow extract a short chunk policy from a long chunk critic directly to side-step the challenge of learning action chunking policies with large chunk sizes?



Our solution is simple: (1) distill a partial chunk size critic from the full chunk size critic, and then (2) extract the policy from this distilled critic! The goal of the distilled critic is to match the value of the full critic given that the second half of the action chunk is optimally picked:

\begin{align}
&Q^P(s_t, a_{t:t+h_a}) \approx Q(s_t, [a_{t:t+h_a}, a^\star_{t+h_a:t+h}]), \text{where } [a_{t:t+h_a}, a^\star_{t+h_a:t+h}]\text{ concatenates two partial action chunks.}
\end{align}

To achieve this, we use the expectile loss (Kostrikov et al., 2022) such that the distilled critic converges to the an upper-expectile of the full critic values, approximating the maximization. The overall value learning procedure is summarized below:

DQC Pseudocode

\begin{align}
& \text{Let } h \text{ be the {\color[rgb]{0.7929,0.1602,0.4804}critic chunk size} and } h_a \text{ be the {\color[rgb]{0,0.46,0.7265}policy chunk size}}.  \phantom{\sum_{k = 0}^{h-1} \gamma^k r_{t+k} } \\
  & \text{1. Sample trajectory segment from the dataset }(s_{t:t+h+1}, a_{t:t+h}, r_{t:t+h}) \sim D. \phantom{\sum_{k = 0}^{h-1} \gamma^k r_{t+k} } \\
  & \text{2. {\color[rgb]{0.7929,0.1602,0.4804}[Original full critic: \textbf{long chunk}]} Optimize }Q \text{ with } L(Q) = \left(Q(s_t, a_{t:t+h}) - \sum_{k = 0}^{h-1} \gamma^k r_{t+k} - \gamma^h \bar Q^P(s_{t+h}, a_{t+h:t+h+h_a} \sim \pi(s_{t+h}))\right)^2. \\
  & \text{3. {\color[rgb]{0,0.46,0.7265}[Distilled partial critic: \textbf{short chunk}]} Optimize }Q^P \text{ with }  L(Q^P)=\mathrm{ExpectileLoss}(\bar Q(s_t, a_{t:t+h}) - Q^P(s_t, a_{t:t+h_a})).\phantom{\sum_{k = 0}^{h-1} \gamma^k r_{t+k} } \\
  & \text{4. Optimize }\pi(a_{t:t+h_a} \mid s_t) \text{ with } L(\pi) = \mathbb{E}_{a_{t:t+h_a} \sim \pi(\cdot \mid s_t)}\left[-Q^P(s_t,  a_{t:t+h_a})\right]. \text{ Repeat.}\phantom{\sum_{k = 0}^{h-1} \gamma^k r_{t+k} } 
  \end{align}

In summary, DQC trains a policy to predict a partial chunk by hill climbing the value of a partial critic that is distilled from the original full chunk critic via an implicit maximization loss. This allows our policy to enjoy both the value speedup benefits associated with Q-chunking without explicitly predicting the full action chunk. Such design not only mitigates the learning challenge of an action chunking policy but also allows for better reactivity, enjoying some nice theoretical guarantees as discussed in Part A. Next, we put our DQC algorithm into test by evaluating it on challenging goal-conditioned RL tasks.


Experiments

We evaluated our method on 6 hardest OGBench environments. These environments are long-horizon and extremely difficult to solve with standard 1-step TD, making it an ideal testbed for our method. Below are some example videos of what it takes to solve some of the most challenging tasks in these environments (taken from Seohong's blogpost)

puzzle-4x6

cube-octuple


For our experiments, we compare with a few ablation baselines:
(1) OS: one-step TD
(2) NS: n-step TD
(3) QC (Li et al., 2025): learning a single chunked critic and then directly extract a policy of the same chunk size from it. Every action chunk is executed open-loop.
(4) DQC-naïve: same as above but only execute the partial chunk open-loop. This is a naïve attempt at decoupling the policy and critic chunk sizes without the distilled critic. In practice, we also use a separate value function to approximate the Q-target in the n-step TD backup and best-of-N policy extraction similar to how it is done in IDQL (Hansen-Estruch et al., 2023) and SHARSA (Park et al., 2025). See more details in our paper.


Our method (DQC) consistently performs on par or better than the baselines across all environments as shown below.

Main result plot

Our method also outperforms the previous SOTA method, SHARSA (Park et al., 2025), across all environments except on cube-octuple where they are similar. The aggregated results for both our ablation baselines and prior works are shown below (10 seeds with 95% CI).

Aggregated bar plot (full)

That is all for now! We have released our code and all the experiment data for our main results at github.com/colinqiyangli/dqc. If you are interested in learning more, come check out our full paper on arXiv!

BibTeX

@article{li2025dqc,
  author = {Qiyang Li and Seohong Park and Sergey Levine},
  title  = {Decoupled Q-chunking},
  conference = {arXiv Pre-print},
  year = {2025},
  url = {http://arxiv.org/abs/2512.10926},
}