Our work focuses on Q-learning, a popular off-policy, value-based family of reinforcement learning (RL) algorithms
that learn a state-action value function, typically via temporal difference (TD) backup. The most standard choice is
the
where the value of a state-action pair is updated towards the immediate reward plus the discounted estimated value 1-step in to the future.
For long-horizon sparse-reward tasks, it is often beneficial to use
However, n-step return backup uses the off-policy trajectories to estimate the return which can be overly pessemistic when the data distribution is sub-optimal.
More recently,
The chunked critic allows QC to directly capture the value of the action chunks, which does not suffer from the off-policy bias.
In Q-chunking (QC), the chunked critic is learned with n-step return backups, but there is a problem—the learned value function may not converge
to the correct value due to a subtle discre
As long as the data collection policy reacts to
the observation/state feedback in the middle of an action chunk, the
resulting state and reward distribution will deviate from that of executing
the action chunk
This bound is tight, meaning that the worst-case QC sub-optimality scales linearly with ε. When the data is not open-loop consistent, there exists an MDP where the action chunking policy learned from the chunked critic is arbitrarily bad. In general, the degree of open-loop consistency (ε) is orthogonal to the sub-optimality of the data (which can hurt n-step return backup). So as long as the data distribution is sub-optimal while being open-loop consistent, critic chunking is preferred over n-step return backup:
Up to now, we have characterized the condition under which Q-chunking should be preferred over n-step return backup,
but action chunking policy is still fundamentally limited in reactivity as it must execute all actions in each chunk
The intuition here is simple—if the action chunking policy is near-optimal, then (perhaps not surprisingly) the first action in each of the action chunk cannot be too sub-optimal. But this result is not as satisfying as the QC result above, because in the worst case, the closed-loop execution performance can degrade up to a factor of 1/(1-γ) (the effective horizon).
In our paper, we also characterized a new set of conditions under which such closed-loop execution is guaranteed to be close to the optimal closed-loop policy even when the data is not open-loop consistent. This is where the decoupling of the execution length of the action chunk from the critic chunking length truly shines.
Intuitively,
This bound is also tight, meaning that both the global and local optimality conditions are necessary for us to guarantee the near-optimality of the closed-loop execution of the action chunking policy. More importantly, the optimality gap does not depend on any measure of open-loop consistency (ε), which can make QC arbitrarily bad when the data is not OLC.
In summary, when the data is OLC and sub-optimal, QC is preferred over n-step return backup. The closed-loop execution of the action chunking policy (DQC) enjoys theoretical guarantees under both OLC and BOV condition whreas QC is only provably near-optimal under OLC, making DQC seem to be a more robust choice conceptually. This inspired us to develop a practical algorithm to effective leverage this, which we will describe next.
One naïve thing we could do is to simply train an QC agent and then closed-loop execute the learned action chunking policy.
But this actually does not perform well in practice (as what we will show below), likely due to
the challenge of learning a good action chunking policy with large chunk sizes.
So now the question is—how can we somehow extract a short chunk policy from a
Our solution is simple: (1) distill a partial chunk size critic from the full chunk size critic, and then (2) extract the policy from this distilled critic! The goal of the distilled critic is to match the value of the full critic given that the second half of the action chunk is optimally picked:
To achieve this, we use the expectile loss (Kostrikov et al., 2022) such that the distilled critic converges to the an upper-expectile of the full critic values, approximating the maximization. The overall value learning procedure is summarized below:
In summary, DQC trains a policy to predict a
We evaluated our method on 6 hardest OGBench environments. These environments are long-horizon and extremely difficult to solve with standard 1-step TD, making it an ideal testbed for our method. Below are some example videos of what it takes to solve some of the most challenging tasks in these environments (taken from Seohong's blogpost)
For our experiments, we compare with a few
Our method (DQC) consistently performs on par or better than the baselines across all environments as shown below.
Our method also outperforms the previous SOTA method, SHARSA (Park et al., 2025), across all environments except on cube-octuple where they are similar. The aggregated results for both our ablation baselines and prior works are shown below (10 seeds with 95% CI).
That is all for now! We have released our code and all the experiment data for our main results at github.com/colinqiyangli/dqc. If you are interested in learning more, come check out our full paper on arXiv!
@article{li2025dqc,
author = {Qiyang Li and Seohong Park and Sergey Levine},
title = {Decoupled Q-chunking},
conference = {arXiv Pre-print},
year = {2025},
url = {http://arxiv.org/abs/2512.10926},
}