Recurrent-Depth VLA:

Implicit Test-Time Compute Scaling of
Vision–Language–Action Models via
Latent Iterative Reasoning


1Stanford University 2Technical University of Munich
3University of Washington 4Allen Institute for AI

*Equal Contribution †Equal Advising

Under Review



Abstract

Current Vision-Language-Action (VLA) models utilize fixed computational depth, processing simple adjustments and complex multi-step manipulations with same amount of compute. While Chain-of-Thought (CoT) prompting enables variable compute, it scales memory linearly and struggles with continuous action spaces. We introduce Recurrent-Depth VLA (RD-VLA), an architecture that achieves computational adaptivity through latent iterative refinement instead of explicit token generation. RD-VLA employs a recurrent action head with weight-tied layers, enabling arbitrary depth with a constant memory footprint. We train the model using truncated backpropagation through time (TBPTT), allowing for efficient supervision of the refinement process. At inference, an adaptive stopping criterion based on latent convergence enables the model to dynamically allocate compute per sample. Our experiments on complex manipulation tasks demonstrate that recurrent depth is critical for success: tasks failing (0%) with single-iteration inference achieve +90% success with four iterations, while simpler tasks saturate quickly. RD-VLA provides a scalable path for test-time compute in robotics, bypassing the data and memory overhead of CoT while replacing discrete, token-based reasoning with latent reasoning, which maintains a constant memory footprint regardless of depth, and does not require any special data collection.

Summary


We introduce Recurrent-Depth VLA. (Left) Previous reasoning VLAs (e.g., ThinkAct, MolmoAct) generate explicit reasoning tokens in output space, requiring expensive autoregressive decoding. (Center) Our approach performs iterative refinement entirely in latent representation space, bypassing token generation overhead. (Right) RD-VLA achieves comparable performance to autoregressive reasoning baselines on LIBERO-10 while being substantially faster due to the efficiency of latent reasoning with adaptive compute.

Method Overview


We introduce a framework that decouples computational depth from the fixed architectural constraints of pretrained vision-language backbones. While standard VLA architectures typically utilize fixed-depth heads, RD-VLA shifts the com- putational burden to a weight-tied recurrent transformer core operating within a continuous latent manifold. Following the Huggin approach, we partition the architecture into a functional triplet: the Prelude, the Recurrent Core, and the Coda (Fig. 2). The Prelude and Coda serve as non-recurrent interface layers that map representations into and out of a dedicated latent manifold optimized for iterative reasoning.

Adaptive Computation and Execution


Adaptive Computation Leveraging the properties of the model, we implement an adaptive computation mechanism at inference. Rather than specifying a fixed iteration count, we utilize the model's own internal convergence as a proxy for reasoning certainty. We define a stopping criterion based on the Kullback- Leibler (KL) divergence between the action distributions of consecutive iterations. Approximating KL via Mean Squared Error (MSE) in the action space, the inference loop terminates at step k* when:

$$ ||\mathbf{a}_k - \mathbf{a}_{k-1}||^2_2 < \delta. $$

where ak is the predicted action chunk at step k and δ is a convergence threshold (e.g., 1e−3). This allows the model to self-regulate: terminating instantly for trivial movements while allocating extended compute for complex situations.

Adaptive Execution Adaptive computation determines how long to recur. At the same time adaptive execution determines how many actions to execute. We observe that instances requiring deep recurrence (k* > 8) often correspond to states of high uncertainty. In these regimes, executing a long horizon of actions is dangerous, as small errors in the initial plan compound over time. We propose two strategies to couple the depth of reasoning with the execution of action: (1) Threshold-Based Adaptive Execution and (2) Linear Decay Execution.

Experiments and Visualization

Please find more quantitative results on our paper!

LIBERO-Spatial

Pick up the black bowl next to the cookie box and place it on the plate
Pick up the black bowl on the ramekin and place it on the plate
Pick up the black bowl in the top drawer of the wooden cabinet and place it on the plate
Pick up the black bowl from table center and place it on the plate
Pick up the black bowl between the plate and the ramekin and place it on the plate
Pick up the black bowl on the wooden cabinet and place it on the plate

LIBERO-Object

Pick up the alphabet soup and place it in the basket
Pick up the cream cheese and place it in the basket
Pick up the salad dressing and place it in the basket
Pick up the bbq sauce and place it in the basket
Pick up the ketchup and place it in the basket
Pick up the tamoto sauce and place it in the basket

LIBERO-Goal

Open the middle drawer of the cabinet
Put the bowl on the stove
Put the wine bottle on top of the cabinet
Open the top drawer and put the bowl inside
Put the bowl on top of the cabinet
Push the plate to the front of the stove

LIBERO-Long

Turn on the stove and put the moka pot on it
Put the black bowl in the bottom drawer of the cabinet and close it
Put both moka pots on the stove
Put both the alphabet soup and the cream cheese box in the basket
Put both the alphabet soup and the tomato sauce in the basket
Put both the cream cheese box and the butter in the basket

CALVIN ABC->D

Move slider left
Rotate red block left
Turn off led
Lift pink block table
Place in slider
Lift red block slider
Stack block
Move slider left
Turn off led
Open drawer
Push red block left
Lift pink block slider
Stack block
Turn off led
Open drawer
Move slider left
Lift red block table
Place in slider
Turn on lightbulb
Close drawer
Push blue block right
Open drawer
Lift pink block slider
Place in slider
Push into drawer
Move slider right
Turn on led
Lift pink block slider
Place in drawer
Lift pink block drawer

Adaptive Computation Examples

Short-horizon Task

Long-horizon Task

Long-horizon Task (Failed)

Real-world Experiments

Task
RD-VLA
Baseline
Speed


RD-VLA

Baseline

BibTeX

@misc{tur2026rdvla,
      title={Recurrent-Depth VLA: Implicit Test-Time Compute Scaling of Vision-Language-Action Models via Latent Iterative Reasoning}, 
      author={Yalcin Tur and Jalal Naghiyev and Haoquan Fang and Wei-Chuan Tsai and Jiafei Duan and Dieter Fox and Ranjay Krishna},
      year={2026},
      eprint={2602.07845},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2602.07845}, 
}