Recurrent-Depth VLA:

Implicit Test-Time Compute Scaling of
Vision–Language–Action Models via
Latent Iterative Reasoning

Yalcin Tur¹, Jalal Naghiyev²*, Haoquan Fang^{3, 4}*, Wei-Chuan Tsai³,
Jiafei Duan^{3, 4†} Dieter Fox^{3, 4†}, Ranjay Krishna^{3, 4†}

¹Stanford University ²Technical University of Munich
³University of Washington ⁴Allen Institute for AI

*Equal Contribution †Equal Advising

Under Review

arXiv Code

Abstract

Current Vision-Language-Action (VLA) models utilize fixed computational depth, processing simple adjustments and complex multi-step manipulations with same amount of compute. While Chain-of-Thought (CoT) prompting enables variable compute, it scales memory linearly and struggles with continuous action spaces. We introduce Recurrent-Depth VLA (RD-VLA), an architecture that achieves computational adaptivity through latent iterative refinement instead of explicit token generation. RD-VLA employs a recurrent action head with weight-tied layers, enabling arbitrary depth with a constant memory footprint. We train the model using truncated backpropagation through time (TBPTT), allowing for efficient supervision of the refinement process. At inference, an adaptive stopping criterion based on latent convergence enables the model to dynamically allocate compute per sample. Our experiments on complex manipulation tasks demonstrate that recurrent depth is critical for success: tasks failing (0%) with single-iteration inference achieve +90% success with four iterations, while simpler tasks saturate quickly. RD-VLA provides a scalable path for test-time compute in robotics, bypassing the data and memory overhead of CoT while replacing discrete, token-based reasoning with latent reasoning, which maintains a constant memory footprint regardless of depth, and does not require any special data collection.

Summary

We introduce Recurrent-Depth VLA. (Left) Previous reasoning VLAs (e.g., ThinkAct, MolmoAct) generate explicit reasoning tokens in output space, requiring expensive autoregressive decoding. (Center) Our approach performs iterative refinement entirely in latent representation space, bypassing token generation overhead. (Right) RD-VLA achieves comparable performance to autoregressive reasoning baselines on LIBERO-10 while being substantially faster due to the efficiency of latent reasoning with adaptive compute.

Method Overview

We introduce a framework that decouples computational depth from the fixed architectural constraints of pretrained vision-language backbones. While standard VLA architectures typically utilize fixed-depth heads, RD-VLA shifts the com- putational burden to a weight-tied recurrent transformer core operating within a continuous latent manifold. Following the Huggin approach, we partition the architecture into a functional triplet: the Prelude, the Recurrent Core, and the Coda (Fig. 2). The Prelude and Coda serve as non-recurrent interface layers that map representations into and out of a dedicated latent manifold optimized for iterative reasoning.

Adaptive Computation and Execution

Adaptive Computation Leveraging the properties of the model, we implement an adaptive computation mechanism at inference. Rather than specifying a fixed iteration count, we utilize the model's own internal convergence as a proxy for reasoning certainty. We define a stopping criterion based on the Kullback- Leibler (KL) divergence between the action distributions of consecutive iterations. Approximating KL via Mean Squared Error (MSE) in the action space, the inference loop terminates at step k* when:

$$ ||\mathbf{a}_k - \mathbf{a}_{k-1}||^2_2 < \delta. $$

where ak is the predicted action chunk at step k and δ is a convergence threshold (e.g., 1e−3). This allows the model to self-regulate: terminating instantly for trivial movements while allocating extended compute for complex situations.

Adaptive Execution Adaptive computation determines how long to recur. At the same time adaptive execution determines how many actions to execute. We observe that instances requiring deep recurrence (k* > 8) often correspond to states of high uncertainty. In these regimes, executing a long horizon of actions is dangerous, as small errors in the initial plan compound over time. We propose two strategies to couple the depth of reasoning with the execution of action: (1) Threshold-Based Adaptive Execution and (2) Linear Decay Execution.

Experiments and Visualization

Please find more quantitative results on our paper!

LIBERO-Spatial

Pick up the black bowl next to the cookie box and place it on the plate

Pick up the black bowl on the ramekin and place it on the plate

Pick up the black bowl in the top drawer of the wooden cabinet and place it on the plate

Pick up the black bowl from table center and place it on the plate

Pick up the black bowl between the plate and the ramekin and place it on the plate

Pick up the black bowl on the wooden cabinet and place it on the plate

LIBERO-Object

Pick up the alphabet soup and place it in the basket

Pick up the cream cheese and place it in the basket

Pick up the salad dressing and place it in the basket

Pick up the bbq sauce and place it in the basket

Pick up the ketchup and place it in the basket

Pick up the tamoto sauce and place it in the basket

LIBERO-Goal

Open the middle drawer of the cabinet

Put the bowl on the stove

Put the wine bottle on top of the cabinet

Open the top drawer and put the bowl inside

Put the bowl on top of the cabinet

Push the plate to the front of the stove

LIBERO-Long

Turn on the stove and put the moka pot on it

Put the black bowl in the bottom drawer of the cabinet and close it

Put both moka pots on the stove

Put both the alphabet soup and the cream cheese box in the basket

Put both the alphabet soup and the tomato sauce in the basket

Put both the cream cheese box and the butter in the basket

CALVIN ABC->D

Move slider left
Rotate red block left
Turn off led
Lift pink block table
Place in slider

Lift red block slider
Stack block
Move slider left
Turn off led
Open drawer

Push red block left
Lift pink block slider
Stack block
Turn off led
Open drawer

Move slider left
Lift red block table
Place in slider
Turn on lightbulb
Close drawer

Push blue block right
Open drawer
Lift pink block slider
Place in slider
Push into drawer

Move slider right
Turn on led
Lift pink block slider
Place in drawer
Lift pink block drawer

Adaptive Computation Examples

Short-horizon Task

Long-horizon Task

Long-horizon Task (Failed)

Real-world Experiments

Task

RD-VLA

Baseline

Speed

RD-VLA

Baseline

BibTeX

@misc{tur2026rdvla,
      title={Recurrent-Depth VLA: Implicit Test-Time Compute Scaling of Vision-Language-Action Models via Latent Iterative Reasoning}, 
      author={Yalcin Tur and Jalal Naghiyev and Haoquan Fang and Wei-Chuan Tsai and Jiafei Duan and Dieter Fox and Ranjay Krishna},
      year={2026},
      eprint={2602.07845},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2602.07845}, 
}

Recurrent-Depth VLA:

Implicit Test-Time Compute Scaling of Vision–Language–Action Models via Latent Iterative Reasoning

Abstract

Summary

Method Overview

Adaptive Computation and Execution

Experiments and Visualization

Please find more quantitative results on our paper!

LIBERO-Spatial

LIBERO-Object

LIBERO-Goal

LIBERO-Long

CALVIN ABC->D

Adaptive Computation Examples

Real-world Experiments

RD-VLA

Baseline

BibTeX

Implicit Test-Time Compute Scaling of
Vision–Language–Action Models via
Latent Iterative Reasoning