Time Is Not in the Pixels

The first task we scoped for a robot deployment was not dexterous — it was temporal. The robot needed to wait for water to boil before pouring it. We could get the arm to grasp the kettle. We could not get any model to understand that the water was not ready yet, or to wait before doing the simplest pick and place. This led us to a question we could not find an answer to anywhere in the literature: do any of the current foundation models for robotics have a representation of duration at all?

The models get better at the parts of robotics that have always been hard: dexterous manipulation, tool use, language conditioning, and ultimately long-horizon sequencing. They do not get better at the parts humans find trivially easy — knowing when the microwave is done before opening the door, staying interruptible during a long pour, remembering when something happened and not just that it did. The question is whether temporal reasoning deserves a first-class representation alongside vision and proprioception — one that lets models reason about duration, not just the sequencing of events.

Duration Blindness

We instructed five VLAs — π 0.5, π 0-FAST, GR00T N1.6, MemoryVLA, and V-JEPA 2 AC — to “wait N seconds, then pick up the block” and measured time to first meaningful action (TTFMA): the first moment the robot produces an action above a calibrated noise threshold. 255 episodes per model.

Fig. I The Invariance
Time to first meaningful action · 255 episodes / model
10 s 1 s 0.1 s 0.01 s 0.001 s 1 s 3 s 5 s 10 s COMMANDED WAIT DURATION OBSERVED TTFMA BASELINE 0.080 s 0.067 1.00 0.059 3.00 0.059 5.01 0.071 10.00 invariant to commanded duration scales 1 : 1
Baseline (no wait)
VLA-Wait (language)
Hierarchical (NOOP injection)

Fig. I The red bar — action timing under the instruction “wait N seconds, then pick up the block” — does not change. The blue bar, which injects N seconds of NOOP externally, scales 1 : 1 and confirms the measurement pipeline works. The gap between them is duration blindness.

The red bar does not move. Whether the instruction says one second or ten, every model begins acting within the first inference step. The blue bar — the same task with NOOP actions injected externally for N seconds — scales linearly, confirming the pipeline works. The model ignores the temporal content of the instruction entirely. On LIBERO, the temporal prefix degrades task success from 46.7% to 3.9%. Asking the model to wait actively breaks the policy.

Teleoperation data systematically filters out pauses, action chunking compresses idle time, and success metrics grade final world state rather than temporal fidelity — LIBERO-fine-tuned checkpoints contain zero NOOP outputs. The action vocabulary does not include “do nothing.”

The natural objection: this is a training data problem. Add synthetic waiting demonstrations — static frames labeled NOOP — and the model will learn. Consider what that data looks like: N seconds of identical frames, each labeled NOOP, followed by one labeled ACT. Every frame is drawn from the same distribution. The observation at t = 1 s is statistically indistinguishable from t = 9 s. The model has no basis for learning which NOOP is the last one.

Fig. II Why Training Data Cannot Help
SYNTHETIC WAITING DATA ยท INSTRUCTION: "WAIT 10 SECONDS, THEN PICK UP THE BLOCK" OBSERVATIONS TARGET LABELS The model sees 10 identical inputs and must predict NOOP 9 times, then ACT once. On what basis? INFORMATION CONTENT P(X | t = 1 s) = P(X | t = 9 s) = P(X) The transition point is not in the pixels. More data does not change this.

Fig. II The absurdity of synthetic waiting data. Every frame during the wait is drawn from the same distribution. The model receives the same observation at t = 1 s and t = 9 s. It must predict NOOP for both, then ACT for t = 10 s — but nothing in the input distinguishes the transition point. Training on this data teaches the model to output NOOP always or ACT always, never to time the switch.

This is not a sample efficiency problem. More data does not help when the data itself is uninformative. The signal is not faint, or hard to decode, or buried under noise — it is absent from the inputs entirely.

Why the Signal Cannot Be Recovered

The impossibility is formal.

Theorem. Let $X_1, X_2, \ldots, X_T \overset{\text{i.i.d.}}{\sim} P$. Then for any measurable $f$:

$$P(X_t = x \mid T = t) \;=\; P(X_t = x) \quad \forall\, t$$

$$\implies\; P(T = t \mid X_t = x) \;=\; P(T = t) \quad (\text{posterior} = \text{prior})$$

$$\implies\; I(T;\, X_t) \;=\; 0$$

Corollary. $Z_t = g(X_t) \;\implies\; I(T;\, Z_t) \,\leq\, I(T;\, X_t) = 0$

No learned representation recovers information absent from its inputs.

No architecture — no matter how large, how deep, or how expensively trained — recovers information absent from its inputs. A static scene produces i.i.d. frames; no function of i.i.d. samples can estimate sample count without a counter. Scaling does not change this.

We probed the VLM backbone to see whether temporal magnitudes survive into the embedding space. VLAs typically split inference between a vision-language backbone that processes the instruction and an expert tower that generates actions — the question is whether the backbone encodes “5 seconds” differently from “10 seconds” and whether that difference reaches the expert.

Fig. III The Encoder’s Categorical Collapse
Cosine distance between temporal-prompt embeddings · PaliGemma VLM dcos ≈ 10−5
Semantic ground truth · what each prompt means 0.1 s 1 s 10 s 100 s 1 000 s 10⁴ s 10⁵ s 10⁶ s “a moment” “5 seconds” “5 minutes” “5 hours” “999 999 s” PaliGemma VLM encoder 18 LAYERS · 3 B PARAMETERS · PRETRAINED Encoder output · normal scale 0 · · · 1 (max cos dist) all five prompts ZOOM ×100 000 Encoder output · magnified “a moment” “5 sec” “5 min” “5 hr” “999 999 s” ≈ 4 × 10⁻⁵ cosine-distance span · statistically real, practically invisible

Fig. III The temporal meaning of a prompt ranges over seven orders of magnitude — from “a moment” to 999 999 seconds. The VLM encoder places all of them within a cosine distance of 4 × 10−5 of each other. The ordering is preserved (Mantel r = 0.77) but the magnitudes are so tiny that the Expert tower treats them as interchangeable. Language is parsed, not grounded.

The backbone preserves a faint numerical ordering (Mantel r = 0.77) but at cosine distances of 10⁻⁵ — magnitudes so small that the expert tower treats all temporal prompts as interchangeable. Zhou, Masmanidis, and Buonomano (2022) showed why this matters structurally: recurrent networks develop smooth duration scaling only when the temporal signal is sustained throughout the interval, not when it arrives as a transient stimulus at t = 0. VLA language instructions are transient by construction — the model receives “wait 5 seconds” once, then runs on observations. The neuroscience predicts categorical timing at best.

Fig. IV Transient Stimulus, Categorical Timing
Input signal structure determines what timing strategy the network can learn Zhou, Masmanidis & Buonomano · PLOS Comp. Biol. 2022
Continuous context SIGNAL PRESENT THROUGHOUT INTERVAL INPUT · instruction signal(t) 0 T TIME OUTPUT · learned duration(instruction) COMMANDED DURATION → ↑ OUTPUT WAIT smooth temporal scaling Transient stimulus SIGNAL PRESENT BRIEFLY, THEN GONE INPUT · instruction signal(t) 0 T TIME OUTPUT · learned duration(instruction) SHORT LONG COMMANDED DURATION → ↑ OUTPUT WAIT categorical — short / long “WAIT 3 SECONDS”

Fig. IV Zhou, Masmanidis & Buonomano (2022): sustained input → smooth temporal scaling. Transient input → categorical timing only (“short” or “long,” no interpolation). VLA language instructions are transient stimuli.

Three Kinds of Time

Current VLAs conflate three independent temporal quantities.

Position is the frame index — “step 7 of 80.” Every model has this via positional encoding. It says where you are in a sequence, nothing about what happened or how long it took.

Progress is a monotonically non-decreasing function of the task’s semantic state — concretely, how much the model’s internal representation changes between frames. It advances quickly during state transitions (reach → grasp, transport → place) and slowly during steady execution. A trajectory might spend 60% of its frames in transport but only 10% of its task-relevant progress.

Duration is elapsed wall-clock time since episode start. Uniform, always advancing, semantics-free. A robot running at 30 Hz during a grasp and 5 Hz during a wait produces the same positional increment per frame but vastly different durations per step.

Fig. V Three Kinds of Time
TASK TIMELINE REACH GRASP TRANSPORT WAIT PLACE RETRACT POSITION "frame 7 of 80" uniform spacing ยท every VLA has this ✓ EXISTS PROGRESS "just finished the grasp" event-driven ยท non-uniform ยท stalls during idle ✗ MISSING DURATION "3.2 seconds have elapsed" 0 s 4.8 s 9.6 s always advancing ยท semantics-free ยท unrecoverable from pixels ✗ MISSING

Fig. V Three independent temporal quantities. Position advances uniformly — every VLA has it. Progress advances non-uniformly — fast during events, stalled during steady state. Duration advances uniformly but independently of both. Current VLAs encode position. They encode neither progress nor duration.

Each missing quantity produces a distinct failure. Without progress, the model allocates equal attention to every frame — transport and grasp are interchangeable. Without duration, the model cannot distinguish a one-second wait from a ten-second wait. And when progress stalls during a static scene, even an event-driven temporal signal flatlines. The three quantities require different solutions because they fail for different reasons.

These are genuinely independent. Position 50 could be early in a long task or late in a short one. Progress 0.8 could arrive at step 20 or step 200. Duration 5 seconds could mean a great deal happened or nothing at all. Collapsing them into a single positional encoding discards two of the three signals.

Towards an Arrow of Time

Most real scenes are not perfectly static — water roils before it boils, adhesive changes sheen as it cures, a microwave’s turntable rotates. Humans reason from vision too; we read the timer, we do not intuit 30 seconds directly.

V-JEPA’s 3D spatiotemporal representations encode motion patterns that vary with elapsed time. The question worth formalizing is whether a model can learn an internal arrow of time purely from the latent dynamics of pixel movements — a representation where the direction and magnitude of elapsed time emerge from learned structure rather than an injected signal.

But there is a more immediate observation. The i.i.d. impossibility holds for cameras. It does not hold for all sensors. A robot arm’s IMU gyroscope exhibits bias drift — a random walk whose variance grows linearly with time. Its force-torque sensor has thermal noise that shifts with temperature. These are not flaws to be filtered out. They are features.

Fig. VI The Sensor Arrow
STATIC SCENE ยท CAMERA VS SENSORS elapsed: 0.0 s STATIC INTERVAL ยท NOTHING VISUALLY HAPPENING CAMERA OBSERVATIONS I(T ; X) = 0 μ Var = σ² (constant) no temporal signal GYROSCOPE BIAS DRIFT I(T ; X) > 0 Var = σ²·t (grows) arrow of time MUTUAL INFORMATION WITH ELAPSED TIME 0 camera: zero, always gyro: grows with time 0 s 10 s

Fig. VI During a static scene, camera observations are i.i.d. and carry zero temporal information. Gyroscope bias drift is a random walk whose variance grows as σ²·t — a statistical arrow of time that persists when nothing is visually happening.

A model that tracks the running variance of its own sensor drift has access to a signal that a camera-only model provably does not. This has been validated on synthetic random walks but not yet on real robot sensor data — the direction is theoretical, not demonstrated.

Fig. VII Two Regimes, One Clock
ACTIVE STATIC (WAITING) ACTIVE STATIC VISUAL CHANGE ‖zt − zt−1 SENSOR ENTROPY Δσ²(t) COMBINED TEMPORAL SIGNAL τ monotone ยท never stops τ̇ > 0 always Semantic surprise (visual representation shift) Sensor entropy (motor noise drift)

Fig. VII Two regimes of temporal signal. During active manipulation, visual representations shift rapidly — a progress signal emerges from what is changing. During static intervals, sensor entropy continues to accumulate. A system that combines both should never lose track of time.

Where to Build It

Several current architectures already contain scalar conditioning pathways that could carry a temporal signal — they are simply aimed at the wrong axis. DiT-based VLAs condition every transformer layer’s gain and bias on a single scalar via AdaLN-Zero (GR00T) or AdaRMS (π 0.5). This scalar is tdenoise, the flow-matching denoising step, which resets every inference call. The mechanism is exactly right. The axis is wrong.

Fig. VIII AdaLN-Zero’s Axis
AdaLN-Zero conditioning in DiT-based VLAs condition(tdenoise) ≠ condition(ttask)
Denoising time · t d 0 1 0.00 resets each call AdaLN-Zero conditions on this axis → Task time · t task (static-scene wait) 0 s 5 s 10 s 15 s 20 s 0.0s no conditioning AdaLN ↓ no current pathway from t task into the DiT DiT action head FLOW MATCHING CONDITIONED ON t d OUTPUT · action chunk chunk at t_task = 0.0 s  

Fig. VIII DiT-based VLAs already have a scalar temporal conditioning mechanism — AdaLN-Zero — built into them. It conditions on denoising time td ∈ [0, 1], which resets every inference call, rather than task time ttask. During a static scene the same chunk is produced at ttask = 1 s and ttask = 10 s.

We ran ten experiments across injection modes, optimization strategies, and training paradigms. The clock signal is learnable — when forced to be the only optimization lever, temporal gates grow and reduce training loss. But the clock alone does not produce duration-specific behavior. The bottleneck is twofold: the language encoder collapses “5 seconds” and “10 seconds” into near-identical embeddings (cosine > 0.999), and independent chunk training provides zero temporal gradient for 97.5% of training data. The clock is a watch without an appointment card — it tells the model what time it is, but the model cannot read what time to act from the prompt.

The question is not whether to give the robot a clock — every deployed robot will have one. The deeper question is whether temporal grounding should be bolted onto the side of a vision-language system or emerge from a physical process already present in the robot’s own sensors. The entropic signal from motor noise suggests a middle path: not an external clock, but an existing physical process leveraged as an informational pathway to curtail the categorical limitations of today’s video backbones.

The strongest version of our claim — that architecture, not training data, is the bottleneck — remains a falsifiable prediction. A Diffusion Policy trained on wait-augmented LIBERO data should have TTFMA ratio below 0.2: it will learn the average wait duration but will not modulate based on the commanded duration, because the observations during the wait are i.i.d. regardless of the label. If the ratio exceeds 0.5, the architectural necessity claim is refuted and the problem is one of data, not structure. We have stated this prediction precisely so that it can be tested.

Duration is the first structural gap we have characterized. It will not be the last.