The Opacity Problem

Watch one large language model (LLM) agent work and you get a wall of text. Watch a thousand of them and you get a machine: a handful of states the agent cycles through on almost every task, a search, edit, execute loop it never declares and the system prompt never specifies. When a run fails, that is where it broke. The catch is that nobody can see it.

This article recovers that machine. From nothing but the agent’s own execution traces, with no labels and no task descriptions, we extract a compact finite-state machine, or FSM, of 7 to 43 states. That machine does the two things an operator actually needs: predict what the agent will do next, and catch a failing run before it wastes the compute.

The whole article follows one thread. Heterogeneous agent traces pass through one deterministic abstraction $\phi$ , pile up into a prefix tree, and collapse in a single classical merge into one automaton. That one object, not four bespoke pipelines, is then read four different ways: as workflow memory, a next-step predictor, a failure detector, and a runtime monitor.

How the machine is built, and how it is used

Data and execution flow. On the left, construction (offline, once per agent): each trace message is mapped to a symbol by φ, the symbol sequences accumulate into a prefix tree, and one state-merge collapses it into a single automaton. On the right, the four runtime consumers and what each reads from that same machine. Hover a consumer to trace it back to the automaton.

LLM-based agents are now deployed across demanding domains: resolving GitHub issues (Yang et al., 2024, 2025), navigating websites (Zhou et al., 2024), operating desktops (Xie et al., 2024), managing customer service interactions (Yao et al., 2024), and orchestrating multi-agent pipelines. Following the ReAct pattern, they interleave chain-of-thought reasoning with tool calls, and every run produces an execution trace of tool calls, natural language, and environment feedback. The behavioral structure governing that trace stays implicit.

A coding agent cycles through search, edit, execute. A customer service agent alternates between database queries and user communication. This structure emerges from the interaction between the system prompt, the available tools, and the task distribution, yet it is nowhere written down.

Understanding this latent structure matters the moment you deploy. Safety auditing needs to verify that an agent visits the right states and avoids attack chains. Debugging needs to locate the bottleneck states where agents get stuck. Production monitoring needs to flag behavioral drift before it costs anything.

Yet current agent analysis works at the level of a single trace. It offers no structural model of the behavior that links one run to the next.

A trace, read symbol by symbol, drives a finite-state machine

Pick a dataset, then scrub the tape (drag or scroll) or let it play. The read-head advances through the agent's activity tape (left) and lights up the current FSM state and the transition it takes (right). A successful run threads to submit; a stuck run loops and never reaches it. These are representative traces; the full set of twelve datasets with their real extracted FSMs is in the live dashboard.

The Inverse Problem

We frame behavioral recovery as an inverse problem: given a corpus of execution traces, reconstruct a finite-state machine that explains the observed behavior. This is grammatical inference (Gold, 1967; Oncina & Garcı́a, 1992), but the classical setting assumes both positive and negative examples. Agent traces give us positive examples only, the runs that happened, with no labeled counter-examples. Identifying the target language from positive examples alone is impossible in the limit (Gold, 1967).

What rescues the problem is a property specific to agents. Unlike arbitrary regular languages, agent behavior is generated by a bounded set of tools and actions, so traces draw from a small activity alphabet, 6 to 42 symbols across our twelve datasets.

A small alphabet is the whole reason this works. It makes the compact automaton both small and, as later chapters show, statistically dense enough to predict from. The only modeling choice in the entire pipeline is how a message becomes a symbol.

Twelve Datasets, Eight Domains

We evaluate on twelve public datasets spanning coding, web navigation, desktop GUI, mobile GUI, customer service, and multi-agent coordination. The trace counts below are the trajectories we use in our experiments, from 184 to 8,337 each, not the size of each source corpus: for the largest sources we draw a fixed slice rather than the whole set.

Dataset	Domain	Traces used	Actions	States	Fitness
SWE-smith	Coding	500	9	10	1.000
SWE-agent	Coding	2,000	24	25	0.999
Mind2Web	Web	500	7	8	1.000
WebArena	Web	8,337	24	25	1.000
AgentNet	Desktop GUI	5,000	24	25	1.000
GUI-Odyssey	Mobile GUI	7,735	6	7	1.000
Who & When	Multi-agent	184	8	9	1.000
tau2-bench airline	Customer service	800	17	18	1.000
tau2-bench retail	Customer service	1,824	18	19	1.000
tau2-bench telecom	Customer service	1,824	42	43	1.000
ATBench	Safety	1,000	14	15	1.000
OSWorld	Desktop OS	2,166	26	27	0.997

Every dataset replays held-out traces at fitness of at least 0.997.

The three largest source datasets are subsampled to a fixed slice: SWE-agent uses 2,000 of the 80,036 available trajectories, Mind2Web 500 of 2,350, and AgentNet 5,000 from the OpenCUA Ubuntu subset. The other nine datasets are used in full.

Every figure in this article is a fixed snapshot; every dataset and every case is live in the interactive dashboard, which renders all twelve datasets straight from the experiment outputs — the FSMs, the failure predictor, the runtime monitor, and more.

Gold, E. M. (1967). Language Identification in the Limit. Information and Control, 10(5), 447–474. back: 1, 2
Oncina, J., & Garcı́a, P. (1992). Inferring Regular Languages in Polynomial Updated Time. Pattern Recognition and Image Analysis, 49–61.
Xie, T., Zhang, D., Chen, J., Li, X., Zhao, S., Cao, R., Hua, T. J., Cheng, Z., Shin, D., Lei, F., Liu, Y., Xu, Y., Zhou, S., Savarese, S., Xiong, C., Zhong, V., & Yu, T. (2024). OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments. The Thirty-Eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track. https://openreview.net/forum?id=tN61DTr4Ed
Yang, J., Jimenez, C. E., Wettig, A., Lieret, K., Yao, S., Narasimhan, K., & Press, O. (2024). SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering. Advances in Neural Information Processing Systems (NeurIPS).
Yang, J., Lieret, K., Jimenez, C. E., Wettig, A., Khandpur, K., Zhang, Y., Hui, B., Press, O., Schmidt, L., & Yang, D. (2025). SWE-smith: Scaling Data for Software Engineering Agents. Proceedings of the Annual Conference on Neural Information Processing Systems (NeurIPS), Datasets & Benchmarks Track.
Yao, S., Narasimhan, K., & others. (2024). τ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains. arXiv Preprint arXiv:2406.12045.
Zhou, S., Xu, F. F., Zhu, H., Zhou, X., Lo, R., Sridhar, A., Cheng, X., Bisk, Y., Fried, D., Alon, U., & Neubig, G. (2024). WebArena: A Realistic Web Environment for Building Autonomous Agents. Proceedings of the International Conference on Learning Representations (ICLR).

From Traces to Symbols

An agent execution trace is a sequence of messages $\tau = (m_1, m_2, \ldots, m_T)$ , where each message has a role (system, user, assistant, tool) and content. The first step is to map each message to a symbolic activity from a finite alphabet.

Activity Extraction

An activity extraction function $\phi: m_t \mapsto a_t \in \mathcal{A}$ maps each message to a symbol. We apply three rules in priority order:

Tool calls: if a message contains a tool_call field, the activity is the function name (for example bash, search_flight, click).
Action tags: if the content contains [ACTION] description, the activity is the action label.
Command extraction: for agents that act through code blocks, we extract the first command token and map it to a semantic category (edit, search, navigate, execute).

If no rule matches, the activity defaults to role:content_type (for example assistant:text).

The extraction is entirely deterministic and format-specific, with no LLM calls. The whole process completes in milliseconds.

An Example

Consider a coding agent trace from SWE-agent with 47 messages. The raw trace contains system prompts, file contents, error messages, and tool invocations. After extraction, the activity sequence is:

init → user → search → user → edit → user → execute → user → edit → user → submit

From 47 messages and thousands of tokens, we get 11 symbols drawn from an alphabet of 24 possible activities. This is the sequence the FSM will model.

Trace Explorer

Step through an agent trace to see how raw messages map to symbolic activities. Each message is classified by the extraction rules into one of the agent's activity symbols.

Replay any real trace symbol by symbol in the trace view.

Why This Works

These alphabets are small by construction: 6 to 42 symbols across the twelve datasets, against the tens of thousands of natural language. That is what keeps FSM extraction tractable.

Even a 42-tool telecom customer service agent (tau2-bench telecom) needs only 43 states to capture its behavioral structure. The alphabet is bounded because the agent’s capabilities are bounded: it can only call the tools it has been given.

$\phi$ is the one design decision in the pipeline, so we stress-test it. Fitness stays above 0.999 across all four granularities, from role-only (two to four symbols) to full tool-level. Failure prediction stays within 0.03 AUROC (area under the ROC curve) on any dataset. The rules above are one valid setting, not the only one.

Building the Machine

Given a corpus of activity sequences, FSM construction is two steps and nothing else: build a prefix tree, then merge structurally equivalent states. There are no thresholds, no number of clusters, no learning rate.

Step 1: Prefix Tree

Insert all activity sequences into a trie. Each unique prefix becomes a distinct state. The prefix tree has perfect training fitness, it replays every training trace exactly, but it can have tens of thousands of states.

For SWE-agent (2,000 traces), the prefix tree has 59,510 states. Most are visited once and represent memorized suffixes rather than reusable transition patterns.

Step 2: Structural Merging

Two states are structurally equivalent if for every activity $a \in \mathcal{A}$ : (i) $\delta(q, a)$ is defined exactly when $\delta(q', a)$ is defined, and (ii) the targets are themselves equivalent. This recursion is computed bottom-up in a single pass.

Structural equivalence is exactly the Myhill-Nerode equivalence on the observed prefix language. By the Myhill-Nerode theorem the quotient is the unique minimal deterministic finite automaton (DFA), so no smaller automaton can reproduce the observed behavior (Hopcroft et al., 2006).

The SWE-agent prefix tree with 59,510 states collapses to just 25 states, a 2,380 $\times$ compression, and the resulting FSM still replays held-out traces at 0.999 fitness.

Prefix Tree to FSM Collapse

Watch a prefix tree collapse into a compact FSM through structural merging. States with identical outgoing transition patterns are merged iteratively until no further merge is possible.

The Resulting FSM

The FSM $\mathcal{M} = (Q, \mathcal{A}, \delta, q_0)$ encodes the agent’s behavioral topology. Recurring patterns become loops, and the state count tracks the number of distinct behavioral modes.

In the tau2-bench retail and telecom customer service agents, a tool-call loop (assistant:tool_call to tool:text) dominates execution, with the conversational path through assistant:text as a separate branch. In a coding agent, the search, edit, execute cycle accounts for most of the trace.

Explorable FSM Graphs

Select a dataset to explore its extracted FSM. Node size reflects visit frequency; edge thickness reflects transition frequency. Drag nodes to rearrange, scroll to zoom, drag the background to pan, and double-click to reset the view.

Open the FSM explorer for any of the twelve datasets in the live dashboard.

Theoretical Properties

The construction guarantees three properties.

Fitness preservation. Structural merging preserves training fitness: if a trace is accepted by the prefix tree, it is accepted by the merged FSM, because merging only adds out-edges (each state carries the union of its merged transitions).

Compactness. The merged FSM is a compact directly-follows automaton: one state per activity, deterministic, accepting every observed trace. We recover this, not the generating automaton, which is impossible to identify from positive examples alone (Gold, 1967), but it is enough for faithful replay and prediction.

Linear runtime. Prefix tree construction is $O(\sum_i T_i)$ , and structural merging is a partition refinement in $O(|Q_\mathcal{P}| \cdot |\mathcal{A}|)$ . In practice all twelve datasets complete in under one second on a single CPU core.

The construction itself is classical (Daciuk et al., 2000). What is new is that bounded agent alphabets make the resulting compact automaton small enough, and dense enough per state, to be useful for the prediction and monitoring tasks that follow.

Daciuk, J., Mihov, S., Watson, B. W., & Watson, R. E. (2000). Incremental Construction of Minimal Acyclic Finite-State Automata. Computational Linguistics, 26(1), 3–16.
Gold, E. M. (1967). Language Identification in the Limit. Information and Control, 10(5), 447–474.
Hopcroft, J. E., Motwani, R., & Ullman, J. D. (2006). Introduction to Automata Theory, Languages, and Computation (3rd ed.). Pearson.

Why the Machine Stays Small

A small state count alone does not make a machine useful. An automaton can still be huge if the language is complex, and a huge automaton spreads its observations thinly, leaving every per-state statistic noisy. What makes the FSM a usable substrate is that it is small, stable, and converges fast, so each state pools enough traces to estimate from. That property, not the exact count, is what the prediction and monitoring chapters depend on.

It converges on a few percent of the data

Structure stabilizes almost immediately. Across five random train/test splits the extracted state count is identical every time, zero variance, so the topology is a property of the agent, not of which traces you happened to sample. Replay fitness plateaus just as fast: within the first 1 to 10% of the training traces every dataset clears 0.95 fitness, and SWE-smith holds 0.9996 from the first 1%.

Three different methods agree

The compactness is not an artifact of our particular merge rule. Three fundamentally different algorithms converge to nearly the same state count on every dataset:

Structural merging (ours), a structural partition.
Alergia (Carrasco & Oncina, 1994), a statistical merge, lands within 1.0 to 6.0 $\times$ .
hidden Markov model (HMM) (Rabiner, 1989), a probabilistic latent-state model, matches the count, though its states are not interpretable.

A structural, a statistical, and a probabilistic method agreeing rules out an algorithmic coincidence.

Carrasco, R. C., & Oncina, J. (1994). Learning Stochastic Regular Grammars by Means of a State Merging Method. International Colloquium on Grammatical Inference, 139–152.
Rabiner, L. R. (1989). A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proceedings of the IEEE, 77(2), 257–286.

Compression and Comparison

We compare against nine baselines from automata learning (Carrasco & Oncina, 1994; Oncina & Garcı́a, 1992), HMMs (Rabiner, 1989), process mining, and agent workflow extraction (Wang et al., 2024). All receive only the same positive training sequences, no failure labels.

How Much Smaller

Our FSMs achieve 15 to 3,036 $\times$ compression over RPNI while replaying held-out traces at fitness of at least 0.997. The ratio grows with trace length and branching: 15 $\times$ on WebArena (short web traces, where RPNI succeeds) up to 3,036 $\times$ on GUI-Odyssey (long, repetitive mobile-GUI traces, where RPNI’s prefix tree explodes to 21,255 states against our 7).

Compression Ratios

Each dataset's state count on a log scale: ours (dark slate) versus RPNI (light slate). The distance between the two dots is the compression, from 15x on WebArena to 3,036x on GUI-Odyssey, all at replay fitness of at least 0.997.

Compare all eight methods interactively in the baselines view, or watch structure stabilize in the convergence view.

Convergence and Stability

Replay fitness reaches its plateau well before the training set is exhausted. On SWE-agent it is already at 0.985 within 1% of the training traces and settles at 0.996 by 10%, while the state count keeps inching up as rare command patterns appear. Structured tool-call domains converge fastest: SWE-smith holds 0.9996 from the first 1% of data. Open web and delegation traces take longer, with Mind2Web needing 5% and Who&When 10% of their traces to clear 0.95 fitness.

Convergence Curves

Replay fitness as training traces accumulate, for the four datasets with incremental-convergence runs. The dashed marker shows where each first reaches 0.95 fitness, from 1% of the data on the coding datasets to 10% on Who&When.

Baselines at a Glance

Method Comparison

State count and fitness across five methods and twelve datasets. Our FSM holds the smallest state count among non-degenerate methods while keeping the highest fitness.

RPNI without negative examples keeps large portions of the prefix tree (382 to 63,897 states) at degraded fitness.
Alergia, the strongest competitor, matches our fitness but uses 1.0 to 6.0 $\times$ more states.
HMM matches our state count but produces non-interpretable latent states.
EDSM (evidence-driven state merging) without negatives collapses to a trivial 1-state acceptor.
k-Tails needs a hyperparameter and produces 1.4 to 10 $\times$ more states than ours at $k=1$ , with state counts exploding past $k=2$ .
Process mining miners reach high fitness but precision 0.00 to 0.80, the “flower model” problem where every activity is reachable from every state.

Precision

The FSM is more than a vocabulary. It rejects every random trace, and at least 99.9% of permuted traces that keep the activity set but scramble the order. Even single-symbol mutations, a substitution or an insertion or an adjacent swap, are blocked 77 to 100% of the time. RPNI, with its thousands of states, accepts 75% of those same permuted traces on WebArena.

Carrasco, R. C., & Oncina, J. (1994). Learning Stochastic Regular Grammars by Means of a State Merging Method. International Colloquium on Grammatical Inference, 139–152.
Oncina, J., & Garcı́a, P. (1992). Inferring Regular Languages in Polynomial Updated Time. Pattern Recognition and Image Analysis, 49–61.
Rabiner, L. R. (1989). A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proceedings of the IEEE, 77(2), 257–286.
Wang, Z. Z., Mao, J., Fried, D., & Neubig, G. (2024). Agent Workflow Memory. arXiv Preprint arXiv:2409.07429.

Prediction: Next Step and Failure

The same FSM state answers both questions an operator asks: what the agent will do next, and whether this run is heading for failure. The estimand is one object, the per-state transition distribution, and compactness is what makes it reliable.

Next-Step Prediction

At each step the predictor estimates $P(a_t \mid \text{context})$ , scored by cross-entropy in bits via 5 $\times$ 5-fold cross-validation (CV). Conditioning on FSM state alone, with no learning beyond an order-1 Markov model, accounts for 83 to 99% of the total cross-entropy improvement on each dataset. The average drops to 0.93 bits, a 62% cut from the unigram baseline at 2.44 bits.

The cleanest test holds the predictor fixed and adds FSM state as a feature. Under absolute discounting, FSM state conditioning adds +0.155 bits on average (0.580 vs 0.735) and helps on every one of six datasets, from +0.016 on SWE-agent to +0.364 on Mind2Web, whose branching web-action vocabulary gains most from knowing where in the workflow it is. Combine FSM state with a small learned model and you get the best predictor in the study at 0.73 bits.

Workflow Memory

The gain generalizes to the agent’s own LLM. Feeding the current FSM state as context for choosing the next action beats Agent Workflow Memory (AWM) (Wang et al., 2024) on all eight ground-truth datasets, with six gaps significant at $p < 10^{-8}$ .

Dataset	N	AWM	Ours	Δ
WebArena	4,800	65.5	81.2	+15.7
SWE-smith	300	74.7	100.0	+25.3
SWE-agent	1,200	67.7	70.5	+2.8
tau2-bench telecom	1,095	28.5	45.6	+17.1
tau2-bench retail	1,095	52.9	65.1	+12.2
tau2-bench airline	480	56.5	57.3	+0.8
ATBench	600	47.8	62.5	+14.7
OSWorld	1,286	55.0	70.7	+15.7

AWM extracts workflows from successful traces only, so on low-success datasets it has little to say. That is exactly where the gap is widest.

Handing the LLM the full FSM, every state and transition, actually loses to AWM (52.2% vs 52.9% on tau2-bench retail). Dumping the whole graph drowns the next-step signal. The format that wins at 65.1% is minimal: next-action probabilities plus a few top continuations from the current state, with no structure dump. Finding the right minimal context is part of the contribution, the same way AWM’s linear-workflow format is part of its.

Context given to the LLM (tau2-bench retail)	Top-1 %
No memory, just the trace so far	27.6
Linear workflows from successful runs (AWM)	52.9
Full machine: current state, every transition, the whole graph	52.2
Full machine, plus multi-step continuations	49.2
Full machine, from successful traces only	50.3
Minimal: next-action probabilities and a few likely continuations	65.1

Step through the workflow-memory comparison per dataset in the memory view.

Predicting Failure

Replay a trace through the FSM and read off per-state behavioral features (visit frequency, message-length statistics, error rate, early/late entropy drift) plus five cross-entropy anomaly features. A single gradient-boosted classifier on a fixed 80/20 split reaches held-out AUROC up to 0.94.

Raw fitness is useless here (AUROC near 0.50): successful and failed traces both replay perfectly. The signal is in the per-state decomposition and in surprise, failing traces take low-probability transitions under the FSM.

Failure prediction scales with machine size: more states give a finer map of where a run can go wrong. The 43-state telecom agent tops out at 0.941; WebArena (0.903) and AgentNet (0.890) follow; SWE-agent, with 25 states, reaches 0.799. ATBench, the only safety-labeled benchmark, reaches 0.894 (0.864 ± 0.024 under repeated CV). Across all eight real-trace datasets the CV standard deviation stays in 0.012 to 0.031, so these are not single-split artifacts.

Failure Prediction

Held-out AUROC across the nine labeled datasets, with the top predictive feature for each. Larger FSMs (more states, more tools) predict better.

Inspect per-state feature importances and failure modes in the failure view.

The predictors are interpretable. On SWE-agent the single strongest feature is whether the trace reaches the submit state: 94.8% of successes get there, only 55.7% of failures do. And this is not a length proxy, structural features score 0.790 against 0.659 for trace length alone. Successful runs touch only 9 of 25 states along a focused search, edit, submit path, while failures spread across all 25 (Jaccard overlap 0.206).

Runtime Monitor

Deployed online, a two-rule monitor fires when the cycle-rate exceeds 0.778 and the unique-state count clears a warm-up floor. On all four evaluated datasets it reaches rank-AUROC 0.66 at the 25% trace checkpoint, against 0.5 for a flag-everything baseline by construction. On SWE-agent it fires at 32% of trace completion, stopping the run before two-thirds of its remaining compute is spent (precision 85.9%, recall 95.5%). By the halfway checkpoint, FSM features alone already recover 92% of the full-trace signal.

When 84% of runs fail, flagging everything scores a high F1 by default (0.914, versus the monitor’s 0.904). The point of a monitor is not whether to flag but when. Rank-AUROC measures exactly that early-warning utility, which a base-rate-inflated F1 hides. The pipeline is FSM replay only, 0.006 ms per step, with no ML model in the loop.

Watch the monitor flag a failing run in real time in the monitor view.

Wang, Z. Z., Mao, J., Fried, D., & Neubig, G. (2024). Agent Workflow Memory. arXiv Preprint arXiv:2409.07429.

Discussion

When Does It Work?

The four tasks in this article, workflow memory, next-step prediction, failure detection, and runtime monitoring, look different, yet one object served all of them. When a system’s action vocabulary is bounded, its behavioral topology is bounded too, and a compact, stable automaton is the natural summary. Beating four bespoke pipelines with it is a consequence, not a design goal.

The topology is also model-invariant: a single FSM achieves perfect fitness across four large language models on the same task, so it is shaped by the system, the tools and prompts and task distribution, more than by the model driving it. It stays stable across extraction granularities too, shifting failure-prediction AUROC by less than 0.03 over four levels of $\phi$ .

A 25-state machine can be read and checked by a person; the 59,510-state prefix tree it came from cannot. That auditability is a direct dividend of minimality.

Limitations

The FSM accepts the observed prefix language, not the agent’s true generating language: like any trace-replay method, it cannot tell a trace that stays within the observed transition patterns from a legitimate one. The extraction function $\phi$ needs a small amount of per-domain knowledge, and fully automatic discovery of it is future work. Failure prediction degrades on simpler machines: AUROC falls to 0.799 on SWE-agent and 0.70 on the 10-state SWE-smith, smaller task spaces with less structure to exploit. Extending the workflow-memory comparison beyond AWM to other memory-injection methods is future work.

For agents with much larger action spaces or weaker sequential structure, the construction stays minimal but stops being compact, and the per-state observation density that drives every result above would degrade with it.

Broader Impact

Compact FSM representations make agent behavioral structure inspectable, which supports safety auditing. The same analysis could be misused to find exploitable behavioral patterns, so deployment should restrict FSM analysis to authorized auditing.

Conclusion

A finite-state machine, built in milliseconds from positive examples with one classical merge, does the work of four bespoke learned pipelines. The same 7-to-43-state object remembers workflows (beating AWM on all eight datasets), predicts the next action (62% lower cross-entropy than a unigram), predicts failure (held-out AUROC up to 0.94), and stops bad runs early at 32% of trace completion. It uses 15 to 3,036 $\times$ fewer states than RPNI, and its state count is identical across every random split.

Automata from Agent Traces

Authors

Affiliations

Published

Links