The Opacity Problem
Watch one large language model (LLM) agent work and you get a wall of text. Watch a thousand of them and you get a machine: a handful of states the agent cycles through on almost every task, a search, edit, execute loop it never declares and the system prompt never specifies. When a run fails, that is where it broke. The catch is that nobody can see it.
This article recovers that machine. From nothing but the agent’s own execution traces, with no labels and no task descriptions, we extract a compact finite-state machine, or FSM, of 7 to 43 states. That machine does the two things an operator actually needs: predict what the agent will do next, and catch a failing run before it wastes the compute.
The whole article follows one thread. Heterogeneous agent traces pass through one deterministic abstraction , pile up into a prefix tree, and collapse in a single classical merge into one automaton. That one object, not four bespoke pipelines, is then read four different ways: as workflow memory, a next-step predictor, a failure detector, and a runtime monitor.
LLM-based agents are now deployed across demanding domains: resolving GitHub issues (Yang et al., 2024, 2025), navigating websites (Zhou et al., 2024), operating desktops (Xie et al., 2024), managing customer service interactions (Yao et al., 2024), and orchestrating multi-agent pipelines. Following the ReAct pattern, they interleave chain-of-thought reasoning with tool calls, and every run produces an execution trace of tool calls, natural language, and environment feedback. The behavioral structure governing that trace stays implicit.
A coding agent cycles through search, edit, execute. A customer service agent alternates between database queries and user communication. This structure emerges from the interaction between the system prompt, the available tools, and the task distribution, yet it is nowhere written down.
Understanding this latent structure matters the moment you deploy. Safety auditing needs to verify that an agent visits the right states and avoids attack chains. Debugging needs to locate the bottleneck states where agents get stuck. Production monitoring needs to flag behavioral drift before it costs anything.
Yet current agent analysis works at the level of a single trace. It offers no structural model of the behavior that links one run to the next.
The Inverse Problem
We frame behavioral recovery as an inverse problem: given a corpus of execution traces, reconstruct a finite-state machine that explains the observed behavior. This is grammatical inference (Gold, 1967; Oncina & Garcı́a, 1992), but the classical setting assumes both positive and negative examples. Agent traces give us positive examples only, the runs that happened, with no labeled counter-examples. Identifying the target language from positive examples alone is impossible in the limit (Gold, 1967).
What rescues the problem is a property specific to agents. Unlike arbitrary regular languages, agent behavior is generated by a bounded set of tools and actions, so traces draw from a small activity alphabet, 6 to 42 symbols across our twelve datasets.
A small alphabet is the whole reason this works. It makes the compact automaton both small and, as later chapters show, statistically dense enough to predict from. The only modeling choice in the entire pipeline is how a message becomes a symbol.
Twelve Datasets, Eight Domains
We evaluate on twelve public datasets spanning coding, web navigation, desktop GUI, mobile GUI, customer service, and multi-agent coordination. The trace counts below are the trajectories we use in our experiments, from 184 to 8,337 each, not the size of each source corpus: for the largest sources we draw a fixed slice rather than the whole set.
| Dataset | Domain | Traces used | Actions | States | Fitness |
|---|---|---|---|---|---|
| SWE-smith | Coding | 500 | 9 | 10 | 1.000 |
| SWE-agent | Coding | 2,000 | 24 | 25 | 0.999 |
| Mind2Web | Web | 500 | 7 | 8 | 1.000 |
| WebArena | Web | 8,337 | 24 | 25 | 1.000 |
| AgentNet | Desktop GUI | 5,000 | 24 | 25 | 1.000 |
| GUI-Odyssey | Mobile GUI | 7,735 | 6 | 7 | 1.000 |
| Who & When | Multi-agent | 184 | 8 | 9 | 1.000 |
| tau2-bench airline | Customer service | 800 | 17 | 18 | 1.000 |
| tau2-bench retail | Customer service | 1,824 | 18 | 19 | 1.000 |
| tau2-bench telecom | Customer service | 1,824 | 42 | 43 | 1.000 |
| ATBench | Safety | 1,000 | 14 | 15 | 1.000 |
| OSWorld | Desktop OS | 2,166 | 26 | 27 | 0.997 |
Every dataset replays held-out traces at fitness of at least 0.997.
The three largest source datasets are subsampled to a fixed slice: SWE-agent uses 2,000 of the 80,036 available trajectories, Mind2Web 500 of 2,350, and AgentNet 5,000 from the OpenCUA Ubuntu subset. The other nine datasets are used in full.
Every figure in this article is a fixed snapshot; every dataset and every case is live in the interactive dashboard, which renders all twelve datasets straight from the experiment outputs — the FSMs, the failure predictor, the runtime monitor, and more.
- Gold, E. M. (1967). Language Identification in the Limit. Information and Control, 10(5), 447–474. back: 1, 2
- Oncina, J., & Garcı́a, P. (1992). Inferring Regular Languages in Polynomial Updated Time. Pattern Recognition and Image Analysis, 49–61.
- Xie, T., Zhang, D., Chen, J., Li, X., Zhao, S., Cao, R., Hua, T. J., Cheng, Z., Shin, D., Lei, F., Liu, Y., Xu, Y., Zhou, S., Savarese, S., Xiong, C., Zhong, V., & Yu, T. (2024). OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments. The Thirty-Eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track. https://openreview.net/forum?id=tN61DTr4Ed
- Yang, J., Jimenez, C. E., Wettig, A., Lieret, K., Yao, S., Narasimhan, K., & Press, O. (2024). SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering. Advances in Neural Information Processing Systems (NeurIPS).
- Yang, J., Lieret, K., Jimenez, C. E., Wettig, A., Khandpur, K., Zhang, Y., Hui, B., Press, O., Schmidt, L., & Yang, D. (2025). SWE-smith: Scaling Data for Software Engineering Agents. Proceedings of the Annual Conference on Neural Information Processing Systems (NeurIPS), Datasets & Benchmarks Track.
- Yao, S., Narasimhan, K., & others. (2024). τ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains. arXiv Preprint arXiv:2406.12045.
- Zhou, S., Xu, F. F., Zhu, H., Zhou, X., Lo, R., Sridhar, A., Cheng, X., Bisk, Y., Fried, D., Alon, U., & Neubig, G. (2024). WebArena: A Realistic Web Environment for Building Autonomous Agents. Proceedings of the International Conference on Learning Representations (ICLR).
From Traces to Symbols
An agent execution trace is a sequence of messages , where each message has a role (system, user, assistant, tool) and content. The first step is to map each message to a symbolic activity from a finite alphabet.
Activity Extraction
An activity extraction function maps each message to a symbol. We apply three rules in priority order:
- Tool calls: if a message contains a
tool_callfield, the activity is the function name (for examplebash,search_flight,click). - Action tags: if the content contains
[ACTION] description, the activity is the action label. - Command extraction: for agents that act through code blocks, we extract the first command token and map it to a semantic category (
edit,search,navigate,execute).
If no rule matches, the activity defaults to role:content_type (for example assistant:text).
The extraction is entirely deterministic and format-specific, with no LLM calls. The whole process completes in milliseconds.
An Example
Consider a coding agent trace from SWE-agent with 47 messages. The raw trace contains system prompts, file contents, error messages, and tool invocations. After extraction, the activity sequence is:
init → user → search → user → edit → user → execute → user → edit → user → submit
From 47 messages and thousands of tokens, we get 11 symbols drawn from an alphabet of 24 possible activities. This is the sequence the FSM will model.
Replay any real trace symbol by symbol in the trace view.
Why This Works
These alphabets are small by construction: 6 to 42 symbols across the twelve datasets, against the tens of thousands of natural language. That is what keeps FSM extraction tractable.
Even a 42-tool telecom customer service agent (tau2-bench telecom) needs only 43 states to capture its behavioral structure. The alphabet is bounded because the agent’s capabilities are bounded: it can only call the tools it has been given.
is the one design decision in the pipeline, so we stress-test it. Fitness stays above 0.999 across all four granularities, from role-only (two to four symbols) to full tool-level. Failure prediction stays within 0.03 AUROC (area under the ROC curve) on any dataset. The rules above are one valid setting, not the only one.
Building the Machine
Given a corpus of activity sequences, FSM construction is two steps and nothing else: build a prefix tree, then merge structurally equivalent states. There are no thresholds, no number of clusters, no learning rate.
Step 1: Prefix Tree
Insert all activity sequences into a trie. Each unique prefix becomes a distinct state. The prefix tree has perfect training fitness, it replays every training trace exactly, but it can have tens of thousands of states.
For SWE-agent (2,000 traces), the prefix tree has 59,510 states. Most are visited once and represent memorized suffixes rather than reusable transition patterns.
Step 2: Structural Merging
Two states are structurally equivalent if for every activity : (i) is defined exactly when is defined, and (ii) the targets are themselves equivalent. This recursion is computed bottom-up in a single pass.
Structural equivalence is exactly the Myhill-Nerode equivalence on the observed prefix language. By the Myhill-Nerode theorem the quotient is the unique minimal deterministic finite automaton (DFA), so no smaller automaton can reproduce the observed behavior (Hopcroft et al., 2006).
The SWE-agent prefix tree with 59,510 states collapses to just 25 states, a 2,380 compression, and the resulting FSM still replays held-out traces at 0.999 fitness.
The Resulting FSM
The FSM encodes the agent’s behavioral topology. Recurring patterns become loops, and the state count tracks the number of distinct behavioral modes.
In the tau2-bench retail and telecom customer service agents, a tool-call loop (assistant:tool_call to tool:text) dominates execution, with the conversational path through assistant:text as a separate branch. In a coding agent, the search, edit, execute cycle accounts for most of the trace.
Open the FSM explorer for any of the twelve datasets in the live dashboard.
Theoretical Properties
The construction guarantees three properties.
Fitness preservation. Structural merging preserves training fitness: if a trace is accepted by the prefix tree, it is accepted by the merged FSM, because merging only adds out-edges (each state carries the union of its merged transitions).
Compactness. The merged FSM is a compact directly-follows automaton: one state per activity, deterministic, accepting every observed trace. We recover this, not the generating automaton, which is impossible to identify from positive examples alone (Gold, 1967), but it is enough for faithful replay and prediction.
Linear runtime. Prefix tree construction is , and structural merging is a partition refinement in . In practice all twelve datasets complete in under one second on a single CPU core.
The construction itself is classical (Daciuk et al., 2000). What is new is that bounded agent alphabets make the resulting compact automaton small enough, and dense enough per state, to be useful for the prediction and monitoring tasks that follow.
- Daciuk, J., Mihov, S., Watson, B. W., & Watson, R. E. (2000). Incremental Construction of Minimal Acyclic Finite-State Automata. Computational Linguistics, 26(1), 3–16.
- Gold, E. M. (1967). Language Identification in the Limit. Information and Control, 10(5), 447–474.
- Hopcroft, J. E., Motwani, R., & Ullman, J. D. (2006). Introduction to Automata Theory, Languages, and Computation (3rd ed.). Pearson.
Why the Machine Stays Small
A small state count alone does not make a machine useful. An automaton can still be huge if the language is complex, and a huge automaton spreads its observations thinly, leaving every per-state statistic noisy. What makes the FSM a usable substrate is that it is small, stable, and converges fast, so each state pools enough traces to estimate from. That property, not the exact count, is what the prediction and monitoring chapters depend on.
It converges on a few percent of the data
Structure stabilizes almost immediately. Across five random train/test splits the extracted state count is identical every time, zero variance, so the topology is a property of the agent, not of which traces you happened to sample. Replay fitness plateaus just as fast: within the first 1 to 10% of the training traces every dataset clears 0.95 fitness, and SWE-smith holds 0.9996 from the first 1%.
Three different methods agree
The compactness is not an artifact of our particular merge rule. Three fundamentally different algorithms converge to nearly the same state count on every dataset:
- Structural merging (ours), a structural partition.
- Alergia (Carrasco & Oncina, 1994), a statistical merge, lands within 1.0 to 6.0.
- hidden Markov model (HMM) (Rabiner, 1989), a probabilistic latent-state model, matches the count, though its states are not interpretable.
A structural, a statistical, and a probabilistic method agreeing rules out an algorithmic coincidence.
- Carrasco, R. C., & Oncina, J. (1994). Learning Stochastic Regular Grammars by Means of a State Merging Method. International Colloquium on Grammatical Inference, 139–152.
- Rabiner, L. R. (1989). A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proceedings of the IEEE, 77(2), 257–286.
Compression and Comparison
We compare against nine baselines from automata learning (Carrasco & Oncina, 1994; Oncina & Garcı́a, 1992), HMMs (Rabiner, 1989), process mining, and agent workflow extraction (Wang et al., 2024). All receive only the same positive training sequences, no failure labels.
How Much Smaller
Our FSMs achieve 15 to 3,036 compression over RPNI while replaying held-out traces at fitness of at least 0.997. The ratio grows with trace length and branching: 15 on WebArena (short web traces, where RPNI succeeds) up to 3,036 on GUI-Odyssey (long, repetitive mobile-GUI traces, where RPNI’s prefix tree explodes to 21,255 states against our 7).
Compare all eight methods interactively in the baselines view, or watch structure stabilize in the convergence view.
Convergence and Stability
Replay fitness reaches its plateau well before the training set is exhausted. On SWE-agent it is already at 0.985 within 1% of the training traces and settles at 0.996 by 10%, while the state count keeps inching up as rare command patterns appear. Structured tool-call domains converge fastest: SWE-smith holds 0.9996 from the first 1% of data. Open web and delegation traces take longer, with Mind2Web needing 5% and Who&When 10% of their traces to clear 0.95 fitness.
Baselines at a Glance
- RPNI without negative examples keeps large portions of the prefix tree (382 to 63,897 states) at degraded fitness.
- Alergia, the strongest competitor, matches our fitness but uses 1.0 to 6.0 more states.
- HMM matches our state count but produces non-interpretable latent states.
- EDSM (evidence-driven state merging) without negatives collapses to a trivial 1-state acceptor.
- k-Tails needs a hyperparameter and produces 1.4 to 10 more states than ours at , with state counts exploding past .
- Process mining miners reach high fitness but precision 0.00 to 0.80, the “flower model” problem where every activity is reachable from every state.
Precision
The FSM is more than a vocabulary. It rejects every random trace, and at least 99.9% of permuted traces that keep the activity set but scramble the order. Even single-symbol mutations, a substitution or an insertion or an adjacent swap, are blocked 77 to 100% of the time. RPNI, with its thousands of states, accepts 75% of those same permuted traces on WebArena.
- Carrasco, R. C., & Oncina, J. (1994). Learning Stochastic Regular Grammars by Means of a State Merging Method. International Colloquium on Grammatical Inference, 139–152.
- Oncina, J., & Garcı́a, P. (1992). Inferring Regular Languages in Polynomial Updated Time. Pattern Recognition and Image Analysis, 49–61.
- Rabiner, L. R. (1989). A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proceedings of the IEEE, 77(2), 257–286.
- Wang, Z. Z., Mao, J., Fried, D., & Neubig, G. (2024). Agent Workflow Memory. arXiv Preprint arXiv:2409.07429.
Prediction: Next Step and Failure
The same FSM state answers both questions an operator asks: what the agent will do next, and whether this run is heading for failure. The estimand is one object, the per-state transition distribution, and compactness is what makes it reliable.
Next-Step Prediction
At each step the predictor estimates , scored by cross-entropy in bits via 55-fold cross-validation (CV). Conditioning on FSM state alone, with no learning beyond an order-1 Markov model, accounts for 83 to 99% of the total cross-entropy improvement on each dataset. The average drops to 0.93 bits, a 62% cut from the unigram baseline at 2.44 bits.
The cleanest test holds the predictor fixed and adds FSM state as a feature. Under absolute discounting, FSM state conditioning adds +0.155 bits on average (0.580 vs 0.735) and helps on every one of six datasets, from +0.016 on SWE-agent to +0.364 on Mind2Web, whose branching web-action vocabulary gains most from knowing where in the workflow it is. Combine FSM state with a small learned model and you get the best predictor in the study at 0.73 bits.
Workflow Memory
The gain generalizes to the agent’s own LLM. Feeding the current FSM state as context for choosing the next action beats Agent Workflow Memory (AWM) (Wang et al., 2024) on all eight ground-truth datasets, with six gaps significant at .
| Dataset | N | AWM | Ours | Δ |
|---|---|---|---|---|
| WebArena | 4,800 | 65.5 | 81.2 | +15.7 |
| SWE-smith | 300 | 74.7 | 100.0 | +25.3 |
| SWE-agent | 1,200 | 67.7 | 70.5 | +2.8 |
| tau2-bench telecom | 1,095 | 28.5 | 45.6 | +17.1 |
| tau2-bench retail | 1,095 | 52.9 | 65.1 | +12.2 |
| tau2-bench airline | 480 | 56.5 | 57.3 | +0.8 |
| ATBench | 600 | 47.8 | 62.5 | +14.7 |
| OSWorld | 1,286 | 55.0 | 70.7 | +15.7 |
AWM extracts workflows from successful traces only, so on low-success datasets it has little to say. That is exactly where the gap is widest.
Handing the LLM the full FSM, every state and transition, actually loses to AWM (52.2% vs 52.9% on tau2-bench retail). Dumping the whole graph drowns the next-step signal. The format that wins at 65.1% is minimal: next-action probabilities plus a few top continuations from the current state, with no structure dump. Finding the right minimal context is part of the contribution, the same way AWM’s linear-workflow format is part of its.
| Context given to the LLM (tau2-bench retail) | Top-1 % |
|---|---|
| No memory, just the trace so far | 27.6 |
| Linear workflows from successful runs (AWM) | 52.9 |
| Full machine: current state, every transition, the whole graph | 52.2 |
| Full machine, plus multi-step continuations | 49.2 |
| Full machine, from successful traces only | 50.3 |
| Minimal: next-action probabilities and a few likely continuations | 65.1 |
Step through the workflow-memory comparison per dataset in the memory view.
Predicting Failure
Replay a trace through the FSM and read off per-state behavioral features (visit frequency, message-length statistics, error rate, early/late entropy drift) plus five cross-entropy anomaly features. A single gradient-boosted classifier on a fixed 80/20 split reaches held-out AUROC up to 0.94.
Raw fitness is useless here (AUROC near 0.50): successful and failed traces both replay perfectly. The signal is in the per-state decomposition and in surprise, failing traces take low-probability transitions under the FSM.
Failure prediction scales with machine size: more states give a finer map of where a run can go wrong. The 43-state telecom agent tops out at 0.941; WebArena (0.903) and AgentNet (0.890) follow; SWE-agent, with 25 states, reaches 0.799. ATBench, the only safety-labeled benchmark, reaches 0.894 (0.864 ± 0.024 under repeated CV). Across all eight real-trace datasets the CV standard deviation stays in 0.012 to 0.031, so these are not single-split artifacts.
Inspect per-state feature importances and failure modes in the failure view.
The predictors are interpretable. On SWE-agent the single strongest feature is whether the trace reaches the submit state: 94.8% of successes get there, only 55.7% of failures do. And this is not a length proxy, structural features score 0.790 against 0.659 for trace length alone. Successful runs touch only 9 of 25 states along a focused search, edit, submit path, while failures spread across all 25 (Jaccard overlap 0.206).
Runtime Monitor
Deployed online, a two-rule monitor fires when the cycle-rate exceeds 0.778 and the unique-state count clears a warm-up floor. On all four evaluated datasets it reaches rank-AUROC 0.66 at the 25% trace checkpoint, against 0.5 for a flag-everything baseline by construction. On SWE-agent it fires at 32% of trace completion, stopping the run before two-thirds of its remaining compute is spent (precision 85.9%, recall 95.5%). By the halfway checkpoint, FSM features alone already recover 92% of the full-trace signal.
When 84% of runs fail, flagging everything scores a high F1 by default (0.914, versus the monitor’s 0.904). The point of a monitor is not whether to flag but when. Rank-AUROC measures exactly that early-warning utility, which a base-rate-inflated F1 hides. The pipeline is FSM replay only, 0.006 ms per step, with no ML model in the loop.
Watch the monitor flag a failing run in real time in the monitor view.
Discussion
When Does It Work?
The four tasks in this article, workflow memory, next-step prediction, failure detection, and runtime monitoring, look different, yet one object served all of them. When a system’s action vocabulary is bounded, its behavioral topology is bounded too, and a compact, stable automaton is the natural summary. Beating four bespoke pipelines with it is a consequence, not a design goal.
The topology is also model-invariant: a single FSM achieves perfect fitness across four large language models on the same task, so it is shaped by the system, the tools and prompts and task distribution, more than by the model driving it. It stays stable across extraction granularities too, shifting failure-prediction AUROC by less than 0.03 over four levels of .
A 25-state machine can be read and checked by a person; the 59,510-state prefix tree it came from cannot. That auditability is a direct dividend of minimality.
Limitations
The FSM accepts the observed prefix language, not the agent’s true generating language: like any trace-replay method, it cannot tell a trace that stays within the observed transition patterns from a legitimate one. The extraction function needs a small amount of per-domain knowledge, and fully automatic discovery of it is future work. Failure prediction degrades on simpler machines: AUROC falls to 0.799 on SWE-agent and 0.70 on the 10-state SWE-smith, smaller task spaces with less structure to exploit. Extending the workflow-memory comparison beyond AWM to other memory-injection methods is future work.
For agents with much larger action spaces or weaker sequential structure, the construction stays minimal but stops being compact, and the per-state observation density that drives every result above would degrade with it.
Broader Impact
Compact FSM representations make agent behavioral structure inspectable, which supports safety auditing. The same analysis could be misused to find exploitable behavioral patterns, so deployment should restrict FSM analysis to authorized auditing.
Conclusion
A finite-state machine, built in milliseconds from positive examples with one classical merge, does the work of four bespoke learned pipelines. The same 7-to-43-state object remembers workflows (beating AWM on all eight datasets), predicts the next action (62% lower cross-entropy than a unigram), predicts failure (held-out AUROC up to 0.94), and stops bad runs early at 32% of trace completion. It uses 15 to 3,036 fewer states than RPNI, and its state count is identical across every random split.