Agentic AI Is A Stack, Not A Chatbot

What Is This?

Haggai Roitman's The Hitchhiker's Guide to Agentic AI: From Foundations to Systems is a 2026 arXiv book-length practitioner reference for building autonomous AI systems.

The useful lesson is not that it contains every agent buzzword. The useful lesson is the structure:

agentic AI is a full stack, not a chat interface

The book moves from transformer foundations and GPU systems, through RLHF, DPO, GRPO, reward modelling, reasoning models, evaluation, RAG, memory, harnesses, MCP, A2A, multi-agent systems, frameworks, UI, and deployment.

That breadth is the point. Roitman's core thesis is that building great AI systems requires understanding the entire pipeline, not just one layer.

For Jamie, this is a clean map of why agent work keeps turning into systems work.

Why Does It Matter?

Most people talk about agents as if the central question is:

which model is smart enough to do the task?

That question matters, but it is incomplete. Agentic systems fail at the joins:

the model has the wrong context;
the tool is badly described;
memory retrieves stale state;
the harness lets loops run too long;
the eval measures a final answer but not the trajectory;
the UI hides uncertainty or blocks correction;
the deployment stack cannot resume long-running work;
the system has no audit trail when something goes wrong.

The better question is:

which layer is actually limiting the agent?

A stronger model may not fix a weak harness. More tools may make tool choice worse. More context may create context rot. More autonomy may only amplify missing verification.

The Stack Model

Use this seven-layer map.

1. Model substrate
2. Training and alignment
3. Reasoning and test-time compute
4. Retrieval and memory
5. Harness and tools
6. Evaluation and environments
7. Product, UI, deployment, and operations

An agent is what happens when these layers are assembled into a loop that can perceive state, choose actions, call tools, observe results, remember useful information, and stop when the job is done.

Layer 1: Model Substrate

The base model is still the engine.

Roitman's book starts with tokenization, transformer architecture, attention, decoding, LoRA, MoE, model compression, speculative decoding, and inference systems such as vLLM. That may look like background material, but it changes how an agent builder thinks.

If inference is slow, the agent's loop becomes expensive. If context is large but poorly managed, the model still misses the right fact. If decoding is unconstrained, structured tool calls become brittle. If the model was not trained for instruction following or tool use, the harness must compensate.

The substrate sets the capability envelope.

Layer 2: Training And Alignment

Agent behaviour is shaped before the harness ever sees it.

The book spends major space on RLHF, PPO, DPO, GRPO, reward models, preference optimization, SFT, and trajectory-level agentic RL. The practical point is simple:

agents inherit the incentives and failure modes of their training process

A model optimized for pleasing responses is not automatically optimized for long-horizon tool work. A reasoning model that performs well on math may still fail at messy state management. A tool-use model can still pursue the wrong subgoal if the reward or instruction frame is wrong.

This is why agent evaluation cannot stop at final-answer quality.

Layer 3: Reasoning And Test-Time Compute

Reasoning models introduced a new lever: spend more computation at inference time.

Roitman's guide covers chain-of-thought, self-consistency, tree-of-thoughts, graph-of-thoughts, MCTS, process reward models, and test-time scaling.

For agents, this matters because long-horizon tasks need more than a single next-token guess. They need planning, branching, correction, and sometimes search.

But test-time compute is not free. More thinking can mean more cost, more latency, and more opportunities to drift. The design question is:

where should the system spend extra reasoning tokens, and where should it call a tool or ask a human?

Layer 4: Retrieval And Memory

RAG is not just search bolted onto a chatbot. In an agent, retrieval and memory define what the system can see.

Roitman separates several memory types: working memory, episodic memory, semantic memory, procedural memory, shared memory, and graph-style memory. That maps cleanly onto practical agent work:

working memory: the current task state;
episodic memory: what happened in prior runs;
semantic memory: stable facts and concepts;
procedural memory: skills, playbooks, and tool-use routines;
shared memory: what multiple agents coordinate through.

The key lesson:

memory is not storage; memory is selective retrieval under constraints

Bad memory makes an agent worse. It can retrieve stale facts, mix user preferences with project state, or bury the one relevant correction under a heap of irrelevant history.

Layer 5: Harness And Tools

This is the layer most non-builders underestimate.

Roitman defines the agent harness as the runtime infrastructure that wraps an LLM and turns it from a stateless text-completion engine into a stateful, goal-directed agent. The harness handles execution, memory, routing, observability, state, tool calls, sandboxing, and error recovery.

The model reasons. The harness gives it a body.

model = reasoner
harness = operating system

Anthropic's engineering guidance makes a similar distinction: agents are systems where models use tools in loops, while workflows are more predefined paths. Both can be useful; the mistake is adding autonomy when a simpler workflow would be more reliable.

A good harness does not just expose tools. It constrains them:

clear schemas;
scoped permissions;
typed outputs;
retry limits;
sandboxing;
confirmation gates;
trace logging;
recovery paths.

That is why agent building feels closer to systems engineering than prompt writing.

Layer 6: Evaluation And Environments

Conversation evals are not enough for agents.

A chat model can be evaluated on a prompt-response pair. An agent has a trajectory:

goal -> plan -> tool call -> observation -> state update -> next action -> result

The failure can happen anywhere in that chain.

Roitman's guide treats agentic environments and benchmarks as first-class because agents must act in worlds. SWE-bench, WebArena-style tasks, browser environments, code sandboxes, and custom task environments all try to measure behaviour across steps, not just text quality.

For Jamie's world, the practical eval question is:

did the agent produce a verified outcome with an inspectable path?

Not:

did the final message sound competent?

Layer 7: Product, UI, Deployment, And Operations

The last layer decides whether the agent is usable.

Roitman's book covers agent development frameworks, agentic UI, observability, async execution, deployment, cost management, state resumption, and long-running tasks. These are not polish. They are part of the system.

A user needs to see what the agent is doing, interrupt when needed, approve sensitive actions, inspect state, understand errors, and roll back mistakes.

This is where protocols matter:

MCP standardizes how applications expose tools, resources, and prompts to models.
A2A-style agent communication points toward agents discovering each other, exchanging tasks, streaming progress, and coordinating long-running work.
Frameworks such as LangGraph and the OpenAI Agents SDK give builders runtimes for stateful loops, tool calls, handoffs, and tracing.

The larger point is protocolization:

agent systems mature when their boundaries become explicit interfaces

Validation Surface

Primary validation: Roitman's arXiv book provides the full-stack taxonomy and practitioner synthesis.
Independent validation: Anthropic's Building Effective Agents independently frames agents around tool-using loops, workflows, evaluator-optimizer patterns, and the need to choose the simplest reliable pattern.
Protocol validation: MCP's official documentation validates the tool/resource/prompt integration layer; OpenAI Agents SDK and LangGraph documentation validate the runtime/tracing/handoff layer.
Limitation / counter-source: Much of the agentic stack is still moving quickly. A book-length guide can become stale at protocol and framework edges; use it as a map, not as a frozen implementation manual.
What remains uncertain: Which standards will dominate, how much agentic RL will matter outside a few high-value domains, and which evals best predict real production reliability.

What This Does Not Prove

This article does not prove that Roitman's stack is the final taxonomy of agentic AI.

Limits:

the arXiv source is a practitioner synthesis, not a benchmark result;
framework and protocol details will age quickly;
the book is broad by design, so individual chapters should be checked against primary docs before implementation;
agentic RL and multi-agent protocols are still less production-stable than simpler workflow patterns;
the strongest practical claim is architectural, not empirical: agent reliability depends on many layers besides model capability.

The safe conclusion is:

use the stack as a map for design and diagnosis, not as an authority to copy blindly

Why Smart People Get This Wrong

They confuse model progress with system progress

A better model can hide a bad system for longer. It does not remove the need for state, tools, evals, observability, and rollback.

They add agents where workflows would work

If the task path is known, a workflow may be safer and cheaper than an autonomous loop. Autonomy is for uncertainty, not for theatre.

They treat memory as a feature, not a liability

Memory can personalize, but it can also contaminate. The question is not whether the agent remembers. It is whether it remembers the right thing with provenance and expiry.

They eval the answer, not the trajectory

Agent quality lives in the path: which tools were called, what evidence was gathered, what state changed, and what verification happened before the final answer.

They ignore the UI

Human oversight requires interface. If the user cannot see, pause, approve, or undo, the system is not responsibly agentic.

How To Use This

Use the stack as a debugging map.

When an agent fails, ask:

Model: was the base capability missing?
Training: was the model optimized for the wrong behaviour?
Reasoning: did it need planning/search, or did it overthink?
Retrieval/memory: did it see the right facts?
Harness/tools: were tools clear, scoped, and recoverable?
Evaluation: did the test measure the real task?
Product/deployment: could the user supervise, resume, and trust the run?

This prevents the lazy answer:

use a better model

Sometimes the right fix is a stricter tool schema, a shorter context window, a deterministic check, a human approval gate, or a better task environment.

Practical Takeaways For Jamie

Hermes is an agent OS, not a chatbot. The leverage is in memory, tools, scheduling, verification, and human checkpoints.
Jme-Loop should stay steward-first. A steward routes work through the stack; it does not pretend model autonomy alone is enough.
Treat protocols as leverage points. MCP, A2A-like boundaries, and skill/tool schemas are how agent systems become composable.
Evaluation is the moat. The useful question is whether a trace proves the work, not whether the answer sounds right.
Use the stack to diagnose failure. Do not blame the model until the harness, context, tool surface, and eval have been checked.

Key Terms

Agentic AI: AI system that can pursue goals through multi-step action, tool use, state updates, and feedback.
Agent harness: runtime layer that wraps a model with tools, memory, orchestration, state, safety, and observability.
RAG: retrieval-augmented generation; retrieving external information before or during generation.
Agentic RAG: retrieval where the agent can decide when, where, and how to search across sources.
MCP: Model Context Protocol; a standard for exposing tools, resources, and prompts to AI applications.
A2A: Agent-to-agent communication pattern/protocol family for agents exchanging tasks and progress.
Trajectory: the sequence of states, tool calls, observations, and outputs produced during an agent run.
Evaluation environment: a structured world or task setup in which agent actions can be observed and scored.

Recall Questions

Why is agentic AI better understood as a stack than as a chatbot?
What does the agent harness do that the model itself does not?
Why can memory make agents worse if it is not selective and provenance-aware?
Why are trajectory evals different from prompt-response evals?
When should Jamie prefer a workflow over a more autonomous agent?

Best Resources To Learn More

Use Roitman's book as the broad map of the agentic AI stack.
Read Anthropic's Building Effective Agents for a concise engineering guide to workflows, agents, and design patterns.
Read the MCP docs when thinking about tool and resource integration.
Read LangGraph or OpenAI Agents SDK docs when thinking about runtimes, handoffs, tracing, and production agent loops.
Cross-link this article with the library's loop-engineering and context-engineering pieces.

Sources

Haggai Roitman, The Hitchhiker's Guide to Agentic AI: From Foundations to Systems, arXiv:2606.24937 (2026). https://arxiv.org/abs/2606.24937
Anthropic Engineering, "Building effective agents." https://www.anthropic.com/engineering/building-effective-agents
Anthropic Engineering, "Effective context engineering for AI agents." https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents
Model Context Protocol documentation. https://modelcontextprotocol.io/docs/getting-started/intro
OpenAI Agents SDK documentation. https://openai.github.io/openai-agents-python/
LangGraph documentation. https://langchain-ai.github.io/langgraph/
Related library article: /library/loop-engineering-agent-systems.
Related reference card: /research/_reference-cards/context-engineering-ai-agents.md.

What Is This?

Haggai Roitman's The Hitchhiker's Guide to Agentic AI: From Foundations to Systems is a 2026 arXiv book-length practitioner reference for building autonomous AI systems.

The useful lesson is not that it contains every agent buzzword. The useful lesson is the structure:

agentic AI is a full stack, not a chat interface

That breadth is the point. Roitman's core thesis is that building great AI systems requires understanding the entire pipeline, not just one layer.

For Jamie, this is a clean map of why agent work keeps turning into systems work.

Why Does It Matter?

Most people talk about agents as if the central question is:

which model is smart enough to do the task?

That question matters, but it is incomplete. Agentic systems fail at the joins:

the model has the wrong context;
the tool is badly described;
memory retrieves stale state;
the harness lets loops run too long;
the eval measures a final answer but not the trajectory;
the UI hides uncertainty or blocks correction;
the deployment stack cannot resume long-running work;
the system has no audit trail when something goes wrong.

The better question is:

which layer is actually limiting the agent?

A stronger model may not fix a weak harness. More tools may make tool choice worse. More context may create context rot. More autonomy may only amplify missing verification.

The Stack Model

Use this seven-layer map.

1. Model substrate
2. Training and alignment
3. Reasoning and test-time compute
4. Retrieval and memory
5. Harness and tools
6. Evaluation and environments
7. Product, UI, deployment, and operations

An agent is what happens when these layers are assembled into a loop that can perceive state, choose actions, call tools, observe results, remember useful information, and stop when the job is done.

Layer 1: Model Substrate

The base model is still the engine.

The substrate sets the capability envelope.

Layer 2: Training And Alignment

Agent behaviour is shaped before the harness ever sees it.

The book spends major space on RLHF, PPO, DPO, GRPO, reward models, preference optimization, SFT, and trajectory-level agentic RL. The practical point is simple:

agents inherit the incentives and failure modes of their training process

This is why agent evaluation cannot stop at final-answer quality.

Layer 3: Reasoning And Test-Time Compute

Reasoning models introduced a new lever: spend more computation at inference time.

Roitman's guide covers chain-of-thought, self-consistency, tree-of-thoughts, graph-of-thoughts, MCTS, process reward models, and test-time scaling.

For agents, this matters because long-horizon tasks need more than a single next-token guess. They need planning, branching, correction, and sometimes search.

But test-time compute is not free. More thinking can mean more cost, more latency, and more opportunities to drift. The design question is:

where should the system spend extra reasoning tokens, and where should it call a tool or ask a human?

Layer 4: Retrieval And Memory

RAG is not just search bolted onto a chatbot. In an agent, retrieval and memory define what the system can see.

Roitman separates several memory types: working memory, episodic memory, semantic memory, procedural memory, shared memory, and graph-style memory. That maps cleanly onto practical agent work:

working memory: the current task state;
episodic memory: what happened in prior runs;
semantic memory: stable facts and concepts;
procedural memory: skills, playbooks, and tool-use routines;
shared memory: what multiple agents coordinate through.

The key lesson:

memory is not storage; memory is selective retrieval under constraints

Bad memory makes an agent worse. It can retrieve stale facts, mix user preferences with project state, or bury the one relevant correction under a heap of irrelevant history.

Layer 5: Harness And Tools

This is the layer most non-builders underestimate.

The model reasons. The harness gives it a body.

model = reasoner
harness = operating system

A good harness does not just expose tools. It constrains them:

clear schemas;
scoped permissions;
typed outputs;
retry limits;
sandboxing;
confirmation gates;
trace logging;
recovery paths.

That is why agent building feels closer to systems engineering than prompt writing.

Layer 6: Evaluation And Environments

Conversation evals are not enough for agents.

A chat model can be evaluated on a prompt-response pair. An agent has a trajectory:

goal -> plan -> tool call -> observation -> state update -> next action -> result

The failure can happen anywhere in that chain.

For Jamie's world, the practical eval question is:

did the agent produce a verified outcome with an inspectable path?

Not:

did the final message sound competent?

Layer 7: Product, UI, Deployment, And Operations

The last layer decides whether the agent is usable.

A user needs to see what the agent is doing, interrupt when needed, approve sensitive actions, inspect state, understand errors, and roll back mistakes.

This is where protocols matter:

MCP standardizes how applications expose tools, resources, and prompts to models.
A2A-style agent communication points toward agents discovering each other, exchanging tasks, streaming progress, and coordinating long-running work.
Frameworks such as LangGraph and the OpenAI Agents SDK give builders runtimes for stateful loops, tool calls, handoffs, and tracing.

The larger point is protocolization:

agent systems mature when their boundaries become explicit interfaces

Validation Surface

Primary validation: Roitman's arXiv book provides the full-stack taxonomy and practitioner synthesis.
Independent validation: Anthropic's Building Effective Agents independently frames agents around tool-using loops, workflows, evaluator-optimizer patterns, and the need to choose the simplest reliable pattern.
Protocol validation: MCP's official documentation validates the tool/resource/prompt integration layer; OpenAI Agents SDK and LangGraph documentation validate the runtime/tracing/handoff layer.
Limitation / counter-source: Much of the agentic stack is still moving quickly. A book-length guide can become stale at protocol and framework edges; use it as a map, not as a frozen implementation manual.
What remains uncertain: Which standards will dominate, how much agentic RL will matter outside a few high-value domains, and which evals best predict real production reliability.

What This Does Not Prove

This article does not prove that Roitman's stack is the final taxonomy of agentic AI.

Limits:

the arXiv source is a practitioner synthesis, not a benchmark result;
framework and protocol details will age quickly;
the book is broad by design, so individual chapters should be checked against primary docs before implementation;
agentic RL and multi-agent protocols are still less production-stable than simpler workflow patterns;
the strongest practical claim is architectural, not empirical: agent reliability depends on many layers besides model capability.

The safe conclusion is:

use the stack as a map for design and diagnosis, not as an authority to copy blindly

Why Smart People Get This Wrong

They confuse model progress with system progress

A better model can hide a bad system for longer. It does not remove the need for state, tools, evals, observability, and rollback.

They add agents where workflows would work

If the task path is known, a workflow may be safer and cheaper than an autonomous loop. Autonomy is for uncertainty, not for theatre.

They treat memory as a feature, not a liability

Memory can personalize, but it can also contaminate. The question is not whether the agent remembers. It is whether it remembers the right thing with provenance and expiry.

They eval the answer, not the trajectory

Agent quality lives in the path: which tools were called, what evidence was gathered, what state changed, and what verification happened before the final answer.

They ignore the UI

Human oversight requires interface. If the user cannot see, pause, approve, or undo, the system is not responsibly agentic.

How To Use This

Use the stack as a debugging map.

When an agent fails, ask:

Model: was the base capability missing?
Training: was the model optimized for the wrong behaviour?
Reasoning: did it need planning/search, or did it overthink?
Retrieval/memory: did it see the right facts?
Harness/tools: were tools clear, scoped, and recoverable?
Evaluation: did the test measure the real task?
Product/deployment: could the user supervise, resume, and trust the run?

This prevents the lazy answer:

use a better model

Sometimes the right fix is a stricter tool schema, a shorter context window, a deterministic check, a human approval gate, or a better task environment.

Practical Takeaways For Jamie

Hermes is an agent OS, not a chatbot. The leverage is in memory, tools, scheduling, verification, and human checkpoints.
Jme-Loop should stay steward-first. A steward routes work through the stack; it does not pretend model autonomy alone is enough.
Treat protocols as leverage points. MCP, A2A-like boundaries, and skill/tool schemas are how agent systems become composable.
Evaluation is the moat. The useful question is whether a trace proves the work, not whether the answer sounds right.
Use the stack to diagnose failure. Do not blame the model until the harness, context, tool surface, and eval have been checked.

Key Terms

Agentic AI: AI system that can pursue goals through multi-step action, tool use, state updates, and feedback.
Agent harness: runtime layer that wraps a model with tools, memory, orchestration, state, safety, and observability.
RAG: retrieval-augmented generation; retrieving external information before or during generation.
Agentic RAG: retrieval where the agent can decide when, where, and how to search across sources.
MCP: Model Context Protocol; a standard for exposing tools, resources, and prompts to AI applications.
A2A: Agent-to-agent communication pattern/protocol family for agents exchanging tasks and progress.
Trajectory: the sequence of states, tool calls, observations, and outputs produced during an agent run.
Evaluation environment: a structured world or task setup in which agent actions can be observed and scored.

Recall Questions

Why is agentic AI better understood as a stack than as a chatbot?
What does the agent harness do that the model itself does not?
Why can memory make agents worse if it is not selective and provenance-aware?
Why are trajectory evals different from prompt-response evals?
When should Jamie prefer a workflow over a more autonomous agent?

Best Resources To Learn More

Use Roitman's book as the broad map of the agentic AI stack.
Read Anthropic's Building Effective Agents for a concise engineering guide to workflows, agents, and design patterns.
Read the MCP docs when thinking about tool and resource integration.
Read LangGraph or OpenAI Agents SDK docs when thinking about runtimes, handoffs, tracing, and production agent loops.
Cross-link this article with the library's loop-engineering and context-engineering pieces.

Sources

Haggai Roitman, The Hitchhiker's Guide to Agentic AI: From Foundations to Systems, arXiv:2606.24937 (2026). https://arxiv.org/abs/2606.24937
Anthropic Engineering, "Building effective agents." https://www.anthropic.com/engineering/building-effective-agents
Anthropic Engineering, "Effective context engineering for AI agents." https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents
Model Context Protocol documentation. https://modelcontextprotocol.io/docs/getting-started/intro
OpenAI Agents SDK documentation. https://openai.github.io/openai-agents-python/
LangGraph documentation. https://langchain-ai.github.io/langgraph/
Related library article: /library/loop-engineering-agent-systems.
Related reference card: /research/_reference-cards/context-engineering-ai-agents.md.

Agentic AI Is A Stack, Not A Chatbot

What Is This?

Why Does It Matter?

The Stack Model

Layer 1: Model Substrate

Layer 2: Training And Alignment

Layer 3: Reasoning And Test-Time Compute

Layer 4: Retrieval And Memory

Layer 5: Harness And Tools

Layer 6: Evaluation And Environments

Layer 7: Product, UI, Deployment, And Operations

Validation Surface

What This Does Not Prove

Why Smart People Get This Wrong

They confuse model progress with system progress

They add agents where workflows would work

They treat memory as a feature, not a liability

They eval the answer, not the trajectory

They ignore the UI

How To Use This

Practical Takeaways For Jamie

Key Terms

Recall Questions

Best Resources To Learn More

Sources

Want more depth?

What next?

Back to Home

Open Learning

Mark complete

Questions & Answers

Agentic AI Is A Stack, Not A Chatbot

What Is This?

Why Does It Matter?

The Stack Model

Layer 1: Model Substrate

Layer 2: Training And Alignment

Layer 3: Reasoning And Test-Time Compute

Layer 4: Retrieval And Memory

Layer 5: Harness And Tools

Layer 6: Evaluation And Environments

Layer 7: Product, UI, Deployment, And Operations

Validation Surface

What This Does Not Prove

Why Smart People Get This Wrong

They confuse model progress with system progress

They add agents where workflows would work

They treat memory as a feature, not a liability

They eval the answer, not the trajectory

They ignore the UI

How To Use This

Practical Takeaways For Jamie

Key Terms

Recall Questions

Best Resources To Learn More

Sources

Want more depth?

What next?

Back to Home

Open Learning

Mark complete

Questions & Answers