Loop Engineering: The Agent Is Only the Inner Loop

What Is This?

Sydney Runkle's The Art of Loop Engineering makes a useful distinction: an agent is not a magic worker. It is the inner loop inside a larger operating system.

The simplest agent loop is:

model receives context -> model calls tools -> tools change the world -> model observes result -> repeat until done

That loop matters, but production agents need more than it. The real craft is designing the loops around the agent:

agent loop -> verification loop -> event loop -> improvement loop

Each loop answers a different question:

Agent loop: can the model do the work?
Verification loop: did it do the work correctly?
Event loop: when should the work run without a human prompting it?
Improvement loop: how does the system learn from traces and get better?

This is the difference between using agents as clever chatbots and building agents as durable machinery.

Why Does It Matter?

Most people over-focus on the model. They ask which model is smartest, which prompt works, or which tool call failed. Those questions matter, but they sit too close to the inner loop.

Useful agent systems are usually won one layer out.

A weak agent with the right harness can become useful. A strong model in a bad harness still fails: it gets the wrong context, uses the wrong tools, repeats mistakes, runs at the wrong time, and never learns from previous runs.

The core lesson:

model capability is raw power
loop design turns raw power into reliable work

This matters for Jamie because it is the operating model behind Hermes, Jme-Loop, research automation, coding agents, and any future AI-enabled company. The advantage is not just "use agents". The advantage is owning the loops that make agents reliable, scheduled, observable, and improvable.

The Four Loops

1. The agent loop

The agent loop is the base unit.

LangChain's own agent docs define an agent as a model that repeatedly calls tools until the task is complete. The harness around that model supplies the prompt, tools, middleware, memory, and runtime context.

In practice:

task -> context -> model -> tool call -> observation -> next action -> final output

For a documentation agent, that might mean:

receiving a docs improvement request,
cloning or reading a repo,
searching files,
editing markdown,
running checks,
opening a pull request.

The agent loop is where work happens. But by itself it does not guarantee good work.

2. The verification loop

The verification loop wraps the agent with a grader.

agent attempts task -> grader checks output -> feedback returns to agent -> agent revises

The grader can be deterministic or model-based.

Deterministic graders are best when the rules are objective:

tests pass,
links resolve,
JSON schema validates,
no forbidden files changed,
output contains required sections.

Model-based graders help when judgment is fuzzier:

is the answer complete?
is the tone right?
did the agent satisfy the user's actual request?
is the diff scoped to the stated problem?

LangChain's RubricMiddleware is one implementation of this pattern: a deep agent produces output, a grader reviews it against a rubric, and feedback is injected back into the run when revision is needed.

The tradeoff is simple: verification increases cost and latency. That is acceptable when quality matters more than speed.

3. The event loop

The event loop connects the agent to the outside world.

A manually invoked agent is still a tool. An event-driven agent becomes part of a system.

event happens -> agent runs -> output changes a system -> next event waits

Events can be:

a cron schedule,
a webhook,
a Slack message,
a GitHub issue,
a new document,
a daily heartbeat,
a completed background task.

This is where agents stop depending on a human typing the next prompt. LangSmith Deployment supports scheduled runs with cron jobs and webhook-triggered graph runs. OpenClaw's heartbeat model uses a similar idea: periodic agent turns can surface what needs attention without forcing the human to keep polling the system.

The event loop gives agents leverage. It lets them run at the right time, not only when remembered.

4. The hill-climbing loop

The hill-climbing loop is the most important one.

Every agent run creates a trace: what it saw, what it tried, which tools it called, where it failed, what the grader said, and what the final result looked like.

A hill-climbing loop uses those traces to improve the harness.

production traces -> failure analysis -> prompt/tool/eval change -> reviewed update -> better future runs

This is different from simply retrying the same task. The return arrow does not go back to the top of the run. It reaches inside the system and changes the harness.

Potential targets for improvement:

prompts,
tool descriptions,
tool availability,
memory and retrieval,
grader rubrics,
retry policies,
routing rules,
human approval gates,
eval datasets,
eventually fine-tuning data.

LangSmith Engine is a commercial expression of this pattern: it analyzes production traces, clusters related failures, recommends fixes, and can generate prompt or code changes. The product is not the point. The loop is the point.

Human Oversight Belongs At Every Level

Loop engineering is not the same as removing humans.

The better frame is: automate repeatable checks and preserve human judgment where taste, context, risk, or responsibility matter.

Human checkpoints can sit in every loop:

Agent loop: require approval before sensitive tool calls such as payments, destructive file operations, database writes, or messages to real people.
Verification loop: use a human as the grader when the quality bar is subjective or high-stakes.
Event loop: require approval before an automated run publishes, emails, deploys, or escalates.
Hill-climbing loop: require review before a proposed harness change goes live.

LangChain's human-in-the-loop docs treat this as a first-class runtime primitive: a tool call can pause, then a human can approve, edit, reject, or respond.

The principle:

automate execution where rules are clear
preserve human judgment where consequences are real

The Stack As A Mental Model

A useful way to read any agent system is to ask which loop is missing.

If the agent cannot act

The missing layer is the agent loop.

It needs tools, context, permissions, and a harness that lets it affect the world.

If the agent acts but makes too many mistakes

The missing layer is the verification loop.

It needs tests, graders, rubrics, schemas, or scoped acceptance criteria.

If the agent works only when manually prompted

The missing layer is the event loop.

It needs schedules, webhooks, inbox triggers, issue triggers, heartbeats, or routing.

If the same failures recur

The missing layer is the hill-climbing loop.

It needs traces, failure clustering, improvement proposals, eval updates, and a path for harness changes to ship.

This diagnosis is more useful than asking, "Which model should we use?"

Why Smart People Get This Wrong

They confuse a demo with a system

A demo proves that an agent can complete one task once. A system proves that the agent can handle a class of tasks repeatedly, with recovery paths when it fails.

They keep the human as the scheduler

If the human has to remember every next prompt, the human is still the operating system. The agent is only a better command line.

They build tools but not graders

Tools let the agent act. Graders let the system know whether action was good enough.

They collect traces but do not learn from them

Observability without improvement is a dashboard. The hill-climbing loop requires traces to change prompts, tools, evals, or policy.

They over-automate the wrong edge

Some work should stay human-reviewed. The goal is not maximum autonomy everywhere. The goal is reliable autonomy where the system has enough context, clear success criteria, and acceptable downside.

How To Use This

1. Name the loop before changing the system

When an agent fails, classify the failure:

Did it lack a tool or context? Agent loop.
Did it produce unchecked bad work? Verification loop.
Did it fail to run at the right time? Event loop.
Did it repeat a known mistake? Hill-climbing loop.

The fix should match the loop.

2. Add deterministic checks first

Before using an LLM judge, ask what can be checked cheaply and objectively:

tests,
linters,
schema validation,
link checks,
diff scope checks,
file existence checks,
permission checks.

Deterministic checks are boring. That is why they work.

3. Treat traces as training material for the harness

After a run, do not only ask whether the output was good. Ask what should change so the next run is more likely to be good:

Was the prompt missing a constraint?
Was a tool too broad or too dangerous?
Did retrieval bring the wrong context?
Did the rubric miss a failure?
Did the agent need a human approval gate?

4. Keep humans at the judgment bottleneck, not the prompting bottleneck

The goal is to remove the human from repetitive prompting, not from responsibility.

A good loop lets the human review the right things:

exceptions,
irreversible actions,
taste calls,
strategic changes,
proposed improvements to the system itself.

Practical Takeaways For Jamie

The agent is only the inner loop. The leverage comes from the loops around it.
Reliability comes from verification. Production agents need graders, tests, rubrics, and scoped acceptance criteria.
Leverage comes from events. Agents become useful infrastructure when they run on schedules, webhooks, inboxes, issues, and heartbeats.
Compounding comes from hill climbing. Traces should improve the harness, not just explain past failures.
Human judgment should move up the stack. Jamie should approve high-stakes decisions and harness changes, not babysit every prompt.

Key Terms

Agent loop: the model-tool-observation cycle that performs work.
Harness: the prompt, tools, context, middleware, memory, routing, and runtime wrapped around a model.
Verification loop: a grader or test layer that checks the agent's output and sends feedback for revision.
Event loop: the trigger layer that runs agents from schedules, webhooks, messages, or system events.
Hill-climbing loop: the improvement layer that uses traces and evals to update the harness.
Trace: a record of an agent run: context, tool calls, intermediate steps, grader feedback, and final output.
Human-in-the-loop: a runtime checkpoint where a human approves, edits, rejects, or supplies judgment before the system continues.

Recall Questions

What is the difference between the agent loop and the verification loop?
Why is an event-driven agent different from a manually invoked agent?
What does the hill-climbing loop improve: the task output or the harness?
Which failures should be handled by deterministic checks before LLM judges?
Where should human oversight sit in a mature agent system?

Best Resources To Learn More

Start with Sydney Runkle's article for the four-loop framing.
Read swyx / Latent Space on "loopcraft" for the broader shift from prompting agents to designing loops that prompt agents.
Read the LangChain agent docs for the model + harness framing.
Read the rubric and human-in-the-loop docs for concrete verification and approval primitives.
Read the LangSmith Engine page for the trace-to-fix version of the hill-climbing loop.

Sources

Sydney Runkle. "The Art of Loop Engineering." X Article, 16 Jun 2026. https://x.com/sydneyrunkle/status/2066928783534289358
Latent.Space / AINews. "Loopcraft: The Art of Stacking Loops." 12 Jun 2026. https://www.latent.space/p/ainews-loopcraft-the-art-of-stacking
LangChain Docs. "Agents." https://docs.langchain.com/oss/python/langchain/agents
LangChain Docs. "Grading rubrics." https://docs.langchain.com/oss/python/deepagents/rubric
LangChain Docs. "Human-in-the-loop." https://docs.langchain.com/oss/python/deepagents/human-in-the-loop
LangSmith Docs. "Use Cron Jobs." https://docs.langchain.com/langsmith/cron-jobs
LangSmith Docs. "Use Webhooks." https://docs.langchain.com/langsmith/use-webhooks
LangSmith. "Engine." https://www.langchain.com/langsmith/engine
OpenClaw Docs. "Heartbeat." https://docs.openclaw.ai/gateway/heartbeat

What Is This?

Sydney Runkle's The Art of Loop Engineering makes a useful distinction: an agent is not a magic worker. It is the inner loop inside a larger operating system.

The simplest agent loop is:

model receives context -> model calls tools -> tools change the world -> model observes result -> repeat until done

That loop matters, but production agents need more than it. The real craft is designing the loops around the agent:

agent loop -> verification loop -> event loop -> improvement loop

Each loop answers a different question:

Agent loop: can the model do the work?
Verification loop: did it do the work correctly?
Event loop: when should the work run without a human prompting it?
Improvement loop: how does the system learn from traces and get better?

This is the difference between using agents as clever chatbots and building agents as durable machinery.

Why Does It Matter?

Most people over-focus on the model. They ask which model is smartest, which prompt works, or which tool call failed. Those questions matter, but they sit too close to the inner loop.

Useful agent systems are usually won one layer out.

The core lesson:

model capability is raw power
loop design turns raw power into reliable work

The Four Loops

1. The agent loop

The agent loop is the base unit.

In practice:

task -> context -> model -> tool call -> observation -> next action -> final output

For a documentation agent, that might mean:

receiving a docs improvement request,
cloning or reading a repo,
searching files,
editing markdown,
running checks,
opening a pull request.

The agent loop is where work happens. But by itself it does not guarantee good work.

2. The verification loop

The verification loop wraps the agent with a grader.

agent attempts task -> grader checks output -> feedback returns to agent -> agent revises

The grader can be deterministic or model-based.

Deterministic graders are best when the rules are objective:

tests pass,
links resolve,
JSON schema validates,
no forbidden files changed,
output contains required sections.

Model-based graders help when judgment is fuzzier:

is the answer complete?
is the tone right?
did the agent satisfy the user's actual request?
is the diff scoped to the stated problem?

The tradeoff is simple: verification increases cost and latency. That is acceptable when quality matters more than speed.

3. The event loop

The event loop connects the agent to the outside world.

A manually invoked agent is still a tool. An event-driven agent becomes part of a system.

event happens -> agent runs -> output changes a system -> next event waits

Events can be:

a cron schedule,
a webhook,
a Slack message,
a GitHub issue,
a new document,
a daily heartbeat,
a completed background task.

The event loop gives agents leverage. It lets them run at the right time, not only when remembered.

4. The hill-climbing loop

The hill-climbing loop is the most important one.

Every agent run creates a trace: what it saw, what it tried, which tools it called, where it failed, what the grader said, and what the final result looked like.

A hill-climbing loop uses those traces to improve the harness.

production traces -> failure analysis -> prompt/tool/eval change -> reviewed update -> better future runs

This is different from simply retrying the same task. The return arrow does not go back to the top of the run. It reaches inside the system and changes the harness.

Potential targets for improvement:

prompts,
tool descriptions,
tool availability,
memory and retrieval,
grader rubrics,
retry policies,
routing rules,
human approval gates,
eval datasets,
eventually fine-tuning data.

Human Oversight Belongs At Every Level

Loop engineering is not the same as removing humans.

The better frame is: automate repeatable checks and preserve human judgment where taste, context, risk, or responsibility matter.

Human checkpoints can sit in every loop:

Agent loop: require approval before sensitive tool calls such as payments, destructive file operations, database writes, or messages to real people.
Verification loop: use a human as the grader when the quality bar is subjective or high-stakes.
Event loop: require approval before an automated run publishes, emails, deploys, or escalates.
Hill-climbing loop: require review before a proposed harness change goes live.

LangChain's human-in-the-loop docs treat this as a first-class runtime primitive: a tool call can pause, then a human can approve, edit, reject, or respond.

The principle:

automate execution where rules are clear
preserve human judgment where consequences are real

The Stack As A Mental Model

A useful way to read any agent system is to ask which loop is missing.

If the agent cannot act

The missing layer is the agent loop.

It needs tools, context, permissions, and a harness that lets it affect the world.

If the agent acts but makes too many mistakes

The missing layer is the verification loop.

It needs tests, graders, rubrics, schemas, or scoped acceptance criteria.

If the agent works only when manually prompted

The missing layer is the event loop.

It needs schedules, webhooks, inbox triggers, issue triggers, heartbeats, or routing.

If the same failures recur

The missing layer is the hill-climbing loop.

It needs traces, failure clustering, improvement proposals, eval updates, and a path for harness changes to ship.

This diagnosis is more useful than asking, "Which model should we use?"

Why Smart People Get This Wrong

They confuse a demo with a system

A demo proves that an agent can complete one task once. A system proves that the agent can handle a class of tasks repeatedly, with recovery paths when it fails.

They keep the human as the scheduler

If the human has to remember every next prompt, the human is still the operating system. The agent is only a better command line.

They build tools but not graders

Tools let the agent act. Graders let the system know whether action was good enough.

They collect traces but do not learn from them

Observability without improvement is a dashboard. The hill-climbing loop requires traces to change prompts, tools, evals, or policy.

They over-automate the wrong edge

Some work should stay human-reviewed. The goal is not maximum autonomy everywhere. The goal is reliable autonomy where the system has enough context, clear success criteria, and acceptable downside.

How To Use This

1. Name the loop before changing the system

When an agent fails, classify the failure:

Did it lack a tool or context? Agent loop.
Did it produce unchecked bad work? Verification loop.
Did it fail to run at the right time? Event loop.
Did it repeat a known mistake? Hill-climbing loop.

The fix should match the loop.

2. Add deterministic checks first

Before using an LLM judge, ask what can be checked cheaply and objectively:

tests,
linters,
schema validation,
link checks,
diff scope checks,
file existence checks,
permission checks.

Deterministic checks are boring. That is why they work.

3. Treat traces as training material for the harness

After a run, do not only ask whether the output was good. Ask what should change so the next run is more likely to be good:

Was the prompt missing a constraint?
Was a tool too broad or too dangerous?
Did retrieval bring the wrong context?
Did the rubric miss a failure?
Did the agent need a human approval gate?

4. Keep humans at the judgment bottleneck, not the prompting bottleneck

The goal is to remove the human from repetitive prompting, not from responsibility.

A good loop lets the human review the right things:

exceptions,
irreversible actions,
taste calls,
strategic changes,
proposed improvements to the system itself.

Practical Takeaways For Jamie

The agent is only the inner loop. The leverage comes from the loops around it.
Reliability comes from verification. Production agents need graders, tests, rubrics, and scoped acceptance criteria.
Leverage comes from events. Agents become useful infrastructure when they run on schedules, webhooks, inboxes, issues, and heartbeats.
Compounding comes from hill climbing. Traces should improve the harness, not just explain past failures.
Human judgment should move up the stack. Jamie should approve high-stakes decisions and harness changes, not babysit every prompt.

Key Terms

Agent loop: the model-tool-observation cycle that performs work.
Harness: the prompt, tools, context, middleware, memory, routing, and runtime wrapped around a model.
Verification loop: a grader or test layer that checks the agent's output and sends feedback for revision.
Event loop: the trigger layer that runs agents from schedules, webhooks, messages, or system events.
Hill-climbing loop: the improvement layer that uses traces and evals to update the harness.
Trace: a record of an agent run: context, tool calls, intermediate steps, grader feedback, and final output.
Human-in-the-loop: a runtime checkpoint where a human approves, edits, rejects, or supplies judgment before the system continues.

Recall Questions

What is the difference between the agent loop and the verification loop?
Why is an event-driven agent different from a manually invoked agent?
What does the hill-climbing loop improve: the task output or the harness?
Which failures should be handled by deterministic checks before LLM judges?
Where should human oversight sit in a mature agent system?

Best Resources To Learn More

Start with Sydney Runkle's article for the four-loop framing.
Read swyx / Latent Space on "loopcraft" for the broader shift from prompting agents to designing loops that prompt agents.
Read the LangChain agent docs for the model + harness framing.
Read the rubric and human-in-the-loop docs for concrete verification and approval primitives.
Read the LangSmith Engine page for the trace-to-fix version of the hill-climbing loop.

Sources

Sydney Runkle. "The Art of Loop Engineering." X Article, 16 Jun 2026. https://x.com/sydneyrunkle/status/2066928783534289358
Latent.Space / AINews. "Loopcraft: The Art of Stacking Loops." 12 Jun 2026. https://www.latent.space/p/ainews-loopcraft-the-art-of-stacking
LangChain Docs. "Agents." https://docs.langchain.com/oss/python/langchain/agents
LangChain Docs. "Grading rubrics." https://docs.langchain.com/oss/python/deepagents/rubric
LangChain Docs. "Human-in-the-loop." https://docs.langchain.com/oss/python/deepagents/human-in-the-loop
LangSmith Docs. "Use Cron Jobs." https://docs.langchain.com/langsmith/cron-jobs
LangSmith Docs. "Use Webhooks." https://docs.langchain.com/langsmith/use-webhooks
LangSmith. "Engine." https://www.langchain.com/langsmith/engine
OpenClaw Docs. "Heartbeat." https://docs.openclaw.ai/gateway/heartbeat

Loop Engineering: The Agent Is Only the Inner Loop

What Is This?

Why Does It Matter?

The Four Loops

1. The agent loop

2. The verification loop

3. The event loop

4. The hill-climbing loop

Human Oversight Belongs At Every Level

The Stack As A Mental Model

If the agent cannot act

If the agent acts but makes too many mistakes

If the agent works only when manually prompted

If the same failures recur

Why Smart People Get This Wrong

They confuse a demo with a system

They keep the human as the scheduler

They build tools but not graders

They collect traces but do not learn from them

They over-automate the wrong edge

How To Use This

1. Name the loop before changing the system

2. Add deterministic checks first

3. Treat traces as training material for the harness

4. Keep humans at the judgment bottleneck, not the prompting bottleneck

Practical Takeaways For Jamie

Key Terms

Recall Questions

Best Resources To Learn More

Sources

Want more depth?

What next?

Back to Home

Open Learning

Mark complete

Questions & Answers

Loop Engineering: The Agent Is Only the Inner Loop

What Is This?

Why Does It Matter?

The Four Loops

1. The agent loop

2. The verification loop

3. The event loop

4. The hill-climbing loop

Human Oversight Belongs At Every Level

The Stack As A Mental Model

If the agent cannot act

If the agent acts but makes too many mistakes

If the agent works only when manually prompted

If the same failures recur

Why Smart People Get This Wrong

They confuse a demo with a system

They keep the human as the scheduler

They build tools but not graders

They collect traces but do not learn from them

They over-automate the wrong edge

How To Use This

1. Name the loop before changing the system

2. Add deterministic checks first

3. Treat traces as training material for the harness

4. Keep humans at the judgment bottleneck, not the prompting bottleneck

Practical Takeaways For Jamie

Key Terms

Recall Questions

Best Resources To Learn More

Sources

Want more depth?

What next?

Back to Home

Open Learning

Mark complete

Questions & Answers