Chapter 10 · Production

Cost, eval, retries, guardrails — what changes when real users show up.

Toy agents pass demos. Production agents survive a Tuesday. The skills are different. This chapter is the condensed version of what you'll learn the hard way otherwise.

Cost: the silent killer

On every loop iteration the entire message history is shipped back to the API. By turn 8, you might be sending 15K tokens of context to make a 30-token decision. Three knobs:

Trim history — drop oldest tool messages once they're no longer relevant.
Summarize — collapse N old turns into one assistant note.
Tier models — use gpt-4o-mini for tool selection, gpt-4o only for the final synthesis.

{`from langchain_core.messages import trim_messages trimmer = trim_messages( max_tokens=4000, strategy="last", # keep last N tokens token_counter=model, include_system=True, # never drop the system message allow_partial=False, start_on="human", # ensure history starts on a human turn ) chain = trimmer | prompt | model_with_tools`} Eval — beyond "looks fine on my machine"

You need a dataset and a metric. Bare minimum:

Dataset — 50–200 real (or realistic) inputs with expected outputs or pass/fail criteria.
Metric — exact match, embedding similarity, or LLM-as-judge for fuzzy outputs.
Runner — replay the dataset against every change, score it.

{`from langsmith import Client from langsmith.evaluation import evaluate def correctness(run, example): expected = example.outputs["answer"] actual = run.outputs["output"] score = llm_judge(expected, actual) # 0..1 return {"key": "correctness", "score": score} evaluate( lambda inputs: executor.invoke(inputs), data="weather_agent_v1", # dataset name in LangSmith evaluators=[correctness], experiment_prefix="prompt-tweak-attempt-3", )`} It's tempting to write unit tests against a prompt + canned input. But agent failures are loop failures — wrong tool, wrong order, infinite retry. Run end-to-end on real prompts; assert on the final answer and the tool trajectory. Retries and timeouts {`# Model-level: retry on rate limit / 5xx model = ChatOpenAI(model="gpt-4o-mini", max_retries=3, timeout=30) # Chain-level: wrap with .with_retry() chain = (prompt | model).with_retry( retry_if_exception_type=(TimeoutError, ConnectionError), wait_exponential_jitter=True, stop_after_attempt=3, ) # Executor-level: cap iterations and wall clock executor = AgentExecutor( agent=agent, tools=tools, max_iterations=10, max_execution_time=60, # seconds )`} Guardrails — input and output

Three layers, each catches different failure modes:

Input guardrails (before the LLM)

Reject prompts over a token cap.
Block known prompt-injection patterns ("ignore previous instructions").
Strip secrets / PII before sending to a third-party API.

Tool guardrails (around tool calls)

Whitelist arguments — never let the LLM pass DROP TABLE to your SQL tool.
Wrap dangerous tools (send_email, charge_card) in human-approval steps.
Rate-limit per session.

Output guardrails (after the LLM)

Validate JSON outputs with Pydantic.
Re-ask if the model violates a content policy.
Run a moderation classifier before returning to the user.

If the agent can spend money, send mail to real people, write to a production DB, or run shell commands, you need a confirmation step. LangGraph's interrupt() primitive is built for this. Don't let a 4¢ token mistake become a $40K incident. Deployment shape

For 95% of cases, the deployed agent looks like one of these:

HTTP endpoint — FastAPI / Flask wrapping executor.ainvoke. Stream via SSE.
Background worker — agents that run for minutes belong on a queue (Celery, Temporal, LangGraph Cloud).
LangServe — opinionated FastAPI integration that exposes any Runnable, with automatic playground.

A short, honest checklist before launch

☐ max_iterations set, with sane wall-clock timeout
☐ Every tool wraps its body in try/except and returns errors as data
☐ LangSmith (or OpenTelemetry) wired up; you can pull a trace by user_id
☐ Eval dataset of 50+ realistic inputs, scored on every release
☐ Token / cost dashboard per user and per route
☐ Human-approval gate on any irreversible tool
☐ Prompt and tool catalog versioned in git, not edited in production
☐ A "kill switch" that disables a misbehaving tool without a deploy

You made it

That's the full surface — primitives, composition, tools, agents, executor, MCP, observability, production. From here, three good places to go next:

LangGraph — the modern way to build agents. Same intuition, graph-based runtime, real persistence and human-in-the-loop primitives.
Build your own MCP server — the fastest way to learn the protocol is to write 20 lines of FastMCP and connect it from Claude Desktop.
Read three real agent codebases — open-source projects like aider, continue.dev, or LangChain's own templates. Every one of them is a variation on the loop you now understand.

Agent = LLM + tools + loop. Messages are the wire. LCEL composes Runnables. Tools are typed Python functions the LLM picks via structured output. The executor runs the loop. MCP is tool calling over a process boundary. Tracing is non-optional. Everything else is detail.