/* global React */

function Chapter10() {
  return (
    <section className="chapter" id="ch-10" data-screen-label="10 Production">
      <div className="chapter-header">
        <div className="eyebrow">Chapter 10 · Production</div>
        <h1 className="chapter-title">Cost, eval, retries, guardrails — what changes when real users show up.</h1>
        <p className="chapter-lede">
          Toy agents pass demos. Production agents survive a Tuesday. The skills are different. This chapter is the
          condensed version of what you'll learn the hard way otherwise.
        </p>
      </div>

      <SectionTitle num="10.1">Cost: the silent killer</SectionTitle>
      <p>
        On every loop iteration the entire message history is shipped back to the API. By turn 8, you might be sending
        15K tokens of context to make a 30-token decision. Three knobs:
      </p>
      <ul>
        <li><strong>Trim history</strong> — drop oldest tool messages once they're no longer relevant.</li>
        <li><strong>Summarize</strong> — collapse N old turns into one assistant note.</li>
        <li><strong>Tier models</strong> — use <code>gpt-4o-mini</code> for tool selection, <code>gpt-4o</code> only for the final synthesis.</li>
      </ul>
      <CodeBlock file="trim_messages.py">{`from langchain_core.messages import trim_messages

trimmer = trim_messages(
    max_tokens=4000,
    strategy="last",          # keep last N tokens
    token_counter=model,
    include_system=True,      # never drop the system message
    allow_partial=False,
    start_on="human",         # ensure history starts on a human turn
)

chain = trimmer | prompt | model_with_tools`}</CodeBlock>

      <SectionTitle num="10.2">Eval — beyond "looks fine on my machine"</SectionTitle>
      <p>You need a dataset and a metric. Bare minimum:</p>
      <ol>
        <li><strong>Dataset</strong> — 50–200 real (or realistic) inputs with expected outputs or pass/fail criteria.</li>
        <li><strong>Metric</strong> — exact match, embedding similarity, or LLM-as-judge for fuzzy outputs.</li>
        <li><strong>Runner</strong> — replay the dataset against every change, score it.</li>
      </ol>
      <CodeBlock file="evaluate.py">{`from langsmith import Client
from langsmith.evaluation import evaluate

def correctness(run, example):
    expected = example.outputs["answer"]
    actual = run.outputs["output"]
    score = llm_judge(expected, actual)   # 0..1
    return {"key": "correctness", "score": score}

evaluate(
    lambda inputs: executor.invoke(inputs),
    data="weather_agent_v1",        # dataset name in LangSmith
    evaluators=[correctness],
    experiment_prefix="prompt-tweak-attempt-3",
)`}</CodeBlock>
      <Callout kind="tip" title="Test the loop, not just the LLM">
        It's tempting to write unit tests against a prompt + canned input. But agent failures are loop failures —
        wrong tool, wrong order, infinite retry. Run end-to-end on real prompts; assert on the <em>final</em> answer
        and the <em>tool trajectory</em>.
      </Callout>

      <SectionTitle num="10.3">Retries and timeouts</SectionTitle>
      <CodeBlock file="resilience.py">{`# Model-level: retry on rate limit / 5xx
model = ChatOpenAI(model="gpt-4o-mini", max_retries=3, timeout=30)

# Chain-level: wrap with .with_retry()
chain = (prompt | model).with_retry(
    retry_if_exception_type=(TimeoutError, ConnectionError),
    wait_exponential_jitter=True,
    stop_after_attempt=3,
)

# Executor-level: cap iterations and wall clock
executor = AgentExecutor(
    agent=agent, tools=tools,
    max_iterations=10,
    max_execution_time=60,        # seconds
)`}</CodeBlock>

      <SectionTitle num="10.4">Guardrails — input and output</SectionTitle>
      <p>Three layers, each catches different failure modes:</p>
      <h4 className="mini-title">Input guardrails (before the LLM)</h4>
      <ul>
        <li>Reject prompts over a token cap.</li>
        <li>Block known prompt-injection patterns ("ignore previous instructions").</li>
        <li>Strip secrets / PII before sending to a third-party API.</li>
      </ul>
      <h4 className="mini-title">Tool guardrails (around tool calls)</h4>
      <ul>
        <li>Whitelist arguments — never let the LLM pass <code>DROP TABLE</code> to your SQL tool.</li>
        <li>Wrap dangerous tools (send_email, charge_card) in human-approval steps.</li>
        <li>Rate-limit per session.</li>
      </ul>
      <h4 className="mini-title">Output guardrails (after the LLM)</h4>
      <ul>
        <li>Validate JSON outputs with Pydantic.</li>
        <li>Re-ask if the model violates a content policy.</li>
        <li>Run a moderation classifier before returning to the user.</li>
      </ul>

      <Callout kind="warning" title="Human-in-the-loop is not optional for destructive tools">
        If the agent can spend money, send mail to real people, write to a production DB, or run shell commands, you
        need a confirmation step. LangGraph's <code>interrupt()</code> primitive is built for this. Don't let a 4¢
        token mistake become a $40K incident.
      </Callout>

      <SectionTitle num="10.5">Deployment shape</SectionTitle>
      <p>For 95% of cases, the deployed agent looks like one of these:</p>
      <ul>
        <li><strong>HTTP endpoint</strong> — FastAPI / Flask wrapping <code>executor.ainvoke</code>. Stream via SSE.</li>
        <li><strong>Background worker</strong> — agents that run for minutes belong on a queue (Celery, Temporal, LangGraph Cloud).</li>
        <li><strong>LangServe</strong> — opinionated FastAPI integration that exposes any Runnable, with automatic playground.</li>
      </ul>

      <SectionTitle num="10.6">A short, honest checklist before launch</SectionTitle>
      <ul>
        <li>☐ <strong>max_iterations</strong> set, with sane wall-clock timeout</li>
        <li>☐ Every tool wraps its body in try/except and returns errors as data</li>
        <li>☐ LangSmith (or OpenTelemetry) wired up; you can pull a trace by user_id</li>
        <li>☐ Eval dataset of 50+ realistic inputs, scored on every release</li>
        <li>☐ Token / cost dashboard per user and per route</li>
        <li>☐ Human-approval gate on any irreversible tool</li>
        <li>☐ Prompt and tool catalog versioned in git, not edited in production</li>
        <li>☐ A "kill switch" that disables a misbehaving tool without a deploy</li>
      </ul>

      <div className="divider"></div>

      <SectionTitle num="">You made it</SectionTitle>
      <p>
        That's the full surface — primitives, composition, tools, agents, executor, MCP, observability, production. From
        here, three good places to go next:
      </p>
      <ul>
        <li><strong>LangGraph</strong> — the modern way to build agents. Same intuition, graph-based runtime, real persistence and human-in-the-loop primitives.</li>
        <li><strong>Build your own MCP server</strong> — the fastest way to learn the protocol is to write 20 lines of <code>FastMCP</code> and connect it from Claude Desktop.</li>
        <li><strong>Read three real agent codebases</strong> — open-source projects like <em>aider</em>, <em>continue.dev</em>, or LangChain's own templates. Every one of them is a variation on the loop you now understand.</li>
      </ul>

      <Callout kind="intuition" title="The summary, in one breath">
        Agent = LLM + tools + loop. Messages are the wire. LCEL composes Runnables. Tools are typed Python functions
        the LLM picks via structured output. The executor runs the loop. MCP is tool calling over a process boundary.
        Tracing is non-optional. Everything else is detail.
      </Callout>
    </section>
  );
}

window.Chapter10 = Chapter10;
