Designing a Chatbot System in 2026

In 2016, building a chatbot meant training Dialogflow to recognize every variation of “hello” a user might type. You wrote if-else trees mapping intents to actions. User says “book a flight,” you extract entities (destination, date), call an API, return a response. The NLU engine was the bottleneck. If users phrased something outside your training data, the bot broke.

Ten years later, the problem flipped. Modern LLMs understand intent without training examples. They can call tools, reason about which API to invoke, and handle edge cases you never anticipated. The new bottleneck is not understanding what the user wants. It is fetching the right data, executing the right actions, maintaining context across turns, and coordinating multiple specialized agents.

If you designed a chatbot system in 2016, the architecture looked like this. NLU engine, action server, Redis queues for ordering, chatlog for persistence. The complexity was in intent classification and scaling message processing.

In 2026, the architecture looks different.

What changed

The core shift is from rule-based intent mapping to tool-calling models. OpenAI’s function calling, Anthropic’s tool use, Google’s function declarations all follow the same pattern. You give the model a set of tools it can invoke. The model decides which tool to call based on conversation context. You execute the tool and return results. The model uses those results to respond to the user.

This sounds simple until you scale it. Real systems need:

  1. Memory: Conversation history, user preferences, past actions
  2. Tool orchestration: Deciding which tool to call, handling failures, retries
  3. Context management: What information to keep in the prompt, what to retrieve on demand
  4. Agent coordination: Multiple specialized agents working together
  5. Data fetching: Retrieving relevant information from databases, APIs, vector stores
  6. Action execution: Actually doing things (sending emails, booking tickets, updating records)

The architecture pieces that handle these are different from 2016.

Modern chatbot architecture

graph TB
    User[User Input] --> Gateway[API Gateway]
    Gateway --> Router[Agent Router]

    Router --> Agent[Primary Agent/LLM]

    Agent --> Tools[Tool Registry]
    Tools --> MCP[MCP Servers]
    Tools --> DirectAPI[Direct APIs]

    Agent --> Memory[Memory System]
    Memory --> ShortTerm[(Conversation Buffer)]
    Memory --> LongTerm[(Vector Store)]

    Agent --> SubAgents[Specialized Agents]
    SubAgents --> DataAgent[Data Retrieval Agent]
    SubAgents --> ActionAgent[Action Execution Agent]
    SubAgents --> AnalysisAgent[Analysis Agent]

    MCP --> FileSystem[Filesystem MCP]
    MCP --> Database[Database MCP]
    MCP --> External[External Service MCP]

    Agent --> Context[Context Manager]
    Context --> Retrieval[(RAG System)]

    Agent --> Response[Response Generator]
    Response --> User

    ActionAgent --> AuditLog[(Audit Log)]
    DataAgent --> Cache[(Redis Cache)]

The flow is no longer linear. A user message hits the primary agent, which can:

  • Call tools directly through MCP servers
  • Delegate to specialized subagents
  • Retrieve context from memory or RAG systems
  • Execute actions and store results
  • Coordinate multiple operations before responding

Tool calling and MCP

OpenAI’s Assistants API, Claude’s tool use, and Google’s Vertex AI function calling all expose tools to the model. The model gets a JSON schema describing available functions. When it needs to use one, it outputs a structured request. Your code executes the function and returns results.

Model Context Protocol (MCP) standardizes this. Instead of writing custom integrations for every data source, you run MCP servers. Each server exposes tools, resources, and prompts. The agent connects to MCP servers and calls their tools.

Claude Code uses this heavily. It connects to MCP servers for filesystem access, database queries, web browsing, IDE integration. When you ask Claude Code to read a file, it calls the filesystem MCP server’s read tool. When it needs to search code, it uses the IDE MCP server.

graph LR
    Agent[LLM Agent] --> MCP1[Filesystem MCP]
    Agent --> MCP2[Database MCP]
    Agent --> MCP3[Web MCP]
    Agent --> MCP4[Custom API MCP]

    MCP1 --> FS[(File System)]
    MCP2 --> DB[(PostgreSQL)]
    MCP3 --> Web[Web APIs]
    MCP4 --> Custom[Internal Services]

Google’s commerce toolkit follows a similar pattern. It exposes shopping APIs (product search, inventory check, price comparison) as function declarations. The model decides when to search products versus check inventory versus compare prices. The toolkit handles the API calls.

The advantage is extensibility. Adding a new capability means writing an MCP server or adding a function declaration. The model learns to use it without retraining.

Memory systems

A 2016 chatbot tracked state in Redis. Current intent, extracted entities, conversation step. Modern systems need richer memory.

OpenAI’s Assistants API maintains thread history automatically. Every message persists. The assistant has access to the full conversation when responding. For long conversations, it truncates early messages but keeps recent context.

Claude Code uses a different approach. It maintains conversation context through summarization. Old messages get compressed into summaries. Recent messages stay in full. This keeps context windows manageable while preserving important information.

The memory layer typically splits into:

Short-term memory: Recent conversation turns, current task context, active tool results. Lives in the prompt or a conversation buffer.

Long-term memory: User preferences, past interactions, learned patterns. Stored in a vector database. Retrieved when relevant.

Working memory: Intermediate results from tool calls, agent-to-agent communication, temporary state. Often in Redis or in-memory cache.

graph TB
    Input[User Message] --> STM[Short-term Memory]
    STM --> Context[Context Window]

    Input --> Retrieval[Retrieval System]
    Retrieval --> LTM[(Long-term Memory/Vector DB)]
    LTM --> Context

    Context --> Agent[LLM Agent]

    Agent --> Tools[Tool Execution]
    Tools --> WM[(Working Memory/Redis)]
    WM --> Agent

    Agent --> Update[Memory Update]
    Update --> STM
    Update --> LTM

The challenge is deciding what to remember. Too much context and you hit token limits. Too little and the agent lacks information to make good decisions. Retrieval-Augmented Generation (RAG) helps by fetching relevant information on demand instead of keeping everything in context.

Agent coordination

Complex tasks require multiple agents. Claude Code demonstrates this. When you ask it to fix a bug, it might:

  1. Use an Explore agent to search the codebase
  2. Use a Bash agent to run tests
  3. Use the main agent to write the fix
  4. Use a test-runner agent to verify the fix

Each agent is specialized. The Explore agent has tools for code search and navigation. The Bash agent executes shell commands. The test-runner knows how to interpret test output.

The coordination pattern is hierarchical. The primary agent decomposes tasks and delegates to subagents. Subagents complete their work and return results. The primary agent synthesizes results and responds to the user.

graph TB
    User[User Request] --> Primary[Primary Agent]

    Primary --> Task1[Subtask: Search Code]
    Primary --> Task2[Subtask: Run Tests]
    Primary --> Task3[Subtask: Write Fix]

    Task1 --> Explore[Explore Agent]
    Task2 --> Bash[Bash Agent]
    Task3 --> Code[Code Agent]

    Explore --> Result1[Search Results]
    Bash --> Result2[Test Output]
    Code --> Result3[Fixed Code]

    Result1 --> Primary
    Result2 --> Primary
    Result3 --> Primary

    Primary --> Response[Synthesized Response]
    Response --> User

OpenAI’s approach is different. The Assistants API doesn’t have built-in agent coordination. You build it yourself by chaining assistant calls or using frameworks like LangChain or AutoGen.

Google’s Vertex AI supports workflows where you define a graph of function calls. The model traverses the graph, calling functions and passing results to subsequent steps.

The tradeoff is control versus simplicity. Hierarchical agents are easier to debug but require orchestration logic. Graph-based workflows are declarative but harder to modify dynamically.

Context engineering

The prompt is now your interface. What you put in the context window determines what the agent can do.

Claude Code’s prompts include tool definitions, codebase context, conversation history, and task instructions. When you open a file, it stays in context. When you run a command, the output gets added. The agent sees everything it might need.

The engineering challenge is token efficiency. A 200k token context window sounds large until you load a codebase. You need strategies:

Lazy loading: Only load files when the agent needs them. Don’t preload everything.

Summarization: Compress old conversation turns. Keep full text only for recent messages.

Retrieval: Store large datasets externally. Fetch relevant chunks on demand using vector search.

Chunking: Break large documents into pieces. Load only relevant sections.

graph LR
    Agent[LLM Agent] --> Context[Context Window 200k tokens]

    Context --> System[System Prompt: 5k]
    Context --> Recent[Recent Messages: 10k]
    Context --> Summary[Conversation Summary: 3k]
    Context --> Tools[Tool Definitions: 8k]
    Context --> Retrieved[Retrieved Context: 20k]
    Context --> Code[Loaded Code: 30k]
    Context --> Available[Available: 124k]

    External[(Vector Store)] --> Retrieved
    Files[(Codebase)] --> Code

Google’s approach with Gemini includes a 1M token context window. This reduces the need for retrieval in some cases. You can load entire codebases or long documents. But it creates new problems. How do you help the model find relevant information in a million tokens? Retrieval strategies still matter.

What you actually build

A production chatbot in 2026 combines these pieces:

  1. API Gateway: Entry point, authentication, rate limiting
  2. Agent Router: Routes requests to the right primary agent based on intent or domain
  3. Primary Agent: Orchestrates the conversation, calls tools, manages memory
  4. Tool Registry: Catalog of available tools, MCP servers, function declarations
  5. Memory System: Short-term, long-term, and working memory layers
  6. Specialized Agents: Domain-specific agents for data retrieval, actions, analysis
  7. Context Manager: Handles RAG, summarization, chunking
  8. Audit Log: Records all tool calls, actions, and agent decisions

The complexity is not in understanding user intent. It is in coordinating all these pieces efficiently.

What to try

If you are building a chatbot today, start by mapping your data sources and actions. What information does the agent need access to? What actions can it perform? Build MCP servers or function declarations for each.

Test tool calling in isolation. Write a simple agent that calls one tool, gets a result, and responds. Verify the model picks the right tool and handles failures gracefully.

Add memory incrementally. Start with conversation history in context. Add vector retrieval when you need it. Measure how much context the agent actually uses.

Then experiment with agent delegation. Identify a task that requires specialized knowledge or tools. Create a subagent that handles just that task. Have your primary agent delegate to it.

Finally, monitor token usage and latency. Every piece you add increases both. The engineering challenge is keeping the system responsive while giving the agent enough context to be useful.