Building AI Agents That Actually Help

Building Agents That Actually Help: What a Grocery Assistant Teaches Us About Agentic AI

The first wave of AI chatbots taught everyone the same lesson: a model that can talk is interesting, but a model that can do useful work is the product.

That distinction matters. A chatbot can answer, summarize, and improvise. An agent can look at the state of the world, decide what information it needs, call tools, inspect the results, and take the next step. It can search a catalog, compare a recipe against a pantry, check a basket, add the missing ingredients, and return a response the interface can render without guessing what the model meant.

We are moving from a world where conversation is the experience to one where conversation is the control plane.

One good example is the video below: an AI Agent that helps customers with their groceries.

This grocery assistant is a small example of that shift, but the pattern behind it is much bigger than grocery. The same architecture can drive a support assistant, a field-service assistant, an internal operations copilot, a travel planner, a financial workflow assistant, or a B2B sales agent. The domain changes. The tools change. The policies change. The loop stays familiar.

The agent receives a user goal. It receives context. It decides whether it has enough information. It calls a tool when it needs ground truth. It observes the result. It may call another tool. Eventually it returns a concise answer plus structured data and, when appropriate, executed actions.

That loop is where most of the value is.

The Agent Is Not The Chat Window

It is tempting to look at an assistant and think the UI is the product. In this grocery assistant example, the frontend matters because it gives the user a place to browse recipes, inspect products, see the cart, and talk naturally. But the frontend is mostly the surface area. The intelligence lives behind it.

The system is built around a commerce-focused grocery assistant. The shopper can ask things like:

  • "What can I make with chicken and broccoli?"
  •  "Show me recipes under $25."
  • "Do I have everything I need for this recipe?"
  • "Add the missing ingredients to my cart."
  • "What can I make with what is already in my basket?"

Those sound like ordinary chat messages, but they are not ordinary retrieval questions. They require state, business rules, and action.

If a user asks what they can cook from the basket, the assistant cannot answer from model memory. It needs the cart. It needs to map cart products back to ingredient concepts. It needs to rank recipes by missing ingredients and estimated cost. If the user asks to add missing ingredients, the assistant should not write a cheerful sentence claiming it did so. It must call the cart mutation path, get the updated canonical cart, and return the actual result.

That is the difference between a demo chatbot and an agentic system you can trust.

What Makes It Agentic

There are a lot of definitions of "agent" in the industry. Some teams mean a fully autonomous system that works for minutes or hours. Some mean a deterministic workflow with one LLM step inside it. Anthropic makes a useful distinction between workflows, where code defines the path, and agents, where the model dynamically chooses tools and steps based on the task. The most practical systems often sit between those two poles.

This grocery assistant uses an agentic loop without pretending the model should own everything. The model is responsible for language understanding, tool selection, and final phrasing. The backend owns the business logic. That split is intentional.

The model can decide, "I need to search recipes," or "I need the cart," or "I should resolve this ingredient name before filtering recipes." But it does not get to invent products, prices, recipe ingredients, product identifiers, or cart state. It can ask for those things through tools. The tools return structured data. The model then uses that data to respond.

In research terms, this resembles the pattern popularized by ReAct: interleave reasoning and acting so the model can plan, call tools, observe results, and adjust. In engineering terms, it is a bounded state machine around an LLM.

That framing is important because it keeps the architecture honest. The system is not "an LLM connected to a database." It is a controlled runtime where the LLM has access to carefully designed capabilities.

The Agentic Loop

At a high level, each chat turn follows a loop like this:

1. The frontend sends the latest user message, recent conversation turns, route context, pantry hints, and cart context.

2. The application runtime builds a compact context block for the model.

3. The orchestrator sends the message, system prompt, and available tool schemas to a tool-calling model interface.

4. The model either returns a final answer or requests one or more tools.

5. The tool executor validates the requested tool, calls deterministic backend services or adapters, and stores visible artifacts.

6. Tool results are sent back to the model as observations.

7. The loop continues until the model returns a final answer, a safety limit is reached, or a repeated tool cycle is detected.

8. The runtime returns structured JSON: assistant text, recipe cards, product cards, cart state, applied actions, follow-up prompts, and optional tool trace.

The cloud provider, hosting model, and LLM vendor are implementation details. The important part is the shape of the loop: context in, tool decision, deterministic execution, observation back to the model, and structured response out.

In a grocery assistant, model-facing tools should be written in domain language. Examples include:

  • search_products
  • get_product
  • match_ingredient_names
  • search_recipes
  • get_recipe
  • analyze_recipe_availability
  • recommend_related_items
  • get_cart
  • update_cart
  • add_recipe_ingredients_to_cart
  • find_recipes_from_cart

These names matter. The model should reason in grocery concepts, not in vendor API names. It should not know whether product truth came from static files, commercetools, or a future commerce platform. The adapter can change. The tool contract should not.

That one decision is what makes the system generalizable. In a support assistant, the tools might be search_orders, create_return_label, lookup_policy, and escalate_ticket. In a healthcare intake assistant, they might be check_eligibility, find_provider, and summarize_visit_history. In a field-service agent, they might be diagnose_asset, schedule_visit, and reserve_part.

The agent loop is the same. Only the domain tools and policies change.

Why Tool Design Matters More Than Prompt Cleverness

The most common mistake in agent projects is trying to make the prompt carry the entire system. Prompts are important, but prompts are not a substitute for product architecture.

Good tools do three things.

First, they expose meaningful domain actions. A grocery assistant needs recipe search, ingredient resolution, cart inspection, and cart mutation. It does not need raw database queries or raw commercetools endpoints in the model context.

Second, they return compact structured data. The model should not receive a full catalog or a giant recipe dump. It should receive the best few results, already shaped for the task. In this grocery assistant example, visible recipe and product artifacts are capped so the model cannot write prose about items the frontend will not render.

Third, they encode policy in the right layer. The tool description can tell the model that match_ingredient_names is resolution-only. The backend enforces that by not merging those results into visible product cards. The prompt can say "do not mutate the cart unless the latest user message clearly asks for it." The cart service still owns the actual mutation semantics and version checks.

This is why agent tools should be treated as product APIs, not helper functions. They are the interface between a probabilistic planner and deterministic software.

Modern tool-calling APIs across providers follow the same basic shape: define tools with schemas, let the model request a call, execute the tool in application code, and send the result back to the model. AWS Bedrock documents this pattern for Converse; OpenAI describes the same idea as function or tool calling. The provider syntax changes, but the architectural concern is the same: the model proposes, the application executes.

The Model Is Not The Source Of Truth

The strongest design choice in this grocery assistant is also the least flashy one: the LLM is not the source of truth.

For commerce, that is non-negotiable.

Prices cannot come from a model. Product IDs cannot come from a model. Cart totals cannot come from a model. Availability cannot come from a model. Recipe-to-product mapping cannot be improvised in prose. If the assistant is going to help someone shop, it needs grounded data and deterministic execution.

The architecture splits responsibility clearly:

  •  In commerce-backed mode, commercetools owns live products, prices, images, and carts.
  • The local recipe layer owns recipes, ingredients, aliases, and recipe-to-product mappings.
  • Backend services own recipe ranking, missing ingredient analysis, cart rules, version handling, and response shaping.
  • The LLM layer owns interpretation, tool choice, and final phrasing.
  • The frontend renders structured outputs instead of scraping assistant text.

That separation is why the assistant can say useful things without becoming dangerous. It can be conversational without being authoritative over the wrong things.

This also explains why it does not need vector-store RAG for its core flows. RAG is useful when the source material is unstructured or too large to fit into deterministic APIs: help-center articles, policy documents, long manuals, internal notes. But recipes, ingredient mappings, products, and carts are structured operational data. They are better exposed through tools.

RAG answers "what does the documentation say?" Tools answer "what is true right now, and what action should we take?"

For this system, tool grounding is the better fit.

A Grocery Assistant Is A Surprisingly Good Test Case

Grocery looks simple until you try to automate it.

A shopper rarely thinks in product keys. They say "onion," "something quick," "a dinner with chicken," "I already have milk," or "add what I am missing." The assistant has to translate that language into canonical ingredients, products, recipes, and cart actions.

It also has to navigate ambiguity. "What can I make with broccoli?" could mean recipes containing broccoli. "What can I make with what is in my cart?" means inspect the cart first, derive ingredient IDs, then rank recipes by missing items. "Add missing ingredients" only makes sense if there is a selected or recently discussed recipe. "Show my cart" is read-only and should not trigger a mutation.

This is exactly where agents become useful. A fixed search box asks the user to know the data model. An agent lets the user express intent, then uses tools to bridge the gap.

In this example, that bridge is concrete:

  • Ingredient aliases help turn shopper language into canonical grocery concepts.
  • Recipe search can use preferences such as budget, cuisine, prep time, and pantry ingredients.
  • Recipe availability compares required ingredients against pantry and cart state.
  • Recipe-to-product mapping connects recipe ingredients to shoppable products.
  • Cart mutations return applied actions and a canonical cart.
  • Follow-up prompts are generated from structured artifacts, not invented by the UI.

The experience feels conversational, but the outcome is operational.

That is the core value proposition for agentic commerce. The assistant reduces the distance between intent and execution. It does not just help a shopper find a product. It helps them complete a job: plan dinner, understand what is missing, and build the basket.

Retail is already moving in this direction. Walmart has been public about Sparky as a shopping companion that helps customers plan, compare, and purchase. Uber Eats has rolled out grocery cart assistance that turns text and image prompts into checkout-ready baskets. The market signal is clear: shopping assistants are becoming less like search bars and more like task engines.

The Hidden Work: State, Safety, And Recovery

Agent demos usually show the happy path. Real systems are mostly about the unhappy paths.

What happens if the cart version is stale? What happens if the runtime restarts and in-memory state is gone? What happens if the model calls the same tool in the same way twice? What happens if the model provider is unavailable? What happens if the user asks a read-only cart question and the model is tempted to call a mutation tool?

These concerns should be handled by the system around the model, not by hoping the model behaves perfectly.

Cart state is canonical and versioned. Mutations use the current cart version. If a version conflict occurs, the system refreshes the cart and retries once. In static mode, the frontend carries a cart snapshot so the backend can recover continuity after runtime resets. Bedrock has a bounded number of tool rounds. Repeated identical tool cycles are detected and stopped.

These details are not flashy, but they are the difference between a neat prototype and a system you can demo with confidence.

The same principles apply in more sensitive domains. If an action is irreversible, expensive, regulated, or reputationally risky, the loop should pause for human approval. Human-in-the-loop patterns are now a standard part of serious agent systems: the agent proposes an action, the user or operator reviews it, and execution resumes only after approval, rejection, or edit.

That approval boundary cannot be left as a polite instruction in the prompt. The system has to enforce it. A payment tool, account-change tool, legal-notice tool, or production-deployment tool should simply refuse to execute unless there is an explicit authorization token, approval record, or workflow state proving that a human approved the action. The model can ask for approval. It cannot decide that approval happened.

For grocery, adding an item to a basket is low-risk and reversible. For payments, account changes, medical decisions, legal notices, or production infrastructure, the approval boundary becomes part of the product.

Observability Is Part Of The Product

Agents are harder to debug than normal request-response software because the model decides the path at runtime. That makes observability non-optional.

This grocery assistant example emits structured events around chat start and end, model request and response rounds, tool start and end, tool errors, token usage, latency, and estimated model cost when model pricing is configured. The chat response can optionally include a tool trace in debug mode.

That matters for engineering, but it also matters for product strategy.

You cannot improve an agent you cannot inspect. You need to know which tools are called, which calls fail, how many rounds a task takes, where latency comes from, how much a conversation costs, and which prompts lead to unsafe or unhelpful behavior. Over time, those traces become the raw material for evaluation.

The best agent teams build test sets from real failures. They replay conversations. They check tool choice. They check whether the answer stayed grounded. They measure tool count, latency, cost, and task completion. They treat prompts and tool descriptions as versioned product surfaces, not disposable text.

This is where the field is maturing. The next generation of agent quality will not come only from bigger models. It will come from better contracts, better evals, better traces, and better domain tools.

Simply: better implementation.

Why The Architecture Is General

The grocery assistant is specific, but the architecture is portable.

Most useful assistants need the same ingredients:

  •  A clear domain boundary.
  • A compact context object.
  • A model-facing tool registry.
  • Deterministic services behind the tools.
  • Structured outputs for the frontend.
  • A state model for sessions, carts, cases, orders, or workflows.
  • Guardrails for read-only versus mutating actions.
  • Observability and replayable evaluations.
  • A fallback strategy when upstream AI or integration services are unavailable.

That is why this example is not just about groceries. It is a reference pattern for building assistants that sit on top of real business systems.

The assistant can be customized at several layers. Change the prompt and tool descriptions to change how it behaves. Change the tool registry to change what it can do. Change adapters to swap static data for a live commerce platform. Change deterministic services to encode a client's business rules. Change the frontend surfaces to show the right structured artifacts for that domain.

That is why capable agents are not generic magic. They become capable because they are customized against a real business process and constrained by real software boundaries.

The model stays replaceable because the application contract is explicit. The current implementation can use a specific model runtime, but the architecture is not tied to one vendor. Tool calling, structured outputs, memory, guardrails, and stateful orchestration are common building blocks across modern agent stacks. Frameworks like LangGraph emphasize durable execution and human-in-the-loop workflows. Protocols like MCP are pushing the ecosystem toward standardized tool and context connections. The useful pattern is bigger than any one SDK.

That is the point. A good agent architecture should make the model more useful without making the whole product fragile.

The Product Lesson

The real promise of AI agents is not that users can chat with software. It is that users can stop translating their goals into software's preferred shape.

In grocery, the user's goal is not "filter product catalog by category equals dairy." It is "I want tacos tonight and I only have 20 minutes." It is "use what is already in my cart." It is "make this recipe shoppable." It is "do I need anything else?"

In support, the goal is not "open form 17B." It is "my order arrived damaged." In operations, it is not "query these three systems and join the results." It is "why is this shipment late and what can we do next?" In sales, it is not "search CRM notes." It is "prepare me for this account meeting."

Agents create value when they absorb that translation burden.

But they only create durable value when they are grounded. A capable model without tools is eloquent. A capable model with the wrong tools is risky. A capable model with good tools, state, policies, deterministic execution, and observability becomes a new kind of interface to the business.

That is what this grocery assistant is meant to show.

Not a chatbot bolted onto a store.

A domain-specific agent that understands a shopper's goal, uses the right systems, respects the source of truth, and turns conversation into action.

References And Further Reading

Blog Posts