How I Built a Real AI Classifier (And Why It Wasn't an Agent)

A case study in finding the right solution shape — classifier over agent, specificity over generalization, observability from day one.

When I started thinking about the wrong-number email problem at Meal Mentor, I assumed the solution was an AI agent. That assumption was wrong. What I actually built was a classifier, a workflow, and a feedback loop — and discovering that gap between what I thought I needed and what actually solved the problem is the more interesting story.

The Problem

Meal Mentor runs on getmealplans.com. There is a separate, unrelated company in Europe also called Meal Mentor. Their customers routinely find our support email and write to us — angry, confused, sometimes describing billing disputes in euros, referencing recipes we've never heard of, asking about products we don't sell.

About a dozen of these emails arrive every day. They're disruptive to support, impossible to ignore, and easy to mishandle if you're not paying close attention. A support person who doesn't catch the context sends a confusing reply. A support person who does catch it has to manually look up the contact information for the other company and compose a redirect.

This is exactly the kind of repetitive, rule-bound work that shouldn't require a human every time.

What I Originally Planned vs. What I Built

I originally thought this problem needed an agent — something that could read, reason, and respond autonomously. What I discovered through building is that the problem is simpler and more tractable than that. The signals are strong and the routing logic is deterministic once you know the class. That's a classifier problem, not an agent problem.

The distinction matters. An agent implies ongoing reasoning and decision-making. A classifier implies: given an input, which bucket does this go in, and how confident are we. Once I reframed it that way, the solution became much cleaner.

The Classification Design

There are two types of signals available for every incoming email.

The first is structured data. We can do a real-time lookup against our customer database to check whether the sender's email address is on file. We also know our product pricing exactly — all Meal Mentor customers in the US pay $14.99 or $149. Any email referencing a different price point, a different currency, or amounts outside those two numbers is a strong signal that this person is not our customer.

The second is semantic content. Customers of the European Meal Mentor often reference things that simply don't exist in our product — specific recipes, meal plans, account details, and product features we've never offered. Our actual customers reference our actual content. A classifier trained on these distinctions can reliably tell them apart.

The database lookup alone isn't sufficient. Many people have multiple email addresses and might contact us from an address we don't have on file. So I built the system to use the database check as a fast path where it works and fall back to semantic classification where it doesn't.

The Tech Stack

The classifier is built with DSPy. I chose DSPy specifically because I didn't want to build something fragile — a raw prompt tied to a specific model that breaks the moment I want to change the model or the model provider updates their behavior. DSPy provides a structured, robust framework for building AI pipelines that are model-portable. The current model is Gemini Flash, chosen for speed and cost at this scale.

The system is deployed on GCP Agent Engine with infrastructure defined in Terraform and deployed through a CI/CD pipeline. For a small internal tool, this might seem like overkill. It isn't. Treating this as real infrastructure from the start means I can iterate on the classifier without manually managing deployments, and I have a reliable audit trail of what changed and when.

Three Classes, Not Two

The classifier produces three outputs: customer, not-customer, and needs-review.

Confident not-customer emails trigger an automated reply. The reply explains what we believe has happened, provides links and contact information for the European Meal Mentor, and does this quickly enough that the person can get to the right place without waiting for a human.

Confident customer emails go to the normal support workflow.

Uncertain emails — cases where the classifier's confidence falls below the threshold — get tagged in Gmail and surface in a queue for our support person to review. This is not a failure mode. This is the system working correctly. The right response to uncertainty is a human review, not a guess.

Observability From the Start

Every scheduled run sends a summary email to our support team and to me. The summary shows how many emails were processed, how they were classified, and which ones went to the review queue.

This was not an afterthought. I built it in from the beginning because a classifier running unmonitored is a liability. The summary email is a lightweight feedback loop that lets us catch systematic errors — a new type of wrong-number email that the classifier hasn't seen before, a change in the European company's product language that shifts the signal distribution, a pricing change on our end.

What's Next

The current gap is training data. The classifier was initialized on a relatively small labeled set and improves with every email that gets reviewed and confirmed by the support team. The next iteration is a small tagging application that surfaces unreviewed emails, makes it easy to label them customer or not-customer, and feeds that data back into the classifier's training loop.

This closes the feedback cycle. The system gets better as it runs, and the improvement is driven by real data from real operational decisions rather than by me manually curating examples.

Why I'm Writing About This

This is not a large system. It solves one specific problem for one specific business. But that specificity is the point.

Effective automation is not about building a general-purpose system that handles 70% of cases and puts humans in the gaps. It's about understanding a problem specifically enough to automate it completely — and then building in the right escalation paths for the genuinely hard cases. The cases that need a human aren't failures. They're the system correctly recognizing the boundary of what it can handle.

The original assumption that this needed an agent was wrong. The right answer was smaller, simpler, and more durable. Finding that answer through iteration, rather than overbuilding toward the first assumption, is the skill.