One of the most common requests we get is some version of "we're drowning in email." A support inbox, a sales inbox, or a shared operations email that gets 100-300 messages per day, and someone on the team has to read every single one to decide what to do with it.

We recently built an AI agent that handles exactly this problem. It reads incoming emails, classifies them by topic and urgency, routes them to the right person or queue, and drafts responses for the straightforward ones. Here's how it works, what went wrong, and what the team's day looks like now.

What the workflow looked like before

The company receives around 200 emails per day through a shared support inbox. These range from password reset requests to urgent billing disputes to partnership inquiries to spam. One person spent roughly three hours every morning reading, tagging, and forwarding each email to the right team member.

Three hours of human time, every single day, just to sort mail. That's 15 hours a week of pure triage, not even counting the time spent actually responding.

How the AI agent works

The system has four stages, running as a Rails background job that processes new emails every two minutes:

Stage 1: Intake. New emails are pulled via IMAP (or a webhook from the email provider). Each email is stored with its sender, subject, body text, and any attachments. We strip HTML, signatures, and quoted replies to get the actual message content.

Stage 2: Classification. The cleaned email text is sent to GPT-4 with a carefully written system prompt. The model classifies each email across two dimensions: topic (billing, technical support, partnership, general inquiry, spam) and urgency (high, medium, low). The prompt includes examples of each category drawn from actual emails the team had previously tagged.

Stage 3: Routing. Based on the classification, the email is automatically assigned to the right queue and team member. Billing goes to the finance team. Technical issues go to the support queue. High-urgency items get flagged in Slack immediately. Spam gets archived.

Stage 4: Draft response. For common request types (password resets, invoice copies, status updates), the agent drafts a response using templates seeded with the customer's actual data from the CRM. The team member reviews and sends with one click, or edits as needed.

The edge cases that almost broke it

The first version worked well on clear-cut emails. But real inboxes are messy. Here's what we had to handle:

Multi-topic emails. A customer writes: "Hey, my invoice is wrong AND I can't log in." The agent initially classified this as "billing" and routed it to finance, who couldn't help with the login issue. We fixed this by allowing the agent to assign multiple tags and route to the primary category while CC'ing the secondary team.

Emotional tone. An email that says "I've been waiting THREE WEEKS for a response" is technically a general inquiry, but it needs to be treated as high-urgency. We added a frustration detector to the classification prompt that bumps urgency when the language signals an unhappy customer.

Reply chains. When customers reply to their own thread with new information, the agent would sometimes re-classify based only on the latest reply, losing context. We now include the last three messages in the thread when classifying.

Internal emails. Team members occasionally forward emails to the shared inbox with notes like "can you handle this?" The agent would classify the forwarded email, not the internal request. We added a rule to detect internal sender domains and route those differently.

What the results look like

After two weeks of tuning, the agent handles the full triage automatically. The numbers:

  • Classification accuracy: 94%. Out of 200 daily emails, about 12 get mis-categorized. The team corrects these with one click, and those corrections feed back into the prompt examples.
  • First-response time: dropped from 4 hours to under 30 minutes. The draft responses for common requests go out almost immediately after review.
  • Human triage time: dropped from 3 hours/day to 20 minutes. The team now reviews flagged items and edge cases instead of reading every email.
  • Spam filtering: 99%+ accuracy. The agent catches spam that the email provider's built-in filter misses.

What the tech stack looks like

The whole system runs on a standard Rails application. Email processing is handled by Sidekiq background jobs. The AI classification calls go to OpenAI's API. Customer data lookups hit the CRM via API (in this case, HubSpot). Notifications go through Slack's webhook API.

There's no exotic infrastructure. No vector databases. No fine-tuned models. It's a well-structured prompt, good error handling, and careful logging so we can spot when the agent makes mistakes and fix the prompt. If you're considering workflow automation like this, the tech stack matters less than the prompt engineering and feedback loops.

When this approach makes sense

Email triage is one of the workflows we recommend automating first for growing companies. It works best when you have a high-volume inbox with predictable categories. If you get 20 emails a day, a human can handle the triage in 15 minutes and the ROI doesn't justify building an agent. If you get 100+, and the categories are relatively stable, the agent pays for itself within the first month.

The other key factor is whether draft responses are useful. If every email requires a completely custom reply, the agent's value is limited to classification and routing. But if 60-70% of emails fall into known categories with standard responses, the draft feature is where the real time savings happen.

What we'd do differently next time

We'd build the feedback loop from day one. In the first version, we had no easy way for the team to flag mis-classifications. We added a "wrong category" button after the first week, and the correction data has been the single most valuable thing for improving accuracy. Start with the feedback mechanism, then launch the automation.

We'd also invest more upfront in the email parsing step. Stripping HTML, removing signatures, and handling forwarded messages cleanly makes a massive difference in classification accuracy. Garbage in, garbage out — and email formatting is surprisingly garbage-heavy. This is the kind of edge case handling that makes the difference between a Zapier-level automation and a custom integration that actually holds up in production.