Support chatbots claim deflection rates north of 70%. We ran a fixed set of 16 standardized issues — drawn from real ticket categories — through 47 deployed SaaS chatbots and scored whether each was genuinely resolved or merely ended. The gap between the two numbers is the report. Across the sample, median claimed deflection was 71% and median measured genuine-resolution was 28%. The difference is 43 points of customers quietly giving up — and the cost of that giving-up shows up downstream in churn, not in the support P&L where the bot looks profitable.
What we tested
Sixteen issues, stratified across four difficulty tiers — simple FAQ ("how do I reset my password"), multi-step procedural ("how do I migrate a workspace"), ambiguous ("our integration stopped working but I'm not sure why"), and edge case ("we have a specific compliance requirement around data residency"). Issue text was identical across vendors with only product-name substitution. Two blind scorers rated each session as resolved / unresolved / escalated against a written rubric, with inter-rater agreement of 0.86. The full issue set and rubric are in the appendix.
This is a deliberately fair test. The issues are real, they're answerable from the vendor's public docs, and the rubric scores intent — did the customer get what they came for, not whether the bot returned a polite-sounding answer.
The headline gap
Across all 47 vendors and all 16 issues, simple FAQ resolved at 64% (against claimed 78%) and edge cases resolved at 8% (against claimed 58%). The chart below shows the full distribution. The pattern is structural, not vendor-specific: every chatbot we tested resolves simple lookups well and collapses on anything that requires multi-turn reasoning, clarifying questions, or escalation judgement.
The category sells deflection as if it scales uniformly across issue types. The data says it doesn't. The bots are competent at exactly the issue tier customers could have answered themselves with five seconds of doc-searching — and incompetent at the issue tier where the customer actually needed help.
Where chatbots succeed and where they fail
The failure mode isn't AI quality. It's medium. Text chat is a serial, low-bandwidth channel: one turn per question, no nonverbal signal, no parallel disambiguation. The kinds of issues chatbots fail on are the kinds that humans need clarifying conversation to resolve — and clarifying conversation in text takes 5–8 turns where speech takes 1–2. By turn 5 the customer is gone.
This is why "better models" haven't moved the deflection ceiling much. The constraint isn't reasoning, it's the medium delivering the reasoning. Voice resolves what chat escalates not because the underlying AI is smarter — often it's the same AI — but because the customer can express the problem in 8 seconds of speech and get a useful response before they would have given up on typing it.
Deflection that ends the conversation isn't the same as resolution that ends the problem.
The CSAT post-mortem
The 43-point gap doesn't show up in the standard CSAT survey — most surveys aren't sent on bot-only sessions, or are sent in a form the unhappy customer skips. It shows up downstream: in the secondary-ticket rate (the same customer comes back with the same issue, 7–14 days later), in churn surveys ("support couldn't help"), and in escalation-quality complaints (the human picks up a re-explained problem from turn 0). Each is a delayed bill for a deflection that the support P&L already booked as a win.
Voice on the same issues
We re-ran the same 16-issue set through three voice-AI configurations on the same vendors' knowledge bases (where we could access them). Voice resolved 71% of simple FAQ, 58% of multi-step, 44% of ambiguous, and 22% of edge cases — roughly 3-4× chat's measured resolution on the harder tiers. Sample size is small (n=3 vendor systems × 16 issues = 48 sessions) and the comparison isn't perfectly controlled. We're publishing the methodology in full so the follow-up cohort can replicate.
What this means in practice
If your deflection rate is a board-reported metric and your CSAT-on-deflected-conversations isn't measured at all, the support P&L is showing a profit and your retention number is paying the bill. The fix isn't a better chatbot; it's a different category — voice with citations and a hard escalation rule when confidence is low. The Modern Support chapter goes deep on what that looks like operationally.
Methodology
47 SaaS vendors selected from the open-beta cohort and public listings of "AI-powered support" deployments; bias toward category leaders. 16 issues stratified across four difficulty tiers (4 each), drawn from real ticket logs (anonymized) and verified answerable from each vendor's public documentation. Sessions conducted by two independent scorers using a written rubric; inter-rater agreement 0.86. "Genuine resolution" scored as: customer received an answer that, if followed, would resolve the underlying issue. "Claimed deflection" pulled from vendor self-reporting where available; otherwise computed from observed escalation behavior. Voice comparison ran on a 3-vendor subset due to access constraints; report with appropriate caveats.
Frequently asked questions
What's the real deflection rate of a typical SaaS chatbot?
Across our 47-vendor mystery-shop study, median measured genuine-resolution was 28%, against a median claimed deflection of 71%. The 43-point gap is the customer quietly giving up — and the cost shows up in churn and secondary tickets, not in the support P&L.
Why don't better AI models close the chatbot deflection gap?
Because the constraint is the medium, not the model. Text chat is serial and low-bandwidth: clarifying questions take 5–8 turns where speech takes 1–2. By turn 5 the customer is gone. The same AI delivered over voice resolves 3-4× more of the harder issues.
Does voice deflection have the same CSAT problem?
Early data says no — voice-resolved sessions track much closer to human-handled CSAT than chat-deflected ones do. The mechanism is straightforward: the customer who got a real conversation and a real answer rates the experience as such, even when the responder is AI.
