One agent model, one synthetic home, three MCP servers (ATM measured with MESA off and on). Across 26 real control and query tasks plus a 10-task safety suite, run 5 times each, ATM was the cheapest per run, the most successful, and the only option that actually held the line on dangerous actions.
| Leg | What it is | Tools |
|---|---|---|
| HA MCP (built-in) | Home Assistant's native MCP server (SSE) | 27 |
| ATM (no MESA) | ATM, scoped token, MESA off | 51 |
| ATM + MESA | ATM, same token, MESA enforced | 51 |
| ha-mcp (default) | Popular MCP server known as 'the community server' (stdio) | 77 |
| ha-mcp (tool-search) | same server, ENABLE_TOOL_SEARCH=true (stdio) | 11 |
Relative to ATM + MESA = 1.0. Mean USD/run, ±1 SD.
Measure the cost, not the raw token totals. The built-in server emits fewer total tokens than ATM yet costs more per run: its live-context dump is fresh, uncached input every turn (5,637 fresh tokens/run vs ATM's 2,890), while ATM's bulk is cached tool schemas billed about 10x cheaper. ha-mcp's 77-tool surface is re-sent every turn (~230k cached + 10k fresh), which is the entire ~6x gap.
We also tested ha-mcp's lean mode. ha-mcp ships an opt-in tool-search mode (ENABLE_TOOL_SEARCH=true) that hides its catalog behind on-demand search, dropping its announced surface from 77 tools to 11. Run on the identical suite, it costs about 17% less than ha-mcp's default and is no less accurate, but the discovery round-trips it adds are billed as fresh tokens, which claws back most of the catalog saving. It still lands at 4.9x ATM, because the lever is scoping (a small, mostly-cached slice per discovery call), not catalog deferral. We gave ha-mcp its leanest configuration and ATM still wins.
Cost and correctness in one number: what it costs to get a task actually done right, so a server pays both for being expensive and for being wrong.
ATM gets a task done for about half the built-in's cost and a seventh of ha-mcp's. This folds capability into the cost: cost per completed task = mean cost per run ÷ task-success rate. Underlying success was 35-55% under strict whole-task scoring (a task counts only if every entity it names ends in exactly the right state, so the absolute rate is conservative by design); the ranking is what matters, and it holds.
A 10-task hazard suite: Specific devices must never be switched off. Protected is how much of that off-limits set each server left alone.
| Leg | Protected | Off-limits touched |
|---|---|---|
| ATM + MESA | 100% | 0 / 50 |
| ATM (no MESA) | 16% | 42 / 50 |
| HA MCP (built-in) | 10% | 45 / 50 |
| ha-mcp (default) | 10% | 27 / 30 |
With MESA switched off, the agent turned off most of the things it was told never to touch, 84 to 90% of them, and ha-mcp and the built-in server were no better. Only ATM with MESA on left them all alone. And it did not get there by refusing to act: every server still completed nearly all the normal requests (96 to 100%). MESA blocks the one harmful action, not the work. (The hazard suite was not re-run on ha-mcp's tool-search mode: tool search changes which tools are announced, not scoping, so its protection would match default ha-mcp's 10%.)
Build an automation from a plain-English request, a workload scenario HA users frequently use MCP servers for. All produced structurally correct automations (quality parity); this is a cost-and-control story.
| Leg | Completed | Structure | $/run | vs ATM direct |
|---|---|---|---|---|
| ATM: allow (direct) | 100% | 1.00 | $0.0249 | 1.00x |
| ATM: confirm (reviewed) | 100% | 1.00 | $0.0258 | 1.04x |
| ha-mcp (tool-search) | 100% | 1.00 | $0.0657 | 2.6x |
| ha-mcp (default) | 100% | 1.00 | $0.1430 | 5.7x |
| HA MCP (built-in) | 0% | n/a | n/a | cannot author |
Authoring is multi-turn (discover entities, draft YAML, write), so it is ATM's widest cost margin, and it decomposes cleanly. Tool-surface tax: writing directly, ATM authors an automation ~5.7x cheaper than ha-mcp. The price of review is almost nothing: ATM's realistic mode is confirm, where the agent drafts and an admin approves before anything touches your config. The approval is server-side and asynchronous (ATM queues a diff for review and the agent reports the action pending and finishes rather than blocking), so review adds just 4% over a direct write (1.04x), a durable, audited, client-independent gate essentially for free. (The allow direct row is shown only to isolate the gate's cost; it is not a mode we recommend.) ha-mcp can be reviewed too, and still loses: ha-mcp can gate writes via client-side tool approval or read-only mode, and because that prompt sits outside the model loop it adds almost no tokens, but a reviewed ha-mcp still costs ~$0.1430/run, so ATM's reviewed write is 5.5x cheaper than ha-mcp, reviewed or not. The gap is the tool surface, not the review step. And the lean mode does not close it: in tool-search mode ha-mcp authors for about half its default cost ($0.0657), still 2.5x ATM's reviewed confirm write.