For 24 months everyone assumed AI was a cloud business. By 2026 it’s clear: a huge slice of the market is moving on-device, and that’s going to reshape who wins. Apple’s local LLM, Gemini Nano on Pixel, Snapdragon X NPUs everywhere — the cloud-default era is over.

What’s running locally now

As of 2026:

  • Apple Intelligence ships a ~3B-parameter local model on every iPhone 15 Pro and newer, plus a server-class private cloud for hard tasks
  • Gemini Nano runs on Pixel 8/9 and select Samsung Galaxy devices for summarisation, smart reply, transcription
  • Microsoft Phi-4 at 14B runs on most Copilot+ PCs without breaking a sweat
  • Llama 3.3 8B, the open-weight default for everything else, runs on any modern laptop with 16GB of RAM

Why this matters more than people realise

Three structural shifts:

  1. Latency drops to zero. Network roundtrip removed. UX feels instantaneous, which changes what’s possible.
  2. Privacy stops being a tax. Sensitive data never leaves the device. Healthcare, legal, finance use cases unlock.
  3. Marginal cost goes to zero. No per-token pricing. Heavy usage becomes economically free for app developers.

What founders should do

If you’re building consumer AI in 2026, your default architecture should be:

  • On-device for the 80% — most queries, fast and free
  • Cloud for the 20% — heavy reasoning, fresh information, complex tool use
  • User-controlled fallback — explicit toggle for which model handles which task

Apple Intelligence’s architecture is the template here. Apple’s ML research blog documents the routing logic in some detail.

What this breaks

Three business models get squeezed:

  • API-as-a-business — anyone selling thin wrappers around a frontier API loses to anyone running a local model that’s good enough.
  • Per-token pricing for consumer apps — users won’t pay $20/month for chat when their phone does the same task for free.
  • Cloud GPU resellers — the addressable market shrinks as inference moves off the cloud entirely.

What this builds

The new winners:

  • Private deployment plays — running LLMs inside enterprise data centres, no internet, no leak risk
  • Personalised on-device experiences — the model that learns your preferences locally without ever phoning home
  • Hybrid orchestration tools — frameworks that smartly route between local and cloud based on task

The honest part

Frontier still matters. The hardest tasks — long-horizon agents, deep research, code generation at scale — still need GPT-6 or Claude 5 in the cloud. The split isn’t local or cloud; it’s local and cloud, with the routing logic itself becoming a strategic surface. The companies that nail the routing layer in 2026 are this decade’s Stripe.

Building hybrid AI architecture? Compare notes with me.