"Will it actually be fast enough running in my browser?"

That's the first question everyone asks when they hear about on-device AI chatbots. It's a fair question. "Runs locally" sounds like it means "runs slowly." We've all tried local software that felt sluggish compared to a cloud service. The assumption is reasonable.

But it's wrong. And we have the numbers to prove it.

This post breaks down real inference speeds across phones, tablets, and laptops, using published benchmark data from the WebLLM research paper, mobile LLM benchmarking studies, and the MLC LLM project. No hand-waving. Just data.

What determines on-device AI speed?

Before the benchmarks, a quick primer on what actually controls how fast a model generates text on your device.

The bottleneck for transformer-based language models isn't raw compute. It's memory bandwidth, how fast the device can shuttle model weights from memory to the processor each time it generates a token. A data center GPU like the NVIDIA A100 has 2+ TB/s of memory bandwidth. Your laptop might have 100–200 GB/s. Your phone has 50–90 GB/s.

That sounds like a huge gap. But it only matters if you're running the same massive model. A 70B parameter model needs to move ~35GB of weights per token (at 4-bit quantization). A 1.5B parameter model? About 800MB. Your phone's memory bandwidth can handle that comfortably.

This is why model size is the key variable. The models that Kanha deploys, fine-tuned models in the 0.5B–3B parameter range, are specifically sized to run fast on consumer hardware. They're not generic models crammed into a browser. They're purpose-built for this.

Two more things matter:

WebGPU provides 3–5x speedup over older WebGL-based inference for transformer models. It gives browsers near-native access to the GPU, and according to the WebLLM research paper, retains up to 80% of native inference performance.
Quantization (running the model in 4-bit precision instead of 16-bit) cuts memory requirements by 4x with minimal quality loss. This is standard practice for on-device deployment.

The benchmarks: device-by-device breakdown

Here's what real devices actually deliver. These numbers are for 4-bit quantized models running in-browser via WebGPU, drawn from published benchmarks by the WebLLM project, the MLC LLM benchmarking suite, and academic studies on mobile LLM inference.

Device Category	Example Device	Model Size	Decode Speed (tok/s)	User Experience
High-end Laptop (Apple Silicon)	MacBook Pro M3 Max	3B	~90	Instant, faster than you can read
High-end Laptop (Apple Silicon)	MacBook Pro M3 Max	8B	~41	Fast streaming, feels real-time
Mid-range Laptop (Apple Silicon)	MacBook Air M2	1.5B	15–25	Smooth, comfortable reading pace
Windows Laptop (Discrete GPU)	RTX 3060+ Laptop	3B	25–40	Fast, responsive
Windows Laptop (Integrated GPU)	Intel Iris / AMD iGPU	1.5B	10–18	Usable, slight initial delay
Flagship Phone (Android)	Snapdragon 8 Gen 3	1.5B	15–30	Good, faster than most cloud chatbots' first-token latency
Flagship Phone (iOS)	iPhone 15/16 Pro	1.5B	15–30	Good, smooth streaming
Mid-range Phone	Snapdragon 870 / older	0.5B	8–15	Functional, noticeable but acceptable
Tablet	iPad Pro M2/M4	3B	30–50	Excellent, laptop-class

What do these numbers feel like?

8–15 tok/s: Words appear at a steady pace. You can read every word as it arrives. This is the floor for a usable experience, and even mid-range phones hit it.
15–30 tok/s: Comfortable streaming. Feels like a real conversation. This is where most flagship phones and mid-range laptops land.
30+ tok/s: Faster than you can read. Indistinguishable from a fast cloud chatbot. Laptops and tablets deliver this consistently.

For context, the average human reads at about 4 words per second, roughly 5 tokens per second. Even the slowest device in the table above generates tokens faster than you can read them. The bottleneck isn't the device. It's your eyes.

What about mobile? The question everyone asks

Let's address this directly because it's the biggest concern we hear: "Sure, it's fast on a MacBook. But what about my phone?"

The data is clear. Flagship phones deliver 15–30 tokens per second with 1.5B models. That's faster than comfortable reading speed. On a Snapdragon 8 Gen 3 or Apple A17 Pro, the experience is smooth, responsive, and genuinely good.

Even mid-range phones, devices with a Snapdragon 870 or equivalent, handle 0.5B models at 8–15 tok/s. That's not blazing fast, but it's faster than a person reads. The words stream in at a pace that feels natural.

Here's what makes this work:

Kanha uses fine-tuned small models, not generic large ones. We're not trying to run Llama 70B in your browser. We train 0.5B–1.5B parameter models specifically on your website data. A fine-tuned 0.5B model that knows your product catalog gives better answers about your products than a generic 8B model working from a prompt. And it runs 8–16x faster.

The model downloads once, then it's cached. The first visit downloads the model, 300MB–800MB for the small models Kanha uses. After that, every subsequent visit loads from the browser cache in seconds. No repeated downloads.

Your phone already runs AI. Autocorrect, photo enhancement, voice recognition, face detection. These features use on-device models every day. A fine-tuned 0.5B language model is lighter than most of them.

Thermal throttling: the honest caveat

We want to be upfront about this. Sustained AI inference can cause phones to throttle performance. Mobile benchmarking research shows that flagship phones can experience a 10–30% speed reduction after extended continuous generation, the chip gets warm and dials back to protect itself.

But here's why this rarely matters for chatbots: typical customer support conversations are short bursts, not sustained generation. A customer asks a question (20–50 tokens generated), reads the answer, maybe asks a follow-up. The phone has time to cool between responses. Thermal throttling affects workloads like generating 2,000-token essays back-to-back. It doesn't meaningfully affect a conversation where the user is reading between messages.

Why smaller models win for customer support

There's a widespread assumption that bigger models are always better. For general-purpose tasks, that's mostly true. For customer support on your website? It's not.

A fine-tuned 0.5B model that has learned your product catalog, return policy, shipping rules, and FAQ outperforms a generic 70B model that's guessing from a system prompt. The knowledge is in the weights, not in a retrieval pipeline that might miss the right chunk. Domain specificity beats raw parameter count for focused use cases.

And smaller models bring real advantages:

Model Size	Download Size (4-bit)	RAM Required	Device Coverage
0.5B	~300MB	~1GB	~85% of all devices
1.5B	~800MB	~2GB	~75% of all devices
3B	~1.5GB	~3.5GB	~60% of desktop, ~40% of mobile

A 0.5B model covers 85% of all devices your visitors use. A 1.5B model covers 75%. By keeping models small and specialized, Kanha maximizes the number of visitors who get a fast, local experience, without sacrificing answer quality on the domain that matters: your business.

How Kanha optimizes for every device

The SDK doesn't just load a model and hope for the best. It adapts to the device:

Automatic model selection. Kanha detects the visitor's device capabilities, GPU, available memory, browser support, and loads the right model size. A MacBook Pro might get a 3B model. A budget Android phone gets a 0.5B model. Both get accurate answers. Both get a fast experience.
WebGPU feature detection with graceful fallback. If the visitor's browser doesn't support WebGPU, Kanha can fall back to a cloud-served response or simply not show the widget. No broken experience.
Model caching. First visit downloads the model. Every subsequent visit loads from the browser cache. The cold start happens once.
Progressive loading. A visible progress indicator during model download so the visitor knows what's happening. No mysterious blank screen.

You don't configure any of this. It just works.

The speed advantage nobody talks about: zero network latency

Here's something that gets overlooked in speed comparisons. Cloud chatbots have a speed floor that no amount of server optimization can eliminate: the network round-trip.

Your customer types a question. It travels to a server (50–300ms depending on connection). The server queues and processes it. The response streams back. Even with the fastest API, that's 500ms–2 seconds before the first word appears.

On-device inference has no network round-trip. The model is already loaded in the browser. First token appears in under 200ms once the model is warm. That 200ms includes everything: tokenization, prefill, first decode step.

This matters most on bad connections:

Scenario	Cloud Chatbot (Time to First Token)	On-Device / Kanha (Time to First Token)
Fast Wi-Fi / 5G	500ms–1.5s	<200ms
Moderate 4G	800ms–2.5s	<200ms
Slow 3G / congested Wi-Fi	2–5s+	<200ms
Offline	Unavailable	Works normally

On a subway, in a coffee shop with crowded Wi-Fi, on a rural 3G connection, the on-device chatbot is faster than the cloud one. And when you're fully offline? The cloud chatbot doesn't work at all. The on-device one does.

For businesses with global audiences, customers on mobile data in emerging markets, visitors at trade shows with overloaded Wi-Fi, this isn't a minor edge case. It's a meaningful portion of your traffic getting a better experience.

The trajectory: it's only getting faster

The numbers above represent where things stand today. They're going to improve, and not gradually.

Browser-level optimizations are shipping continuously. WebGPU is still a relatively young API. Chrome, Safari, and Edge teams are shipping performance improvements every release cycle. Research from the WeInfer framework presented at the ACM Web Conference 2025 demonstrated a 3.76x improvement over the WebLLM baseline through better scheduling and memory management, without changing the model or the hardware.

NPU access is coming to browsers. Apple's Neural Engine delivers 35 TOPS. Qualcomm's Hexagon NPU hits 60 TOPS on flagship chips. Today, browsers can't access these processors. The WebNN API will change that. When it does, prefill speed (how fast the model processes your question before generating a response) could improve by 10–50x. First-token latency drops from 200ms to near-instant.

Model compression keeps improving. Every generation of quantization techniques delivers the same quality at smaller model sizes. A 2026 0.5B model matches or exceeds a 2024 1.5B model on domain-specific tasks.

The point: buying into on-device inference now means your chatbot's performance automatically improves as browsers update and hardware advances. No migration needed. No re-architecture. The same SDK, the same model format, just faster every quarter.

The numbers are clear

On-device AI chatbots are fast enough for production use today. Laptops deliver 25–90+ tok/s. Flagship phones hit 15–30 tok/s. Even mid-range phones outpace human reading speed with the right model size.

The speed question has an answer, and the answer is yes. It's fast enough. On most devices, it's more than fast enough. And it's getting faster every quarter without you doing anything.

If you're still wondering whether on-device AI can deliver a good user experience on mobile: try it on your own phone. Load kanha.ai, open the chat, and see for yourself. The numbers in this post describe what you'll feel.

Try it at kanha.ai. Free tier, no credit card, on-device by default.

How Fast Is On-Device AI? Speed Benchmarks Across Phones, Tablets, and Laptops