Tutorial

Voice-First Banking with Cakra Smart Search: Replacing Menus with Intent

A deep-dive into how I rebuilt a mobile banking interface around the Web Speech API, intent-based navigation, and context-aware form hydration — and why menu-diving is a UX anti-pattern.

A
Azar
Feb 24, 20267 min read48 likes

Mobile banking apps have a navigation problem. Every feature lives behind three taps, four if you count the loading state. When we started Cakra Smart Search, the brief was simple on paper: let users get to any feature in one sentence. Six months later, that one sentence had reshaped how I think about UX entirely.

Why Menus Are an Anti-Pattern in Banking

The conventional mobile banking app has 30–50 features and roughly seven slots on its home screen. Everything else hides under "More," and "More" is where good UX goes to die. Cakra's research surfaced something we'd all known intuitively: users don't learn the menu structure. They learn the path to the two or three features they use most, and anything outside that path is a friction wall.

Intent-based UX isn't about hiding the menu. It's about admitting the menu was always a crutch — a side-effect of designers and engineers needing to organize work, not users needing to organize their thoughts.

The Web Speech API, Used Seriously

Most voice-input demos stop at "tell me what you said." We needed to take the spoken sentence and route it through an intent classifier, extract entities (recipient, amount, account), and pre-fill a multi-step transaction form. The Web Speech API gets you the transcription. Everything after that is yours to build.

TypeScript
function useVoiceIntent() {
  const recognition = useMemo(() => {
    const SR = window.SpeechRecognition ?? window.webkitSpeechRecognition;
    if (!SR) return null;
    const r = new SR();
    r.lang = "id-ID";
    r.interimResults = false;
    r.maxAlternatives = 3;
    return r;
  }, []);

  return useCallback(async () => {
    if (!recognition) throw new Error("voice-unsupported");

    const transcript = await new Promise<string>((resolve, reject) => {
      recognition.onresult = (e) => resolve(e.results[0][0].transcript);
      recognition.onerror = (e) => reject(e.error);
      recognition.start();
    });

    return classifyIntent(transcript); // → { route, entities }
  }, [recognition]);
}

Context-Aware Form Hydration

Once we had intent + entities, we needed forms that could be partially pre-filled and still feel coherent. The trick was treating the URL as the source of truth for the search context — every transaction screen reads its initial state from query params, which means the same screen reached via tap, voice, or deep link behaves identically. No two code paths for "user navigated normally" versus "user came from voice intent."

"Make voice navigation a strict subset of regular navigation. If your voice path has its own state machine, you have two products to maintain — and only one of them gets tested."

The 40+ Transaction Lifecycle Problem

Cakra supports more than 40 distinct transaction types — transfers, bill payments, virtual account top-ups, QRIS payments, foreign exchange — each with its own validation rules, OTP flow, receipt format, and success state. The temptation is to build 40 screens. The actual answer was a generic transaction shell with pluggable validation, confirmation, and success components, driven by a per-transaction config.

When you find yourself copy-pasting the third near-identical screen, stop. The right abstraction is almost always config-driven — but only after you've built two or three screens by hand and seen the actual axes of variation.

What Voice Taught Me About UX

I came into Cakra thinking voice was a novelty layer on top of a normal app. I left convinced it's a forcing function. When you commit to "any feature in one sentence," you stop tolerating screens that exist purely because the menu had room for them. You delete the cruft. You consolidate. The app gets smaller, and the surface that's left is the surface that matters.

If you're considering adding voice to a product, my advice is to use it as a UX audit before you ship it as a feature. Try to navigate your own app by sentence alone. The screens you can't reach are the screens you probably shouldn't have built.

Building voice or intent-based UX? Curious to compare notes.
© 2026 Azar. All rights reserved.Sailing the React seas

AZAR

Frontend Developer
Loading