The Web Speech API: Voice Recognition and Synthesis in the Browser
10 min read

The Web Speech API: Voice Recognition and Synthesis in the Browser

Summary

Web interfaces have evolved beyond clicks and keystrokes. The Web Speech API, developed by the W3C, provides two independent subsystems to integrate voice into any web application: SpeechRecognition (speech-to-text) for capturing and transcribing the user’s voice, and SpeechSynthesis (text-to-speech) for reading text aloud using the system’s available voices.

This article covers the architecture of both subsystems, shows practical implementations with code, and addresses critical caveats like browser fragmentation, the Chrome 15-second bug, and iOS restrictions.


1. Two Independent Subsystems

The API is divided into two completely separate modules:

SubsystemDirectionInterfaceBrowser Support
SpeechRecognitionUser → App (microphone input)SpeechRecognition / webkitSpeechRecognitionLimited (mainly Chromium-based)
SpeechSynthesisApp → User (audio output)window.speechSynthesis + SpeechSynthesisUtteranceUniversal (Baseline)

Note: SpeechSynthesis enjoys universal adoption across all modern browsers. SpeechRecognition, however, remains experimental with fragmented support. Always implement feature detection before using either.


2. Speech Recognition (Speech-to-Text)

2.1 Feature Detection and Instantiation

Due to vendor prefixing, the constructor must be resolved defensively:

const SpeechRecognition = window.SpeechRecognition || window.webkitSpeechRecognition;

if (!SpeechRecognition) {
  console.error('SpeechRecognition is not supported in this browser.');
}

2.2 Configuration Properties

Once instantiated, the recognition engine is configured through four key properties:

PropertyTypeDescription
langstringBCP 47 language tag (e.g., "en-US", "es-ES"). Defaults to the document’s lang attribute.
continuousbooleanIf true, keeps listening after returning results. Default: false (stops after one result).
interimResultsbooleanIf true, emits provisional transcriptions in real time before finalization.
maxAlternativesnumberNumber of alternative transcriptions per result segment. Default: 1.
const recognition = new SpeechRecognition();
recognition.lang = 'en-US';
recognition.continuous = true;
recognition.interimResults = true;
recognition.maxAlternatives = 1;

2.3 The Result Hierarchy

Results are not flat strings. They arrive as a three-level object tree:

  1. SpeechRecognitionResultList — a list-like collection accessible through event.results
  2. SpeechRecognitionResult — each item contains an isFinal boolean and one or more alternatives
  3. SpeechRecognitionAlternative — holds the transcript string and a confidence score (0 to 1)
recognition.onresult = (event) => {
  for (let i = event.resultIndex; i < event.results.length; i++) {
    const result = event.results[i];
    const transcript = result[0].transcript;
    const confidence = result[0].confidence;

    if (result.isFinal) {
      console.log(`Final: "${transcript}" (${(confidence * 100).toFixed(1)}%)`);
    } else {
      console.log(`Interim: "${transcript}"`);
    }
  }
};

2.4 Event Lifecycle

The recognition engine fires events in a deterministic sequence:

start → audiostart → soundstart → speechstart

  result (repeated while speaking)

speechend → soundend → audioend → end

Key events to handle:

recognition.onspeechstart = () => {
  console.log('Voice detected — transcription in progress...');
};

recognition.onerror = (event) => {
  switch (event.error) {
    case 'not-allowed':
      console.error('Microphone access denied by user or policy.');
      break;
    case 'network':
      console.error('Cloud recognition service unreachable.');
      break;
    case 'no-speech':
      console.warn('No speech detected — only silence or background noise.');
      break;
    case 'aborted':
      console.warn('Recognition was aborted.');
      break;
  }
};

recognition.onnomatch = () => {
  console.warn('Speech was detected but could not be recognized.');
};

recognition.onend = () => {
  console.log('Recognition session ended.');
};

2.5 Full Working Example: Live Dictation

const SpeechRecognition = window.SpeechRecognition || window.webkitSpeechRecognition;

if (!SpeechRecognition) {
  document.getElementById('status').textContent = 'Speech Recognition not supported.';
} else {
  const recognition = new SpeechRecognition();
  recognition.lang = 'en-US';
  recognition.continuous = true;
  recognition.interimResults = true;

  const finalOutput = document.getElementById('final');
  const interimOutput = document.getElementById('interim');
  let finalTranscript = '';

  recognition.onresult = (event) => {
    let interim = '';

    for (let i = event.resultIndex; i < event.results.length; i++) {
      const transcript = event.results[i][0].transcript;

      if (event.results[i].isFinal) {
        finalTranscript += transcript;
      } else {
        interim += transcript;
      }
    }

    finalOutput.textContent = finalTranscript;
    interimOutput.textContent = interim;
  };

  recognition.onerror = (event) => {
    document.getElementById('status').textContent = `Error: ${event.error}`;
  };

  document.getElementById('start-btn').addEventListener('click', () => {
    finalTranscript = '';
    recognition.start();
    document.getElementById('status').textContent = 'Listening...';
  });

  document.getElementById('stop-btn').addEventListener('click', () => {
    recognition.stop();
    document.getElementById('status').textContent = 'Stopped.';
  });
}

3. Speech Synthesis (Text-to-Speech)

3.1 The Singleton Controller

Unlike SpeechRecognition, the synthesis engine is accessed through a single global instance:

const synth = window.speechSynthesis;

This controller manages a FIFO queue of utterances and exposes three read-only state flags:

PropertyDescription
synth.pendingtrue if utterances are waiting in the queue
synth.speakingtrue if audio is currently being produced
synth.pausedtrue if playback has been paused

Control methods:

synth.speak(utterance);
synth.pause();
synth.resume();
synth.cancel();

3.2 Creating Utterances

The SpeechSynthesisUtterance object is the unit of speech. It holds the text and all prosodic configuration:

PropertyRangeDescription
textStringThe text content to be spoken
voiceSpeechSynthesisVoiceThe voice profile to use
langString (BCP 47)Language override
pitch0 – 2Voice pitch (1 = normal)
rate0.1 – 10Speaking speed (1 = normal)
volume0 – 1Audio volume (1 = max)
const utterance = new SpeechSynthesisUtterance('Hello, welcome to the application.');
utterance.lang = 'en-US';
utterance.pitch = 1;
utterance.rate = 1;
utterance.volume = 0.8;

synth.speak(utterance);

3.3 Loading Voices Asynchronously

Voices are loaded asynchronously by the browser (especially on Chromium). A synchronous call to getVoices() during page load will often return an empty array. You must listen for the voiceschanged event:

const synth = window.speechSynthesis;
let voices = [];

const loadVoices = () => {
  voices = synth.getVoices();
  console.log(`${voices.length} voices loaded.`);
};

loadVoices();

if (synth.onvoiceschanged !== undefined) {
  synth.onvoiceschanged = loadVoices;
}

Each SpeechSynthesisVoice object contains:

voices.forEach((voice) => {
  console.log(`${voice.name} (${voice.lang}) ${voice.default ? '— DEFAULT' : ''}`);
});

3.4 Full Working Example: Text Reader with Voice Selection

const synth = window.speechSynthesis;
let voices = [];

const voiceSelect = document.getElementById('voice-select');
const textInput = document.getElementById('text-input');

const populateVoices = () => {
  voices = synth.getVoices();
  voiceSelect.innerHTML = '';

  voices.forEach((voice, index) => {
    const option = document.createElement('option');
    option.value = index;
    option.textContent = `${voice.name} (${voice.lang})`;
    voiceSelect.appendChild(option);
  });
};

populateVoices();
if (synth.onvoiceschanged !== undefined) {
  synth.onvoiceschanged = populateVoices;
}

document.getElementById('speak-btn').addEventListener('click', () => {
  synth.cancel();

  const utterance = new SpeechSynthesisUtterance(textInput.value);
  utterance.voice = voices[voiceSelect.value];
  utterance.pitch = parseFloat(document.getElementById('pitch').value);
  utterance.rate = parseFloat(document.getElementById('rate').value);

  utterance.onstart = () => {
    document.getElementById('status').textContent = 'Speaking...';
  };

  utterance.onend = () => {
    document.getElementById('status').textContent = 'Done.';
  };

  utterance.onerror = (event) => {
    document.getElementById('status').textContent = `Error: ${event.error}`;
  };

  synth.speak(utterance);
});

document.getElementById('stop-btn').addEventListener('click', () => {
  synth.cancel();
  document.getElementById('status').textContent = 'Cancelled.';
});

4. Cloud vs. On-Device Processing

SpeechRecognition can route audio through two radically different pipelines:

4.1 Cloud Processing (Default)

By default, Chrome serializes the audio stream and sends it to Google’s cloud servers for transcription. This provides high accuracy (deep neural models, massive training datasets) but introduces trade-offs:

  • Latency: network round-trips degrade real-time experiences
  • Privacy: raw audio data leaves the device
  • Offline: completely non-functional without internet

4.2 On-Device Processing (Experimental)

Modern browsers are introducing local inference using on-device models. This eliminates network latency, works offline, and keeps audio private:

const recognition = new SpeechRecognition();

const availability = await SpeechRecognition.available();

if (availability === 'available') {
  recognition.start();
} else if (availability === 'downloadable') {
  await SpeechRecognition.install();
  recognition.start();
} else {
  console.warn('On-device recognition not supported — falling back to cloud.');
  recognition.start();
}

Note: SpeechRecognition.available() and SpeechRecognition.install() are experimental APIs. They allow checking if language models are cached locally and triggering their download. Support is limited to recent Chromium versions.


5. Security Model

The Web Speech API enforces strict security constraints to prevent passive eavesdropping:

  • User Gesture Required: recognition.start() must be called within a user-initiated event (click, tap). Calling it programmatically without a gesture throws an error.
  • HTTPS Only: the entire API is restricted to secure contexts. HTTP pages cannot access the microphone.
  • Visual Indicators: browsers display persistent, non-hideable recording indicators (tab icons, status bar alerts) while the microphone is active.
  • Permissions-Policy: server administrators can block microphone access at the HTTP header level for cross-origin iframes:
Permissions-Policy: microphone=()

6. Browser Compatibility (2026)

BrowserSpeechSynthesis (TTS)SpeechRecognition (STT)Notes
Chrome✅ Full (v33+)✅ Full (v33+)Primary implementor. Cloud-based by default.
Edge✅ Full (v14+)⚠️ UnreliableExposes the interface but events may never fire.
Firefox✅ Full (v49+)❌ Not implementedPhilosophical objection to routing voice biometrics to third-party clouds.
Safari✅ Full (v7+)⚠️ Partial (v14.1+)Requires webkit prefix. Depends on Siri being enabled.
Opera✅ Full (v21+)❌ Not implementedMirrors Edge/Firefox limitations.
Brave⚠️ Variable❌ BlockedBlocks cloud STT connections for privacy.

6.1 Defensive Feature Detection

Always verify support before using either subsystem:

const hasSpeechRecognition = 'SpeechRecognition' in window || 'webkitSpeechRecognition' in window;
const hasSpeechSynthesis = 'speechSynthesis' in window;

if (!hasSpeechRecognition) {
  console.warn('SpeechRecognition unavailable. Consider a cloud fallback (Google Cloud STT, Azure, AWS Transcribe).');
}

if (!hasSpeechSynthesis) {
  console.warn('SpeechSynthesis unavailable.');
}

7. Known Issues and Workarounds

7.1 Chrome’s 15-Second TTS Bug

Chrome’s cloud-backed voices abort utterances longer than ~15 seconds without firing end or error events, leaving the UI in a “speaking” state forever. The fix is to chunk long texts into shorter utterances:

const speakLongText = (text) => {
  const synth = window.speechSynthesis;
  synth.cancel();

  const sentences = text.match(/[^.!?]+[.!?]+/g) || [text];

  sentences.forEach((sentence) => {
    const utterance = new SpeechSynthesisUtterance(sentence.trim());
    utterance.voice = voices[0];
    synth.speak(utterance);
  });
};

7.2 Safari iOS: TTS Requires a “Prime” Gesture

Safari on iOS blocks TTS audio until the user triggers a gesture. To work around this, silently “prime” the audio context on the first user interaction:

document.addEventListener('click', () => {
  const primer = new SpeechSynthesisUtterance('');
  window.speechSynthesis.speak(primer);
}, { once: true });

After this, subsequent calls to synth.speak() will work without requiring additional gestures.

7.3 SpeechRecognition in iOS PWAs

When a web page is added to the home screen on iOS (“Add to Home Screen”), webkitSpeechRecognition is completely disabled. All calls throw NotAllowed errors without even showing the permission prompt. There is no workaround within the Web Speech API — a cloud-based STT service must be used instead.


8. Accessibility Benefits

The Web Speech API directly improves accessibility for:

  • Motor-impaired users: voice commands replace keyboard and mouse interactions
  • Visually-impaired users: TTS reads content aloud, complementing screen readers
  • Low-literacy users: audio feedback reduces dependency on reading ability
  • Hands-free scenarios: cooking, driving, or medical environments where touch/type input is impractical

Best practices when integrating speech into accessible interfaces:

  • Dismiss virtual keyboards by calling element.blur() when recognition starts, reclaiming viewport space on mobile
  • Show interim results visually so the user sees real-time feedback during dictation
  • Use the boundary event of TTS to highlight text as it is being read, providing a synchronized visual-audio experience

9. Conclusion

The Web Speech API provides a powerful bridge between voice and the web platform. SpeechSynthesis is production-ready across all browsers, while SpeechRecognition requires careful feature detection and fallback strategies due to fragmented support.

Start with TTS for accessibility improvements — it works everywhere. For STT, target Chromium browsers with cloud-based recognition, implement robust error handling, and evaluate cloud fallback services (Google Cloud Speech-to-Text, Azure Cognitive Services, AWS Transcribe) for cross-browser coverage.


Sources: W3C Web Speech API specification, MDN Web Docs, Chromium issue tracker, browser compatibility data (caniuse.com). Examples correspond to the API state as of March 2026.