The Web Speech API: Voice Recognition and Synthesis in the Browser
Summary
Web interfaces have evolved beyond clicks and keystrokes. The Web Speech API, developed by the W3C, provides two independent subsystems to integrate voice into any web application: SpeechRecognition (speech-to-text) for capturing and transcribing the user’s voice, and SpeechSynthesis (text-to-speech) for reading text aloud using the system’s available voices.
This article covers the architecture of both subsystems, shows practical implementations with code, and addresses critical caveats like browser fragmentation, the Chrome 15-second bug, and iOS restrictions.
1. Two Independent Subsystems
The API is divided into two completely separate modules:
| Subsystem | Direction | Interface | Browser Support |
|---|---|---|---|
| SpeechRecognition | User → App (microphone input) | SpeechRecognition / webkitSpeechRecognition | Limited (mainly Chromium-based) |
| SpeechSynthesis | App → User (audio output) | window.speechSynthesis + SpeechSynthesisUtterance | Universal (Baseline) |
Note: SpeechSynthesis enjoys universal adoption across all modern browsers. SpeechRecognition, however, remains experimental with fragmented support. Always implement feature detection before using either.
2. Speech Recognition (Speech-to-Text)
2.1 Feature Detection and Instantiation
Due to vendor prefixing, the constructor must be resolved defensively:
const SpeechRecognition = window.SpeechRecognition || window.webkitSpeechRecognition;
if (!SpeechRecognition) {
console.error('SpeechRecognition is not supported in this browser.');
}
2.2 Configuration Properties
Once instantiated, the recognition engine is configured through four key properties:
| Property | Type | Description |
|---|---|---|
lang | string | BCP 47 language tag (e.g., "en-US", "es-ES"). Defaults to the document’s lang attribute. |
continuous | boolean | If true, keeps listening after returning results. Default: false (stops after one result). |
interimResults | boolean | If true, emits provisional transcriptions in real time before finalization. |
maxAlternatives | number | Number of alternative transcriptions per result segment. Default: 1. |
const recognition = new SpeechRecognition();
recognition.lang = 'en-US';
recognition.continuous = true;
recognition.interimResults = true;
recognition.maxAlternatives = 1;
2.3 The Result Hierarchy
Results are not flat strings. They arrive as a three-level object tree:
SpeechRecognitionResultList— a list-like collection accessible throughevent.resultsSpeechRecognitionResult— each item contains anisFinalboolean and one or more alternativesSpeechRecognitionAlternative— holds thetranscriptstring and aconfidencescore (0 to 1)
recognition.onresult = (event) => {
for (let i = event.resultIndex; i < event.results.length; i++) {
const result = event.results[i];
const transcript = result[0].transcript;
const confidence = result[0].confidence;
if (result.isFinal) {
console.log(`Final: "${transcript}" (${(confidence * 100).toFixed(1)}%)`);
} else {
console.log(`Interim: "${transcript}"`);
}
}
};
2.4 Event Lifecycle
The recognition engine fires events in a deterministic sequence:
start → audiostart → soundstart → speechstart
↓
result (repeated while speaking)
↓
speechend → soundend → audioend → end
Key events to handle:
recognition.onspeechstart = () => {
console.log('Voice detected — transcription in progress...');
};
recognition.onerror = (event) => {
switch (event.error) {
case 'not-allowed':
console.error('Microphone access denied by user or policy.');
break;
case 'network':
console.error('Cloud recognition service unreachable.');
break;
case 'no-speech':
console.warn('No speech detected — only silence or background noise.');
break;
case 'aborted':
console.warn('Recognition was aborted.');
break;
}
};
recognition.onnomatch = () => {
console.warn('Speech was detected but could not be recognized.');
};
recognition.onend = () => {
console.log('Recognition session ended.');
};
2.5 Full Working Example: Live Dictation
const SpeechRecognition = window.SpeechRecognition || window.webkitSpeechRecognition;
if (!SpeechRecognition) {
document.getElementById('status').textContent = 'Speech Recognition not supported.';
} else {
const recognition = new SpeechRecognition();
recognition.lang = 'en-US';
recognition.continuous = true;
recognition.interimResults = true;
const finalOutput = document.getElementById('final');
const interimOutput = document.getElementById('interim');
let finalTranscript = '';
recognition.onresult = (event) => {
let interim = '';
for (let i = event.resultIndex; i < event.results.length; i++) {
const transcript = event.results[i][0].transcript;
if (event.results[i].isFinal) {
finalTranscript += transcript;
} else {
interim += transcript;
}
}
finalOutput.textContent = finalTranscript;
interimOutput.textContent = interim;
};
recognition.onerror = (event) => {
document.getElementById('status').textContent = `Error: ${event.error}`;
};
document.getElementById('start-btn').addEventListener('click', () => {
finalTranscript = '';
recognition.start();
document.getElementById('status').textContent = 'Listening...';
});
document.getElementById('stop-btn').addEventListener('click', () => {
recognition.stop();
document.getElementById('status').textContent = 'Stopped.';
});
}
3. Speech Synthesis (Text-to-Speech)
3.1 The Singleton Controller
Unlike SpeechRecognition, the synthesis engine is accessed through a single global instance:
const synth = window.speechSynthesis;
This controller manages a FIFO queue of utterances and exposes three read-only state flags:
| Property | Description |
|---|---|
synth.pending | true if utterances are waiting in the queue |
synth.speaking | true if audio is currently being produced |
synth.paused | true if playback has been paused |
Control methods:
synth.speak(utterance);
synth.pause();
synth.resume();
synth.cancel();
3.2 Creating Utterances
The SpeechSynthesisUtterance object is the unit of speech. It holds the text and all prosodic configuration:
| Property | Range | Description |
|---|---|---|
text | String | The text content to be spoken |
voice | SpeechSynthesisVoice | The voice profile to use |
lang | String (BCP 47) | Language override |
pitch | 0 – 2 | Voice pitch (1 = normal) |
rate | 0.1 – 10 | Speaking speed (1 = normal) |
volume | 0 – 1 | Audio volume (1 = max) |
const utterance = new SpeechSynthesisUtterance('Hello, welcome to the application.');
utterance.lang = 'en-US';
utterance.pitch = 1;
utterance.rate = 1;
utterance.volume = 0.8;
synth.speak(utterance);
3.3 Loading Voices Asynchronously
Voices are loaded asynchronously by the browser (especially on Chromium). A synchronous call to getVoices() during page load will often return an empty array. You must listen for the voiceschanged event:
const synth = window.speechSynthesis;
let voices = [];
const loadVoices = () => {
voices = synth.getVoices();
console.log(`${voices.length} voices loaded.`);
};
loadVoices();
if (synth.onvoiceschanged !== undefined) {
synth.onvoiceschanged = loadVoices;
}
Each SpeechSynthesisVoice object contains:
voices.forEach((voice) => {
console.log(`${voice.name} (${voice.lang}) ${voice.default ? '— DEFAULT' : ''}`);
});
3.4 Full Working Example: Text Reader with Voice Selection
const synth = window.speechSynthesis;
let voices = [];
const voiceSelect = document.getElementById('voice-select');
const textInput = document.getElementById('text-input');
const populateVoices = () => {
voices = synth.getVoices();
voiceSelect.innerHTML = '';
voices.forEach((voice, index) => {
const option = document.createElement('option');
option.value = index;
option.textContent = `${voice.name} (${voice.lang})`;
voiceSelect.appendChild(option);
});
};
populateVoices();
if (synth.onvoiceschanged !== undefined) {
synth.onvoiceschanged = populateVoices;
}
document.getElementById('speak-btn').addEventListener('click', () => {
synth.cancel();
const utterance = new SpeechSynthesisUtterance(textInput.value);
utterance.voice = voices[voiceSelect.value];
utterance.pitch = parseFloat(document.getElementById('pitch').value);
utterance.rate = parseFloat(document.getElementById('rate').value);
utterance.onstart = () => {
document.getElementById('status').textContent = 'Speaking...';
};
utterance.onend = () => {
document.getElementById('status').textContent = 'Done.';
};
utterance.onerror = (event) => {
document.getElementById('status').textContent = `Error: ${event.error}`;
};
synth.speak(utterance);
});
document.getElementById('stop-btn').addEventListener('click', () => {
synth.cancel();
document.getElementById('status').textContent = 'Cancelled.';
});
4. Cloud vs. On-Device Processing
SpeechRecognition can route audio through two radically different pipelines:
4.1 Cloud Processing (Default)
By default, Chrome serializes the audio stream and sends it to Google’s cloud servers for transcription. This provides high accuracy (deep neural models, massive training datasets) but introduces trade-offs:
- Latency: network round-trips degrade real-time experiences
- Privacy: raw audio data leaves the device
- Offline: completely non-functional without internet
4.2 On-Device Processing (Experimental)
Modern browsers are introducing local inference using on-device models. This eliminates network latency, works offline, and keeps audio private:
const recognition = new SpeechRecognition();
const availability = await SpeechRecognition.available();
if (availability === 'available') {
recognition.start();
} else if (availability === 'downloadable') {
await SpeechRecognition.install();
recognition.start();
} else {
console.warn('On-device recognition not supported — falling back to cloud.');
recognition.start();
}
Note:
SpeechRecognition.available()andSpeechRecognition.install()are experimental APIs. They allow checking if language models are cached locally and triggering their download. Support is limited to recent Chromium versions.
5. Security Model
The Web Speech API enforces strict security constraints to prevent passive eavesdropping:
- User Gesture Required:
recognition.start()must be called within a user-initiated event (click, tap). Calling it programmatically without a gesture throws an error. - HTTPS Only: the entire API is restricted to secure contexts. HTTP pages cannot access the microphone.
- Visual Indicators: browsers display persistent, non-hideable recording indicators (tab icons, status bar alerts) while the microphone is active.
- Permissions-Policy: server administrators can block microphone access at the HTTP header level for cross-origin iframes:
Permissions-Policy: microphone=()
6. Browser Compatibility (2026)
| Browser | SpeechSynthesis (TTS) | SpeechRecognition (STT) | Notes |
|---|---|---|---|
| Chrome | ✅ Full (v33+) | ✅ Full (v33+) | Primary implementor. Cloud-based by default. |
| Edge | ✅ Full (v14+) | ⚠️ Unreliable | Exposes the interface but events may never fire. |
| Firefox | ✅ Full (v49+) | ❌ Not implemented | Philosophical objection to routing voice biometrics to third-party clouds. |
| Safari | ✅ Full (v7+) | ⚠️ Partial (v14.1+) | Requires webkit prefix. Depends on Siri being enabled. |
| Opera | ✅ Full (v21+) | ❌ Not implemented | Mirrors Edge/Firefox limitations. |
| Brave | ⚠️ Variable | ❌ Blocked | Blocks cloud STT connections for privacy. |
6.1 Defensive Feature Detection
Always verify support before using either subsystem:
const hasSpeechRecognition = 'SpeechRecognition' in window || 'webkitSpeechRecognition' in window;
const hasSpeechSynthesis = 'speechSynthesis' in window;
if (!hasSpeechRecognition) {
console.warn('SpeechRecognition unavailable. Consider a cloud fallback (Google Cloud STT, Azure, AWS Transcribe).');
}
if (!hasSpeechSynthesis) {
console.warn('SpeechSynthesis unavailable.');
}
7. Known Issues and Workarounds
7.1 Chrome’s 15-Second TTS Bug
Chrome’s cloud-backed voices abort utterances longer than ~15 seconds without firing end or error events, leaving the UI in a “speaking” state forever. The fix is to chunk long texts into shorter utterances:
const speakLongText = (text) => {
const synth = window.speechSynthesis;
synth.cancel();
const sentences = text.match(/[^.!?]+[.!?]+/g) || [text];
sentences.forEach((sentence) => {
const utterance = new SpeechSynthesisUtterance(sentence.trim());
utterance.voice = voices[0];
synth.speak(utterance);
});
};
7.2 Safari iOS: TTS Requires a “Prime” Gesture
Safari on iOS blocks TTS audio until the user triggers a gesture. To work around this, silently “prime” the audio context on the first user interaction:
document.addEventListener('click', () => {
const primer = new SpeechSynthesisUtterance('');
window.speechSynthesis.speak(primer);
}, { once: true });
After this, subsequent calls to synth.speak() will work without requiring additional gestures.
7.3 SpeechRecognition in iOS PWAs
When a web page is added to the home screen on iOS (“Add to Home Screen”), webkitSpeechRecognition is completely disabled. All calls throw NotAllowed errors without even showing the permission prompt. There is no workaround within the Web Speech API — a cloud-based STT service must be used instead.
8. Accessibility Benefits
The Web Speech API directly improves accessibility for:
- Motor-impaired users: voice commands replace keyboard and mouse interactions
- Visually-impaired users: TTS reads content aloud, complementing screen readers
- Low-literacy users: audio feedback reduces dependency on reading ability
- Hands-free scenarios: cooking, driving, or medical environments where touch/type input is impractical
Best practices when integrating speech into accessible interfaces:
- Dismiss virtual keyboards by calling
element.blur()when recognition starts, reclaiming viewport space on mobile - Show interim results visually so the user sees real-time feedback during dictation
- Use the
boundaryevent of TTS to highlight text as it is being read, providing a synchronized visual-audio experience
9. Conclusion
The Web Speech API provides a powerful bridge between voice and the web platform. SpeechSynthesis is production-ready across all browsers, while SpeechRecognition requires careful feature detection and fallback strategies due to fragmented support.
Start with TTS for accessibility improvements — it works everywhere. For STT, target Chromium browsers with cloud-based recognition, implement robust error handling, and evaluate cloud fallback services (Google Cloud Speech-to-Text, Azure Cognitive Services, AWS Transcribe) for cross-browser coverage.
Sources: W3C Web Speech API specification, MDN Web Docs, Chromium issue tracker, browser compatibility data (caniuse.com). Examples correspond to the API state as of March 2026.