Pipecat speech to text
Use the Speechmatics STT service to transcribe live audio in your Pipecat voice bots.
Features
- Real-time transcription — Low-latency streaming with partial (interim) results
- Turn detection — Adaptive, fixed, ML-based, or external control modes
- Speaker diarization — Identify and attribute speech to different speakers
- Speaker filtering — Focus on specific speakers or ignore others (like the assistant)
- Custom vocabulary — Boost recognition for domain-specific terms and proper nouns
- Output formatting — Configurable templates for multi-speaker transcripts
Installation
pip install "pipecat-ai[speechmatics]"
Basic configuration
Authentication
By default, the service reads your API key from the SPEECHMATICS_API_KEY environment variable.
Service options
Input parameters
These are passed via params=SpeechmaticsSTTService.InputParams(...):
Example
from pipecat.services.speechmatics.stt import SpeechmaticsSTTService
stt = SpeechmaticsSTTService(
params=SpeechmaticsSTTService.InputParams(
language="en",
operating_point=SpeechmaticsSTTService.OperatingPoint.ENHANCED,
),
)
Advanced configuration
Turn detection
Turn detection determines when a user has finished their complete thought, while the Realtime API's EndOfUtterance message indicates a pause in speech. The service handles this distinction automatically.
Modes
Set turn_detection_mode to control how end of speech is detected:
Start with EXTERNAL mode. This lets you use Pipecat's turn detection features (like LocalSmartTurnAnalyzerV3) which are well-integrated with the pipeline. Only switch to other modes if you need Speechmatics to handle turn detection directly.
from pipecat.services.speechmatics.stt import SpeechmaticsSTTService
# External mode (default, recommended) - use Pipecat's turn detection
stt = SpeechmaticsSTTService(
params=SpeechmaticsSTTService.InputParams(
turn_detection_mode=SpeechmaticsSTTService.TurnDetectionMode.EXTERNAL,
),
)
# Adaptive mode - Speechmatics determines end-of-turn
stt = SpeechmaticsSTTService(
params=SpeechmaticsSTTService.InputParams(
turn_detection_mode=SpeechmaticsSTTService.TurnDetectionMode.ADAPTIVE,
),
)
# Fixed mode - consistent silence threshold
stt = SpeechmaticsSTTService(
params=SpeechmaticsSTTService.InputParams(
turn_detection_mode=SpeechmaticsSTTService.TurnDetectionMode.FIXED,
end_of_utterance_silence_trigger=0.8, # 800ms of silence
end_of_utterance_max_delay=5.0, # Force end after 5s
),
)
# Smart turn mode - Speechmatics ML-based turn detection
stt = SpeechmaticsSTTService(
params=SpeechmaticsSTTService.InputParams(
turn_detection_mode=SpeechmaticsSTTService.TurnDetectionMode.SMART_TURN,
),
)
When using ADAPTIVE or SMART_TURN modes, remove any competing VAD or turn-detection features from your pipeline to avoid conflicts.
Configuration
Advanced diarization
The service can attribute words to speakers and lets you decide which speakers are treated as active (foreground) vs passive (background).
Configuration
from pipecat.services.speechmatics.stt import SpeechmaticsSTTService
stt = SpeechmaticsSTTService(
params=SpeechmaticsSTTService.InputParams(
enable_diarization=True,
speaker_sensitivity=0.7,
max_speakers=3,
prefer_current_speaker=True,
additional_vocab=[
SpeechmaticsSTTService.AdditionalVocabEntry(content="Speechmatics"),
SpeechmaticsSTTService.AdditionalVocabEntry(content="API", sounds_like=["A P I"]),
],
),
)
Known speakers
Use known_speakers to attribute words to specific speakers across sessions. This is useful when you want consistent speaker identification for known participants.
from pipecat.services.speechmatics.stt import SpeechmaticsSTTService
stt = SpeechmaticsSTTService(
params=SpeechmaticsSTTService.InputParams(
enable_diarization=True,
known_speakers=[
SpeechmaticsSTTService.SpeakerIdentifier(label="Alice", speaker_identifiers=["speaker_abc123"]),
SpeechmaticsSTTService.SpeakerIdentifier(label="Bob", speaker_identifiers=["speaker_def456"]),
],
),
)
Speaker identifiers are unique to each Speechmatics account and can be obtained from a previous transcription session.
Speaker focus
Control which speakers are treated as active (foreground) vs passive (background):
- Active speakers are the speakers you care about in your application. They generate
FINAL_TRANSCRIPTevents. - Passive speakers are still transcribed, but their words are buffered and only included in the output alongside new words from active speakers.
Focus modes
SpeakerFocusMode.RETAINkeeps non-focused speakers as passive.SpeakerFocusMode.IGNOREdiscards non-focused speaker words entirely.
ignore_speakers always excludes those speakers from transcription and their speech will not trigger VAD or end of utterance detection.
from pipecat.services.speechmatics.stt import SpeechmaticsSTTService
stt = SpeechmaticsSTTService(
params=SpeechmaticsSTTService.InputParams(
focus_speakers=["S1"],
focus_mode=SpeechmaticsSTTService.SpeakerFocusMode.RETAIN,
ignore_speakers=["S3"],
),
)
Speaker formatting
Use speaker_active_format and speaker_passive_format to format transcripts for your LLM.
The templates support {speaker_id}, {text}, {ts}, {start_time}, {end_time}, and {lang}.
from pipecat.services.speechmatics.stt import SpeechmaticsSTTService
stt = SpeechmaticsSTTService(
params=SpeechmaticsSTTService.InputParams(
speaker_active_format="<{speaker_id}>{text}</{speaker_id}>",
speaker_passive_format="<{speaker_id} background>{text}</{speaker_id}>",
),
)
When you use a custom format, include it in your bot's system prompt so the LLM can interpret speaker tags consistently.
Updating speakers during transcription
You can dynamically change which speakers to focus on or ignore during an active transcription session using the update_params() method.
from pipecat.services.speechmatics.stt import SpeechmaticsSTTService
stt = SpeechmaticsSTTService(
params=SpeechmaticsSTTService.InputParams(enable_diarization=True),
)
# Later, during transcription:
stt.update_params(
SpeechmaticsSTTService.UpdateParams(
focus_speakers=["S1", "S2"],
ignore_speakers=["S3"],
focus_mode=SpeechmaticsSTTService.SpeakerFocusMode.RETAIN,
)
)
This is useful when you need to adjust speaker filtering based on runtime conditions, such as when a new participant joins or leaves a conversation.
Example
from pipecat.services.speechmatics.stt import SpeechmaticsSTTService
stt = SpeechmaticsSTTService(
params=SpeechmaticsSTTService.InputParams(
# Service options
language="en",
operating_point=SpeechmaticsSTTService.OperatingPoint.ENHANCED,
# Turn detection
turn_detection_mode=SpeechmaticsSTTService.TurnDetectionMode.EXTERNAL,
max_delay=1.5,
include_partials=True,
# Diarization
enable_diarization=True,
speaker_sensitivity=0.6,
max_speakers=4,
prefer_current_speaker=True,
# Speaker focus
focus_speakers=["S1", "S2"],
focus_mode=SpeechmaticsSTTService.SpeakerFocusMode.RETAIN,
ignore_speakers=[],
# Output formatting
speaker_active_format="[{speaker_id}]: {text}",
speaker_passive_format="[{speaker_id} (background)]: {text}",
# Custom vocabulary
additional_vocab=[
SpeechmaticsSTTService.AdditionalVocabEntry(content="Speechmatics"),
SpeechmaticsSTTService.AdditionalVocabEntry(content="Pipecat", sounds_like=["pipe cat"]),
],
),
)
Next steps
- Quickstart — Build a complete voice bot
- Text to speech — Use Speechmatics voices in your bot
- Pipecat documentation — Full Speechmatics STT reference