Documentation Index
Fetch the complete documentation index at: https://daily-docs-pr-4552.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
Overview
Cartesia provides two STT service implementations:CartesiaSTTServicefor real-time speech recognition using Cartesia’s WebSocket API with theink-whispermodel, supporting streaming transcription with both interim and final results for low-latency applicationsCartesiaTurnsSTTServicefor turn-based speech recognition using Cartesia’s v2 WebSocket API with theink-2model, where the server drives turn boundaries and pushes structured events for turn lifecycle management including start, updates, eager end predictions, resume, and final turn completion
Cartesia STT API Reference
Pipecat’s API methods for Cartesia STT integration
Cartesia Turns STT API Reference
Pipecat’s API methods for Cartesia Turns STT integration
Standard STT Example
Complete example with transcription logging
Turns STT Example
Complete example with turn-based transcription
Cartesia Documentation
Official Cartesia STT documentation and features
Cartesia Platform
Access API keys and transcription models
Installation
To use Cartesia services, install the required dependency:Prerequisites
Cartesia Account Setup
Before using Cartesia STT services, you need:- Cartesia Account: Sign up at Cartesia
- API Key: Generate an API key from your account dashboard
- Model Access: Ensure access to the transcription model you plan to use (
ink-whisperforCartesiaSTTService,ink-2forCartesiaTurnsSTTService)
Required Environment Variables
CARTESIA_API_KEY: Your Cartesia API key for authentication
CartesiaSTTService
Cartesia API key for authentication.
Custom API endpoint URL. If empty, defaults to
"api.cartesia.ai". Override
for proxied deployments.Audio encoding format.
Audio sample rate in Hz.
Configuration options for the transcription service. Deprecated in v0.0.105.
Use
settings=CartesiaSTTService.Settings(...) for model/language and direct
init parameters for encoding/sample_rate instead.Runtime-configurable settings for the STT service. See Settings
below.
P99 latency from speech end to final transcript in seconds. Override for your
deployment.
Settings
Runtime-configurable settings passed via thesettings constructor argument using CartesiaSTTService.Settings(...). These can be updated mid-conversation with STTUpdateSettingsFrame, which triggers an automatic reconnection with the new parameters. See Service Settings for details.
| Parameter | Type | Default | Description |
|---|---|---|---|
model | str | "ink-whisper" | The transcription model to use. (Inherited from base STT settings.) |
language | Language | str | "en" | Target language for transcription. (Inherited from base STT settings.) |
Usage
Basic Setup
With Custom Options
Notes
- Inactivity timeout: Cartesia disconnects WebSocket connections after 3 minutes of inactivity. The timeout resets with each message sent. Silence-based keepalive is enabled by default to prevent disconnections.
- Auto-reconnect on send: If the connection is closed (e.g., due to timeout), the service automatically reconnects when the next audio data is sent.
- Runtime settings updates: Changing settings (e.g.,
languageormodel) viaSTTUpdateSettingsFrametriggers a reconnection with the new parameters. To avoid audio loss, reconnection is deferred until the current user turn ends (i.e., untilUserStoppedSpeakingFrameis received). Audio frames arriving during the reconnect are buffered and replayed once the new connection is ready. This enables safe dynamic language switching mid-conversation. - Finalize on VAD stop: When the pipeline’s VAD detects the user has stopped speaking, the service sends a
"finalize"command to flush the transcription session and produce a final result.
Event Handlers
Cartesia STT supports the standard service connection events:| Event | Description |
|---|---|
on_connected | Connected to Cartesia WebSocket |
on_disconnected | Disconnected from Cartesia WebSocket |
CartesiaTurnsSTTService
The server drives turn boundaries with theink-2 model, pushing structured events for turn lifecycle management including start, updates, eager end predictions, resume, and final turn completion.
Cartesia API key for authentication.
WebSocket URL for the Cartesia Streaming ASR v2 endpoint.
Audio sample rate in Hz. If
None, uses the pipeline sample rate.Whether to broadcast an interruption when the server signals the start of a new turn.
Minimum idle timeout (in seconds) before sending silence to prevent dangling turns. The actual threshold is
max(chunk_duration * 2, watchdog_min_timeout).Optional additional HTTP headers to send with the WebSocket handshake.
Runtime-updatable settings. See Settings below.
Settings
Runtime-configurable settings passed via thesettings constructor argument using CartesiaTurnsSTTService.Settings(...). The ink-2 model family is English-only and does not support runtime model or language switching. Attempts to update these fields will be reported as unhandled.
| Parameter | Type | Default | Description |
|---|---|---|---|
model | str | "ink-2" | The transcription model to use. (Inherited from base STT settings.) |
language | Language | str | None | Target language (fixed to English). (Inherited from base STT settings.) |
Usage
Basic Setup
With Custom Configuration
With Event Handlers
Turn-Based Protocol
The service speaks the v2 turn-based wire protocol:turn.start: Server detected the start of a turn. PushesUserStartedSpeakingFrameand optionally broadcasts an interruption.turn.update: Incremental transcript update. PushesInterimTranscriptionFrame.turn.eager_end: Server eagerly predicted the end of turn. Available via event handler for speculative downstream processing.turn.resume: User resumed speaking after an eager end. Available via event handler.turn.end: Final transcript for the completed turn. PushesTranscriptionFrameandUserStoppedSpeakingFrame.
is_final flag and no finalize command — closing the socket ends the session.
Notes
- English-only: The ink-2 model family supports English transcription only at launch.
- No runtime model switching: Unlike the v1 API, the ink-2 model does not support runtime model or language switching.
- Watchdog for dangling turns: If audio stops flowing after a
turn.start, the service sends silence to prevent the turn from hanging indefinitely. Configure the threshold withwatchdog_min_timeout. - Server-driven turns: The server controls turn boundaries. There is no client-side
finalizecommand. - Interruption support: Set
should_interrupt=Trueto broadcast interruptions when the user starts speaking, enabling natural turn-taking.
Event Handlers
Cartesia Turns STT supports the following event handlers:| Event | Handler Signature | Description |
|---|---|---|
on_connected | async def(service) | Connected to Cartesia WebSocket |
on_disconnected | async def(service) | Disconnected from Cartesia WebSocket |
on_connection_error | async def(service, error_msg) | Connection error occurred |
on_turn_start | async def(service, transcript: str) | Server detected start of a turn |
on_turn_update | async def(service, transcript: str) | Incremental transcript update |
on_turn_eager_end | async def(service, transcript: str) | Server eagerly predicted end of turn |
on_turn_resume | async def(service) | User resumed speaking after an eager end |
on_turn_end | async def(service, transcript: str) | Final transcript for the completed turn |