Skip to main content

Documentation Index

Fetch the complete documentation index at: https://daily-docs-pr-4552.mintlify.app/llms.txt

Use this file to discover all available pages before exploring further.

Overview

Cartesia provides two STT service implementations:
  • CartesiaSTTService for real-time speech recognition using Cartesia’s WebSocket API with the ink-whisper model, supporting streaming transcription with both interim and final results for low-latency applications
  • CartesiaTurnsSTTService for turn-based speech recognition using Cartesia’s v2 WebSocket API with the ink-2 model, where the server drives turn boundaries and pushes structured events for turn lifecycle management including start, updates, eager end predictions, resume, and final turn completion

Cartesia STT API Reference

Pipecat’s API methods for Cartesia STT integration

Cartesia Turns STT API Reference

Pipecat’s API methods for Cartesia Turns STT integration

Standard STT Example

Complete example with transcription logging

Turns STT Example

Complete example with turn-based transcription

Cartesia Documentation

Official Cartesia STT documentation and features

Cartesia Platform

Access API keys and transcription models

Installation

To use Cartesia services, install the required dependency:
uv add "pipecat-ai[cartesia]"

Prerequisites

Cartesia Account Setup

Before using Cartesia STT services, you need:
  1. Cartesia Account: Sign up at Cartesia
  2. API Key: Generate an API key from your account dashboard
  3. Model Access: Ensure access to the transcription model you plan to use (ink-whisper for CartesiaSTTService, ink-2 for CartesiaTurnsSTTService)

Required Environment Variables

  • CARTESIA_API_KEY: Your Cartesia API key for authentication

CartesiaSTTService

api_key
str
required
Cartesia API key for authentication.
base_url
str
default:""
Custom API endpoint URL. If empty, defaults to "api.cartesia.ai". Override for proxied deployments.
encoding
str
default:"pcm_s16le"
Audio encoding format.
sample_rate
int
default:"None"
Audio sample rate in Hz.
live_options
CartesiaLiveOptions | None
default:"None"
deprecated
Configuration options for the transcription service. Deprecated in v0.0.105. Use settings=CartesiaSTTService.Settings(...) for model/language and direct init parameters for encoding/sample_rate instead.
settings
CartesiaSTTService.Settings
default:"None"
Runtime-configurable settings for the STT service. See Settings below.
ttfs_p99_latency
float
default:"CARTESIA_TTFS_P99"
P99 latency from speech end to final transcript in seconds. Override for your deployment.

Settings

Runtime-configurable settings passed via the settings constructor argument using CartesiaSTTService.Settings(...). These can be updated mid-conversation with STTUpdateSettingsFrame, which triggers an automatic reconnection with the new parameters. See Service Settings for details.
ParameterTypeDefaultDescription
modelstr"ink-whisper"The transcription model to use. (Inherited from base STT settings.)
languageLanguage | str"en"Target language for transcription. (Inherited from base STT settings.)

Usage

Basic Setup

from pipecat.services.cartesia.stt import CartesiaSTTService

stt = CartesiaSTTService(
    api_key=os.getenv("CARTESIA_API_KEY"),
)

With Custom Options

from pipecat.services.cartesia.stt import CartesiaSTTService

stt = CartesiaSTTService(
    api_key=os.getenv("CARTESIA_API_KEY"),
    settings=CartesiaSTTService.Settings(
        model="ink-whisper",
        language="es",
    ),
    sample_rate=16000,
)

Notes

  • Inactivity timeout: Cartesia disconnects WebSocket connections after 3 minutes of inactivity. The timeout resets with each message sent. Silence-based keepalive is enabled by default to prevent disconnections.
  • Auto-reconnect on send: If the connection is closed (e.g., due to timeout), the service automatically reconnects when the next audio data is sent.
  • Runtime settings updates: Changing settings (e.g., language or model) via STTUpdateSettingsFrame triggers a reconnection with the new parameters. To avoid audio loss, reconnection is deferred until the current user turn ends (i.e., until UserStoppedSpeakingFrame is received). Audio frames arriving during the reconnect are buffered and replayed once the new connection is ready. This enables safe dynamic language switching mid-conversation.
  • Finalize on VAD stop: When the pipeline’s VAD detects the user has stopped speaking, the service sends a "finalize" command to flush the transcription session and produce a final result.
The InputParams / params= / live_options= pattern is deprecated as of v0.0.105. Use Settings / settings= instead. See the Service Settings guide for migration details.

Event Handlers

Cartesia STT supports the standard service connection events:
EventDescription
on_connectedConnected to Cartesia WebSocket
on_disconnectedDisconnected from Cartesia WebSocket
@stt.event_handler("on_connected")
async def on_connected(service):
    print("Connected to Cartesia STT")

CartesiaTurnsSTTService

The server drives turn boundaries with the ink-2 model, pushing structured events for turn lifecycle management including start, updates, eager end predictions, resume, and final turn completion.
api_key
str
required
Cartesia API key for authentication.
url
str
default:"wss://api.cartesia.ai/stt/turns/websocket"
WebSocket URL for the Cartesia Streaming ASR v2 endpoint.
sample_rate
int | None
default:"None"
Audio sample rate in Hz. If None, uses the pipeline sample rate.
should_interrupt
bool
default:"True"
Whether to broadcast an interruption when the server signals the start of a new turn.
watchdog_min_timeout
float
default:"0.5"
Minimum idle timeout (in seconds) before sending silence to prevent dangling turns. The actual threshold is max(chunk_duration * 2, watchdog_min_timeout).
extra_headers
dict[str, str] | None
default:"None"
Optional additional HTTP headers to send with the WebSocket handshake.
settings
CartesiaTurnsSTTService.Settings
default:"None"
Runtime-updatable settings. See Settings below.

Settings

Runtime-configurable settings passed via the settings constructor argument using CartesiaTurnsSTTService.Settings(...). The ink-2 model family is English-only and does not support runtime model or language switching. Attempts to update these fields will be reported as unhandled.
ParameterTypeDefaultDescription
modelstr"ink-2"The transcription model to use. (Inherited from base STT settings.)
languageLanguage | strNoneTarget language (fixed to English). (Inherited from base STT settings.)

Usage

Basic Setup

from pipecat.services.cartesia.turns.stt import CartesiaTurnsSTTService

stt = CartesiaTurnsSTTService(
    api_key=os.getenv("CARTESIA_API_KEY"),
)

With Custom Configuration

from pipecat.services.cartesia.turns.stt import CartesiaTurnsSTTService

stt = CartesiaTurnsSTTService(
    api_key=os.getenv("CARTESIA_API_KEY"),
    sample_rate=16000,
    should_interrupt=True,
    watchdog_min_timeout=1.0,
)

With Event Handlers

from pipecat.services.cartesia.turns.stt import CartesiaTurnsSTTService

stt = CartesiaTurnsSTTService(
    api_key=os.getenv("CARTESIA_API_KEY"),
)

@stt.event_handler("on_turn_start")
async def on_turn_start(service, transcript):
    print(f"User started speaking: {transcript}")

@stt.event_handler("on_turn_end")
async def on_turn_end(service, transcript):
    print(f"Final transcript: {transcript}")

Turn-Based Protocol

The service speaks the v2 turn-based wire protocol:
connected → turn.start → turn.update* → (turn.eager_end → turn.resume?)* → turn.end → ...
  • turn.start: Server detected the start of a turn. Pushes UserStartedSpeakingFrame and optionally broadcasts an interruption.
  • turn.update: Incremental transcript update. Pushes InterimTranscriptionFrame.
  • turn.eager_end: Server eagerly predicted the end of turn. Available via event handler for speculative downstream processing.
  • turn.resume: User resumed speaking after an eager end. Available via event handler.
  • turn.end: Final transcript for the completed turn. Pushes TranscriptionFrame and UserStoppedSpeakingFrame.
Transcripts are cumulative per turn. There is no is_final flag and no finalize command — closing the socket ends the session.

Notes

  • English-only: The ink-2 model family supports English transcription only at launch.
  • No runtime model switching: Unlike the v1 API, the ink-2 model does not support runtime model or language switching.
  • Watchdog for dangling turns: If audio stops flowing after a turn.start, the service sends silence to prevent the turn from hanging indefinitely. Configure the threshold with watchdog_min_timeout.
  • Server-driven turns: The server controls turn boundaries. There is no client-side finalize command.
  • Interruption support: Set should_interrupt=True to broadcast interruptions when the user starts speaking, enabling natural turn-taking.

Event Handlers

Cartesia Turns STT supports the following event handlers:
EventHandler SignatureDescription
on_connectedasync def(service)Connected to Cartesia WebSocket
on_disconnectedasync def(service)Disconnected from Cartesia WebSocket
on_connection_errorasync def(service, error_msg)Connection error occurred
on_turn_startasync def(service, transcript: str)Server detected start of a turn
on_turn_updateasync def(service, transcript: str)Incremental transcript update
on_turn_eager_endasync def(service, transcript: str)Server eagerly predicted end of turn
on_turn_resumeasync def(service)User resumed speaking after an eager end
on_turn_endasync def(service, transcript: str)Final transcript for the completed turn
Example:
@stt.event_handler("on_turn_eager_end")
async def on_turn_eager_end(service, transcript):
    print(f"Eager end prediction: {transcript}")
    # Optionally start processing speculatively

@stt.event_handler("on_turn_resume")
async def on_turn_resume(service):
    print("User resumed speaking, discard speculative processing")