---
name: cartesia-tts
description: >
  Use this skill when integrating Cartesia TTS API (Sonic model)
  for ultra-low latency text-to-speech, voice cloning, or streaming audio.
  Covers REST/WebSocket API, Python urllib integration.
---

# Cartesia TTS Skill

You are an expert at integrating Cartesia Sonic TTS API into Python applications.

## Overview

Cartesia provides ultra-low latency TTS built on State Space Model (SSM) technology:
- **Sonic 3** -- flagship model, 42+ languages
- **Ultra-low latency** -- 40-90ms time-to-first-audio
- **Voice cloning** -- from 3-second audio clip
- **Streaming** -- WebSocket and SSE
- **Fine-grained control** -- speed, volume, emotion, laughter

> [!IMPORTANT]
> Cartesia uses SSM (State Space Model), NOT transformer-based.
> This enables significantly lower latency than competitors.

## Current Models

| Model | Latency | Languages |
|-------|---------|-----------|
| `sonic-3` | ~90ms | 42+ |
| `sonic-2` | ~100ms | 30+ |

## API Base

```
https://api.cartesia.ai/
```

Authentication: `X-API-Key: $CARTESIA_API_KEY`

## Key Endpoints

| Endpoint | Method | Purpose |
|----------|--------|---------|
| `/tts/bytes` | POST | Generate full audio (returns binary) |
| `/tts/sse` | POST | Stream audio via SSE |
| `wss://api.cartesia.ai/tts/websocket` | WS | Real-time WebSocket streaming |
| `/voices` | GET | List voices |
| `/voices/clone/clip` | POST | Clone voice from audio clip |

## Quick Start (Python)

```python
import json
import urllib.request

payload = json.dumps({
    'model_id': 'sonic-3',
    'transcript': text,
    'voice': {
        'mode': 'id',
        'id': 'a0e99841-438c-4a64-b679-ae501e7d6091'
    },
    'output_format': {
        'container': 'mp3',
        'bit_rate': 128000,
        'sample_rate': 44100
    }
}).encode('utf-8')

req = urllib.request.Request(
    'https://api.cartesia.ai/tts/bytes',
    data=payload,
    headers={
        'X-API-Key': api_key,
        'Cartesia-Version': '2024-06-10',
        'Content-Type': 'application/json'
    }
)

with urllib.request.urlopen(req) as resp:
    audio_data = resp.read()
    with open(output_path, 'wb') as f:
        f.write(audio_data)
```

## Voice Control

```python
voice = {
    'mode': 'id',
    'id': 'voice-id',
    '__experimental_controls': {
        'speed': 'normal',       # "slowest" to "fastest"
        'emotion': ['positivity:high', 'curiosity:medium']
    }
}
```

## Output Formats

- `mp3` (128kbps default)
- `raw` + `pcm_f32le` / `pcm_s16le` / `pcm_alaw` / `pcm_mulaw`
- `wav`

## API Docs

- [TTS API Reference](https://docs.cartesia.ai/api-reference/tts/bytes)
- [Voices](https://docs.cartesia.ai/api-reference/voices/list)
- [Voice Cloning](https://docs.cartesia.ai/api-reference/voices/clone-voice-clip)
- [WebSocket](https://docs.cartesia.ai/api-reference/tts/websocket)

## Related Skills
- `ai-api` -- AI integration patterns (Python, TTS)
- `tts-voice-instructor` -- voice instruction engineering (OpenAI TTS)
