Skip to main content

Audio API

The Audio API enables speech-to-text transcription, audio translation, text-to-speech synthesis, and audio-input chat completions. This reference covers all audio-related endpoints.

Overview

EndpointDescription
/v1/audio/transcriptionsConvert audio to text
/v1/audio/translationsTranslate audio to English text
/v1/audio/speechConvert text to audio
/v1/chat/completionsReason over audio input (gpt-4o-audio-preview)
PAYG token required

Audio endpoints — and any chat completion request that uses an audio modality — require a Pay-As-You-Go (PAYG) token. Subscription tokens (sk-sub_...) currently receive HTTP 403 with error code audio_requires_payg. Subscription quota support for audio is on the roadmap; until then, generate a PAYG key from the dashboard to use audio features.

Speech to Text (Transcription)

Convert audio files to text in the original language.

Endpoint

POST /v1/audio/transcriptions

Request

Headers:

Authorization: Bearer sk-your-api-key
Content-Type: multipart/form-data

Parameters:

ParameterTypeRequiredDescription
filefileYesAudio file (max 25 MB)
modelstringYesTranscription model — see Supported Transcription Models
languagestringNoLanguage code (ISO 639-1)
promptstringNoContext or spelling hints
response_formatstringNojson, text, srt, vtt, verbose_json (whisper-* only) — gpt-4o-(mini-)transcribe always returns json
temperaturenumberNoSampling temperature (0-1, default 0)

Supported Transcription Models

ModelBest forNotes
whisper-1General-purpose, multilingualOpenAI's reliable baseline
whisper-large-v3Higher accuracy than whisper-1Routed via OpenRouter
whisper-large-v3-turboFastest, lowest cost in the whisper familyRouted via OpenRouter
gpt-4o-mini-transcribeCheap GPT-4o-grade transcriptionReturns json only
gpt-4o-transcribeHighest accuracy, especially Chinese / noisy audioReturns json only — recommended default for CJK content

Supported Audio Formats

  • flac
  • mp3
  • mp4
  • mpeg
  • mpga
  • m4a
  • ogg
  • wav
  • webm

Example: Basic Transcription

cURL:

curl https://api.apertis.ai/v1/audio/transcriptions \
-H "Authorization: Bearer sk-your-api-key" \
-F file="@audio.mp3" \
-F model="whisper-1"

Python:

from openai import OpenAI

client = OpenAI(
api_key="sk-your-api-key",
base_url="https://api.apertis.ai/v1"
)

audio_file = open("audio.mp3", "rb")
transcript = client.audio.transcriptions.create(
model="whisper-1",
file=audio_file
)

print(transcript.text)

Node.js:

import OpenAI from 'openai';
import fs from 'fs';

const client = new OpenAI({
apiKey: 'sk-your-api-key',
baseURL: 'https://api.apertis.ai/v1'
});

const transcript = await client.audio.transcriptions.create({
model: 'whisper-1',
file: fs.createReadStream('audio.mp3')
});

console.log(transcript.text);

Response

JSON format (default):

{
"text": "Hello, this is a transcription of the audio file."
}

Verbose JSON format:

{
"task": "transcribe",
"language": "english",
"duration": 5.5,
"text": "Hello, this is a transcription of the audio file.",
"segments": [
{
"id": 0,
"start": 0.0,
"end": 2.5,
"text": "Hello, this is",
"tokens": [50364, 2425, 11, 341, 307],
"temperature": 0.0,
"avg_logprob": -0.25,
"compression_ratio": 1.2,
"no_speech_prob": 0.01
}
]
}

Example: With Language Hint

Improve accuracy by specifying the language:

transcript = client.audio.transcriptions.create(
model="whisper-1",
file=audio_file,
language="zh" # Chinese
)

Example: With Spelling Hints

Provide context for uncommon words:

transcript = client.audio.transcriptions.create(
model="whisper-1",
file=audio_file,
prompt="Apertis, API, OpenAI, GPT-4o" # Spelling hints
)

Example: SRT Subtitles

Generate subtitle files:

transcript = client.audio.transcriptions.create(
model="whisper-1",
file=audio_file,
response_format="srt"
)

# Save as .srt file
with open("subtitles.srt", "w") as f:
f.write(transcript)

Audio Translation

Translate audio from any language to English text.

Endpoint

POST /v1/audio/translations

Request

Parameters:

ParameterTypeRequiredDescription
filefileYesAudio file (max 25 MB)
modelstringYeswhisper-1, whisper-large-v3, or whisper-large-v3-turbo (translation is whisper-only)
promptstringNoContext hints
response_formatstringNoOutput format
temperaturenumberNoSampling temperature

Example

Python:

audio_file = open("french_audio.mp3", "rb")
translation = client.audio.translations.create(
model="whisper-1",
file=audio_file
)

print(translation.text) # English translation

cURL:

curl https://api.apertis.ai/v1/audio/translations \
-H "Authorization: Bearer sk-your-api-key" \
-F file="@french_audio.mp3" \
-F model="whisper-1"

Response

{
"text": "Hello, how are you today?"
}

Text to Speech (TTS)

Convert text to natural-sounding audio.

Endpoint

POST /v1/audio/speech

Request

Headers:

Authorization: Bearer sk-your-api-key
Content-Type: application/json

Parameters:

ParameterTypeRequiredDescription
modelstringYesTTS model — see Supported TTS Models
inputstringYesText to convert (max 4096 chars)
voicestringYesVoice to use
response_formatstringNomp3, opus, aac, flac, wav, pcm
speednumberNoSpeed (0.25-4.0, default 1.0)
instructionsstringNoTone / accent / pacing prompt — only honoured by gpt-4o-mini-tts-2025-12-15

Supported TTS Models

ModelBest forNotes
tts-1Standard quality, low latencyOpenAI baseline
tts-1-hdHigh quality, slowerOpenAI HD baseline
gpt-4o-mini-tts-2025-12-15Steerable voice (instruction-following)Use the instructions field to set tone, accent, pacing
gemini-3.1-flash-tts-previewMultilingual TTSRouted via OpenRouter

Available Voices

VoiceDescription
alloyNeutral, balanced
echoWarm, conversational
fableExpressive, storytelling
onyxDeep, authoritative
novaEnergetic, youthful
shimmerClear, pleasant

Available Models

ModelDescriptionQuality
tts-1Standard TTSGood, fast
tts-1-hdHigh-definition TTSBest, slower

Example: Basic TTS

Python:

response = client.audio.speech.create(
model="tts-1",
voice="alloy",
input="Hello! Welcome to Apertis API documentation."
)

# Save to file
response.stream_to_file("output.mp3")

Node.js:

const response = await client.audio.speech.create({
model: 'tts-1',
voice: 'alloy',
input: 'Hello! Welcome to Apertis API documentation.'
});

const buffer = Buffer.from(await response.arrayBuffer());
fs.writeFileSync('output.mp3', buffer);

cURL:

curl https://api.apertis.ai/v1/audio/speech \
-H "Authorization: Bearer sk-your-api-key" \
-H "Content-Type: application/json" \
-d '{
"model": "tts-1",
"voice": "alloy",
"input": "Hello! Welcome to Apertis API documentation."
}' \
--output output.mp3

Response

Binary audio data in the requested format. Default is mp3.

Example: High-Quality Audio

response = client.audio.speech.create(
model="tts-1-hd", # Higher quality
voice="nova",
input="This is high-definition audio.",
response_format="flac" # Lossless format
)

Example: Adjusting Speed

# Slower speech
response = client.audio.speech.create(
model="tts-1",
voice="onyx",
input="Speaking slowly and clearly.",
speed=0.75
)

# Faster speech
response = client.audio.speech.create(
model="tts-1",
voice="echo",
input="Speaking quickly!",
speed=1.5
)

Streaming TTS

For real-time audio streaming:

with client.audio.speech.with_streaming_response.create(
model="tts-1",
voice="alloy",
input="Streaming audio in real-time..."
) as response:
for chunk in response.iter_bytes():
# Process audio chunks as they arrive
audio_player.play(chunk)

Audio in Chat Completions

gpt-4o-audio-preview accepts audio input (and optionally produces audio output) through the standard /v1/chat/completions endpoint. This unlocks voice-agent use cases where the model reasons over the audio directly rather than passing through a transcription step.

Endpoint

POST /v1/chat/completions

Request

Use the standard chat completions schema with three additional fields:

  • modalities — array of output modalities, e.g. ["text", "audio"]
  • audio — output audio config: {"voice": "alloy", "format": "wav"}
  • A user message whose content array includes an input_audio block with {data: <base64>, format: <"wav"|"mp3">}

Example

Python (openai SDK):

import base64
from openai import OpenAI

client = OpenAI(api_key="sk-your-payg-api-key", base_url="https://api.apertis.ai/v1")

with open("question.wav", "rb") as f:
audio_b64 = base64.b64encode(f.read()).decode()

response = client.chat.completions.create(
model="gpt-4o-audio-preview",
modalities=["text", "audio"],
audio={"voice": "alloy", "format": "wav"},
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "What does this recording say, and how should I reply?"},
{"type": "input_audio", "input_audio": {"data": audio_b64, "format": "wav"}},
],
}
],
)

print(response.choices[0].message.content)
# Spoken reply (base64-encoded wav) is on response.choices[0].message.audio

cURL:

AUDIO_B64=$(base64 -i question.wav)
curl https://api.apertis.ai/v1/chat/completions \
-H "Authorization: Bearer sk-your-payg-api-key" \
-H "Content-Type: application/json" \
-d "{
\"model\": \"gpt-4o-audio-preview\",
\"modalities\": [\"text\", \"audio\"],
\"audio\": {\"voice\": \"alloy\", \"format\": \"wav\"},
\"messages\": [{
\"role\": \"user\",
\"content\": [
{\"type\": \"text\", \"text\": \"Summarise this clip.\"},
{\"type\": \"input_audio\", \"input_audio\": {\"data\": \"${AUDIO_B64}\", \"format\": \"wav\"}}
]
}]
}"

Notes

  • Subscription tokens are blocked at the gateway with HTTP 403 / audio_requires_payg. Use a PAYG token.
  • Streaming (stream: true) is not yet supported for audio output — request a non-streaming response.
  • Other adaptors that route through OpenAI-format channels (e.g., OpenRouter) inherit the same content-block shape.

Supported Languages

Transcription Languages

Whisper supports transcription in 50+ languages:

LanguageCodeLanguageCode
EnglishenChinesezh
SpanishesJapaneseja
FrenchfrKoreanko
GermandeArabicar
ItalianitHindihi
PortugueseptRussianru

Full list of supported languages

TTS Languages

Text-to-speech primarily supports:

  • English (best quality)
  • Most major languages (good quality)
note

While TTS can handle multiple languages, voice quality is optimized for English.

Best Practices

Audio Quality

For best transcription results:

  1. Use high-quality audio - 16kHz or higher sample rate
  2. Minimize background noise - Clear recordings work best
  3. Use appropriate format - MP3 or WAV recommended
  4. Keep files under 25MB - Split longer recordings

Optimizing TTS

  1. Use punctuation - Helps with natural pauses
  2. Break long text - Split into sentences for better pacing
  3. Choose appropriate voice - Match voice to content type
  4. Adjust speed - Slower for clarity, faster for efficiency

Long Audio Handling

For files larger than 25MB:

from pydub import AudioSegment

# Split audio into 10-minute chunks
audio = AudioSegment.from_file("long_audio.mp3")
chunk_length = 10 * 60 * 1000 # 10 minutes in milliseconds

chunks = [audio[i:i+chunk_length] for i in range(0, len(audio), chunk_length)]

transcripts = []
for i, chunk in enumerate(chunks):
chunk.export(f"chunk_{i}.mp3", format="mp3")
with open(f"chunk_{i}.mp3", "rb") as f:
result = client.audio.transcriptions.create(
model="whisper-1",
file=f
)
transcripts.append(result.text)

full_transcript = " ".join(transcripts)

Error Handling

Common Errors

ErrorCauseSolution
Invalid file formatUnsupported audio formatConvert to MP3/WAV
File too largeExceeds 25MB limitSplit into chunks
Invalid modelModel not availableCheck model name
Rate limit exceededToo many requestsImplement backoff

Error Response Example

{
"error": {
"message": "Invalid file format. Supported formats: mp3, mp4, mpeg, mpga, m4a, wav, webm",
"type": "invalid_request_error",
"code": "invalid_file_format"
}
}

Pricing

For current model pricing, please visit the Models page.