Voice AI API Pricing 2026: How to Calculate Real Costs

Learn how Voice AI API pricing works: decode all cost layers, compute loaded cost per minute, and cut spend without hurting quality. Get the formula.

TL;DR

Voice AI API pricing is the total cost of running AI-powered voice interactions, not just one advertised number. A production voice agent bills across multiple layers: platform orchestration, speech-to-text, LLM reasoning, text-to-speech, telephony, concurrency, and add-ons. To compare vendors accurately, calculate the “loaded cost per voice minute” and, more importantly, the cost per resolved call or qualified conversation. The cheapest per-minute rate often isn’t the cheapest in production.


Voice AI API pricing is the cost model for APIs that power real-time AI voice agents. In production, that cost is almost never a single number. A working voice agent uses a platform layer, speech-to-text transcription, a large language model, text-to-speech synthesis, telephony infrastructure, concurrency controls, and various add-ons. Each layer has its own meter, its own billing unit, and its own vendor.

That is why the advertised “per-minute” rate on a pricing page often has little resemblance to the actual cost per call.

One platform might charge a platform fee and pass through STT, TTS, LLM, and telephony costs separately. Another might bundle those layers into one flat per-minute rate. Vapi’s documentation lists $0.05/minute for calls, with transcriber, model, voice, and telephony costs charged at cost on top of that source. Bland, by contrast, says its flat per-minute rates include LLM, STT, TTS, and telephony source. Same product category, completely different billing structures. This article explains how to read any voice AI API pricing page and calculate what you will actually spend.

What Is Voice AI API Pricing?

Voice AI API pricing covers the billing structure for APIs used to build AI systems that listen, understand, respond, and speak in real time. It can include per-minute platform fees, STT minutes, TTS characters or audio seconds, LLM input and output tokens, telephony minutes, phone number rental, concurrency limits, recordings, storage, transfers, denoising, knowledge base access, and compliance features.

In plain terms: a voice AI API turns a phone call or web voice conversation into a chain of paid events. The caller speaks, the system transcribes audio, the LLM decides what to say or do, the TTS engine generates speech, the telephony layer carries the call, and the platform coordinates the workflow. Each layer may have its own meter.

The pricing unit depends on the layer. Deepgram prices streaming STT per minute and TTS per 1,000 characters source. OpenAI and Anthropic price models by token, with separate input and output rates source. Twilio and Telnyx price telephony by minute source. These different billing units make apples-to-apples comparison genuinely difficult.

Why Voice AI API Pricing Is Hard to Compare

Three things make comparison frustrating:

Different vendors bundle different layers. Some quote a platform fee only, expecting you to add provider costs. Others include telephony. Others include everything. Without knowing what is inside the number, comparing two per-minute rates is meaningless.

Billing units don’t translate cleanly. A TTS provider might charge per 1,000 characters. An LLM charges per million tokens. STT charges per audio minute. Telephony charges per connected minute. Converting all of these into a single “cost per minute of conversation” requires assumptions about talk/listen ratio, agent verbosity, prompt length, and call duration.

Usage behavior changes cost. A concise agent with short prompts calling U.S. local numbers will cost less per minute than a verbose agent with long system prompts making international toll-free calls. The same platform, same pricing page, wildly different bills.

Practitioners on Reddit have repeatedly flagged confusion around realtime audio token billing, with developers asking how “1M audio tokens” maps to actual voice minutes and whether they should expect large bills from short tests source. Similar threads about ElevenLabs voice agent pricing show users asking whether they pay LLM, TTS, and STT separately, and whether the API is cheaper than the full agent dashboard source.

The confusion is real, and it starts with not knowing which meters are running.

The 7 Cost Layers in a Voice AI API Stack

Every voice AI call passes through multiple billable layers. Here is what each one covers, with current pricing examples.

1. Platform or Orchestration Fee

This is what the voice AI platform charges for coordinating the call: managing state, routing, prompts, tool calls, logs, integrations, turn-taking, and concurrency.

Orchestration is not “just markup.” It includes realtime routing, retries, observability, testing tools, state management, and deployment infrastructure. But buyers need to know whether this fee is bundled with other layers or charged separately.

Vapi charges $0.05/minute for calls, with provider costs passed through at cost source. Retell lists voice infrastructure at $0.055/minute inside its detailed component pricing source. SigmaMind uses a $0.03/minute platform fee for voice agents, with STT, TTS, LLM, and telephony provider costs billed as actuals. This modular approach lets teams using a voice agent builder pick the best provider for each layer instead of accepting a default stack.

2. Speech-to-Text (STT)

STT converts caller audio into text. It is usually billed by audio minute, audio hour, or websocket connection time.

Current examples: Deepgram’s Nova-3 Monolingual costs $0.0048/minute pay-as-you-go, while Flux English is $0.0065/minute source. ElevenLabs lists Scribe v2 Realtime at $0.39/hour (about $0.0065/minute) source.

One detail that surprises buyers: STT billing often runs during silence because the transcription engine must stay active and listening. Retell’s FAQ confirms that billing covers the entire call duration, including silence, because the STT engine remains active source. This means hold time, pauses, and caller hesitation all cost money.

3. LLM Cost

The LLM interprets intent, maintains context, decides the next response, calls tools, and generates the text that feeds into TTS. LLMs bill input and output tokens separately, with output tokens often costing 3 to 5 times more than input.

Anthropic lists Claude Sonnet 4.6 at $3 per million input tokens and $15 per million output tokens, with Haiku 4.5 at $1/$5 source. OpenAI’s realtime audio model (gpt-realtime-1.5) charges $32 per million audio input tokens and $64 per million audio output tokens source. Google’s Gemini 3.1 Flash Live prices audio input at $0.005/minute and audio output at $0.018/minute source.

The critical nuance here: LLM cost is not linear with call duration. In a cascaded voice agent, longer conversations require more transcript history in each turn. Retell’s billing documentation states that agents using more than 3,500 LLM tokens in prompts can have billed duration scaled proportionally based on token usage source. Long calls are disproportionately expensive, and the next section explains why.

4. Text-to-Speech (TTS)

TTS turns the model’s text response into spoken audio. It is often priced per 1,000 characters, per million characters, or per generated audio minute.

Deepgram lists Aura-2 at $0.030 per 1,000 characters source. ElevenLabs charges $0.05 per 1,000 characters for Flash/Turbo and $0.10 for Multilingual v2/v3 source. Cartesia’s Sonic TTS uses 1 credit per character source.

TTS cost depends on how much the agent talks, not how long the call lasts. A verbose agent that over-explains costs more in TTS than a concise one. This is a lever most teams underuse.

5. Telephony

Telephony carries the call through PSTN, SIP, or a programmable voice provider. It includes inbound, outbound, toll-free, SIP interface, phone number rental, transfers, recordings, and media streaming.

Twilio’s U.S. Programmable Voice lists local outbound at $0.0140/minute and local inbound at $0.0085/minute source. Telnyx lists Voice API calls at $0.002/minute plus SIP trunking fees source. Retell passes through U.S. telephony at approximately $0.015/minute source.

Telephony can make “cheap AI” look expensive in international, outbound, toll-free, or high-transfer workflows. Teams that already have a Twilio or Telnyx account should check whether the platform supports BYOC (bring your own carrier) via SIP to avoid double-paying for telephony infrastructure.

6. Concurrency and Capacity

Concurrency is the number of calls that can run at the same time. Some platforms include a baseline and charge more for additional slots.

Retell includes 20 concurrent calls on pay-as-you-go and charges $8 per additional concurrent call per month source. Bland caps concurrency by plan: 10 on Start, 50 on Build, 100 on Scale source.

A platform can look affordable at low volume but become a bottleneck if concurrency caps throttle peak demand or outbound campaigns.

7. Add-ons and Operational Features

Production contact centers need more than raw voice. Common add-ons include knowledge bases, denoising, PII removal, AI QA, branded caller ID, batch calling, call recording, answering machine detection, and compliance features.

Retell lists Knowledge Base at +$0.005/minute, Batch Call at +$0.005/dial, Branded Call at +$0.10/outbound call, PII Removal at +$0.01/minute, and AI Quality Assurance at $0.10/minute after the first 100 minutes source. Telnyx lists optional charges for recording, conference calls, transfer, AMD, noise suppression, and media streaming source.

These features affect compliance, debugging, analytics, and reliability. They are not optional for serious deployments. Good voice AI analytics that break down cost by layer, call, and agent are essential for managing these expenses over time.

Common Voice AI API Pricing Models

Not all vendors structure pricing the same way. There are five main models.

Model 1: Platform Fee + Pass-Through Provider Costs

The platform charges an orchestration fee and passes through actual provider costs for STT, TTS, LLM, and telephony. Vapi ($0.05/minute platform + at-cost providers) and SigmaMind ($0.03/minute platform + provider actuals) use this approach.

Good for: Teams that want transparency and the ability to swap providers for cost, quality, or latency reasons. A model-agnostic voice agent platform makes this tuning straightforward.

Tradeoff: Budgeting requires understanding provider-level pricing and usage patterns. Total cost varies with model choice and call behavior.

Model 2: Bundled Flat Per-Minute Pricing

One per-minute rate covers major layers. Bland lists $0.14/minute on Start, $0.12/minute on Build (with a $299/month platform fee), and $0.11/minute on Scale (with a $499/month platform fee), with LLM, STT, TTS, and telephony included source.

Good for: Finance teams that prefer predictable invoices and nontechnical buyers who want a simple number.

Tradeoff: Less control over which providers are used. The bundle may bake in margin for worst-case usage, meaning you might overpay for simple, short calls.

A LinkedIn discussion comparing pricing models captured this tension well: practitioners debated whether buyers should value per-minute transparency or flat-rate predictability, concluding that “predictable” and “cheapest” serve different buyer needs source.

Model 3: DIY Component API Pricing

The buyer assembles separate APIs for STT, TTS, LLM, and telephony. This means buying Deepgram for transcription, ElevenLabs or Cartesia for synthesis, OpenAI or Anthropic for reasoning, and Twilio or Telnyx for telephony, then building the orchestration layer in-house.

Good for: Teams with strong engineering capacity that want maximum control and the lowest possible unit costs.

Tradeoff: Requires building and maintaining latency management, streaming, failure recovery, logging, tool calls, telephony integration, and monitoring. A practitioner post from Kwindla Hultman Kramer specifically notes that a self-hosted voice agent stack must cover service discovery, load balancing, logging, monitoring, bandwidth, compliance, multi-region deployment, analytics, customer support, and DevOps salaries source. DIY can reduce vendor margin, but it shifts reliability and operations costs onto your team.

Model 4: Realtime Speech-to-Speech Token Pricing

A single realtime model processes and generates speech directly, billing audio and text tokens separately. OpenAI’s Realtime API and Google’s Gemini Flash Live fit this category.

OpenAI’s gpt-realtime-1.5 charges $32 per million audio input tokens and $64 per million audio output tokens source. Gemini 3.1 Flash Live prices audio at $0.005/minute input and $0.018/minute output source.

Good for: Low-latency, simple conversational use cases.

Tradeoff: Audio-token pricing is harder to forecast. Reddit users have repeatedly questioned what “1M audio tokens” means in practice and how it translates to minutes of conversation source.

Model 5: Enterprise Committed-Use Pricing

High-volume users negotiate rates, concurrency, support tiers, compliance arrangements, data residency, and SLAs. Retell offers custom enterprise pricing with dedicated servers, SSO, custom BAA, and 24/7 support source. Anthropic says volume discounts are available and negotiated case-by-case source.

Good for: Contact centers and enterprises running tens of thousands of minutes per month.

Tradeoff: Pricing is opaque until you engage sales. If you need enterprise voice AI pricing with custom telephony, compliance, or deployment, talking to a sales team is the fastest way to get real numbers.

How to Calculate True Voice AI API Cost

The most useful metric is the loaded voice-minute cost: everything that is billed during one minute of connected conversation.

Loaded voice-minute cost =
  platform/orchestration fee
  + STT cost per minute
  + LLM cost per minute (varies by turn)
  + TTS cost per minute (varies by verbosity)
  + telephony cost per minute
  + add-on costs per minute
  + concurrency/capacity allocation

Then, to estimate monthly spend:

Monthly cost =
  loaded voice-minute cost × monthly connected minutes
  + phone number rental
  + fixed platform subscription (if any)
  + support/compliance/security costs

But the number that actually matters for the business is cost per resolved call:

Cost per resolved call =
  (total AI voice cost + human handoff cost + failed/retry cost)
  ÷ number of successfully resolved calls

For outbound campaigns, use cost per qualified conversation:

Cost per qualified conversation =
  total campaign cost
  ÷ number of calls that reached the target outcome

A Reddit thread on outbound AI calls puts it bluntly: the metric that matters is cost per qualified conversation, not the raw per-minute price, because retry rate, pickup rate, conversion rate, and call quality change the economics entirely source.

If you struggle with tracking these metrics, this guide on tracking cost per support call breaks down the methodology in more detail.

The Long-Call Tax: Why LLM Costs Grow Faster Than Call Duration

This is one of the most underappreciated dynamics in voice AI API pricing.

STT, telephony, and TTS costs scale roughly linearly with call duration. LLM cost does not. In a cascaded voice agent, the model receives the full conversation transcript (plus system prompt, tool definitions, and tool results) on every turn. As the conversation grows, the token count per turn grows with it.

Kwindla Hultman Kramer, a practitioner who shared a detailed cost breakdown on LinkedIn, found that in a common production stack (Deepgram + GPT-4o + Cartesia), the cost split for a typical short call was roughly: STT 28%, LLM 23%, TTS 48%, hosting under 1%. But for a 30-minute conversation, the LLM share rose to 47% because conversation history had to be repeatedly sent to the model source.

This means short support calls and long advisory calls should not use the same cost assumptions. For customer support workflows with predictably short calls, TTS is likely the dominant cost. For sales, onboarding, or consulting calls that run long, LLM cost will take over.

One practical response: summarize or compress conversation history instead of sending the full transcript every turn. This reduces both cost and latency.

Hidden Costs to Watch

Beyond the seven cost layers, several expenses catch teams off guard:

Silence and hold time. If billing runs for the entire connected call, pauses and hold time cost money. The STT engine stays active even when nobody is talking.

Long prompts and tool definitions. System prompts, tool schemas, retrieved knowledge base content, and accumulated transcript all count as input tokens. The more you add, the more each turn costs.

Premium voices on every call. A LinkedIn practitioner post argues that not every call needs high-fidelity emotional voice quality, suggesting teams route premium TTS to retention and sales calls while using cheaper voices for support and FAQ source.

International or toll-free telephony. U.S. local rates may be $0.002 to $0.014 per minute, but international outbound or toll-free inbound rates can be multiples of that.

Transfers and post-transfer telephony. When a voice agent transfers to a human, who pays for the rest of the call? Some platforms continue billing. Others stop.

Failed calls, voicemails, and retries. Outbound campaigns generate many unanswered calls, voicemails, and answering machine detections. Each still costs something, even if it produces no outcome.

Compliance and security features. PII redaction, call recording, HIPAA-friendly workflows, SSO, audit logs, and BAAs are often locked behind enterprise plans. Ask about this upfront.

How to Reduce Voice AI API Cost Without Hurting Call Quality

Seven tactics that work:

1. Use the cheapest model that can complete the task. Anthropic explicitly recommends choosing Haiku for simple tasks and Sonnet for complex reasoning source. Not every FAQ call needs a frontier model.

2. Shorten prompts and remove unused tools. Retell’s billing exception for prompts above 3,500 tokens demonstrates that bloated prompts can directly increase billed duration source. Trim what is not actively needed.

3. Route premium TTS to high-value flows only. Use a natural but inexpensive voice for appointment confirmations and utility calls. Reserve premium voices for sales and retention conversations.

4. Summarize long conversation history. Instead of sending the full transcript on every LLM turn, compress earlier portions into a summary. This cuts token usage and reduces latency in long calls.

5. Track cost by layer, not just total invoice. If you cannot see how much you spend on STT vs. TTS vs. LLM vs. telephony per call, you cannot optimize. Look for platforms that expose per-layer cost breakdowns.

6. Watch latency. Slower systems lengthen calls, which increases per-minute costs. Cartesia claims Sonic-3 has 90ms time-to-first-audio source. Faster TTS and lower orchestration latency mean shorter, more natural calls.

7. Use BYOC where it improves economics. If you already have a Twilio or Telnyx account with negotiated rates, bringing your own carrier via SIP avoids paying a second layer of telephony markup.

To measure whether cost reductions are actually hurting quality, track resolution rate, escalation rate, and caller satisfaction alongside per-minute cost.

Pricing Meter Decoder

Different billing units appear on different pricing pages. This table translates them.

Meter Usually applies to Why it matters
Connected minute Platform, telephony, voice agent API Easiest to forecast, but may hide layers
Audio minute STT, recording, denoising Runs during silence
Generated character TTS Verbose agents cost more
Input token LLM prompt, transcript, context, tools Long prompts and history increase cost
Output token LLM response Often 3 to 5x more expensive than input
Concurrent call Capacity Peak load costs even if average volume is low
Phone number/month Telephony Small fee, but multiplies across numbers/regions
Per transfer Call handoff Important for support and sales
Per dial Outbound campaigns Includes unanswered calls and voicemails

Pricing Factors by Use Case

The cost drivers change significantly depending on what the voice agent does.

Use case Main cost drivers
Customer support Call length, knowledge base, human transfers, recordings, QA, PII redaction
Appointment scheduling Calendar tool calls, short calls, telephony, reminders, retries
Outbound sales / lead qualification Pickup rate, retry rate, branded caller ID, voicemail detection, cost per qualified lead
Debt collection Compliance, call recording, transfer rules, identity verification
Healthcare scheduling BAA, PII/PHI handling, audit logs, human escalation
E-commerce refunds Tool calls, order lookup, policy logic, payment integrations

Questions to Ask Before Choosing a Voice AI API

Use this checklist when evaluating any vendor’s voice AI API pricing:

  1. Does the advertised per-minute rate include STT, TTS, LLM, and telephony?
  2. Which costs are pass-through provider charges vs. platform margin?
  3. Are calls billed per second, per minute, or rounded up?
  4. Do you bill for silence, hold time, voicemails, and failed calls?
  5. What happens to billing after a human transfer?
  6. What are inbound, outbound, toll-free, SIP, and international rates?
  7. What are phone number fees?
  8. What are concurrency limits and overage rates?
  9. Are knowledge base, denoising, PII redaction, QA, recordings, and analytics included?
  10. Do long prompts, tool calls, or large context windows change pricing?
  11. Can we select STT, TTS, and LLM providers per workflow?
  12. Can we bring our own Twilio, Telnyx, or SIP carrier?
  13. Can we view cost per call, per agent, per workflow, and per provider?
  14. Are HIPAA/BAA, SSO, audit logs, and private cloud available? Are they included or enterprise-only?

Voice AI API Pricing vs. Related Terms

Term Meaning
Voice AI API pricing Full pricing model for APIs used to power AI voice interactions
AI voice agent pricing Pricing for a complete voice agent product or platform
Speech-to-text pricing Cost to transcribe incoming audio
Text-to-speech pricing Cost to generate spoken output
Realtime API pricing Pricing for low-latency multimodal interaction, often token-based
Telephony pricing Cost to place, receive, route, transfer, and record phone calls
Loaded voice-minute cost All active voice AI costs normalized to one connected minute
Cost per resolved call Total cost divided by successful resolutions
BYOC Bring your own carrier, usually via SIP or providers like Twilio/Telnyx
Concurrency pricing Pricing or limits based on simultaneous active calls

Estimate Your Voice AI API Costs

Voice AI API pricing rewards teams that understand the full stack, not just the headline rate. The cheapest per-minute provider can become the most expensive if it causes longer calls, forces premium voices everywhere, bills silence aggressively, or hides concurrency limits.

The best approach: pick a pricing model that matches your team’s technical capacity and need for predictability, then optimize by use case. Use the loaded voice-minute formula to compare vendors honestly. Track cost per resolved call or cost per qualified conversation to measure real business impact.

If you want to see how these layers add up for your specific call volume and provider choices, estimate your voice AI costs on SigmaMind’s pricing page. Or, if you’d rather test a workflow first, you can start building a voice agent for free and pay only for what you use.


FAQ

What is voice AI API pricing?

Voice AI API pricing is the billing structure for APIs that power real-time AI voice interactions. It typically spans multiple cost layers: a platform orchestration fee, speech-to-text, LLM reasoning, text-to-speech, telephony, concurrency, and add-ons like recordings, PII redaction, and knowledge base access. The true cost is the sum of all active layers during a call, not just the advertised per-minute rate.

How much does a voice AI API cost per minute?

It depends on the pricing model and providers chosen. Advertised platform rates range from $0.03 to $0.14 per minute, but the loaded cost (including STT, LLM, TTS, and telephony) typically lands between $0.07 and $0.30+ per minute for a fully functional voice agent. Retell lists pay-as-you-go voice agents at $0.07 to $0.31/minute depending on configuration source.

Does voice AI API pricing include telephony?

Sometimes. Bland says its flat per-minute rate includes telephony source. Vapi and SigmaMind charge telephony separately, either at cost or through BYOC SIP integrations. Always ask whether telephony is bundled or billed as an additional layer.

Why do some voice AI platforms look much cheaper than others?

Usually because they are quoting only one layer. A platform fee of $0.03 or $0.05 per minute looks cheap, but adding STT, LLM, TTS, and telephony can triple or quadruple the real cost. Bundled platforms quote higher headline numbers because more layers are included. The only honest comparison uses loaded voice-minute cost.

How do LLM tokens affect voice agent cost?

LLM cost increases with every turn because the model receives the conversation transcript, system prompt, tool definitions, and any retrieved context. For short calls, LLM cost might be 20 to 25% of the total. For 30-minute conversations, it can rise to nearly half the total cost because accumulated history is sent on each turn source.

Is bundled per-minute pricing better than pass-through pricing?

Neither is universally better. Bundled pricing is simpler to budget and good for teams that want predictability. Pass-through pricing gives more control over provider selection and lets teams optimize cost, quality, and latency per workflow. Engineering-heavy teams tend to prefer pass-through. Finance-driven teams tend to prefer bundled.

What is the best metric: cost per minute or cost per resolved call?

Cost per resolved call. A cheap per-minute rate means nothing if the agent fails to resolve issues, causes excessive retries, or escalates most calls to humans. Practitioners on Reddit consistently argue that cost per qualified conversation is the metric that actually reflects business value source.

What hidden fees should I watch for in voice AI pricing?

The most common surprises: silence billing (STT running during pauses), long-prompt surcharges, premium TTS applied to all calls, international telephony rates, concurrency overages, transfer costs, failed-call charges, and compliance features locked behind enterprise plans. Ask about each before committing.

Evolve with SigmaMind AI

Build, launch & scale conversational AI agents

Contact Sales