Start for free, pay as you go

Usage-based pricing with no upfront commitments–just pay for what you use.

Free

Evaluate platform capabilities in a low-risk, self-serve environment. Includes limited concurrency for development and testing.
  • Access to industry-leading models featuring Speech-to-Text and Audio Intelligence capabilities
  • $50 in free credits, good for up to 185 hours of pre-recorded audio or 333 hours of streaming audio transcription
  • Process up to five new streams per minute in parallel
  • Process up to five pre-recorded audio files in parallel
Get your API key

Pay as you go

Flexible, production-ready access with high default concurrency and automatic scaling.
  • Scalable processing with unlimited concurrency available and customizable rate limits for any workload
  • Dedicated support with sub-hour response times and optional custom SLAs and SLOs
  • Enterprise-grade security including DPA available for EU data residency and BAA available for HIPAA compliance
  • Flexible deployment options including self-hosted options (On-prem, VPC coming soon) and purchasing through AWS Marketplace
Set up billing

Customize your plan

Need a plan tailored to your business? Whether you're an enterprise processing millions of hours, need dedicated infrastructure, or require custom model configurations, our team will work with you to create a solution that aligns with your business requirements.

Pre-recorded Speech-to-Text

Build on top of the most accurate Speech-to-Text model on the market with >93% accuracy.
Models
Pay as you go
Custom
Slam
Highest accuracy transcription powered by LLM intelligence—understands context, not just words. Only available in English.
$0.27/hr
Get custom rate limits, enhanced concurrency, and enterprise-grade flexibility tailored to your AI workloads
Universal
Fast, accurate transcription across 99 languages—exceptional accuracy straight out of the box.
$0.27/hr

Streaming Speech-to-Text

Transcribe live audio and video files in real-time at ultra-low latency and high-quality accuracy. Leverage auto punctuation and casing, next-gen end-of-turn detection, and ITM/formatting.
Models
Pay as you go
Custom
Universal-Streaming
Ultra-fast, ultra-accurate real-time transcription. Built-in turn detection, and unlimited concurrency.
$0.15/hr
Get custom rate limits, enhanced concurrency, and enterprise-grade flexibility tailored to your AI workloads

Speech Understanding

Audio Intelligence
AI models to summarize speech, redact personal information, detect hateful content, identify spoken topics, and more.
Models
Pay as you go
Custom
Entity Detection

Identify a wide range of entities that are spoken in your audio files, such as person and company names, email addresses, dates, and locations.

$0.08/hr
Get custom rate limits, enhanced concurrency, and enterprise-grade flexibility tailored to your AI workloads
Topic Detection

Label the topics that are spoken in your audio and video files. The predicted topic labels follow the standardized IAB Taxonomy, which makes them suitable for contextual targeting.

$0.15/hr
Key Phrases

Accurately identify significant words and phrases in your transcription, enabling you to extract the most pertinent concepts or highlights from your audio/video file.

$0.01/hr
PII Audio Redaction
$0.05/hr
PII Redaction

Identify and remove Personally Identifiable Information, such as phone numbers and social security numbers, from the transcription text before it is returned to you.

$0.08/hr
Sentiment Analysis

With Sentiment Analysis, AssemblyAI can detect the sentiment of each sentence of speech spoken in your audio files.

$0.02/hr
Content Moderation

Detect sensitive content in your audio and video files - such as hate speech, violence, sensitive social issues, alcohol, drugs, and more.

$0.15/hr
Auto Chapters

Automatically generate a summary over time for audio and video files.

$0.08/hr
Summarization

Leverage our AI-powered Summarization models to automatically summarize audio/video data in your products at scale. Customize the summary types to best fit your use case.

$0.03/hr
Models
Pay as you go
Custom
Claude 4 Opus

Model with superior performance on complex reasoning tasks, advanced creative work, and sophisticated problem-solving.

$0.015 / 1k tokens (Input)
$0.075 / 1k tokens (output)
$0.015 / 1k tokens (Input)
$0.075 / 1k tokens (output)
Claude 4 Sonnet

Model with enhanced reasoning and improved performance for everyday tasks while maintaining speed and cost-effectiveness.

$0.003/ 1k tokens (Input)
$0.015/ 1k tokens (output)
$0.003/ 1k tokens (Input)
$0.015/ 1k tokens (output)
Claude 3.7 Sonnet

Offers enhanced reasoning capabilities, strong at complex reasoning tasks.

$0.003/ 1k tokens (Input)
$0.015/ 1k tokens (output)
$0.003/ 1k tokens (Input)
$0.015/ 1k tokens (output)
Claude 3.5 Sonnet

A mid-tier upgrade balancing power and performance.

$0.003/ 1k tokens (Input)
$0.015/ 1k tokens (output)
$0.003/ 1k tokens (Input)
$0.015/ 1k tokens (output)
Claude 3.5 Haiku

The fastest model in the family, optimized for quick responses while maintaining good reasoning.

$0.0008/ 1k tokens (Input)
$0.004/ 1k tokens (output)
$0.0008/ 1k tokens (Input)
$0.004/ 1k tokens (output)
Claude 3 Opus

The most powerful legacy Claude 3 model, excels at complex writing and analysis.

$0.015/ 1k tokens (Input)
$0.075/ 1k tokens (output)
$0.015/ 1k tokens (Input)
$0.075/ 1k tokens (output)
Claude 3 Haiku

A legacy model with a balanced combination of performance and speed for efficient, high-throughput tasks.

$0.00025/ 1k tokens (Input)
$0.00125/ 1k tokens (output)
$0.00025/ 1k tokens (Input)
$0.00125/ 1k tokens (output)

Security and Privacy

AssemblyAI uses enterprise-grade security practices to keep your data safe. We approach security by design and default, and continuously ensure AssemblyAI is secure for you and your team.
GDPR Compliant
GDPR
PCI DSS Compliant
PCI DSS
AICPA SOC 2 Compliant
SOC 2 Type 2
EU Data Residency
ISO 27001
HIPAA Compliance

Frequently Asked Questions

What are the differences between Speech-to-Text models?

Universal is a high-accuracy English model built for general-purpose use cases. It offers strong out-of-the-box performance and supports features like speaker diarization and real-time streaming. Slam-1 is our most advanced speech language model, designed specifically for speech tasks. It uses a prompt-based architecture for deeper contextual understanding and allows domain-specific customization—no retraining needed. Perfect for legal, medical, and other specialized use cases. Universal-Streaming is an ultra-fast, ultra-accurate streaming speech-to-text model designed for voice agents.

Can I sign up for free?

Yes! With the free offer, you get $50 in credits to use towards AssemblyAI’s Speech-to-Text APIs. To add more credits, simply add a credit card to your account.

Do you offer volume discounts?

Absolutely! If you plan to send large volumes of audio and video content through our API, please reach out to us here to see if you qualify for a volume discount.

How does Universal-Streaming concurrency work?

We don't limit how many streams you can run simultaneously - only how quickly you can start new ones, giving you unlimited scale while ensuring reliable performance.

Free users can start 5 new streams per minute, while pay-as-you-go accounts start with 100 new streams per minute that automatically grows by 10% each minute you're at capacity. This means within 5 minutes of sustained usage, you can scale from 100 to 146 new streams per minute (for a total of 610 concurrent streams), with unlimited ceiling as your usage grows.

These limits are designed to never interfere with legitimate applications - normal scaling patterns automatically get more capacity before hitting any walls, while protecting against runaway scripts or abuse. Your baseline limit is guaranteed and never decreases, so you can scale smoothly from dozens to thousands of simultaneous streams without artificial barriers or surprise fees.

Need higher limits? Contact our sales team for custom limits that match your deployment timeline.

How does Universal-Streaming session-based pricing work?

We charge based on total session duration - the entire time your connection stays open, whether audio is flowing or not. This gives you complete transparency and control: you pay for exactly what you're using, with no hidden costs for idle streams. You can choose to keep streams open continuously for instant response or open them strategically as needed to minimize costs, scaling up and down without prepaid commitments based on how your voice application actually works.

How fast does it take for audio and video files to process?

Most audio files sent to AssemblyAI's API can be processed in less than 60 seconds. For example, you ca process a 30 minute pre-recorded audio file in 23 seconds with Universal speech-to-text model.

How does billing work?

Great question. Once you add a credit card and deposit funds into your account, your account's funds will be drained as you use the API.

How is multichannel billed?

When multichannel is enabled, each channel will be transcribed and billed separately. The total cost is calculated by taking the hourly transcription rate (billed per second) and multiplying it by the number of channels. To calculate your total cost, simply multiply your recording's duration by the hourly rate, then multiply that result by the number of channels.

For example, if you sent a 5-minute recording with three channels, you would be billed for the 5 minutes of audio multiplied by the standard rate, with that total multiplied by three channels. This is equivalent to being billed for 15 minutes of audio.

Can I purchase or use AssemblyAI through the AWS Marketplace?

You can also get started with AssemblyAI on the AWS Marketplace—or ask your AWS account team about how to leverage AssemblyAI to revolutionize the way your company understands its customers.

How can I talk to someone?

Feel free to email us at support@assemblyai.com, or click the chat button in the bottom right corner of your browser to chat live with our API Support team!

What languages do you support?

We support over 99 languages and counting, including Global English (English and all of its accents).

What is a token?

In the context of a Large Language Model (LLM), a “token” is the smallest unit of text processed by the model. 100 tokens roughly maps to ~75 words.

Turn voice data into unparalleled product experiences

Partner with the leader in Speech AI to build powerful products with breakthrough industry impact.