July 11, 2025

Build an AI Voice Agent with DeepSeek R1, AssemblyAI, and ElevenLabs

Learn how to build a real-time AI voice agent using AssemblyAI, DeepSeek R1 via Ollama, and ElevenLabs. With AI voice agents handling an increasing share of customer interactions, this guide walks you through setting up transcription, AI-powered responses, and voice synthesis for seamless automation.

Streaming Speech-to-Text

Tutorial

Python

Smitha Kolan

Developer Educator

Smitha Kolan

Developer Educator

Reviewed by

No items found.

Table of contents

[Visible on live site]

Get $50 in credits

AI voice agents are revolutionizing how businesses interact with users, enabling real-time conversations with intelligent and human-like responses. According to a16z, financial services alone make up 25% of total global contact center spending and account for over $100 billion annually in business process outsourcing (BPO). With voice agents increasingly completing complex tasks successfully, businesses are rapidly adopting automated voice interactions to improve efficiency and customer experience.

In this tutorial, you'll learn how to build an AI voice agent that transcribes speech in real time using AssemblyAI's Universal-Streaming, generates intelligent responses using DeepSeek R1 (7B model) via Ollama, and converts those responses into speech with ElevenLabs. The entire process happens with ultra-low latency, allowing for smooth and interactive voice-based applications that feel natural and responsive.

By the end of this tutorial, you'll have a fully functional AI voice agent that can transcribe, process, and respond to spoken queries with industry-leading accuracy and speed. If you prefer watching the video, check it out below:

Key components of the AI voice agent

The AI voice agent consists of three major components:

Real-time speech-to-text: AssemblyAI's Universal-Streaming model converts spoken words into text with ultra-low latency (~300ms) and superior accuracy, specifically designed for voice agents.
AI-powered responses: DeepSeek R1 (7B model) via Ollama generates context-aware responses with intelligent conversation flow.
Text-to-speech synthesis: ElevenLabs converts the AI-generated text response into natural-sounding audio.

In this step-by-step guide, we'll install the necessary dependencies, configure the AI agent, and run the complete system using AssemblyAI's cutting-edge Universal-Streaming technology that delivers the accuracy and speed voice agents need.

Step 1: Install required dependencies

Before you begin, ensure you have all the required dependencies installed.

1.1 Get API Keys

To use AssemblyAI and ElevenLabs, sign up for their API keys:

AssemblyAI (Speech-to-Text): Sign up for a free API key - Get $50 in free credits to get started‍
ElevenLabs (Text-to-Speech): Sign up for an account

1.2 Install Ollama

DeepSeek R1 is accessed via Ollama. Install Ollama by following the instructions here.

1.3 Install PortAudio (Required for real-time transcription)

For Debian/Ubuntu:

apt install portaudio19-dev

‍

For MacOS:

brew install portaudio

‍

1.4 Install Python Libraries

Run the following command to install the required Python dependencies:

pip install "assemblyai[extras]" ollama elevenlabs

‍

1.5 (MacOS Only) Install MPV for Audio Streaming

brew install mpv

‍

Step 2: Download the DeepSeek R1 model

Since this script uses DeepSeek R1 via Ollama, download the model locally by running:

ollama pull deepseek-r1:7b

Step 3: Implement the AI voice agent

Create a new Python file called AIVoiceAgent.py and add the following code.

3.1 Import required libraries

import assemblyai as aai
from assemblyai.streaming.v3 import (
    BeginEvent,
    StreamingClient,
    StreamingClientOptions,
    StreamingError,
    StreamingEvents,
    StreamingParameters,
    StreamingSessionParameters,
    TerminationEvent,
    TurnEvent,
)
from elevenlabs.client import ElevenLabs
from elevenlabs import stream
import ollama

‍

The first step is to import the necessary libraries. AssemblyAI's Universal-Streaming is used for real-time speech-to-text transcription with superior accuracy and ultra-low latency, Ollama allows us to interact with DeepSeek R1, and ElevenLabs is responsible for converting AI-generated text into speech. These tools together form the core of our AI voice agent, providing the developer-friendly integration that makes building production-ready voice applications straightforward.

3.2 Define the AI voice agent class

Next, we define the AIVoiceAgent class, which initializes our API keys and sets up the conversation state.

class AIVoiceAgent:
    def __init__(self):
        aai.settings.api_key = "ASSEMBLYAI_API_KEY"
        self.elevenlabs_client = 
ElevenLabs(api_key="ELEVENLABS_API_KEY")
        self.streaming_client = None
        self.full_transcript = [
            {"role": "system", "content": "You are a language model called R1 created by DeepSeek. Answer in less than 300 characters."}
        ]

‍

This class stores API keys for AssemblyAI and ElevenLabs and initializes full_transcript, which keeps track of the conversation history. The system message defines the behavior of the AI model, ensuring that responses are concise and relevant. AssemblyAI's Universal-Streaming provides the reliable foundation for accurate speech recognition that powers the entire conversation flow.

3.3 Set up real-time transcription with Universal-Streaming

We then define the function that handles streaming real-time transcription using AssemblyAI's latest Universal-Streaming model.

    def start_transcription(self):
        print("\nReal-time transcription with Universal-Streaming:", 
end="\r\n")
        
        # Configure streaming parameters for optimal voice agent 
performance
        streaming_params = StreamingParameters(
            sample_rate=16000,
            format_turns=True,  # Enable intelligent turn detection
            min_end_of_turn_silence_when_confident=160  # Optimized for 
voice agents
        )
        
        self.streaming_client = StreamingClient(
            StreamingClientOptions(api_key=aai.settings.api_key)
        )
        
        # Set up event handlers for Universal-Streaming
        self.streaming_client.on(StreamingEvents.Begin, self.on_begin)
        self.streaming_client.on(StreamingEvents.Turn, self.on_turn)
        self.streaming_client.on(StreamingEvents.Termination, 
self.on_terminated)
        self.streaming_client.on(StreamingEvents.Error, self.on_error)
        # Start the streaming session
        self.streaming_client.connect(streaming_params)
        
        # Stream microphone audio
        microphone_stream = 
aai.extras.MicrophoneStream(sample_rate=16000)
        self.streaming_client.stream(microphone_stream)

‍

This method starts the real-time transcription process using Universal-Streaming, capturing audio from the microphone and streaming it to AssemblyAI for conversion into text with ~300ms latency. Universal-Streaming's intelligent endpointing ensures natural conversation flow without awkward pauses or premature interruptions.

3.4 Handling Universal-Streaming events

Once the transcription starts, we need to process the incoming text using Universal-Streaming's event-driven architecture.

    def on_begin(self, client, event: BeginEvent):
        print(f"Universal-Streaming session started with ID: 
{event.id}")
    def on_turn(self, client, event: TurnEvent):
        if not event.transcript:
            return
        
        if event.end_of_turn:
            print(f"\nFinal: {event.transcript}")
            self.generate_ai_response(event.transcript)
        else:
            print(f"Partial: {event.transcript}", end="\r")
    def on_terminated(self, client, event: TerminationEvent):
        print(f"\nSession terminated: {event.audio_duration_seconds} 
seconds processed")
    def on_error(self, client, error: StreamingError):
        if not str(error):
            return
        print(f"Streaming error: {error}")

‍

Universal-Streaming provides immutable transcripts, meaning the text won't change once delivered, enabling faster downstream processing. When an end-of-turn is detected using intelligent endpointing, the final transcript is sent to the AI model for response generation.

3.5 Generate AI responses with DeepSeek R1

Once transcription is complete, the system sends the conversation history to DeepSeek R1 via Ollama to generate a response.

    def generate_ai_response(self, transcript_text):
        self.stop_transcription()
        self.full_transcript.append({"role": "user", "content": 
transcript_text})
        print(f"\nUser: {transcript_text}", end="\r\n")
        
        ollama_stream = ollama.chat(
            model="deepseek-r1:7b",
            messages=self.full_transcript,
            stream=True,
        )
        
        
        response_text = "".join(chunk['message']['content'] for chunk in 
ollama_stream)
        print("DeepSeek R1:", response_text )
        self.speak_response(response_text)
        self.full_transcript.append({"role": "assistant", "content": 
response_text})
        self.start_transcription()


    def speak_response(self, text):
        """Convert text to speech using ElevenLabs (new SDK layout)"""
        try:
            # Preferred path for current SDKs (streams chunks as they’re 
produced)
            audio_generator = self.elevenlabs_client.text_to_speech.stream(
                text=text,
                voice_id="pNInz6obpgDQGcFmaJgB",        # Adam’s voice ID :contentReference[oaicite:0]{index=0}
                model_id="eleven_monolingual_v1",
                output_format="mp3_44100_128",
            )
        except AttributeError:
            # Fall back to older SDKs that still expose `.generate()`
            audio_generator = self.elevenlabs_client.generate(
                text=text,
                voice="Adam",
                model="eleven_monolingual_v1",
            )
        # Plays the bytes (or generator) through your system audio
        stream(audio_generator)
    def stop_transcription(self):
        if self.streaming_client:
            self.streaming_client.disconnect(terminate=True)

‍

This method pauses transcription, sends the user's message to DeepSeek R1, and streams back the AI-generated response. The response is then converted to speech using ElevenLabs, creating a seamless conversation experience powered by AssemblyAI's industry-leading accuracy.

3.6 Main execution

    def run(self):
        print("Starting AI Voice Agent with Universal-Streaming...")
        print("Pricing: Universal-Streaming at just $0.15/hour with 
unlimited concurrency")
        try:
            self.start_transcription()
            input("Press Enter to stop...")
        except KeyboardInterrupt:
            print("\nShutting down...")
        finally:
            self.stop_transcription()


if __name__ == "__main__":
    agent = AIVoiceAgent()
    agent.run()

‍

Step 4: Running the AI voice agent

Once all dependencies are installed and the model is downloaded, simply run:

python AIVoiceAgent.py

‍

Why Universal-Streaming for voice agents

AssemblyAI's Universal-Streaming delivers several key advantages for voice agent applications:

Ultra-low latency: ~300ms immutable transcripts enable faster response times
Superior accuracy: Substantial improvements in capturing emails, codes, and proper nouns
Intelligent endpointing: Combines acoustic and semantic features for natural conversation flow
Transparent pricing: $0.15/hour with unlimited concurrency and no hidden fees
Production-ready: Designed specifically for voice agents with enterprise reliability

This makes Universal-Streaming the ideal choice for developers building voice agents that need to feel natural, complete tasks successfully, and scale without compromise.

Final words

By integrating AssemblyAI's Universal-Streaming, DeepSeek R1, and ElevenLabs, this tutorial demonstrates how to build a production-ready AI voice agent with ultra-low latency transcription, intelligent conversation flow, and natural speech synthesis—enabling voice agents that feel more responsive and complete tasks successfully.

Universal-Streaming's developer-friendly API and cutting-edge technology provide the reliable foundation needed for building voice agents that deliver exceptional user experiences. With industry-leading accuracy and transparent pricing, you can focus on building innovative features while AssemblyAI handles the complexities of speech recognition.

Learn more about AssemblyAI's Universal-Streaming: Check the Docs

Get started now

Get started with $50 in free credits

Try it today

Build an AI Voice Agent with DeepSeek R1, AssemblyAI, and ElevenLabs

Key components of the AI voice agent

Step 1: Install required dependencies

1.1 Get API Keys

1.2 Install Ollama

1.3 Install PortAudio (Required for real-time transcription)

1.4 Install Python Libraries

1.5 (MacOS Only) Install MPV for Audio Streaming

Step 2: Download the DeepSeek R1 model

Step 3: Implement the AI voice agent

3.1 Import required libraries

3.2 Define the AI voice agent class

3.3 Set up real-time transcription with Universal-Streaming

3.4 Handling Universal-Streaming events

3.5 Generate AI responses with DeepSeek R1

3.6 Main execution

Step 4: Running the AI voice agent

Why Universal-Streaming for voice agents

Final words

Convert Speech to Text in Python in 5 Minutes

How to Get YouTube Video Transcripts

Transcribe Twilio Phone Calls in Real-Time with AssemblyAI

How to build the lowest latency voice agent in Vapi: Achieving ~465ms end-to-end Latency

Build an AI Voice Agent with DeepSeek R1, AssemblyAI, and ElevenLabs

Key components of the AI voice agent

Step 1: Install required dependencies

1.1 Get API Keys

1.2 Install Ollama

1.3 Install PortAudio (Required for real-time transcription)

1.4 Install Python Libraries

1.5 (MacOS Only) Install MPV for Audio Streaming

Step 2: Download the DeepSeek R1 model

Step 3: Implement the AI voice agent

3.1 Import required libraries

3.2 Define the AI voice agent class

3.3 Set up real-time transcription with Universal-Streaming

3.4 Handling Universal-Streaming events

3.5 Generate AI responses with DeepSeek R1

3.6 Main execution

Step 4: Running the AI voice agent

Why Universal-Streaming for voice agents

Final words

Related posts

Convert Speech to Text in Python in 5 Minutes

How to Get YouTube Video Transcripts

Transcribe Twilio Phone Calls in Real-Time with AssemblyAI

How to build the lowest latency voice agent in Vapi: Achieving ~465ms end-to-end Latency