Changelog
Follow along to see weekly accuracy and product improvements.
Introducing LeMUR, the easiest way to build LLM apps on spoken data
We've released LeMUR - our framework for applying LLMs to spoken data - for general availability. LeMUR is optimized for high accuracy on specific tasks:
- Custom Summary allows users to automatically summarize files in a flexible way
- Question & Answer allows users to ask specific questions about audio files and receive answers to these questions
- Action Items allows users to automatically generate a list of action items from virtual or in-person meetings
Additionally, LeMUR can be applied to groups of transcripts in order to simultaneously analyze a set of files at once, allowing users to, for example, summarize many podcast episode or ask questions about a series of customer calls.
Our Python SDK allows users to work with LeMUR in just a few lines of code:
# version 0.15 or greater
import assemblyai as aai
# set your API key
aai.settings.api_key = f"{API_TOKEN}"
# transcribe the audio file (meeting recording)
transcriber = aai.Transcriber()
transcript = transcriber.transcribe("https://storage.googleapis.com/aai-web-samples/meeting.mp4")
# generate and print action items
result = transcript.lemur.action_items(
context="A GitLab meeting to discuss logistics",
answer_format="**<topic header>**\n<relevant action items>\n",
)
print(result.response)
Learn more about LeMUR in our blog post, or jump straight into the code in our associated Colab notebook.
Introducing our Conformer-2 model
We've released Conformer-2, our latest AI model for automatic speech recognition. Conformer-2 is trained on 1.1M hours of English audio data, extending Conformer-1 to provide improvements on proper nouns, alphanumerics, and robustness to noise.
Conformer-2 is now the default model for all English audio files sent to the v2/transcript
endpoint for async processing and introduces no breaking changes.
We’ll be releasing Conformer-2 for real-time English transcriptions within the next few weeks.
Read our full blog post about Conformer-2 here. You can also try it out in our Playground.
New parameter and timestamps fix
We’ve introduced a new, optional speech_threshold
parameter, allowing users to only transcribe files that contain at least a specified percentage of spoken audio, represented as a ratio in the range [0, 1]
.
You can use the speech_threshold
parameter with our Python SDK as below:
import assemblyai as aai
aai.settings.api_key = f"{ASSEMBLYAI_API_KEY}"
config = aai.TranscriptionConfig(speech_threshold=0.1)
file_url = "https://github.com/AssemblyAI-Examples/audio-examples/raw/main/20230607_me_canadian_wildfires.mp3"
transcriber = aai.Transcriber()
transcript = transcriber.transcribe(file_url, config)
print(transcript.text)
Smoke from hundreds of wildfires in Canada is triggering air quality alerts throughout the US. Skylines from ...
If the percentage of speech in the audio file does not meet or surpass the provided threshold, then the value of transcript.text
will be None
and you will receive an error:
if not transcript.text:
print(transcript.error)
Audio speech threshold 0.9461 is below the requested speech threshold value 1.0
As usual, you can also include the speech_threshold
parameter in the JSON of raw HTTP requests for any language.
We’ve fixed a bug in which timestamps could sometimes be incorrectly reported for our Topic Detection and Content Safety models.
We’ve made improvements to detect and remove a hallucination that would sometimes occur with specific audio patterns.
Character sequence improvements
We’ve fixed an issue in which the last character in an alphanumeric sequence could fail to be transcribed. The fix is effective immediately and constitutes a 95% reduction in errors of this type.
We’ve fixed an issue in which consecutive identical numbers in a long number sequence could fail to be transcribed. This fix is effective immediately and constitutes a 66% reduction in errors of this type.
Speaker Labels improvement
We’ve made improvements to the Speaker Labels model, adjusting the impact of the speakers_expected
parameter to better allow the model to determine the correct number of unique speakers, especially in cases where one or more speakers talks substantially less than others.
We’ve expanded our caching system to include additional third-party resources to help further ensure our continued operations in the event of external resources being down.
Significant processing time improvement
We’ve made significant improvements to our transcoding pipeline, resulting in a 98% overall speedup in transcoding time and a 12% overall improvement in processing time for our asynchronous API.
We’ve implemented a caching system for some third-party resources to ensure our continued operations in the event of external resources being down.
Announcing LeMUR - our new framework for applying powerful LLMs to transcribed speech
We’re introducing our new framework LeMUR, which makes it simple to apply Large Language Models (LLMs) to transcripts of audio files up to 10 hours in length.
LLMs unlock a range of impressive capabilities that allow teams to build powerful Generative AI features. However, building these features is difficult due to the limited context windows of modern LLMs, among other challenges that necessitate the development of complicated processing pipelines.
LeMUR circumvents this problem by making it easy to apply LLMs to transcribed speech, meaning that product teams can focus on building differentiating Generative AI features rather than focusing on building infrastructure. Learn more about what LeMUR can do and how it works in our announcement blog, or jump straight to trying LeMUR in our Playground.
New PII and Entity Detection Model
We’ve upgraded to a new and more accurate PII Redaction model, which improves credit card detections in particular.
We’ve made stability improvements regarding the handling and caching of web requests. These improvements additionally fix a rare issue with punctuation detection.
Multilingual and stereo audio fixes, & Japanese model retraining
We’ve fixed two edge cases in our async transcription pipeline that were producing non-deterministic results from multilingual and stereo audio.
We’ve improved word boundary detection in our Japanese automatic speech recognition model. These changes are effective immediately for all Japanese audio files submitted to AssemblyAI.
Decreased latency and improved password reset
We’ve implemented a range of improvements to our English pipeline, leading to an average 38% improvement in overall latency for asynchronous English transcriptions.
We’ve made improvements to our password reset process, offering greater clarity to users attempting to reset their passwords while still ensuring security throughout the reset process.
Conformer-1 now available for Real-Time transcription, new Speaker Labels parameter, and more
We're excited to announce that our new Conformer-1 Speech Recognition model is now available for real-time English transcriptions, offering a 24.3% relative accuracy improvement.
Effective immediately, this state-of-the-art model will be the default model for all English audio data sent to the wss://api.assemblyai.com/v2/realtime/ws
WebSocket API.
The Speaker Labels model now accepts a new optional parameter called speakers_expected
. If you have high confidence in the number of speakers in an audio file, then you can specify it with speakers_expected
in order to improve Speaker Labels performance, particularly for short utterances.
TLS 1.3 is now available for use with the AssemblyAI API. Using TLS 1.3 can decrease latency when establishing a connection to the API.
Our PII redaction scaling has been improved to increase stability, particularly when processing longer files.
We've improved the quality and accuracy of our Japanese model.
Short transcripts that are unable to be summarized will now return an empty summary and a successful transcript.
Introducing our Conformer-1 model
We've released our new Conformer-1 model for speech recognition. Conformer-1 was trained on 650K hours of audio data and is our most accurate model to date.
Conformer-1 is now the default model for all English audio files sent to the /v2/transcript
endpoint for async processing.
We'll be releasing it for real-time English transcriptions within the next two weeks, and will add support for more languages soon.
New AI Models for Italian / Japanese Punctuation Improvements
Our Content Safety and Topic Detection models are now available for use with Italian audio files.
We’ve made improvements to our Japanese punctuation model, increasing relative accuracy by 11%. These changes are effective immediately for all Japanese audio files submitted to AssemblyAI.
Hindi Punctuation Improvements
We’ve made improvements to our Hindi punctuation model, increasing relative accuracy by 26%. These changes are effective immediately for all Hindi audio files submitted to AssemblyAI.
We’ve tuned our production infrastructure to reduce latency and improve overall consistency when using the Topic Detection and Content Moderation models.
Improved PII Redaction
We’ve released a new version of our PII Redaction model to improve PII detection accuracy, especially for credit card and phone number edge cases. Improvements are effective immediately for all API calls that include PII redaction.
Automatic Language Detection Upgrade
We’ve released a new version of our Automatic Language Detection model that better targets speech-dense parts of audio files, yielding improved accuracy. Additionally, support for dual-channel and low-volume files has been improved. All changes are effective immediately.
Our Core Transcription API has been migrated from EC2 to ECS in order to ensure scalable, reliable service and preemptively protect against service interruptions.
Password Reset
Users can now reset their passwords from our web UI. From the Dashboard login, simply click “Forgot your password?” to initiate a password reset. Alternatively, users who are already logged in can change their passwords from the Account tab on the Dashboard.

The maximum phrase length for our Word Search feature has been increased from 2 to 5, effective immediately.
Dual Channel Support for Conversational Summarization / Improved Timestamps
We’ve made updates to our Conversational Summarization model to support dual-channel files. Effective immediately, dual_channel
may be set to True
when summary_model
is set to conversational
.
We've made significant improvements to timestamps for non-English audio. Timestamps are now typically accurate between 0 and 100 milliseconds. This improvement is effective immediately for all non-English audio files submitted to AssemblyAI for transcription.
Improved Transcription Accuracy for Phone Numbers
We’ve made updates to our Core Transcription model to improve the transcription accuracy of phone numbers by 10%. This improvement is effective immediately for all audio files submitted to AssemblyAI for transcription.
We've improved scaling for our read-only database, resulting in improved performance for read-only requests.
v9 Transcription Model Released
We are happy to announce the release of our most accurate Speech Recognition model to date - version 9 (v9). This updated model delivers increased performance across many metrics on a wide range of audio types.
Word Error Rate, or WER, is the primary quantitative metric by which the performance of an automatic transcription model is measured. Our new v9 model shows significant improvements across a range of different audio types, as seen in the chart below, with a more than 11% improvement on average.

In addition to standard overall WER advancements, the new v9 model shows marked improvements with respect to proper nouns. In the chart below, we can see the relative performance increase of v9 over v8 for various types of audio, with a nearly 15% improvement on average.

The new v9 transcription model is currently live in production. This means that customers will see improved performance with no changes required on their end. The new model will automatically be used for all transcriptions created by our /v2/transcript
endpoint going forward, with no need to upgrade for special access.
While our customers enjoy the elevated performance of the v9 model, our AI research team is already hard at work on our v10 model, which is slated to launch in early 2023. Building upon v9, the v10 model is expected to radically improve the state of the art in speech recognition.
Try our new v9 transcription model through your browser using the AssemblyAI Playground. Alternatively, sign up for a free API token to test it out through our API, or schedule a time with our AI experts to learn more.