Changelog
Follow along to see weekly accuracy and product improvements.
New LeMUR parameter, reduced hold music hallucinations
Users can now directly pass in custom text inputs into LeMUR through the input_text
parameter as an alternative to transcript IDs. This gives users the ability to use any information from the async API, formatted however they want, with LeMUR for maximum flexibility.
For example, users can assign action items per user by inputting speaker-labeled transcripts, or pull citations by inputting timestamped transcripts. Learn more about the new input_text
parameter in our LeMUR API reference, or check out examples of how to use the input_text
parameter in the AssemblyAI Cookbook.
We’ve made improvements that reduce hallucinations which sometimes occurred from transcribing hold music on phone calls. This improvement is effective immediately with no changes required by users.
We’ve fixed an issue that would sometimes yield an inability to fulfill a request when XML was returned by LeMUR /task
endpoint.
Reduced latency, improved error messaging
We’ve made improvements to our file downloading pipeline which reduce transcription latency. Latency has been reduced by at least 3 seconds for all audio files, with greater improvements for large audio files provided via external URLs.
We’ve improved error messaging for increased clarity in the case of internal server errors.
New Dashboard features and LeMUR fix
We have released the beta for our new usage dashboard. You can now see a usage summary broken down by async transcription, real-time transcription, Audio Intelligence, and LeMUR. Additionally, you can see charts of usage over time broken down by model.
We have added support for AWS marketplace on the dashboard/account management pages of our web application.
We have fixed an issue in which LeMUR would sometimes fail when handling extremely short transcripts.
New LeMUR features and other improvements
We have added a new parameter to LeMUR that allows users to specify a temperature
for LeMUR generation. Temperature refers to how stochastic the generated text is and can be a value from 0 to 1, inclusive, where 0 corresponds to low creativity and 1 corresponds to high creativity. Lower values are preferred for tasks like multiple choice, and higher values are preferred for tasks like coming up with creative summaries of clips for social media.
Here is an example of how to set the temperature
parameter with our Python SDK (which is available in version 0.18.0
and up):
import assemblyai as aai
aai.settings.api_key = f"{API_TOKEN}"
transcriber = aai.Transcriber()
transcript = transcriber.transcribe("https://storage.googleapis.com/aai-web-samples/meeting.mp4")
result = transcript.lemur.summarize(
temperature=0.25
)
print(result.response)
We have added a new endpoint that allows users to delete the data for a previously submitted LeMUR request. The response data as well as any context provided in the original request will be removed. Continuing the example from above, we can see how to delete LeMUR data using our Python SDK:
request_id = result.request_id
deletion_result = aai.Lemur.purge_request_data(request_id)
print(deletion_result)
We have improved the error messaging for our Word Search functionality. Each phrase used in a Word Search functionality must be 5 words or fewer. We have improved the clarity of the error message when a user makes a request which contains a phrase that exceeds this limit.
We have fixed an edge case error that would occur when both disfluencies and Auto Chapters were enabled for audio files that contained non-fluent English.
Improvements - observability, logging, and patches
We have improved logging for our LeMUR service to allow for the surfacing of more detailed errors to users.
We have increased observability into our Speech API internally, allowing for finer grained metrics of usage.
We have fixed a minor bug that would sometimes lead to incorrect timestamps for zero-confidence words.
We have fixed an issue in which requests to LeMUR would occasionally hang during peak usage due to a memory leak issue.
Multi-language speaker labels
We have recently launched Speaker Labels for 10 additional languages:
- Spanish
- Portuguese
- German
- Dutch
- Finnish
- French
- Italian
- Polish
- Russian
- Turkish
Audio Intelligence unbundling and price decreases
We have unbundled and lowered the price for our Audio Intelligence models. Previously, the bundled price for all Audio Intelligence models was $2.10/hr, regardless of the number of models used.
We have made each model accessible at a lower, unbundled, per-model rate:
- Auto chapters: $0.30/hr
- Content Moderation: $0.25/hr
- Entity detection: $0.15/hr
- Key Phrases: $0.06/hr
- PII Redaction: $0.20/hr
- Audio Redaction: $0.05/hr
- Sentiment analysis: $0.12/hr
- Summarization: $0.06/hr
- Topic detection: $0.20/hr
New language support and improvements to existing languages
We now support the following additional languages for asynchronous transcription through our /v2/transcript
endpoint:
- Chinese
- Finnish
- Korean
- Polish
- Russian
- Turkish
- Ukrainian
- Vietnamese
Additionally, we've made improvements in accuracy and quality to the following languages:
- Dutch
- French
- German
- Italian
- Japanese
- Portuguese
- Spanish
You can see a full list of supported languages and features here. You can see how to specify a language in your API request here. Note that not all languages support Automatic Language Detection.
Pricing decreases
We have decreased the price of Core Transcription from $0.90 per hour to $0.65 per hour, and decreased the price of Real-Time Transcription from $0.90 per hour to $0.75 per hour.
Both decreases were effective as of August 3rd.
Significant Summarization model speedups
We’ve implemented changes that yield between a 43% to 200% increase in processing speed for our Summarization models, depending on which model is selected, with no measurable impact on the quality of results.
We have standardized the response from our API for automatically detected languages that do not support requested features. In particular, when Automatic Language Detection is used and the detected language does not support a feature requested in the transcript request, our API will return null
in the response for that feature.
Introducing LeMUR, the easiest way to build LLM apps on spoken data
We've released LeMUR - our framework for applying LLMs to spoken data - for general availability. LeMUR is optimized for high accuracy on specific tasks:
- Custom Summary allows users to automatically summarize files in a flexible way
- Question & Answer allows users to ask specific questions about audio files and receive answers to these questions
- Action Items allows users to automatically generate a list of action items from virtual or in-person meetings
Additionally, LeMUR can be applied to groups of transcripts in order to simultaneously analyze a set of files at once, allowing users to, for example, summarize many podcast episode or ask questions about a series of customer calls.
Our Python SDK allows users to work with LeMUR in just a few lines of code:
# version 0.15 or greater
import assemblyai as aai
# set your API key
aai.settings.api_key = f"{API_TOKEN}"
# transcribe the audio file (meeting recording)
transcriber = aai.Transcriber()
transcript = transcriber.transcribe("https://storage.googleapis.com/aai-web-samples/meeting.mp4")
# generate and print action items
result = transcript.lemur.action_items(
context="A GitLab meeting to discuss logistics",
answer_format="**<topic header>**\n<relevant action items>\n",
)
print(result.response)
Learn more about LeMUR in our blog post, or jump straight into the code in our associated Colab notebook.
Introducing our Conformer-2 model
We've released Conformer-2, our latest AI model for automatic speech recognition. Conformer-2 is trained on 1.1M hours of English audio data, extending Conformer-1 to provide improvements on proper nouns, alphanumerics, and robustness to noise.
Conformer-2 is now the default model for all English audio files sent to the v2/transcript
endpoint for async processing and introduces no breaking changes.
We’ll be releasing Conformer-2 for real-time English transcriptions within the next few weeks.
Read our full blog post about Conformer-2 here. You can also try it out in our Playground.
New parameter and timestamps fix
We’ve introduced a new, optional speech_threshold
parameter, allowing users to only transcribe files that contain at least a specified percentage of spoken audio, represented as a ratio in the range [0, 1]
.
You can use the speech_threshold
parameter with our Python SDK as below:
import assemblyai as aai
aai.settings.api_key = f"{ASSEMBLYAI_API_KEY}"
config = aai.TranscriptionConfig(speech_threshold=0.1)
file_url = "https://github.com/AssemblyAI-Examples/audio-examples/raw/main/20230607_me_canadian_wildfires.mp3"
transcriber = aai.Transcriber()
transcript = transcriber.transcribe(file_url, config)
print(transcript.text)
Smoke from hundreds of wildfires in Canada is triggering air quality alerts throughout the US. Skylines from ...
If the percentage of speech in the audio file does not meet or surpass the provided threshold, then the value of transcript.text
will be None
and you will receive an error:
if not transcript.text:
print(transcript.error)
Audio speech threshold 0.9461 is below the requested speech threshold value 1.0
As usual, you can also include the speech_threshold
parameter in the JSON of raw HTTP requests for any language.
We’ve fixed a bug in which timestamps could sometimes be incorrectly reported for our Topic Detection and Content Safety models.
We’ve made improvements to detect and remove a hallucination that would sometimes occur with specific audio patterns.
Character sequence improvements
We’ve fixed an issue in which the last character in an alphanumeric sequence could fail to be transcribed. The fix is effective immediately and constitutes a 95% reduction in errors of this type.
We’ve fixed an issue in which consecutive identical numbers in a long number sequence could fail to be transcribed. This fix is effective immediately and constitutes a 66% reduction in errors of this type.
Speaker Labels improvement
We’ve made improvements to the Speaker Labels model, adjusting the impact of the speakers_expected
parameter to better allow the model to determine the correct number of unique speakers, especially in cases where one or more speakers talks substantially less than others.
We’ve expanded our caching system to include additional third-party resources to help further ensure our continued operations in the event of external resources being down.
Significant processing time improvement
We’ve made significant improvements to our transcoding pipeline, resulting in a 98% overall speedup in transcoding time and a 12% overall improvement in processing time for our asynchronous API.
We’ve implemented a caching system for some third-party resources to ensure our continued operations in the event of external resources being down.
Announcing LeMUR - our new framework for applying powerful LLMs to transcribed speech
We’re introducing our new framework LeMUR, which makes it simple to apply Large Language Models (LLMs) to transcripts of audio files up to 10 hours in length.
LLMs unlock a range of impressive capabilities that allow teams to build powerful Generative AI features. However, building these features is difficult due to the limited context windows of modern LLMs, among other challenges that necessitate the development of complicated processing pipelines.
LeMUR circumvents this problem by making it easy to apply LLMs to transcribed speech, meaning that product teams can focus on building differentiating Generative AI features rather than focusing on building infrastructure. Learn more about what LeMUR can do and how it works in our announcement blog, or jump straight to trying LeMUR in our Playground.
New PII and Entity Detection Model
We’ve upgraded to a new and more accurate PII Redaction model, which improves credit card detections in particular.
We’ve made stability improvements regarding the handling and caching of web requests. These improvements additionally fix a rare issue with punctuation detection.
Multilingual and stereo audio fixes, & Japanese model retraining
We’ve fixed two edge cases in our async transcription pipeline that were producing non-deterministic results from multilingual and stereo audio.
We’ve improved word boundary detection in our Japanese automatic speech recognition model. These changes are effective immediately for all Japanese audio files submitted to AssemblyAI.
Decreased latency and improved password reset
We’ve implemented a range of improvements to our English pipeline, leading to an average 38% improvement in overall latency for asynchronous English transcriptions.
We’ve made improvements to our password reset process, offering greater clarity to users attempting to reset their passwords while still ensuring security throughout the reset process.