In 2017, we launched Amazon Transcribe, an automatic speech recognition service that makes it easy for developers to add a speech-to-text capability to their applications. Since then, we added support for more languages, enabling customers globally to transcribe audio recordings in 31 languages, including 6 in real-time.
A popular use case for Amazon Transcribe is transcribing customer calls. This allows companies to analyze the transcribed text using natural language processing techniques to detect sentiment or to identify the most common call causes. If you operate in a country with multiple official languages or across multiple regions, your audio files can contain different languages. Thus, files have to be tagged manually with the appropriate language before transcription can take place. This typically involves setting up teams of multi-lingual speakers, which creates additional costs and delays in processing audio files.
The media and entertainment industry often uses Amazon Transcribe to convert media content into accessible and searchable text files. Use cases include generating subtitles or transcripts, moderating content, and more. Amazon Transcribe is also used by operations team for quality control, for example checking that audio and video are in sync thanks to the timestamps present in the extracted text. However, other problems couldn’t be easily solved, such as verifying that the main spoken language in your videos is correctly labeled to avoid streaming video in the wrong language.
Today, I’m extremely happy to announce that Amazon Transcribe can now automatically identify the dominant language in an audio recording. This feature will help customers build more efficient transcription workflows by getting rid of manual tagging. In addition to the examples mentioned above, you can now also easily use Amazon Transcribe to automatically recognize and transcribe voicemails, meetings, and any form of recorded communication.
Introducing Automatic Language Identification
With a minimum of 30 seconds of audio, Amazon Transcribe can efficiently generate transcripts in the spoken language without wasting time and resources on manual tagging. Automatic identification of the dominant language is available in batch transcription mode for all 31 languages. Thanks to sampling techniques, language identification happens much faster than the transcription itself, in the matter of seconds.
If you’re already using Amazon Transcribe for speech recognition, you just need to enable the feature in the StartTranscriptionJob API. Before your transcription job is complete, the response of the GetTranscriptionJob API will tell the dominant language of the audio recording, and its confidence score between 0 and 1. The transcript lists the top five languages and their respective confidence scores.
Of course, if you want to use Amazon Transcribe exclusively for automatic language identification, you can simply process the API response and ignore the transcript. In this case, you should stick to short 30-45 second audio recordings to minimize costs.
You can also restrict languages that Amazon Transcribe tries to identify, by passing a list of languages to the StartTranscriptionJob API. For example, if your company call center only receives calls in English, Spanish and French, then restricting identifiable languages to this list will increase language identification accuracy.
Now, I’d like to show you how easy it us to use this new feature!
Detecting the Dominant Language With Amazon Transcribe
First, let’s try a high quality sample. I’ll use the audio track from one of my breakout sessions at AWS Summit Paris 2019. I can easily download it using the youtube-dl tool.
$ youtube-dl -f bestaudio https://www.youtube.com/watch?v=AFN5jaTurfA
$ mv AWS\ \&\ EarthCube\ _\ Deep\ learning\ démarrer\ avec\ MXNet\ et\ Tensorflow\ en\ 10\ minutes-AFN5jaTurfA.m4a video.m4a
Using ffmpeg, I shorten the audio clip to 1 minute.
$ ffmpeg -i video.m4a -ss 00:00:00.00 -t 00:01:00.00 video-1mn.m4a
Then, I upload the clip to an Amazon Simple Storage Service (S3) bucket.
$ aws s3 cp video-1mn.m4a s3://jsimon-transcribe-uswest2/
Next, I use the AWS CLI to run a transcription job on this audio clip, with language identification enabled.
$ awscli transcribe start-transcription-job --transcription-job-name video-test --identify-language --media MediaFileUri=s3://jsimon-transcribe-uswest2/video-1mn.m4a
Waiting only a few seconds, I check the status of the job. I could also use a Amazon CloudWatch event to be notified that language identification is complete.
$ awscli transcribe get-transcription-job --transcription-job-name video-test
As highlighted in the output, the dominant language has been correctly detected in seconds, with a high confidence score of 91.59%. A few more seconds later, the transcription job is complete. Running the same CLI call, I can retrieve a link to the transcription, which also includes the top 5 languages for the audio clip, sorted by decreasing score.
Adding up French and Canadian French, we pretty much get a score of 100%, so there’s no doubt that this clip is in French. In some cases, you may not care for that level of detail, and you’ll see in the next example how to restrict the list of detected languages.
Restricting the List of Detected Languages
As customer call transcription is a popular use case for Amazon Transcribe, here is a 40-second audio clip (WAV, 8KHz, 16-bit resolution), where I’m reading a paragraph from the French version of the Amazon Transcribe page. As you can hear, quality is pretty awful, and I added background music (Bach-ground, actually) for good measure.
Again, I upload the clip to an S3 bucket, and I use the AWS CLI to transcribe it. This time, I restrict the list of languages to French, Spanish, German, US English, and British English.
$ aws s3 cp speech-8k.wav s3://jsimon-transcribe-uswest2/
$ awscli transcribe start-transcription-job --transcription-job-name speech-8k-test --identify-language --media MediaFileUri=s3://jsimon-transcribe-uswest2/speech-8k.wav --language-options fr-FR es-ES de-DE en-US en-GB
A few seconds later, I check the status of the job.
$ awscli transcribe get-transcription-job --transcription-job-name speech-8k-test
As highlighted in the output, the dominant language has been correctly detected with a very high confidence score in spite of the terrible audio quality. Restricting the list of languages certainly helps, and you should use it whenever possible.
Automatic Language Identification is available today in these regions:
- US East (N. Virginia), US East (Ohio), US West (N. California), US West (Oregon), AWS GovCloud (US-West).
- Canada (Central).
- South America (São Paulo).
- Europe (Ireland), Europe (London), Europe (Paris), Europe (Frankfurt).
- Middle East (Bahrain).
- Asia Pacific (Hong Kong), Asia Pacific (Mumbai), Asia Pacific (Tokyo), Asia Pacific (Seoul), Asia Pacific (Singapore), Asia Pacific (Sydney).