Not My Real Life: Head In The Cloud: Majority of Alexa Now Running on Faster, More Cost-Effective Amazon EC2 Inf1 Instances

Today, we are announcing that the Amazon Alexa team has migrated the vast majority of their GPU-based machine learning inference workloads to Amazon Elastic Compute Cloud (EC2) Inf1 instances, powered by AWS Inferentia. This resulted in 25% lower end-to-end latency, and 30% lower cost compared to GPU-based instances for Alexa’s text-to-speech workloads. The lower latency allows Alexa engineers to innovate with more complex algorithms and to improve the overall Alexa experience for our customers.

AWS built AWS Inferentia chips from the ground up to provide the lowest-cost machine learning (ML) inference in the cloud. They power the Inf1 instances that we launched at AWS re:Invent 2019. Inf1 instances provide up to 30% higher throughput and up to 45% lower cost per inference compared to GPU-based G4 instances, which were, before Inf1, the lowest-cost instances in the cloud for ML inference.

Alexa is Amazon’s cloud-based voice service that powers Amazon Echo devices and more than 140,000 models of smart speakers, lights, plugs, smart TVs, and cameras. Today customers have connected more than 100 million devices to Alexa. And every month, tens of millions of customers interact with Alexa to control their home devices (“Alexa, increase temperature in living room,” “Alexa, turn off bedroom’“), to listen to radios and music (“Alexa, start Maxi 80 on bathroom,” “Alexa, play Van Halen from Spotify“), to be informed (“Alexa, what is the news?” “Alexa, is it going to rain today?“), or to be educated, or entertained with 100,000+ Alexa Skills.

If you ask Alexa where she lives, she’ll tell you she is right here, but her head is in the cloud. Indeed, Alexa’s brain is deployed on AWS, where she benefits from the same agility, large-scale infrastructure, and global network we built for our customers.

How Alexa Works
When I’m in my living room and ask Alexa about the weather, I trigger a complex system. First, the on-device chip detects the wake word (Alexa). Once detected, the microphones record what I’m saying and stream the sound for analysis in the cloud. At a high level, there are two phases to analyze the sound of my voice. First, Alexa converts the sound to text. This is known as Automatic Speech Recognition (ASR). Once the text is known, the second phase is to understand what I mean. This is Natural Language Understanding (NLU). The output of NLU is an Intent (what does the customer want) and associated parameters. In this example (“Alexa, what’s the weather today ?”), the intent might be “GetWeatherForecast” and the parameter can be my postcode, inferred from my profile.

This whole process uses Artificial Intelligence heavily to transform the sound of my voice to phonemes, phonemes to words, words to phrases, phrases to intents. Based on the NLU output, Alexa routes the intent to a service to fulfill it. The service might be internal to Alexa or external, like one of the skills activated on my Alexa account. The fulfillment service processes the intent and returns a response as a JSON document. The document contains the text of the response Alexa must say.

The last step of the process is to generate the voice of Alexa from the text. This is known as Text-To-Speech (TTS). As soon as the TTS starts to produce sound data, it is streamed back to my Amazon Echo device: “The weather today will be partly cloudy with highs of 16 degrees and lows of 8 degrees.” (I live in Europe, these are Celsius degrees :-) ). This Text-To-Speech process also heavily involves machine learning models to build a phrase that sounds natural in terms of pronunciations, rhythm, connection between words, intonation etc.

Alexa is one of the most popular hyperscale machine learning services in the world, with billions of inference requests every week. Of Alexa’s three main inference workloads (ASR, NLU, and TTS), TTS workloads initially ran on GPU-based instances. But the Alexa team decided to move to the Inf1 instances as fast as possible to improve the customer experience and reduce the service compute cost.

What is AWS Inferentia?
AWS Inferentia is a custom chip, built by AWS, to accelerate machine learning inference workloads and optimize their cost. Each AWS Inferentia chip contains four NeuronCores. Each NeuronCore implements a high-performance systolic array matrix multiply engine, which massively speeds up typical deep learning operations such as convolution and transformers. NeuronCores are also equipped with a large on-chip cache, which helps cut down on external memory accesses, dramatically reducing latency and increasing throughput.

AWS Inferentia can be used natively from popular machine-learning frameworks like TensorFlow, PyTorch, and MXNet, with AWS Neuron. AWS Neuron is a software development kit (SDK) for running machine learning inference using AWS Inferentia chips. It consists of a compiler, run-time, and profiling tools that enable you to run high-performance and low latency inference.

Who Else is Using Amazon EC2 Inf1?
In addition to Alexa, Amazon Rekognition is also adopting AWS Inferentia. Running models such as object classification on Inf1 instances resulted in 8x lower latency and doubled throughput compared to running these models on GPU instances.

Customers, from Fortune 500 companies to startups, are using Inf1 instances for machine learning inference. For example, Snap Inc. incorporates machine learning (ML) into many aspects of Snapchat, and exploring innovation in this field is a key priority for them. Once they heard about AWS Inferentia, they collaborated with AWS to adopt Inf1 instances to help with ML deployment, including around performance and cost. They started with their recommendation models inference, and are now looking forward to deploying more models on Inf1 instances in the future.

Conde Nast, one of the world’s leading media companies, saw a 72% reduction in cost of inference compared to GPU-based instances for its recommendation engine. And Anthem, one of the leading healthcare companies in the US, observed 2x higher throughput compared to GPU-based instances for its customer sentiment machine learning workload.

How to Get Started with Amazon EC2 Inf1
You can start using Inf1 instances today.

If you prefer to manage your own machine learning application development platforms, you can get started by either launching Inf1 instances with AWS Deep Learning AMIs, which include the Neuron SDK, or you can use Inf1 instances via Amazon Elastic Kubernetes Service or Amazon ECS for containerized machine learning applications. To learn more about running containers on Inf1 instances, read this blog to get started on ECS and this blog to get started on EKS.

The easiest and quickest way to get started with Inf1 instances is via Amazon SageMaker, a fully managed service that enables developers to build, train, and deploy machine learning models quickly.

Get started with Inf1 on Amazon SageMaker today.

-- seb

PS: The team just released this video, check it out!