nvidia’s shrinking market —
The amazon-designed Inferentia chips reduced cost and latency in text-to-speech.
On Thursday, an Amazon AWS blogpost announced that the company has moved most of the cloud processing for its Alexa personal assistant off of Nvidia GPUs and onto its own Inferentia Application Specific Integrated Circuit (ASIC). Amazon dev Sebastien Stormacq describes the Inferentia’s hardware design as follows:
AWS Inferentia is a custom chip, built by AWS, to accelerate machine learning inference workloads and optimize their cost. Each AWS Inferentia chip contains four NeuronCores. Each NeuronCore implements a high-performance systolic array matrix multiply engine, which massively speeds up typical deep learning operations such as convolution and transformers. NeuronCores are also equipped with a large on-chip cache, which helps cut down on external memory accesses, dramatically reducing latency and increasing throughput.
When an Amazon customer—usually someone who owns an Echo or Echo dot—makes use of the Alexa personal assistant, very little of the processing is done on the device itself. The workload for a typical Alexa request looks something like this:
- A human speaks to an Amazon Echo, saying: “Alexa, what’s the special ingredient in Earl Grey tea?”
- The Echo detects the wake word—Alexa—using its own on-board processing
- The Echo streams the request to Amazon data centers
- Within the Amazon data center, the voice stream is converted to phonemes (Inference AI workload)
- Still in the data center, phonemes are converted to words (Inference AI workload)
- Words are assembled into phrases (Inference AI workload)
- Phrases are distilled into intent (Inference AI workload)
- Intent is routed to an appropriate fulfillment service, which returns a response as a JSON document
- JSON document is parsed, including text for Alexa’s reply
- Text form of Alexa’s reply is converted into natural-sounding speech (Inference AI workload)
- Natural speech audio is streamed back to the Echo device for playback—”It’s bergamot orange oil.”
As you can see, almost all of the actual work done in fulfilling an Alexa request happens in the cloud—not in an Echo or Echo Dot device itself. And the vast majority of that cloud work is performed not by traditional if-then logic but inference—which is the answer-providing side of neural network processing.
According to Stormacq, shifting this inference workload from Nvidia GPU hardware to Amazon’s own Inferentia chip resulted in 30-percent lower cost and 25-percent improvement in end-to-end latency on Alexa’s text-to-speech workloads. Amazon isn’t the only company using the Inferentia processor—the chip powers Amazon AWS Inf1 instances, which are available to the general public and compete with Amazon’s GPU-powered G4 instances.
Amazon’s AWS Neuron software development kit allows machine-learning developers to use Inferentia as a target for popular frameworks, including TensorFlow, PyTorch, and MXNet.
Listing image by Amazon