Amazon Elastic Kubernetes Service (EKS) has quickly become a leading choice for machine learning workloads. It combines the developer agility and the scalability of Kubernetes, with the wide selection of Amazon Elastic Compute Cloud (EC2) instance types available on AWS, such as the C5, P3, and G4 families.
As models become more sophisticated, hardware acceleration is increasingly required to deliver fast predictions at high throughput. Today, we’re very happy to announce that AWS customers can now use the Amazon EC2 Inf1 instances on Amazon Elastic Kubernetes Service, for high performance and the lowest prediction cost in the cloud.
A primer on EC2 Inf1 instances
Inf1 instances were launched at AWS re:Invent 2019. They are powered by AWS Inferentia, a custom chip built from the ground up by AWS to accelerate machine learning inference workloads.
Inf1 instances are available in multiple sizes, with 1, 4, or 16 AWS Inferentia chips, with up to 100 Gbps network bandwidth and up to 19 Gbps EBS bandwidth. An AWS Inferentia chip contains four NeuronCores. Each one implements a high-performance systolic array matrix multiply engine, which massively speeds up typical deep learning operations such as convolution and transformers. NeuronCores are also equipped with a large on-chip cache, which helps cut down on external memory accesses, saving I/O time in the process. When several AWS Inferentia chips are available on an Inf1 instance, you can partition a model across them and store it entirely in cache memory. Alternatively, to serve multi-model predictions from a single Inf1 instance, you can partition the NeuronCores of an AWS Inferentia chip across several models.
Compiling Models for EC2 Inf1 Instances
To run machine learning models on Inf1 instances, you need to compile them to a hardware-optimized representation using the AWS Neuron SDK. All tools are readily available on the AWS Deep Learning AMI, and you can also install them on your own instances. You’ll find instructions in the Deep Learning AMI documentation, as well as tutorials for TensorFlow, PyTorch, and Apache MXNet in the AWS Neuron SDK repository.
In the demo below, I will show you how to deploy a Neuron-optimized model on an EKS cluster of Inf1 instances, and how to serve predictions with TensorFlow Serving. The model in question is BERT, a state of the art model for natural language processing tasks. This is a huge model with hundreds of millions of parameters, making it a great candidate for hardware acceleration.
Building an EKS Cluster of EC2 Inf1 Instances
First of all, let’s build a cluster with two inf1.2xlarge instances. I can easily do this with eksctl
, the command-line tool to provision and manage EKS clusters. You can find installation instructions in the EKS documentation.
Here is the configuration file for my cluster. Eksctl
detects that I’m launching a node group with an Inf1 instance type, and will start your worker nodes using the EKS-optimized Accelerated AMI.
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
name: cluster-inf1
region: us-west-2
nodeGroups:
- name: ng1-public
instanceType: inf1.2xlarge
minSize: 0
maxSize: 3
desiredCapacity: 2
ssh:
allow: true
Then, I use eksctl
to create the cluster. This process will take approximately 10 minutes.
$ eksctl create cluster -f inf1-cluster.yaml
Eksctl
automatically installs the Neuron device plugin in your cluster. This plugin advertises Neuron devices to the Kubernetes scheduler, which can be requested by containers in a deployment spec. I can check with kubectl
that the device plug-in container is running fine on both Inf1 instances.
$ kubectl get pods -n kube-system
NAME READY STATUS RESTARTS AGE
aws-node-tl5xv 1/1 Running 0 14h
aws-node-wk6qm 1/1 Running 0 14h
coredns-86d5cbb4bd-4fxrh 1/1 Running 0 14h
coredns-86d5cbb4bd-sts7g 1/1 Running 0 14h
kube-proxy-7px8d 1/1 Running 0 14h
kube-proxy-zqvtc 1/1 Running 0 14h
neuron-device-plugin-daemonset-888j4 1/1 Running 0 14h
neuron-device-plugin-daemonset-tq9kc 1/1 Running 0 14h
Next, I define AWS credentials in a Kubernetes secret. They will allow me to grab my BERT model stored in S3. Please note that both keys needs to be base64-encoded.
apiVersion: v1
kind: Secret
metadata:
name: aws-s3-secret
type: Opaque
data:
AWS_ACCESS_KEY_ID: <base64-encoded value>
AWS_SECRET_ACCESS_KEY: <base64-encoded value>
Finally, I store these credentials on the cluster.
$ kubectl apply -f secret.yaml
The cluster is correctly set up. Now, let’s build an application container storing a Neuron-enabled version of TensorFlow Serving.
Building an Application Container for TensorFlow Serving
The Dockerfile is very simple. We start from an Amazon Linux 2 base image. Then, we install the AWS CLI, and the TensorFlow Serving package available in the Neuron repository.
FROM amazonlinux:2
RUN yum install -y awscli
RUN echo $'[neuron] \n\
name=Neuron YUM Repository \n\
baseurl=https://yum.repos.neuron.amazonaws.com \n\
enabled=1' > /etc/yum.repos.d/neuron.repo
RUN rpm --import https://yum.repos.neuron.amazonaws.com/GPG-PUB-KEY-AMAZON-AWS-NEURON.PUB
RUN yum install -y tensorflow-model-server-neuron
I build the image, create an Amazon Elastic Container Registry repository, and push the image to it.
$ docker build . -f Dockerfile -t tensorflow-model-server-neuron
$ docker tag IMAGE_NAME 123456789012.dkr.ecr.us-west-2.amazonaws.com/inf1-demo
$ aws ecr create-repository --repository-name inf1-demo
$ docker push 123456789012.dkr.ecr.us-west-2.amazonaws.com/inf1-demo
Our application container is ready. Now, let’s define a Kubernetes service that will use this container to serve BERT predictions. I’m using a model that has already been compiled with the Neuron SDK. You can compile your own using the instructions available in the Neuron SDK repository.
Deploying BERT as a Kubernetes Service
The deployment manages two containers: the Neuron runtime container, and my application container. The Neuron runtime runs as a sidecar container image, and is used to interact with the AWS Inferentia chips. At startup, the latter configures the AWS CLI with the appropriate security credentials. Then, it fetches the BERT model from S3. Finally, it launches TensorFlow Serving, loading the BERT model and waiting for prediction requests. For this purpose, the HTTP and grpc ports are open. Here is the full manifest.
kind: Service
apiVersion: v1
metadata:
name: eks-neuron-test
labels:
app: eks-neuron-test
spec:
ports:
- name: http-tf-serving
port: 8500
targetPort: 8500
- name: grpc-tf-serving
port: 9000
targetPort: 9000
selector:
app: eks-neuron-test
role: master
type: ClusterIP
---
kind: Deployment
apiVersion: apps/v1
metadata:
name: eks-neuron-test
labels:
app: eks-neuron-test
role: master
spec:
replicas: 2
selector:
matchLabels:
app: eks-neuron-test
role: master
template:
metadata:
labels:
app: eks-neuron-test
role: master
spec:
volumes:
- name: sock
emptyDir: {}
containers:
- name: eks-neuron-test
image: 123456789012.dkr.ecr.us-west-2.amazonaws.com/inf1-demo:latest
command: ["/bin/sh","-c"]
args:
- "mkdir ~/.aws/ && \
echo '[eks-test-profile]' > ~/.aws/credentials && \
echo AWS_ACCESS_KEY_ID=$AWS_ACCESS_KEY_ID >> ~/.aws/credentials && \
echo AWS_SECRET_ACCESS_KEY=$AWS_SECRET_ACCESS_KEY >> ~/.aws/credentials; \
/usr/bin/aws --profile eks-test-profile s3 sync s3://jsimon-inf1-demo/bert /tmp/bert && \
/usr/local/bin/tensorflow_model_server_neuron --port=9000 --rest_api_port=8500 --model_name=bert_mrpc_hc_gelus_b4_l24_0926_02 --model_base_path=/tmp/bert/"
ports:
- containerPort: 8500
- containerPort: 9000
imagePullPolicy: Always
env:
- name: AWS_ACCESS_KEY_ID
valueFrom:
secretKeyRef:
key: AWS_ACCESS_KEY_ID
name: aws-s3-secret
- name: AWS_SECRET_ACCESS_KEY
valueFrom:
secretKeyRef:
key: AWS_SECRET_ACCESS_KEY
name: aws-s3-secret
- name: NEURON_RTD_ADDRESS
value: unix:/sock/neuron.sock
resources:
limits:
cpu: 4
memory: 4Gi
requests:
cpu: "1"
memory: 1Gi
volumeMounts:
- name: sock
mountPath: /sock
- name: neuron-rtd
image: 790709498068.dkr.ecr.us-west-2.amazonaws.com/neuron-rtd:1.0.6905.0
securityContext:
capabilities:
add:
- SYS_ADMIN
- IPC_LOCK
volumeMounts:
- name: sock
mountPath: /sock
resources:
limits:
hugepages-2Mi: 256Mi
aws.amazon.com/neuron: 1
requests:
memory: 1024Mi
I use kubectl
to create the service.
$ kubectl create -f bert_service.yml
A few seconds later, the pods are up and running.
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
eks-neuron-test-5d59b55986-7kdml 2/2 Running 0 14h
eks-neuron-test-5d59b55986-gljlq 2/2 Running 0 14h
Finally, I redirect service port 9000 to local port 9000, to let my prediction client connect locally.
$ kubectl port-forward svc/eks-neuron-test 9000:9000 &
Now, everything is ready for prediction, so let’s invoke the model.
Predicting with BERT on EKS and Inf1
The inner workings of BERT are beyond the scope of this post. This particular model expects a sequence of 128 tokens, encoding the words of two sentences we’d like to compare for semantic equivalence.
Here, I’m only interested in measuring prediction latency, so dummy data is fine. I build 100 prediction requests storing a sequence of 128 zeros. I send them to the TensorFlow Serving endpoint via grpc, and I compute the average prediction time.
import numpy as np
import grpc
import tensorflow as tf
from tensorflow_serving.apis import predict_pb2
from tensorflow_serving.apis import prediction_service_pb2_grpc
import time
if __name__ == '__main__':
channel = grpc.insecure_channel('localhost:9000')
stub = prediction_service_pb2_grpc.PredictionServiceStub(channel)
request = predict_pb2.PredictRequest()
request.model_spec.name = 'bert_mrpc_hc_gelus_b4_l24_0926_02'
i = np.zeros([1, 128], dtype=np.int32)
request.inputs['input_ids'].CopyFrom(tf.contrib.util.make_tensor_proto(i, shape=i.shape))
request.inputs['input_mask'].CopyFrom(tf.contrib.util.make_tensor_proto(i, shape=i.shape))
request.inputs['segment_ids'].CopyFrom(tf.contrib.util.make_tensor_proto(i, shape=i.shape))
latencies = []
for i in range(100):
start = time.time()
result = stub.Predict(request)
latencies.append(time.time() - start)
print("Inference successful: {}".format(i))
print ("Ran {} inferences successfully. Latency average = {}".format(len(latencies), np.average(latencies)))
On average, prediction took 5.92ms. As far as BERT goes, this is pretty good!
Ran 100 inferences successfully. Latency average = 0.05920819044113159
In real-life, we would certainly be batching prediction requests in order to increase throughput. If needed, we could also scale to larger Inf1 instances supporting several Inferentia chips, and deliver even more prediction performance at low cost.
Getting Started
Kubernetes users can deploy Amazon Elastic Compute Cloud (EC2) Inf1 instances on Amazon Elastic Kubernetes Service today in the US East (N. Virginia) and US West (Oregon) regions. As Inf1 deployment progresses, you’ll be able to use them with Amazon Elastic Kubernetes Service in more regions.
Give this a try, and please send us feedback either through your usual AWS Support contacts, on the AWS Forum for Amazon Elastic Kubernetes Service, or on the container roadmap on Github.
- Julien Via AWS News Blog https://ift.tt/1EusYcK
No comments:
Post a Comment