Model Serving¶
Introduction¶
KServe is a standard Model Inference Platform on Kubernetes, built for highly scalable use cases. It provides performant, standardized inference protocol across machine learning (ML) frameworks and supports modern serverless inference workload with autoscaling including Scale to Zero on GPU.
You can use KServe to do the following:
Provide a Kubernetes Custom Resource Definition for serving ML models on arbitrary frameworks.
Encapsulate the complexity of autoscaling, networking, health checking, and server configuration to bring cutting edge serving features like GPU autoscaling, Scale to Zero, and canary rollouts to your ML deployments.
Enable a simple, pluggable, and complete story for your production ML inference server by providing prediction, pre-processing, post-processing and explainability out of the box.
Please browse the KServe GitHub repo for more details.
Get started¶
In this section, you deploy an InferenceService
with a predictor that loads a spam email detection model trained with custom dataset. We have prepared the files needed to deploy the model through KServe before experiment, including model file, configuration file. This preparation work is complicated and lengthy, so we do not elaborate here. If you want to know more details, for example, how to prepare the model files, please read the KServe example tutorial.
Prepare model and configuration files¶
First, you create a notebook refer to Notebooks. Then, download model package and unzip it in this notebook server:
!wget https://github.com/vmware/ml-ops-platform-for-vsphere/blob/main/website/content/en/docs/kubeflow-tutorial/lab4_files/v1.zip
!unzip v1.zip
Upload to MinIO¶
If you already have the MinIO storage, you can directly skip the MinIO deployment step, and follow the next steps to upload data to MinIO. If not, we also provide instructions on how to deploy standalone MinIO on the kubernetes clusters, you may refer to the upload data to MinIO bucket part in the Feature Store. And the YAML files are the same with MinIO deployment files.
# create pvc
$ kubectl apply -f minio-standalone-pvc.yml
# create service
$ kubectl apply -f minio-standalone-service.yml
# create deployment
$ kubectl apply -f minio-standalone-deployment.yml
This step uploads v1/torchserve/model-store
, v1/torchserve/config
to MinIO buckets. You need to find the MinIO endpoint_url
, accesskey
, secretkey
before upload using the following commands in terminal:
# get the endpoint url for MinIO
$ kubectl get svc minio-service -n kubeflow -o jsonpath='{.spec.clusterIP}'
# get the secret name for Minio. your-namespace is admin for this Kubernetes cluster.
$ kubectl get secret -n <your-namespace> | grep minio
# get the access key for MinIO
$ kubectl get secret <minio-secret-name> -n <your-namespace> -o jsonpath='{.data.accesskey}' | base64 -d
# get the secret key for MinIO
$ kubectl get secret <minio-secret-name> -n <your-namespace> -o jsonpath='{.data.secretkey}' | base64 -d
You need to install boto3
dependency package in the notebook server created earlier, and run the follow Python code to upload model files:
!pip install boto3 -i https://pypi.tuna.tsinghua.edu.cn/simple
import os
from urllib.parse import urlparse
import boto3
os.environ["AWS_ENDPOINT_URL"] = "http://10.117.233.16:9000" # repalce it to your MinIO endpoint url
os.environ["AWS_REGION"] = "us-east-1"
os.environ["AWS_ACCESS_KEY_ID"] = "minioadmin" # repalce it to your MinIO access key
os.environ["AWS_SECRET_ACCESS_KEY"] = "minioadmin" # repalce it to your MinIO secret key
s3 = boto3.resource('s3',
endpoint_url=os.getenv("AWS_ENDPOINT_URL"),
verify=True)
print("current buckets in s3:")
print(list(s3.buckets.all()))
curr_path = os.getcwd()
base_path = os.path.join(curr_path, "torchserve")
bucket_path = "spam_email"
bucket = s3.Bucket(bucket_name)
# upload
bucket.upload_file(os.path.join(base_path, "model-store", "spam_email.mar"),
os.path.join(bucket_path, "model-store/spam_email.mar"))
bucket.upload_file(os.path.join(base_path, "config", "config.properties"),
os.path.join(bucket_path, "config/config.properties"))
# check files
for obj in bucket.objects.filter(Prefix=bucket_path):
print(obj.key)
Create MinIO service account and secret¶
When you create an InferenceService
to start model, authorization is needed to access MinIO to get the model. Thus, you create MinIO service account and secret using the follow YAML file:
cat << EOF | kubectl apply -f -
apiVersion: v1
kind: Secret
metadata:
name: minio-s3-secret-user # you can set a different secret name
annotations:
serving.kserve.io/s3-endpoint: "10.117.233.16:9000" # replace with your s3 endpoint e.g minio-service.kubeflow:9000
serving.kserve.io/s3-usehttps: "0" # by default 1, if testing with minio you can set to 0
serving.kserve.io/s3-region: "us-east-2"
serving.kserve.io/s3-useanoncredential: "false" # omitting this is the same as false, if true will ignore provided credential and use anonymous credentials
type: Opaque
stringData: # use "stringData" for raw credential string or "data" for base64 encoded string
AWS_ACCESS_KEY_ID: minioadmin # repalce it to your MinIO access key
AWS_SECRET_ACCESS_KEY: minioadmin # repalce it to your MinIO secret key
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: minio-service-account-user # you can set a different sa name
secrets:
- name: minio-s3-secret-user
EOF
Run InferenceService
using KServe¶
Let’s define a new InferenceService
YAML for the model and apply it to the cluster. Meanwhile, please notice that set storageUri
to your bucket_name/bucket_path
. You may also need to change metadata: name
and serviceAccountName
.
cat << EOF | kubectl apply -f -
apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
name: "spam-email-serving" # you can set a different secret name
spec:
predictor:
serviceAccountName: minio-service-account-user # replace with your MinIO service account created before
model:
modelFormat:
name: pytorch
storageUri: "s3://spam-bucket/spam_email" # set yourself model s3 path
resources:
requests:
cpu: 50m
memory: 200Mi
limits:
cpu: 100m
memory: 500Mi
# limits:
# nvidia.com/gpu: "1" # for inference service on GPU
EOF
Check InferenceService
status¶
Run the following command in terminal to check the status of InferenceService
. True
means your model server is running well.
$ kubectl get inferenceservice spam-email-serving -n <your-namespace>
NAME URL READY PREV LATEST PREVROLLEDOUTREVISION LATESTREADYREVISION AGE
spam-email-serving http://spam-email-serving.kubeflow-user.example.com True 100 spam-email-serving-predictor-default-47q2g 23h
Test perform inference¶
Define a Test_bot for convenience¶
Run the following cells to define a test_bot to make prediction in the notebook server.
!pip install multiprocess -i https://pypi.tuna.tsinghua.edu.cn/simple
import requests
import json
import multiprocess as mp
class Test_bot():
def __init__(self, uri, model, host, session):
self.uri = uri
self.model = model
self.host = host
self.session = session
self.headers = {'Host': self.host, 'Content-Type': "application/json", 'Cookie': "authservice_session=" + self.session}
self.email = [
# features: shorter_text, body, business, html, money
"[0, 0, 0, 0, 0] email longer than 500 character" + "a" * 500, # ham
"[1, 0, 0, 0, 0] email shorter than 500 character", # ham
"[1, 0, 1, 1, 1] email shorter than 500 character + business + html + money", # spam
"[0, 1, 0, 0, 1] email longer than 500 character + body" + "a" * 500, # spam
"[0, 1, 1, 1, 1] email longer than 500 character + body + business + html + money" + "a" * 500, # spam
"[1, 1, 1, 1, 1] email shorter than 500 character body + business + html + money", # spam
]
def update_uri(self, uri):
self.uri = uri
def update_model(self, model):
self.model = model
def update_host(self, host):
self.host = host
self.update_headers()
def update_session(self, session):
self.session = session
self.update_headers()
def update_headers(self):
self.headers = {'Host': self.host, 'Content-Type': "application/json", 'Cookie': "authservice_session=" + self.session}
def get_data(self, x):
if isinstance(x, str):
email = x
elif isinstance(x, int):
email = self.email[x % 6]
else:
email = self.email[0]
json_data = json.dumps({
"instances": [
email,
]
})
return json_data
def readiness(self):
uri = self.uri + '/v1/models/' + self.model
response = requests.get(uri, headers = self.headers, timeout=5)
return response.text
def predict(self, x=None):
uri = self.uri + '/v1/models/' + self.model + ':predict'
response = requests.post(uri, data=self.get_data(x), headers = self.headers, timeout=10)
return response.text
def explain(self, x=None):
uri = self.uri + '/v1/models/' + self.model + ':explain'
response = requests.post(uri, data=self.get_data(x), headers = self.headers, timeout=10)
return response.text
def concurrent_predict(self, num=10):
print("fire " + str(num) + " requests to " + self.host)
with mp.Pool() as pool:
responses = pool.map(self.predict, range(num))
return responses
Determine host and session¶
Run the following command in terminal to get host, which is set to the headers in your request.
$ kubectl get inferenceservice spam-email-serving -o jsonpath='{.status.url}' | cut -d "/" -f 3
Use your web browser to login to Kubeflow on vSphere UI, and get Cookies: authservice_session. If you use Chrome browser, go to Developer Tools -> Applications -> Cookies to get session.
Test model prediction¶
Run the following cell to do model prediction in the notebook server.
# replace it with the url you used to access Kubeflow
bot = Test_bot(uri='http://10.117.233.8',
model='spam_email',
# replace it with what is printed above
host='spam-email-serving.kubeflow-user-example-com.example.com',
# replace with your browser session
session='MTY2NjE2MDYyMHxOd3dBTkZZelVqVkdOVkJIVUVGR1IweEVTbG95VVRZMU5WaEVXbE5GTlV0WlVrWk1WRk5FTkU5WVIxZFJRelpLVFZoWVVFOVdSa0U9fMj0VhQPme_rORhhdy0mtBJk-yGWdzibFfPMdU3TztbJ')
print(bot.readiness())
print(bot.predict(0))
The output looks like:
{"name": "spam_email", "ready": true}
{"predictions": [{"version": "2", "prediction": "ham"}]}
Delete InferenceService
¶
When you are done with your InferenceService
, run the following command in terminal to delete it:
$ kubectl delete inferenceservice <your-inferenceservice> -n <your-namespace>