Install Kubeflow on vSphere

This section guides you to install Kubeflow on vSphere.

Note

In this section, we install Kubeflow on vSphere 1.6.1. Configurations are slightly different for other versions.

Prerequisites

Adhere to the following requirements before deploying Kubeflow on vSphere package on Tanzu Kubernetes Grid Service (TKG) clusters.

For the deployment on TKG clusters, Kubeflow on vSphere is installed on a Tanzu Kubernetes Cluster (TKC). So before the deployment of Kubeflow on vSphere, you need to get vSphere and TKC ready.

Minimally required resources for TKG cluster to install Kubeflow

To install Kubeflow, the TKG cluster must meet the following minimum requirements:

  • Kubernetes version 1.21, 1.22, 1.23, 1.24 or 1.25

  • At least one worker node satisfies below minimum resources requirements:
    • 4 CPU

    • 16GB memory

    • 50GB storage

Note

Above resources requirements of TKG cluster only support a toy version of Kubeflow installation which may not be able to deploy heavy workloads due to limited resources. It is therefore suggested that users should create the TKG cluster with suitable resources depending on the workloads they would like to deploy using Kubeflow.

Deploy Kubeflow on vSphere package on TKG clusters

Note that the below deployment procedure is for Linux and Windows users, but Windows users would need to first install the Windows version of kubectl and kctrl command.

Add package repository

kubectl create ns carvel-kubeflow
kubectl config set-context --current --namespace=carvel-kubeflow

kctrl package repository add --repository kubeflow-carvel-repo --url projects.registry.vmware.com/kubeflow/kubeflow-carvel-repo:1.6.1

If you get the error kctrl: Error: the server could not find the requested resource (post packagerepositories.packaging.carvel.dev), this means the Carvel Custom Resource Definitions (CRD) have not been installed. You can do so by running:

kubectl apply -f https://github.com/vmware-tanzu/carvel-kapp-controller/releases/latest/download/release.yml

If kapp-controller fails to deploy, make sure the PodSecurityPolicy is properly configured:

kubectl create rolebinding psp:serviceaccounts --clusterrole=psp:vmware-system-restricted --group=system:serviceaccounts -n kapp-controller

You can check kapp-controller deployment by running:

kubectl get deployment.apps/kapp-controller -n kapp-controller
NAME              READY   UP-TO-DATE   AVAILABLE   AGE
kapp-controller   0/1     0            0           2m11s

When READY shows 1/1, kapp-controller is running successfully and you can add the package repository again.

Create config.yaml file

Create a config.yaml file which is used in Kubeflow on vSphere installation later.

Note

This YAML file is created based on values schema of Kubeflow on vSphere package, i.e. the configurations. More details are found in Values schema.

cat <<EOF > config.yaml

service_type: "LoadBalancer"

IP_address: ""
CD_REGISTRATION_FLOW: True
EOF

Install Kubeflow on vSphere package

kctrl package install \
    --wait-check-interval 5s \
    --wait-timeout 30m0s \
    --package-install kubeflow \
    --package kubeflow.community.tanzu.vmware.com \
    --version 1.6.1 \
    --values-file config.yaml

This takes a few minutes, so please wait patiently. You see a “Succeeded” message in the end if the installation is successful.

../_images/install-tkgs-deploySucceed.png

To follow the installation process, you can use:

kctrl package installed status -i kubeflow

Access Kubeflow on vSphere

Now, access the deployed Kubeflow on vSphere in browser and start using it.

To access Kubeflow on vSphere, you need to get the IP address of the service. There are three options.

  • When you set service_type to LoadBalancer, run the following command and visit EXTERNAL-IP of istio-ingressgateway.

    kubectl get svc istio-ingressgateway -n istio-system
    
    # example output:
    # NAME                   TYPE           CLUSTER-IP       EXTERNAL-IP      PORT(S)                                                                      AGE
    # istio-ingressgateway   LoadBalancer   198.51.217.125   10.105.151.142   15021:31063/TCP,80:30926/TCP,443:31275/TCP,31400:30518/TCP,15443:31204/TCP   11d
    
    # In this example, visit http://10.105.151.142:80
    
  • When you set service_type to NodePort, run the following command and visit nodeIP:nodePort.

    kubectl get svc istio-ingressgateway -n istio-system
    
    # example output:
    # NAME                   TYPE       CLUSTER-IP       EXTERNAL-IP   PORT(S)                                                                      AGE
    # istio-ingressgateway   NodePort   198.51.217.125   <none>        15021:31063/TCP,80:30926/TCP,443:31275/TCP,31400:30518/TCP,15443:31204/TCP   11d
    
    kubectl get nodes -o wide
    
    # example output:
    # NAME                                                      STATUS   ROLES                  AGE   VERSION            INTERNAL-IP     EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION      CONTAINER-RUNTIME
    # v1a2-v1-23-8-tkc-v100-8c-dcpvc-4zct9                      Ready    control-plane,master   26d   v1.23.8+vmware.2   10.105.151.73   <none>        Ubuntu 20.04.4 LTS   5.4.0-124-generic   containerd://1.6.6
    # v1a2-v1-23-8-tkc-v100-8c-workers-zwfx4-77b7df85f7-f7f6f   Ready    <none>                 26d   v1.23.8+vmware.2   10.105.151.74   <none>        Ubuntu 20.04.4 LTS   5.4.0-124-generic   containerd://1.6.6
    # v1a2-v1-23-8-tkc-v100-8c-workers-zwfx4-77b7df85f7-l5mp5   Ready    <none>                 26d   v1.23.8+vmware.2   10.105.151.75   <none>        Ubuntu 20.04.4 LTS   5.4.0-124-generic   containerd://1.6.6
    
    ## In this example, anyone of the following works:
    # http://10.105.151.73:30926
    # http://10.105.151.74:30926
    # http://10.105.151.75:30926
    
  • Use port-forward. Then visit the IP address of your client host.

    kubectl port-forward -n istio-system svc/istio-ingressgateway --address 0.0.0.0 8080:80
    
    # if you run the command locally, visit http://localhost:8080
    

Then you use the IP to access Kubeflow on vSphere in browser.

../_images/install-tkgs-login.png

If you did not make any change to the Kubeflow on vSphere configurations, the default login credentials are: user@example.com / 12341234.

For the first time you login after deployment, you are guided to namespace creation page.

../_images/install-tkgs-createNS.png

Then, the Kubeflow on vSphere web UI looks like below:

../_images/install-tkgs-home.png

Configure pod permission and security policy

For your first time deployment, you need to configure pod permission and security policy in order to create and configure new pods. This is important because pod creation is needed for many Kubeflow on vSphere functions, such as Notebook Server creation.

To check your own user profile:

kubectl get profile
kubectl get serviceaccount,authorizationpolicies,rolebinding -n <namespace_name>

And to configure pod-security-policy, run the following command on your client host:

cat << EOF | kubectl apply -f -
kind: RoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: rb-all-sa_ns-<namespace_name>
  namespace: <namespace_name>
roleRef:
  kind: ClusterRole
  name: psp:vmware-system-privileged
  apiGroup: rbac.authorization.k8s.io
subjects:
- kind: Group
  apiGroup: rbac.authorization.k8s.io
  name: system:serviceaccounts:<namespace_name>
EOF

Note

Remember to replace namespace_name to the namespace that you work in.

Troubleshooting

More kctrl commands are found in kapp-controller’s native CLI documentation.

Delete the Kubeflow on vSphere package

To uninstall the Kubeflow on vSphere package:

kctrl package installed delete --package-install kubeflow

When deleting the Kubeflow on vSphere package, some resources may get stuck at deleting status. To solve this problem:

# take namespace knative-serving as an example
kubectl patch ns knative-serving -p '{"spec":{"finalizers":null}}'
kubectl delete ns knative-serving --grace-period=0 --force

Reconciliation issue

Kapp-controller keeps reconciling Kubeflow on vSphere, which prevents you from editing a Kubeflow on vSphere resource. In this case, you may pause and then trigger the reconciliation of Kubeflow on vSphere to solve this issue.

  • To pause the reconciliation of a package installation:

    kctrl package installed pause --package-install kubeflow
    
  • To trigger the reconciliation of a package installation:

    kctrl package installed kick --package-install kubeflow --wait --wait-check-interval 5s --wait-timeout 30m0s
    

Inspect package installation

  • To check the status of package installation:

    kubectl get PackageInstall kubeflow -o yaml
    
  • To print the status of App created by package installation:

    kctrl package installed status --package-install kubeflow
    

Update package configurations

To update the configuration of Kubeflow on vSphere package using an updated configuration file (i.e., config.yaml):

kctrl package installed update --package-install kubeflow --values-file config.yaml

Values schema

To inspect values schema (configurations) of the Kubeflow on vSphere package, run the following command:

kctrl package available get -p kubeflow.community.tanzu.vmware.com/1.6.1 --values-schema

We summarize some important values schema in below table.

Key

Default

Type

Description

CD_REGISTRATION_FLOW

true

boolean

Turn on Registration Flow, so that the Kubeflow on vSphere Central Dashboard prompts new users to create a namespace (profile).

IP_address

“”

string

EXTERNAL_IP address of istio-ingressgateway, valid only if service_type is LoadBalancer.

service_type

LoadBalancer

string

Service type of istio-ingressgateway. Available options: LoadBalancer or NodePort.

Notebook Server creation failure

When you try to create a Notebook Server, you may meet the following error:

FailedCreate 1s (x2 over 1s) statefulset-controller create Pod test-01-0 in StatefulSet test-01 failed error: pods “test-01-0” is forbidden: PodSecurityPolicy: unable to admit pod: []

This error occurs because Notebook Server creation needs pod creation, and you did not configure the pod security policy correctly. To solve this error, you need to configure pod security policy based on Configure pod permission and security policy.

cert-manager-webhook is not ready

Cert-manager is used by Kubeflow components to provide certificates for admission webhooks. When you try to install Kubeflow, you may meet the following error about cert-manager:

Error from server (InternalError): error when creating "STDIN": Internal error occurred: failed calling webhook "webhook.cert-manager.io": failed to call webhook: Post "https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=10s": dial tcp 10.96.202.64:443: connect: connection refused

This error message indicates that the webhook is not yet ready to receive request. You simply need to wait a couple seconds and retry.

For more troubleshooting info about cert-manager, check https://cert-manager.io/docs/troubleshooting/webhook/