Known Issues

General Issues

Existing clusters show Kubernetes version as 0.0.0 after CSE is upgraded to 2.6.0

The way Kubernetes version of a cluster is determined, changed between CSE 2.5.x and 2.6.0. If the cluster metadata is not properly updated, then CSE 2.6.0 defaults the version to 0.0.0.

Workaround: CSE 2.6.1 takes care of this issue and defaults to the Kubernetes version of the template from which the cluster is deployed. However, please note that if the template itself was created by CSE 2.5.x, then this approach is not foolproof. In such cases it’s better to recreate the template in CSE 2.6.1, and then run cse convert-cluster command against the affected cluster to fix its metadata. Possible error messages if the template is not recreated and cse convert-cluster is not run are as follows (but not limited to):

N/A or patch version missing in/from Kubernetes version field

$ vcd cse cluster list
Name                        VDC          Org      Kubernetes      Status      Provider
--------------------------  -----------  -------  --------------  ----------  ----------
used_old_tempalte           new-org-vdc  new-org  upstream 1.16   POWERED_ON  native
didn_t_run_cluster_convert  new-org-vdc  new-org  N/A             POWERED_ON  native

Kubernetes upgrade operation fails

$ vcd cse cluster upgrade "used_old_tempalte" ubuntu-16.04_k8-1.17_weave-2.6.0 1
cluster operation: Upgrading cluster 'used_old_tempalte' software to match template
ubuntu-16.04_k8-1.17_weave-2.6.0 (revision 1): Kubernetes: 1.16 -> 1.17.2,
Docker-CE: 18.09.7 -> 19.03.5, CNI: weave 2.6.0 -> 2.6.0,
.
.
task: [REDACTED uuid], result: error, message: Unexpected error while upgrading
cluster 'used_old_tempalte': Invalid version string: '1.16'

Never ending CSE tasks in VCD UI / Failed CSE tasks without proper error message

If CSE server encounters any error during cluster/node creation, users may see CSE tasks in VCD never reach to completion, or the tasks may show up as failed without a proper error message. Currently, UI lacks the ability to properly express error messages upon operation failures. Some examples might be - A user input parameter was invalid, or an unexpected error (network connection/outage) occurred. Please inspect CSE server logs in these cases, or file a github issue.

Fresh installation of CSE 2.5.1 or below via `pip install` is broken

CSE 2.5.1 or below versions have an open-ended dependencies, which permit pip to pull and install latest versions of the dependencies. Two such dependencies are pyvcloud and vcd-cli, and their latest available versions are incompatible with CSE 2.5.1 or below. We are reviewing our design on dependencies, and hope to bring improvements in near future.

Workaround: - Un-install incompatible pyvcloud and vcd-cli libraries, and manually install compatible versions.

# Un-install pyvcloud and vcd-cli
pip3 uninstall pyvcloud vcd-cli --user --yes

#Install specific version of the libraries which are compatible with CSE 2.5.1 and CSE 2.0.0
pip3 install pyvcloud==21.0.0 vcd-cli==22.0.0 --upgrade --user

`vcd cse ovdc list` operation will timeout when numerous OrgVDCs exist

CSE makes an API call per OrgVDC in order to access required metadata, and that can timeout with large number of OrgVDCs.

Example - Trying to use vcd cse ovdc list with 250+ VDCs:

vcd cse ovdc list
Usage: vcd cse ovdc list [OPTIONS]
Try "vcd cse ovdc list -h" for help.

Error: Unknown error. Please contact your System Administrator

Workaround: extend the cell timeout to be able to wait for the required amount of time. See the section ‘Setting the API Extension Timeout’ under CSE Server Management.

CSE server fails to start up after disabling the Service Provider Access to the Legacy API Endpoint

Workaround: Don’t disable Service Provider Access to the Legacy API Endpoint

VCD 10.0 deprecates the /api/sessions REST end point, and introduces a new /cloudapi/ based REST endpoint for authenticating VCD users. CSE relies on the ‘/api’ end point for operations, so it is necessary that the legacy API endpoint is not disabled in vCloud Director.

More details

Update : CSE 2.6.0 has resolved this issue.

Failures during template creation or installation

One of the template creation scripts may have exited with an error
One of the scripts may be hung waiting for a response
If the VM has no Internet access, scripts may fail
Check CSE logs for script outputs, to determine the cause behind the observed failure

CSE service fails to start

Workaround: rebooting the VM starts the service

CSE 1.2.6 and up are incompatible with VCD 9.0

CSE installation fails with MissingLinkException

Cluster creation fails when VCD external network has a DNS suffix and the DNS server resolves `localhost.my.suffix` to a valid IP

This is due to a bug in etcd (More detail HERE, with the kubeadm config file contents necessary for the workaround specified in this comment).

The main issue is that etcd prioritizes the DNS server (if it exists) over the /etc/hosts file to resolve hostnames, when the conventional behavior would be to prioritize checking any hosts files before going to the DNS server. This becomes problematic when kubeadm attempts to initialize the control plane node using localhost. etcd checks the DNS server for any entry like localhost.suffix, and if this actually resolves to an IP, attempts to do some operations involving that incorrect IP, instead of localhost.

The workaround (More detail HERE is to create a kubeadm config file (no way to specify listen-peer-urls argument in command line), and modify the kubeadm init command in the CSE control plane script for the template of the cluster you are attempting to deploy. CSE control plane script is located at ~/.cse-scripts/<template name>_rev<template_revision>/scripts/mstr.sh

Change command from, kubeadm init --kubernetes-version=v1.13.5 > /root/kubeadm-init.out to kubeadm init --config >/path/to/kubeadm.yaml > /root/kubeadm-init.out

Kubernetes version has to be specified within the configuration file itself, since --kubernetes-version and --config are incompatible.

NFS Limitations

Currently, NFS servers in a Kubernetes cluster are not only accessible by nodes of that cluster but also by any VM (outside of the cluster) residing in the same OrgVDC. Ideal solution is to have vApp network created for each Kubernetes cluster, which is in our road-map to implement. Until then, please choose one of below workarounds to avert this problem if the need arises.

Give access to only control plane & worker nodes of the cluster by adding individual IPs of the nodes into /etc/exports file on NFS server.
- Create and run a script periodically which retrieves IPs of nodes in the cluster and then add them to NFS server access list (/etc/exports).
```
 /home 203.0.113.256(rw,sync,no_root_squash,no_subtree_check) 203.0.113.257(rw,sync,no_root_squash,no_subtree_check)
```
Administrator can manually add a vApp network for each Kubernetes cluster in VCD.
Create a ssh tunnel from each worker node (using ssh local port forwarding) and then use 127.0.0.1:<port> in the Kubernetes declarative specs as IP of the NFS server.
- In NFS server, for any given shared directory, add below line to /etc/exports file.
  - /home localhost(insecure,rw,sync,no_subtree_check)
  - systemctl restart nfs-kernel-server.service
  - Copy ssh public key of each worker node into ~/.ssh/authorized_keys in NFS server
- Client: Generate key using ssh-keygen and copy the contents of ~/.ssh/id_rsa.pub
- NFS server: Paste the contents (public key) from client into ~/.ssh/authorized_keys
- In each control plane/worker node,
  - apt-get install portmap
  - ssh -fNv -L 3049:127.0.0.1:2049 user@NFSServer
- Read more about this approach at
  - http://www.debianadmin.com/howto-use-ssh-local-and-remote-port-forwarding.html
  - https://gist.github.com/proudlygeek/5721498

Enterprise PKS Limitations

When attaching an NSX-T-backed vCenter (such as Enterprise PKS vCenter) to a MicrosoftSQL-backed VCD, the vCenter can fail to connect. Refer to this work around
Command vcd cse node info on native K8 clusters is broken when Enterprise PKS is part of CSE set-up
Once vcd cse cluster resize is run on Enterprise PKS based clusters, organization administrator’s attempts to view and perform CRUD operations on those clusters will begin to fail with errors.
Once vcd cse cluster resize is run on Enterprise PKS based clusters, commands vcd cse cluster info and vcd cse cluster list on those resized clusters will begin to display incomplete results.
Once a given OrgVDC is enabled for Enterprise PKS, renaming that OrgVDC in VCD will cause further K8 cluster deployment failures in that OrgVDC.

Container Service Extension (CSE)

Kubernetes as a Service for VMware vCloud Director.

Known Issues

General Issues

Existing clusters show Kubernetes version as 0.0.0 after CSE is upgraded to 2.6.0

Never ending CSE tasks in VCD UI / Failed CSE tasks without proper error message

Fresh installation of CSE 2.5.1 or below via `pip install` is broken

`vcd cse ovdc list` operation will timeout when numerous OrgVDCs exist

CSE server fails to start up after disabling the Service Provider Access to the Legacy API Endpoint

Failures during template creation or installation

CSE service fails to start

CSE 1.2.6 and up are incompatible with VCD 9.0

Cluster creation fails when VCD external network has a DNS suffix and the DNS server resolves `localhost.my.suffix` to a valid IP

NFS Limitations

Enterprise PKS Limitations

Known Issues

General Issues

Existing clusters show Kubernetes version as 0.0.0 after CSE is upgraded to 2.6.0

Never ending CSE tasks in VCD UI / Failed CSE tasks without proper error message

Fresh installation of CSE 2.5.1 or below via pip install is broken

vcd cse ovdc list operation will timeout when numerous OrgVDCs exist

CSE server fails to start up after disabling the Service Provider Access to the Legacy API Endpoint

Failures during template creation or installation

CSE service fails to start

CSE 1.2.6 and up are incompatible with VCD 9.0

Cluster creation fails when VCD external network has a DNS suffix and the DNS server resolves localhost.my.suffix to a valid IP

NFS Limitations

Enterprise PKS Limitations

Fresh installation of CSE 2.5.1 or below via `pip install` is broken

`vcd cse ovdc list` operation will timeout when numerous OrgVDCs exist

Cluster creation fails when VCD external network has a DNS suffix and the DNS server resolves `localhost.my.suffix` to a valid IP