Known Issues
General Issues
Unable to create, update, or delete cluster after upgrade to CSE 3.1.1 or 3.1.2 from CSE 3.1.0
The native cluster entity type in CSE 3.1.Z has a hooks
section in the native entity type that allows for cluster
creation, update, and deletion. When upgrading from CSE 3.1.0 to CSE 3.1.1 or 3.1.2, the hooks
section
of the native cluster entity type is incorrectly overwritten to null
, leading to this issue.
Resolution
This issue does not exist when upgrading from CSE 3.1.0 to CSE 3.1.3 and 3.1.4.
Workaround for CSE 3.1.1 and 3.1.2
- Make a GET request on https:///cloudapi/1.0.0/entityTypes/urn:vcloud:type:cse:nativeCluster:2.0.0
at api version 36.0 for the
Accept
header. - Copy the response body from (1) to format a new request body. The
hooks
section in the response body from (1) should benull
. - Change the request body of the
hooks
section in (2) to be:"hooks": { "PreDelete": "urn:vcloud:behavior-interface:deleteCluster:cse:k8s:1.0.0", "PostCreate": "urn:vcloud:behavior-interface:createCluster:cse:k8s:1.0.0", "PostUpdate": "urn:vcloud:behavior-interface:updateCluster:cse:k8s:1.0.0" }
- Make a PUT request on https:///cloudapi/1.0.0/entityTypes/urn:vcloud:type:cse:nativeCluster:2.0.0 using the request body in (3).
- Repeat step (1) and ensure that the
hooks
section in the response body has the hooks section in (3).
Resizing pre-existing TKG clusters after upgrade to CSE 3.1.3 fails with “kubeconfig not found in control plane extra configuration” in server logs
In CSE 3.1.3, the control plane node writes the kubeconfig to the extra config so that worker nodes can install core packages. During cluster resize, when the pre-existing cluster’s worker nodes look for the kubeconfig, the control plane’s extra config does not have it because the cluster was created prior to 3.1.3.
Resolution
This issue is fixed in CSE 3.1.4 for pre-existing clusters not to retrieve the kubeconfig during cluster resize.
Workaround for CSE 3.1.3
- Log in to the control plane vm
- Add a placeholder VM extra config element by executing
vmtoolsd --cmd "info-set guestinfo.kubeconfig $(echo VMware | base64)"
. This step allows the worker nodes to retrieve a placeholder kubeconfig even though this kubeconfig won’t be used. - Verify if the extra config element has been set properly with command
vmtoolsd --cmd "info-get gustinfo.kubeconfig"
- Reattempt Resize operation
TKG Cluster creation fails with “ACCESS_TO_RESOURCE_IS_FORBIDDEN due to lack of [VAPP_VIEW] right” even though this right is not missing
This issue is due to a security context being wiped out.
Resolution
This issue is fixed in CSE 3.1.4.
Native cluster creation fails for Ubuntu 20 templates
This failure is due to a race condition in faster customer environment infrastructures.
Resolution
This issue is fixed in CSE 3.1.4.
TKG cluster creation intermittently fails due to a VM reboot
This occurs due to a cloud-init script execution error when a VM is rebooted, and this issue may be encountered as a VM postcustomization timeout.
Resolution
This issue is fixed in CSE 3.1.4.
TKG cluster resize fails after 1 day of cluster creation
This issue in resizing the TKG cluster occurs because the token to join a cluster has expired. Please note that the issue may also be encountered when trying to resize a TKG cluster that was upgraded to CSE 3.1.3 or 3.1.4.
Resolution
This issue is fixed in CSE 3.1.3 for newly created clusters.
Workaround
For clusters created prior to CSE 3.1.3, the following workaround is to create a new token and update the RDE:
- Run the following in the control plane node:
kubeadm token create --print-join-command --ttl 0
- In Postman, GET the entity at: https:///cloudapi/1.0.0/entities/. The entity ID can be
retrieved from the cluster info page or via
vcd cse cluster info
- Copy the response body in (2) and replace the
kubeadm join ...
command with the output in (1) to form the request body. - Do a PUT on the same URL in (2) with
Content-Type: application/json
and with the request body formed in (3). - The resize operation can then be performed. If the resize failed, then the operation may be triggered again after (4) due to the RDE update triggering the behavior.
CSE Upgrade from 3.1.3 to 3.1.3 fails
The use case for upgrading from CSE 3.1.3 to 3.1.3 is needed if cse upgrade
or cse install
fails; in this case,
one would need to run cse upgrade
for CSE to be able to run, but this upgrade is failing. The workaround for this
upgrade failure is to delete the CSE extension (instructions here)
and then run cse install
again.
No kapp-controller or metrics-server version is installed or listed in the UI/CLI on TKG clusters using TKG ova 1.3.X
The compatible kapp-controller and metrics-server versions are listed in an ova’s TKR bom file. For TKG ova 1.3.Z, these versions are not found in the same sections of the TKR bom file as the sections for TKG ova’s >= 1.4.0.
Output of vcd cse cluster info
for TKG clusters has the kubeconfig of the cluster embedded in it, while output for Native clusters don’t have it.
Although both native and TKG clusters use RDE 2.0.0 for representation in VCD, they differ quite a bit in their structure.
kubeconfig content being part of the output of vcd cse cluster info
for TKG clusters and not native clusters is by design.
In CSE 3.1.1, vcd-cli
prints the error Error: 'NoneType' object is not subscriptable
to console on invoking CSE commands
This error is observed when CSE tries to restore a previously expired session and/or CSE server is down or unreachable.
Workaround:
Please logout and log back in vcd-cli
before exceuting further CSE related commands.
In CSE 3.1, pre-existing templates will not work after upgrading to CSE 3.1 (legacy_mode=true)
After upgrade to CSE 3.1 running in legacy_mode, existing templates will not work,
unless their corresponding scripts files are moved to the right location.
CSE 3.0.x keeps the template script files under the folder ~/.cse_scripts
,
CSE 3.1.0 keeps them under ~./cse_scripts/<template cookbook version>
.
Workaround(s):
- Please create a folder named
~/.cse_scripts/1.0.0
and move all contents of~/.cse_scripts
into it. (or) - Another recommended workaround is to recreate the templates.
In CSE 3.1, deleting the cluster in an error state may fail from CLI/UI
Delete operation on a cluster that is in an error state (RDE.state = RESOLUTION_ERROR
(or) status.phase = <Operation>:FAILED
),
may fail with Bad request (400).
Workaround:
VCD 10.3:
Login as the user who installed CSE (the user provided in CSE configuration file during cse install
).
- RDE resolution : Perform
POST https://<vcd-fqdn>/cloudapi/1.0.0/entities/{cluster-id}/resolve
- RDE deletion: Perform
DELETE https://<vcd-fqdn>/cloudapi/1.0.0/entities/{cluster-id}?invokeHooks=false
- vApp deletion: Delete the corresponding vApp from UI (or) via API call.
- API call: Perform
GET https://<vcd-fqdn>/cloudapi/1.0.0/entities/{cluster-id}
to retrieve the vApp Id, which is same as theexternalID
property in the corresponding RDE. Invoke Delete vApp API. - UI: Identify the vApp with the same name as the cluster in the same Organization virtual datacenter and delete it.
- API call: Perform
Update: For VCD 10.3, please use vcd cse cluster delete --force
to delete clusters that can’t be
deleted. Learn more here.
VCD 10.2:
Login as System administrator (or) user with ADMIN_FC right on cse:nativeCluster
entitlement
- RDE resolution : Perform
POST https://<vcd-fqdn>/cloudapi/1.0.0/entities/{cluster-id}/resolve
- RDE deletion: Perform
DELETE https://<vcd-fqdn>/cloudapi/1.0.0/entities/{id}
- vApp deletion: Delete the corresponding vApp from UI (or) via API call.
- API call: Perform
GET https://<vcd-fqdn>/cloudapi/1.0.0/entities/<cluster-id>
to retrieve the vApp Id, which is same as theexternalID
property in the corresponding RDE. Invoke Delete vApp API. - UI: Identify the vApp with the same name as the cluster in the same Organization virtual datacenter and delete it.
- API call: Perform
In CSE 3.1, pending tasks are visible in the VCD UI right after cse upgrade
After upgrading to CSE 3.1 using cse upgrade
command, you may notice pending
tasks on RDE based Kubernetes clusters. This is merely a cosmetic issue, and it
should not have any negative impact on the functionality. The pending tasks should
disappear after 24 hours of timeout.
CSE 3.1 silently ignores the api_version
property in the config.yaml
CSE 3.1 need not be started with a particular VCD API version. It is now capable of accepting incoming requests at any supported VCD API version. Refer to changes in the configuration file
CSE 3.1 upgrade may fail to update the clusters owned by System users correctly.
During the cse upgrade
, the RDE representation of the existing clusters is
transformed to become forward compatible. The newly created RDEs are supposed
to be owned by the corresponding original cluster owners in the process.
However, the ownership assignment may fail if the original owners are from the System org.
This is a bug in VCD.
Workaround:
Edit the RDE by updating the owner.name
and owner.id
in the payload
PUT https://<vcd-fqdn>/cloudapi/1.0.0/entities/id?invokeHooks=false
Unable to change the default storage profile for Native cluster deployments
The default storage profile for native cluster deployments can’t be changed in CSE, unless specified via CLI.
vCD follows particular order of precedence to pick the storage-profile for any VM instantiation:
- User-specified storage-profile
- Storage-profile with which the template is created (if VM is being instantiated from a template)
- Organization virtual datacenter default storage-profile
Workaround:
- Disable the storage-profile with which the template is created on the ovdc.
- Set the desired storage-profile as default on the ovdc.
Failures during template creation or installation
- One of the template creation scripts may have exited with an error
- One of the scripts may be hung waiting for a response
- If the VM has no Internet access, scripts may fail
- Check CSE logs for script outputs, to determine the cause behind the observed failure
CSE service fails to start
- Workaround: rebooting the VM starts the service
Cluster creation fails when VCD external network has a DNS suffix and the DNS server resolves localhost.my.suffix
to a valid IP
This is due to a bug in etcd (More detail HERE, with the kubeadm config file contents necessary for the workaround specified in this comment).
The main issue is that etcd prioritizes the DNS server (if it exists) over the
/etc/hosts
file to resolve hostnames, when the conventional behavior would be
to prioritize checking any hosts files before going to the DNS server. This
becomes problematic when kubeadm attempts to initialize the control plane node
using localhost
. etcd checks the DNS server for any entry like
localhost.suffix
, and if this actually resolves to an IP, attempts to do some
operations involving that incorrect IP, instead of localhost
.
The workaround (More detail HERE
is to create a kubeadm config file (no way to specify listen-peer-urls
argument in command line), and modify the kubeadm init
command in the CSE
control plane script for the template of the cluster you are attempting to deploy.
CSE control plane script is located at
~/.cse-scripts/<template name>_rev<template_revision>/scripts/mstr.sh
Change command from,
kubeadm init --kubernetes-version=v1.13.5 > /root/kubeadm-init.out
to
kubeadm init --config >/path/to/kubeadm.yaml > /root/kubeadm-init.out
Kubernetes version has to be specified within the configuration file itself,
since --kubernetes-version
and --config
are incompatible.
Task for create cluster goes on forever even when cluster create has failed in CSE server
CSE server versions 3.1.1, 3.1.2, 3.1.3 and 3.1.4 are impacted. This issue will be observed, if the user used to deploy the TKGm kubernetes cluster is missing “Manage user’s own API token” right from their role. The cluster create task will not be marked as “failed” even though CSE server logs will indicate that the cluster creation has failed. Adding the missing right viz. “Manage user’s own API token” to the user’s role and reattempting the cluster creation operation should fix the issue.