FAQ¶

1 What to do if you encounter issues during the first installation and need to clean up before reinstalling!¶

Note: This is only for scenarios where problems occur during the initial installation and a complete reinstallation is required. Please confirm carefully before proceeding with the following cleanup steps!

If an installation issue occurs and you need to completely remove everything before reinstalling, you must clean up the following three areas to reinstall Guance from the Launcher:

1.1 Clean up installed Guance application services¶

Clean up various Guance application services installed in Kubernetes. You can enter the Launcher container on your operations machine and execute the built-in cleanup script of Launcher:

kubectl exec -it launcher-xxxxxxxx-xxx -n launcher /bin/bash

launcher-xxxxxxxx-xxx is the name of your Launcher service pod! Once inside the container, you will see the k8s-clear.sh script (for versions after 1.47.103, this script is located in the /config/tools directory). Executing this script will clean up all Guance application services and Kubernetes resources:

1.2 Clean up automatically created databases in MySQL¶

You can enter the Launcher container, which has a built-in MySQL client tool, and use the following command to connect to the Guance MySQL instance:

mysql -h <MySQL instance host> -u root -P <MySQL port> -p

You need to connect using the MySQL administrator account. After connecting, execute the following six MySQL database and user cleanup commands:

drop database df_core;
drop user df_core;
drop database df_message_desk;
drop user df_message_desk;
drop database df_func;
drop user df_func;
drop database df_dialtesting;
drop user df_dialtesting;

1.3 Clean up automatically created users in InfluxDB¶

Use the influx client tool to connect to InfluxDB and execute the following two user cleanup commands:

drop user user_wr;
drop user user_ro;

2 Deployment Precautions¶

2.1 Can manually modify Kubernetes resources automatically generated by the installation program after deployment?¶

Manual modifications are not allowed, because when upgrading using Launcher after a new version release, it will regenerate Deployment, Service, Ingress resources based on the configuration information provided during installation (except for Configmap, where the configuration items can be modified arbitrarily, but arbitrary modifications may cause abnormal program operation).

3 Independent Container Rancher Server Certificate Update¶

3.1 How to handle if the certificate has not expired¶

Rancher server can run normally. Upgrading to Rancher v2.0.14+, v2.1.9+, or v2.2.2+ will automatically check the certificate validity period. If it detects that the certificate is about to expire, it will automatically generate a new certificate. Therefore, for independently running Rancher Server containers, you only need to upgrade the Rancher version to one that supports automatic SSL certificate updates before the certificate expires, no additional actions are required.

3.2 How to handle if the certificate has expired¶

Rancher server cannot run normally. Even upgrading to Rancher v2.0.14+, v2.1.9+, or v2.2.2+ may result in certificate errors. If this happens, follow these steps to handle the issue:

Upgrade Rancher to version v2.0.14+, v2.1.9+, or v2.2.2+;
Execute the following commands:
For version 2.0 or 2.1

docker exec -ti <rancher_server_id> mv /var/lib/rancher/management-state/certs/bundle.json /var/lib/rancher/management-state/certs/bundle.json-bak

For version 2.2+

docker exec -ti <rancher_server_id> mv /var/lib/rancher/management-state/tls/localhost.crt /var/lib/rancher/management-state/tls/localhost.crt-bak

For version 2.3+

 docker exec -ti <rancher_server_id> mv /var/lib/rancher/k3s/server/tls /var/lib/rancher/k3s/server/tlsbak

 # Execute twice, the first time for requesting certificates, the second time for loading certificates and starting
 docker restart <rancher_server_id>

For version 2.4+

a. Exec into rancher server

kubectl --insecure-skip-tls-verify -n kube-system delete secrets k3s-serving
kubectl --insecure-skip-tls-verify delete secret serving-cert -n cattle-system
rm -f /var/lib/rancher/k3s/server/tls/dynamic-cert.json

b. Restart rancher-server

docker restart <rancher_server_id>

c. Execute the following command to refresh parameters

curl --insecure -sfL https://server-url/v3

Restart the Rancher Server container

docker restart <rancher_server_id>

4 Handling Rancher Server Certificate Expiration Leading to K8s Cluster Management Issues¶

If the cluster certificate has expired, even upgrading to Rancher v2.0.14, v2.1.9, or higher versions cannot rotate the certificate. Rancher uses Agents to update certificates, and if the certificate expires, it cannot communicate with Agents.

4.1 Solution¶

Manually set the node's time forward. Since Agents only communicate with K8S master and Rancher Server, if the Rancher Server certificate has not expired, you only need to adjust the K8S master node time. Adjustment commands:

# Disable NTP synchronization to prevent automatic time updates
timedatectl set-ntp false
# Modify node time
timedatectl set-time '2019-01-01 00:00:00'

Then upgrade Rancher Server, wait for the certificate rotation to complete, and then resynchronize the time.

timedatectl set-ntp true

Check the certificate validity period

openssl x509 -in /etc/kubernetes/ssl/kube-apiserver.pem -noout -dates

5 Why DataWay Is Not Visible in the Frontend After Creation¶

5.1 Common Cause Analysis¶

The Dataway service did not start normally after being deployed on the server.
The Dataway service configuration file is incorrect, not configured with the correct listening and workspace token information.
The Dataway service runtime configuration is incorrect, specifically check the Dataway logs for troubleshooting.
The server where Dataway is deployed cannot communicate with the kodo service. (Including not adding the correct df-kodo service resolution in hosts)
The kodo service is abnormal, specifically check the kodo service logs for confirmation.
The df-kodo ingress service is not correctly configured. Specifically, it cannot access http|https://df-kodo.<xxxx>:<port>.

6 Why Dial Testing Services Cannot Be Used¶

6.1 Cause Analysis¶

The Guance application deployed is in an offline environment, and the physical nodes cannot access the internet. (Common scenario)
Self-built test node network anomaly.
Regional provider network anomaly.
Dial testing task creation error.

7 Common Deployment Issues and Solutions¶

7.1 `describe pods` reports `unbound immediate PersistentVolumeClaims` error¶

Check PVC

NAMESPACE    NAME                                     STATUS    VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS       AGE
default      opensearch-single-opensearch-single-0    Bound     pvc-0da2cb6f-1cb9-4630-b0ab-512ce57743a8   16Gi       RWO            openebs-hostpath   19d
launcher     persistent-data                          Pending                                                                        df-nfs-storage     6m3s
middleware   data-es-cluster-0                        Bound     pvc-36e48f5a-37b3-4c28-ad14-059265ee3009   50Gi       RWO            openebs-hostpath   18d

Notice that the status of persistent-data is Pending.

Check the status of the nfs-subdir-external-provisioner container

kubectl get pods -n kube-system  | grep nfs-subdir-external-provisioner
nfs-provisioner-nfs-subdir-external-provisioner-58b7cdf6f5dr5vr   0/1     ContainerCreating   0             7h7m

Check the information of nfs-provisioner-nfs-subdir-external-provisioner-58b7cdf6f5dr5vr

kubectl describe  -n kube-system pods nfs-provisioner-nfs-subdir-external-provisioner-58b7cdf6f5dr5vr
....
  Type     Reason       Age                     From     Message
  ----     ------       ----                    ----     -------
  Warning  FailedMount  30m (x49 over 6h53m)    kubelet  Unable to attach or mount volumes: unmounted volumes=[nfs-subdir-external-provisioner-root], unattached volumes=[kube-api-access-5p4qn nfs-subdir-external-provisioner-root]: timed out waiting for the condition
  Warning  FailedMount  5m27s (x136 over 7h4m)  kubelet  Unable to attach or mount volumes: unmounted volumes=[nfs-subdir-external-provisioner-root], unattached volumes=[nfs-subdir-external-provisioner-root kube-api-access-5p4qn]: timed out waiting for the condition
  Warning  FailedMount  74s (x217 over 7h6m)    kubelet  MountVolume.SetUp failed for volume "nfs-subdir-external-provisioner-root" : mount failed: exit status 32
Mounting command: mount
Mounting arguments: -t nfs 10.200.14.112:/nfsdata /var/lib/kubelet/pods/3970ff5f-5dbf-419e-a6af-3080508d2524/volumes/kubernetes.io~nfs/nfs-subdir-external-provisioner-root
Output: mount: wrong fs type, bad option, bad superblock on 10.200.14.112:/nfsdata,
       missing codepage or helper program, or other error
       (for several filesystems (e.g. nfs, cifs) you might
       need a /sbin/mount.<type> helper program)

       In some cases useful info is found in syslog - try
       dmesg | tail or so.

The cause of wrong fs type, bad option is that nfs-utils is not installed.

Install nfs-utils

Execute the following command on the host:

yum install nfs-utils