https://rancher.com/ logo
Title
w

wonderful-rain-13345

03/24/2023, 12:01 AM
I feel like I'm missing something-- I can't reliably bring up a cluster.
Using k3s on ubuntu. Nothing special
h

hundreds-evening-84071

03/24/2023, 12:09 AM
Is the 1st node coming up? If have not already looked at this:
journalctl -u k3s.service
w

wonderful-rain-13345

03/24/2023, 12:10 AM
👀
rancher says the cluster: "non-ready bootstrap machine(s) production-master-86-5fc57bd6c-c88m2 and join url to be available on bootstrap node"
cluster is "explorable" in rancher
c

creamy-pencil-82913

03/24/2023, 12:12 AM
check the logs on the node?
w

wonderful-rain-13345

03/24/2023, 12:12 AM
yeah nothing of note
c

creamy-pencil-82913

03/24/2023, 12:12 AM
are all of the pods running?
are there any errors in the cluster agent pod?
w

wonderful-rain-13345

03/24/2023, 12:13 AM
only 6 pods?
c

creamy-pencil-82913

03/24/2023, 12:14 AM
you’re gonna have to give me a little more to work with
w

wonderful-rain-13345

03/24/2023, 12:14 AM
no i know
:))
cattle-cluster-agent-644ddf96b9-nj9vv_cluster-register.log
this is service log
i'm going to upgrade rancher too
c

creamy-pencil-82913

03/24/2023, 12:16 AM
is that the cattle-cluster-agent log?
it looks like its still starting up
w

wonderful-rain-13345

03/24/2023, 12:17 AM
the first one is the pod
c

creamy-pencil-82913

03/24/2023, 12:17 AM
is the pod failing and being restarted? you might check the --previous logs
w

wonderful-rain-13345

03/24/2023, 12:17 AM
no restarts
i'll let it st for a bit
i'm using kine with pgsl
i'll let it simmer for a bit
Thanks for checking, appreciated
c

creamy-pencil-82913

03/24/2023, 12:22 AM
ohhh hmm, is this an imported or provisioned cluster?
w

wonderful-rain-13345

03/24/2023, 12:22 AM
nope
sorry i was unclear
c

creamy-pencil-82913

03/24/2023, 12:22 AM
Did you provision it via rancher, or just import it?
w

wonderful-rain-13345

03/24/2023, 12:23 AM
I have an old k3os cluster i started that is running rancher (via helm), running rancher v2.7.1. I used that to deploy a basically plain ubuntu image on vmware via a template. using v1.24.10+k3s1
c

creamy-pencil-82913

03/24/2023, 12:25 AM
ok. so the cluster running rancher is on kine with postgres, and the provisioned cluster (the one that is stuck waiting) is a single-node cluster with etcd
is that correct?
w

wonderful-rain-13345

03/24/2023, 12:25 AM
yes, it's got etcd, worker, controller roles.
ran with
INSTALL_K3S_EXEC  = --disable-cloud-controller
c

creamy-pencil-82913

03/24/2023, 12:26 AM
ah well that might do it
wait you ran with that on the downstream cluster?
w

wonderful-rain-13345

03/24/2023, 12:27 AM
in Cluster Config, Agent Env Vars
c

creamy-pencil-82913

03/24/2023, 12:27 AM
why though
if you do
kubectl get node -o wide
on the downstream cluster, is the node NotReady?
w

wonderful-rain-13345

03/24/2023, 12:27 AM
How do you install vsphere cpi/csi on k3s?
c

creamy-pencil-82913

03/24/2023, 12:28 AM
Manually, since we don’t include any packaged cloud provider charts or the in-tree cloud providers
w

wonderful-rain-13345

03/24/2023, 12:28 AM
node is ready
c

creamy-pencil-82913

03/24/2023, 12:28 AM
I think that your disable flag didn’t take
which is probably good
w

wonderful-rain-13345

03/24/2023, 12:28 AM
yeah that was my understand that i had to disable CC so the CPI can be installed
because conflicts
conflicting port iirc
c

creamy-pencil-82913

03/24/2023, 12:29 AM
hmm so it’s showing as ready, and all the pods are ready, but the UI still shows it as waiting?
w

wonderful-rain-13345

03/24/2023, 12:29 AM
yep
c

creamy-pencil-82913

03/24/2023, 12:29 AM
can you show the output of
kubectl get pod -A -o wide
and
kubectl get node -o wide
w

wonderful-rain-13345

03/24/2023, 12:30 AM
packerbuilt@production-master-86-649b8f71-5nqln:~$ kubectl get pod -A -o wide
NAMESPACE       NAME                                    READY   STATUS      RESTARTS   AGE   IP          NODE                                  NOMINATED NODE   READINESS GATES
kube-system     coredns-7b5bbc6644-qqpp8                1/1     Running     0          24m   10.42.0.4   production-master-86-649b8f71-5nqln   <none>           <none>
kube-system     metrics-server-667586758d-7gl5r         1/1     Running     0          24m   10.42.0.5   production-master-86-649b8f71-5nqln   <none>           <none>
cattle-system   cattle-cluster-agent-644ddf96b9-nj9vv   1/1     Running     0          24m   10.42.0.6   production-master-86-649b8f71-5nqln   <none>           <none>
kube-system     helm-install-traefik-crd-snd26          0/1     Completed   0          24m   10.42.0.3   production-master-86-649b8f71-5nqln   <none>           <none>
kube-system     helm-install-traefik-w5w5t              0/1     Completed   1          24m   10.42.0.2   production-master-86-649b8f71-5nqln   <none>           <none>
kube-system     traefik-64b96ccbcd-j5qdd                1/1     Running     0          23m   10.42.0.7   production-master-86-649b8f71-5nqln   <none>           <none>
packerbuilt@production-master-86-649b8f71-5nqln:~$ kubectl get node -o wide
NAME                                  STATUS   ROLES                  AGE   VERSION         INTERNAL-IP    EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION      CONTAINER-RUNTIME
production-master-86-649b8f71-5nqln   Ready    control-plane,master   25m   v1.24.10+k3s1   172.16.1.252   <none>        Ubuntu 22.04.2 LTS   5.15.0-60-generic   <containerd://1.6.15-k3s1>
c

creamy-pencil-82913

03/24/2023, 12:31 AM
You’re missing the
etcd
role. Did you tweak anything else in your agent env vars?
Rancher only knows how to manage clusters that use embedded etcd. If you do something to tweak the K3s config so that it doesn’t use etcd, it will be very confused.
w

wonderful-rain-13345

03/24/2023, 12:32 AM
production-master-86-5fc57bd6c-c88m2.yaml
c

creamy-pencil-82913

03/24/2023, 12:32 AM
that doesn’t have the info I’m looking for, I think it’s in the cluster object
w

wonderful-rain-13345

03/24/2023, 12:33 AM
i unchecked klipper lb
apiVersion: <http://provisioning.cattle.io/v1|provisioning.cattle.io/v1>
kind: Cluster
metadata:
  name: production
  annotations:
    <http://field.cattle.io/creatorId|field.cattle.io/creatorId>: user-ks4kc
#    key: string
  creationTimestamp: '2023-03-23T05:12:52Z'
  finalizers:
    - <http://wrangler.cattle.io/provisioning-cluster-remove|wrangler.cattle.io/provisioning-cluster-remove>
    - <http://wrangler.cattle.io/rke-cluster-remove|wrangler.cattle.io/rke-cluster-remove>
    - <http://wrangler.cattle.io/cloud-config-secret-remover|wrangler.cattle.io/cloud-config-secret-remover>
#    - string
  generation: 8
  labels:
    {}
#    key: string
  namespace: fleet-default
  resourceVersion: '28260918'
  uid: 8213564a-d541-4aea-9966-4c829c5f5e44
  fields:
    - production
    - 'true'
    - production-kubeconfig
spec:
  agentEnvVars:
    - name: K3S_DATASTORE_ENDPOINT
      value: <postgres://k3s:k3s@172.16.1.60:32768/k3s_production?sslmode=disable>
    - name: INSTALL_K3S_EXEC
      value: '--disable-cloud-controller'
    - name: K3S_KUBECONFIG_MODE
      value: '644'
#    - name: string
#      value: string
  cloudCredentialSecretName: cattle-global-data:cc-ps8zn
  defaultPodSecurityPolicyTemplateName: ''
  kubernetesVersion: v1.24.10+k3s1
  localClusterAuthEndpoint:
    caCerts: ''
    enabled: false
    fqdn: ''
  rkeConfig:
    chartValues:
      {}
    etcd:
      disableSnapshots: false
      s3:
        bucket: nrc-rancher
        cloudCredentialName: cattle-global-data:cc-g868h
        endpoint: <http://nyc3.digitaloceanspaces.com|nyc3.digitaloceanspaces.com>
        endpointCA: ''
        folder: production
        region: nyc3
        skipSSLVerify: false
      snapshotRetention: 5
      snapshotScheduleCron: 0 */5 * * *
    etcdSnapshotCreate:
      generation: 1
    machineGlobalConfig:
      disable:
        - servicelb
        - local-storage
      disable-apiserver: false
      disable-cloud-controller: false
      disable-controller-manager: false
      disable-etcd: false
      disable-kube-proxy: false
      disable-network-policy: false
      disable-scheduler: false
      etcd-expose-metrics: false
      secrets-encryption: false
    machinePools:
      - controlPlaneRole: true
        etcdRole: true
        machineConfigRef:
          kind: VmwarevsphereConfig
          name: nc-production-master-86-rhpfw
        machineOS: linux
        name: master-86
        quantity: 1
        unhealthyNodeTimeout: 0s
        workerRole: true
      - controlPlaneRole: true
        etcdRole: true
        machineConfigRef:
          kind: VmwarevsphereConfig
          name: nc-production-master-93-hs8s4
        machineOS: linux
        name: master-93
        quantity: 0
        unhealthyNodeTimeout: 0s
        workerRole: true
      - machineConfigRef:
          kind: VmwarevsphereConfig
          name: nc-production-worker-86-b4s76
        machineOS: linux
        name: worker-86
        quantity: 0
        unhealthyNodeTimeout: 0s
        workerRole: true
      - machineConfigRef:
          kind: VmwarevsphereConfig
          name: nc-production-worker-93-sx4p6
        machineOS: linux
        name: worker-93
        quantity: 0
        unhealthyNodeTimeout: 0s
        workerRole: true
#      - cloudCredentialSecretName: string
#        controlPlaneRole: boolean
#        displayName: string
#        drainBeforeDelete: boolean
#        drainBeforeDeleteTimeout: string
#        etcdRole: boolean
#        labels:
#          key: string
#        machineConfigRef:
#          apiVersion: string
#          fieldPath: string
#          kind: string
#          name: string
#          namespace: string
#          resourceVersion: string
#          uid: string
#        machineDeploymentAnnotations:
#          key: string
#        machineDeploymentLabels:
#          key: string
#        machineOS: string
#        maxUnhealthy: string
#        name: string
#        nodeStartupTimeout: string
#        paused: boolean
#        quantity: int
#        rollingUpdate:
#          maxSurge: string
#          maxUnavailable: string
#        taints:
#          - effect: string
#            key: string
#            timeAdded: string
#            value: string
#        unhealthyNodeTimeout: string
#        unhealthyRange: string
#        workerRole: boolean
    machineSelectorConfig:
      - config:
          docker: false
          protect-kernel-defaults: false
          selinux: false
#      - config:
#        
#        machineLabelSelector:
#          matchExpressions:
#            - key: string
#              operator: string
#              values:
#                - string
#          matchLabels:
#            key: string
    registries:
      configs:
        {}
#        authConfigSecretName: string
#        caBundle: string
#        insecureSkipVerify: boolean
#        tlsSecretName: string
      mirrors:
        {}
#        endpoint:
#          - string
#        rewrite:
#          key: string
    upgradeStrategy:
      controlPlaneConcurrency: '1'
      controlPlaneDrainOptions:
        deleteEmptyDirData: true
        disableEviction: false
        enabled: false
        force: false
        gracePeriod: -1
        ignoreDaemonSets: true
        skipWaitForDeleteTimeoutSeconds: 0
        timeout: 120
#        ignoreErrors: boolean
#        postDrainHooks:
#          - annotation: string
#        preDrainHooks:
#          - annotation: string
      workerConcurrency: '1'
      workerDrainOptions:
        deleteEmptyDirData: true
        disableEviction: false
        enabled: false
        force: false
        gracePeriod: -1
        ignoreDaemonSets: true
        skipWaitForDeleteTimeoutSeconds: 0
        timeout: 120
#        ignoreErrors: boolean
#        postDrainHooks:
#          - annotation: string
#        preDrainHooks:
#          - annotation: string
#    additionalManifest: string
#    etcdSnapshotRestore:
#      generation: int
#      name: string
#      restoreRKEConfig: string
#    infrastructureRef:
#      apiVersion: string
#      fieldPath: string
#      kind: string
#      name: string
#      namespace: string
#      resourceVersion: string
#      uid: string
#    provisionGeneration: int
#    rotateCertificates:
#      generation: int
#      services:
#        - string
#    rotateEncryptionKeys:
#      generation: int
  machineSelectorConfig:
    - config: {}
#  clusterAPIConfig:
#    clusterName: string
#  defaultClusterRoleForProjectMembers: string
#  enableNetworkPolicy: boolean
#  redeploySystemAgentGeneration: int
__clone: true
c

creamy-pencil-82913

03/24/2023, 12:34 AM
ohhhhhh you got it to pass through the K3S_DATASTORE_ENDPOINT variable to the agent
Yeah that’s not supported. as I said above Rancher only supports using embedded etcd
w

wonderful-rain-13345

03/24/2023, 12:34 AM
yea
heh how'd it work before lol
c

creamy-pencil-82913

03/24/2023, 12:34 AM
if you’re OK with seeing the warning you can use it as-is but it will probably be very confused.
w

wonderful-rain-13345

03/24/2023, 12:34 AM
yeah it refuses to join workers
c

creamy-pencil-82913

03/24/2023, 12:35 AM
yep
All of the provisioning stuff expects for there to be etcd roles and join info available
w

wonderful-rain-13345

03/24/2023, 12:35 AM
so what's the story with kine? in my mental model i thought i'd save me from when nodes die. but apparently i need the join token + the DB
c

creamy-pencil-82913

03/24/2023, 12:36 AM
It works well if you have a highly available external DB. It does make your server nodes essentially disposable as long as you still have the DB and token.
w

wonderful-rain-13345

03/24/2023, 12:36 AM
that's good 😅
c

creamy-pencil-82913

03/24/2023, 12:36 AM
But Rancher doesn’t support it, because we didn’t want to have to teach it how to do that
w

wonderful-rain-13345

03/24/2023, 12:36 AM
hmm so do i just not use the etcd role?
c

creamy-pencil-82913

03/24/2023, 12:36 AM
RKE1 and RKE2 both support only etcd, so the support for K3s also only supports embedded etcd
You just can’t point it at an external DB.
Blow away that node/cluster and build a new one without trying to point it at an external DB
w

wonderful-rain-13345

03/24/2023, 12:37 AM
embedded etcd is real etcd? or the one i read about with sql lite that is a default in k3s?
c

creamy-pencil-82913

03/24/2023, 12:38 AM
kine without an external db is sqlite
embedded etcd is real etcd
w

wonderful-rain-13345

03/24/2023, 12:38 AM
bundled into k3s?
c

creamy-pencil-82913

03/24/2023, 12:38 AM
yes
w

wonderful-rain-13345

03/24/2023, 12:38 AM
that env var switches k3s to use kine, right?
c

creamy-pencil-82913

03/24/2023, 12:38 AM
that env var tells it to use kine with an external database, yeah
you did the right thing and it is a great hack that I didn’t think would work
but Rancher just doesn’t expect it
w

wonderful-rain-13345

03/24/2023, 12:39 AM
(i realize how complicated this machinery is and I'm aware i'm super simplifying it and asking questions that are in "it depends" / "it's complicated" territory)
ok
so i guess my move here is don't use kine, and just snapshot that cluster's etcd.
and if things go horribly wrong, just restore from snapshot?
I really like rancher and k3s, but i will say i've been frustrated by a seemingly lack of guidance around how to prep images (i.e. what is needed for k8s vs k3s).
It seems like a DR plan would include backing up rancher's etcd + my cluster's etcd? (gitops ci/cd aside)
c

creamy-pencil-82913

03/24/2023, 12:42 AM
for DR I would just recommend backing up the token and setting up etcd backups to S3
w

wonderful-rain-13345

03/24/2023, 12:42 AM
i seem to always lose my etcds on all my clusters 😄
c

creamy-pencil-82913

03/24/2023, 12:43 AM
from that you should be able to restore the cluster
w

wonderful-rain-13345

03/24/2023, 12:43 AM
ok
i wonder how it worked before
c

creamy-pencil-82913

03/24/2023, 12:44 AM
if you didn’t try to add more workers maybe it didn’t care that the join URL wasn’t available?
w

wonderful-rain-13345

03/24/2023, 12:44 AM
in the last cluster, i was churning workers
I had 3 masters, and was churning them workers hard.
c

creamy-pencil-82913

03/24/2023, 12:45 AM
I am honestly not sure what specifically it is looking for, I just know that the provisioning code only works with etcd. We talked about supporting external SQL DBs for provisioned clusters but it was removed from scope.
w

wonderful-rain-13345

03/24/2023, 12:45 AM
The problem occurred when i accidently scaled down master to 0, but rancher didn't kill the last master (for safety?) but it wouldn't let more join, or rather it did but wouldn't reflect in the rancher UI. Then workers wouldn't join. Could only add masters
heh, i guess this works out
can i still pass
--disable-servicelb
or
--disable servicelb
c

creamy-pencil-82913

03/24/2023, 12:51 AM
yeah that should be fine
w

wonderful-rain-13345

03/24/2023, 12:52 AM
iirc i think the former is deprecated
Thanks a lot Brandon, much appreciated!
c

creamy-pencil-82913

03/24/2023, 12:54 AM
yeah sorry, you want --disable=servicelb not --disable-servicelb
we have some --disable-x flags and also --disable=x,y
w

wonderful-rain-13345

03/24/2023, 12:55 AM
i'll try with the UI check box 😄
c

creamy-pencil-82913

03/24/2023, 12:55 AM
depending on whether you want to disable a packaged manifest, or disable a core controller
w

wonderful-rain-13345

03/24/2023, 12:56 AM
ahh yeah, i wanna to use metallb instead of klipper. And when you install longhorn on k3s with local storage enabled, every time a new node starts up (maybe only masters?) it sets the local-storage class to default, even if Longhorn is the default and local-storage's default flag had been cleared, which breaks stuff
c

creamy-pencil-82913

03/24/2023, 12:58 AM
yep 😕
gotta disable local-storage as well
or be explicit about the StorageClassName on your PVCs, which I personally prefer
w

wonderful-rain-13345

03/24/2023, 12:59 AM
yeah, gets tricky with helm charts. They aren't all well written
I saw there was nfs support in k3s tree? Did that mean I could mount an NFS pvc to a pod without a separate CSI?
p

polite-piano-74233

03/24/2023, 1:22 AM
thats native in kubernetes so yea, you can just point the container volume at an nfs share directly
also i learned a lot from this thread 😄
w

wonderful-rain-13345

03/24/2023, 1:22 AM
nice thanks @polite-piano-74233
c

creamy-pencil-82913

03/24/2023, 1:29 AM
I like the NFS subdir provisioner too, it'll give you CSI PVs that are just a subdirectory off a base export. Handles cleanup and everything.
w

wonderful-rain-13345

03/24/2023, 1:30 AM
yeah i've used previously
i'm on esx 6.7, can't goto 7 really so the vsphere csi is kinda limited for me. Longhorn seems like a silver bullet.