This message was deleted.
# k3s
a
This message was deleted.
w
Using k3s on ubuntu. Nothing special
h
Is the 1st node coming up? If have not already looked at this:
journalctl -u k3s.service
w
👀
rancher says the cluster: "non-ready bootstrap machine(s) production-master-86-5fc57bd6c-c88m2 and join url to be available on bootstrap node"
cluster is "explorable" in rancher
c
check the logs on the node?
w
yeah nothing of note
c
are all of the pods running?
are there any errors in the cluster agent pod?
w
only 6 pods?
c
you’re gonna have to give me a little more to work with
w
no i know
:))
this is service log
i'm going to upgrade rancher too
c
is that the cattle-cluster-agent log?
it looks like its still starting up
w
the first one is the pod
c
is the pod failing and being restarted? you might check the --previous logs
w
no restarts
i'll let it st for a bit
i'm using kine with pgsl
i'll let it simmer for a bit
Thanks for checking, appreciated
c
ohhh hmm, is this an imported or provisioned cluster?
w
nope
sorry i was unclear
c
Did you provision it via rancher, or just import it?
w
I have an old k3os cluster i started that is running rancher (via helm), running rancher v2.7.1. I used that to deploy a basically plain ubuntu image on vmware via a template. using v1.24.10+k3s1
c
ok. so the cluster running rancher is on kine with postgres, and the provisioned cluster (the one that is stuck waiting) is a single-node cluster with etcd
is that correct?
w
yes, it's got etcd, worker, controller roles.
ran with
INSTALL_K3S_EXEC  = --disable-cloud-controller
c
ah well that might do it
wait you ran with that on the downstream cluster?
w
in Cluster Config, Agent Env Vars
c
why though
if you do
kubectl get node -o wide
on the downstream cluster, is the node NotReady?
w
How do you install vsphere cpi/csi on k3s?
c
Manually, since we don’t include any packaged cloud provider charts or the in-tree cloud providers
w
node is ready
c
I think that your disable flag didn’t take
which is probably good
w
yeah that was my understand that i had to disable CC so the CPI can be installed
because conflicts
conflicting port iirc
c
hmm so it’s showing as ready, and all the pods are ready, but the UI still shows it as waiting?
w
yep
c
can you show the output of
kubectl get pod -A -o wide
and
kubectl get node -o wide
w
Copy code
packerbuilt@production-master-86-649b8f71-5nqln:~$ kubectl get pod -A -o wide
NAMESPACE       NAME                                    READY   STATUS      RESTARTS   AGE   IP          NODE                                  NOMINATED NODE   READINESS GATES
kube-system     coredns-7b5bbc6644-qqpp8                1/1     Running     0          24m   10.42.0.4   production-master-86-649b8f71-5nqln   <none>           <none>
kube-system     metrics-server-667586758d-7gl5r         1/1     Running     0          24m   10.42.0.5   production-master-86-649b8f71-5nqln   <none>           <none>
cattle-system   cattle-cluster-agent-644ddf96b9-nj9vv   1/1     Running     0          24m   10.42.0.6   production-master-86-649b8f71-5nqln   <none>           <none>
kube-system     helm-install-traefik-crd-snd26          0/1     Completed   0          24m   10.42.0.3   production-master-86-649b8f71-5nqln   <none>           <none>
kube-system     helm-install-traefik-w5w5t              0/1     Completed   1          24m   10.42.0.2   production-master-86-649b8f71-5nqln   <none>           <none>
kube-system     traefik-64b96ccbcd-j5qdd                1/1     Running     0          23m   10.42.0.7   production-master-86-649b8f71-5nqln   <none>           <none>
packerbuilt@production-master-86-649b8f71-5nqln:~$ kubectl get node -o wide
NAME                                  STATUS   ROLES                  AGE   VERSION         INTERNAL-IP    EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION      CONTAINER-RUNTIME
production-master-86-649b8f71-5nqln   Ready    control-plane,master   25m   v1.24.10+k3s1   172.16.1.252   <none>        Ubuntu 22.04.2 LTS   5.15.0-60-generic   <containerd://1.6.15-k3s1>
c
You’re missing the
etcd
role. Did you tweak anything else in your agent env vars?
Rancher only knows how to manage clusters that use embedded etcd. If you do something to tweak the K3s config so that it doesn’t use etcd, it will be very confused.
that doesn’t have the info I’m looking for, I think it’s in the cluster object
w
i unchecked klipper lb
Copy code
apiVersion: <http://provisioning.cattle.io/v1|provisioning.cattle.io/v1>
kind: Cluster
metadata:
  name: production
  annotations:
    <http://field.cattle.io/creatorId|field.cattle.io/creatorId>: user-ks4kc
#    key: string
  creationTimestamp: '2023-03-23T05:12:52Z'
  finalizers:
    - <http://wrangler.cattle.io/provisioning-cluster-remove|wrangler.cattle.io/provisioning-cluster-remove>
    - <http://wrangler.cattle.io/rke-cluster-remove|wrangler.cattle.io/rke-cluster-remove>
    - <http://wrangler.cattle.io/cloud-config-secret-remover|wrangler.cattle.io/cloud-config-secret-remover>
#    - string
  generation: 8
  labels:
    {}
#    key: string
  namespace: fleet-default
  resourceVersion: '28260918'
  uid: 8213564a-d541-4aea-9966-4c829c5f5e44
  fields:
    - production
    - 'true'
    - production-kubeconfig
spec:
  agentEnvVars:
    - name: K3S_DATASTORE_ENDPOINT
      value: <postgres://k3s:k3s@172.16.1.60:32768/k3s_production?sslmode=disable>
    - name: INSTALL_K3S_EXEC
      value: '--disable-cloud-controller'
    - name: K3S_KUBECONFIG_MODE
      value: '644'
#    - name: string
#      value: string
  cloudCredentialSecretName: cattle-global-data:cc-ps8zn
  defaultPodSecurityPolicyTemplateName: ''
  kubernetesVersion: v1.24.10+k3s1
  localClusterAuthEndpoint:
    caCerts: ''
    enabled: false
    fqdn: ''
  rkeConfig:
    chartValues:
      {}
    etcd:
      disableSnapshots: false
      s3:
        bucket: nrc-rancher
        cloudCredentialName: cattle-global-data:cc-g868h
        endpoint: <http://nyc3.digitaloceanspaces.com|nyc3.digitaloceanspaces.com>
        endpointCA: ''
        folder: production
        region: nyc3
        skipSSLVerify: false
      snapshotRetention: 5
      snapshotScheduleCron: 0 */5 * * *
    etcdSnapshotCreate:
      generation: 1
    machineGlobalConfig:
      disable:
        - servicelb
        - local-storage
      disable-apiserver: false
      disable-cloud-controller: false
      disable-controller-manager: false
      disable-etcd: false
      disable-kube-proxy: false
      disable-network-policy: false
      disable-scheduler: false
      etcd-expose-metrics: false
      secrets-encryption: false
    machinePools:
      - controlPlaneRole: true
        etcdRole: true
        machineConfigRef:
          kind: VmwarevsphereConfig
          name: nc-production-master-86-rhpfw
        machineOS: linux
        name: master-86
        quantity: 1
        unhealthyNodeTimeout: 0s
        workerRole: true
      - controlPlaneRole: true
        etcdRole: true
        machineConfigRef:
          kind: VmwarevsphereConfig
          name: nc-production-master-93-hs8s4
        machineOS: linux
        name: master-93
        quantity: 0
        unhealthyNodeTimeout: 0s
        workerRole: true
      - machineConfigRef:
          kind: VmwarevsphereConfig
          name: nc-production-worker-86-b4s76
        machineOS: linux
        name: worker-86
        quantity: 0
        unhealthyNodeTimeout: 0s
        workerRole: true
      - machineConfigRef:
          kind: VmwarevsphereConfig
          name: nc-production-worker-93-sx4p6
        machineOS: linux
        name: worker-93
        quantity: 0
        unhealthyNodeTimeout: 0s
        workerRole: true
#      - cloudCredentialSecretName: string
#        controlPlaneRole: boolean
#        displayName: string
#        drainBeforeDelete: boolean
#        drainBeforeDeleteTimeout: string
#        etcdRole: boolean
#        labels:
#          key: string
#        machineConfigRef:
#          apiVersion: string
#          fieldPath: string
#          kind: string
#          name: string
#          namespace: string
#          resourceVersion: string
#          uid: string
#        machineDeploymentAnnotations:
#          key: string
#        machineDeploymentLabels:
#          key: string
#        machineOS: string
#        maxUnhealthy: string
#        name: string
#        nodeStartupTimeout: string
#        paused: boolean
#        quantity: int
#        rollingUpdate:
#          maxSurge: string
#          maxUnavailable: string
#        taints:
#          - effect: string
#            key: string
#            timeAdded: string
#            value: string
#        unhealthyNodeTimeout: string
#        unhealthyRange: string
#        workerRole: boolean
    machineSelectorConfig:
      - config:
          docker: false
          protect-kernel-defaults: false
          selinux: false
#      - config:
#        
#        machineLabelSelector:
#          matchExpressions:
#            - key: string
#              operator: string
#              values:
#                - string
#          matchLabels:
#            key: string
    registries:
      configs:
        {}
#        authConfigSecretName: string
#        caBundle: string
#        insecureSkipVerify: boolean
#        tlsSecretName: string
      mirrors:
        {}
#        endpoint:
#          - string
#        rewrite:
#          key: string
    upgradeStrategy:
      controlPlaneConcurrency: '1'
      controlPlaneDrainOptions:
        deleteEmptyDirData: true
        disableEviction: false
        enabled: false
        force: false
        gracePeriod: -1
        ignoreDaemonSets: true
        skipWaitForDeleteTimeoutSeconds: 0
        timeout: 120
#        ignoreErrors: boolean
#        postDrainHooks:
#          - annotation: string
#        preDrainHooks:
#          - annotation: string
      workerConcurrency: '1'
      workerDrainOptions:
        deleteEmptyDirData: true
        disableEviction: false
        enabled: false
        force: false
        gracePeriod: -1
        ignoreDaemonSets: true
        skipWaitForDeleteTimeoutSeconds: 0
        timeout: 120
#        ignoreErrors: boolean
#        postDrainHooks:
#          - annotation: string
#        preDrainHooks:
#          - annotation: string
#    additionalManifest: string
#    etcdSnapshotRestore:
#      generation: int
#      name: string
#      restoreRKEConfig: string
#    infrastructureRef:
#      apiVersion: string
#      fieldPath: string
#      kind: string
#      name: string
#      namespace: string
#      resourceVersion: string
#      uid: string
#    provisionGeneration: int
#    rotateCertificates:
#      generation: int
#      services:
#        - string
#    rotateEncryptionKeys:
#      generation: int
  machineSelectorConfig:
    - config: {}
#  clusterAPIConfig:
#    clusterName: string
#  defaultClusterRoleForProjectMembers: string
#  enableNetworkPolicy: boolean
#  redeploySystemAgentGeneration: int
__clone: true
c
ohhhhhh you got it to pass through the K3S_DATASTORE_ENDPOINT variable to the agent
Yeah that’s not supported. as I said above Rancher only supports using embedded etcd
w
yea
heh how'd it work before lol
c
if you’re OK with seeing the warning you can use it as-is but it will probably be very confused.
w
yeah it refuses to join workers
c
yep
All of the provisioning stuff expects for there to be etcd roles and join info available
w
so what's the story with kine? in my mental model i thought i'd save me from when nodes die. but apparently i need the join token + the DB
c
It works well if you have a highly available external DB. It does make your server nodes essentially disposable as long as you still have the DB and token.
w
that's good 😅
c
But Rancher doesn’t support it, because we didn’t want to have to teach it how to do that
w
hmm so do i just not use the etcd role?
c
RKE1 and RKE2 both support only etcd, so the support for K3s also only supports embedded etcd
You just can’t point it at an external DB.
Blow away that node/cluster and build a new one without trying to point it at an external DB
w
embedded etcd is real etcd? or the one i read about with sql lite that is a default in k3s?
c
kine without an external db is sqlite
embedded etcd is real etcd
w
bundled into k3s?
c
yes
w
that env var switches k3s to use kine, right?
c
that env var tells it to use kine with an external database, yeah
you did the right thing and it is a great hack that I didn’t think would work
but Rancher just doesn’t expect it
w
(i realize how complicated this machinery is and I'm aware i'm super simplifying it and asking questions that are in "it depends" / "it's complicated" territory)
ok
so i guess my move here is don't use kine, and just snapshot that cluster's etcd.
and if things go horribly wrong, just restore from snapshot?
I really like rancher and k3s, but i will say i've been frustrated by a seemingly lack of guidance around how to prep images (i.e. what is needed for k8s vs k3s).
It seems like a DR plan would include backing up rancher's etcd + my cluster's etcd? (gitops ci/cd aside)
c
for DR I would just recommend backing up the token and setting up etcd backups to S3
w
i seem to always lose my etcds on all my clusters 😄
c
from that you should be able to restore the cluster
w
ok
i wonder how it worked before
c
if you didn’t try to add more workers maybe it didn’t care that the join URL wasn’t available?
w
in the last cluster, i was churning workers
I had 3 masters, and was churning them workers hard.
c
I am honestly not sure what specifically it is looking for, I just know that the provisioning code only works with etcd. We talked about supporting external SQL DBs for provisioned clusters but it was removed from scope.
w
The problem occurred when i accidently scaled down master to 0, but rancher didn't kill the last master (for safety?) but it wouldn't let more join, or rather it did but wouldn't reflect in the rancher UI. Then workers wouldn't join. Could only add masters
heh, i guess this works out
can i still pass
--disable-servicelb
or
--disable servicelb
c
yeah that should be fine
w
iirc i think the former is deprecated
Thanks a lot Brandon, much appreciated!
c
yeah sorry, you want --disable=servicelb not --disable-servicelb
we have some --disable-x flags and also --disable=x,y
w
i'll try with the UI check box 😄
c
depending on whether you want to disable a packaged manifest, or disable a core controller
w
ahh yeah, i wanna to use metallb instead of klipper. And when you install longhorn on k3s with local storage enabled, every time a new node starts up (maybe only masters?) it sets the local-storage class to default, even if Longhorn is the default and local-storage's default flag had been cleared, which breaks stuff
c
yep 😕
gotta disable local-storage as well
or be explicit about the StorageClassName on your PVCs, which I personally prefer
w
yeah, gets tricky with helm charts. They aren't all well written
I saw there was nfs support in k3s tree? Did that mean I could mount an NFS pvc to a pod without a separate CSI?
p
thats native in kubernetes so yea, you can just point the container volume at an nfs share directly
also i learned a lot from this thread 😄
w
nice thanks @polite-piano-74233
c
I like the NFS subdir provisioner too, it'll give you CSI PVs that are just a subdirectory off a base export. Handles cleanup and everything.
w
yeah i've used previously
i'm on esx 6.7, can't goto 7 really so the vsphere csi is kinda limited for me. Longhorn seems like a silver bullet.