This message was deleted.
# harvester
a
g
how many healthy master nodes do you have left when you are trying to add the new node?
q
2
it's also not promoting my 3rd node to master
@great-bear-19718 ^
@great-bear-19718 any chance you can give me a hand on this? i'd really get my cluster back to health w/ 3 masters.
g
Copy code
(⎈|default:N/A)➜  ~ k get nodes
NAME           STATUS   ROLES                       AGE    VERSION
harvester-01   Ready    control-plane,etcd,master   374d   v1.24.7+rke2r1
harvester-03   Ready    control-plane,etcd,master   219d   v1.24.7+rke2r1
harvester-07   Ready    <none>                      200d   v1.24.7+rke2r1
is this the node you are trying to add?
harvester-07
?
q
no
7 is the node that was already present, but it wont promote it to a master for some reason
so right now, i have a very-scary ha setup, w/ only 2 masters and 1 worker.
g
so node got added to the cluster?
q
usually it just auto-promotes the 3rd machine when you delete a master
g
just not getting promoted.. yeah.. it should
let me check what is going on
q
lmk if you want a new bundle πŸ™‚
the node i added was harvester-02-r or something like that (old one i lost was harvester-02)
g
so what is
harvester-07
is that node marked ready?
i dont see
harvester-02-r
in the support bundle
q
yeah. it's happy
g
ideally 07 should have been promoted to master
q
02-r hasnt actually joined yet. it just sits at not ready
yeah, that's what i experience before but for some reason it wont. and i've reset the whole cluster and what-not too. so it's not a "try rebooting it" kinda fix
g
on the node that is still not ready
can you please check the output of
journalctl -fu rancherd
?
q
the one that wont join?
g
yeah the one that wont join
q
k. give me a few, i have to spin it up. we just moved all our gear to a new suite in the building over the weekend.
brb
g
the support bundle has no reference since it has obviously not joined this cluster but i see no reference to it
no rush
q
so yeah, that other node is fubar right now, i had to salvage a part after the move to get one of the others up.
any chance we can figure out why 7 is not promoting?
i have another machine i can try to bring in too if needed though
g
ok let me check why 7 did not promote
q
thanks. i dont like running an ha w/ only 2 masters... i lose one, and it's a bad day.
g
other 2 nodes have a topology setup
harvester-07 does not
i assume you defined the topology
q
ah. is that why?
i need to add a topology?
look at that. it's promoting! lol
geeze.
g
πŸ‘
q
okay, i'll stand up another node really fast and let you know if it has issues joining again
can you help me with these:
i really want to upgrade to 1.1.2 but i dont want to till these are fixed.
g
is that from the last upgrade?
q
yeah
i upped from 1.0 to 1.1.1
it's been lingering since i think
g
Copy code
(⎈|default:harvester-system)➜  v1 k get deploy -n cattle-monitoring-system
NAME                                    READY   UP-TO-DATE   AVAILABLE   AGE
rancher-monitoring-grafana              0/1     1            0           375d
grafana is not ready
q
also, looks like 7 is stuck promoting. it's still showing cordoned 😞
g
there will be a promote job in the
harvester-system
ns that should have more info
also output of
kubectl describe pod rancher-monitoring-grafana-787b587b6d-nqccz -n cattle-monitoring-system
it is stuck
i'd like to know why its stuck
q
looks like it cant mount a pvc
should i try deleting it really fast so it can retry? or want to see the describe?
g
yeah sure
cant make it worse
q
for promote:
Copy code
Normal  Scheduled  31m   default-scheduler  Successfully assigned harvester-system/harvester-promote-harvester-07-qqnlp to harvester-07
  Normal  Pulled     31m   kubelet            Container image "busybox:1.32.0" already present on machine
  Normal  Created    31m   kubelet            Created container promote
  Normal  Started    31m   kubelet            Started container promote
want logs?
g
yep
q
Copy code
E1002 16:57:58.119302    6624 memcache.go:255] couldn't get resource list for <http://custom.metrics.k8s.io/v1beta1|custom.metrics.k8s.io/v1beta1>: Got empty response for: <http://custom.metrics.k8s.io/v1beta1|custom.metrics.k8s.io/v1beta1>
deployment "rancher-webhook" successfully rolled out
<http://machine.cluster.x-k8s.io/custom-eb75b22fefcf|machine.cluster.x-k8s.io/custom-eb75b22fefcf> labeled
secret/custom-eb75b22fefcf-machine-plan labeled
<http://rkebootstrap.rke.cattle.io/custom-eb75b22fefcf|rkebootstrap.rke.cattle.io/custom-eb75b22fefcf> labeled
Waiting for promotion...
Waiting for promotion...
Waiting for promotion...
then it's waiting for promotion... A LOT
g
kubectl get pods -n kube-system
q
everything is running want outptu?
output*
g
yes please
q
Copy code
NAME                                                    READY   STATUS    RESTARTS         AGE
cloud-controller-manager-harvester-01                   1/1     Running   1661 (41h ago)   282d
cloud-controller-manager-harvester-03                   1/1     Running   24 (41h ago)     30d
etcd-harvester-01                                       1/1     Running   37 (41h ago)     282d
etcd-harvester-03                                       1/1     Running   6 (41h ago)      30d
harvester-whereabouts-h6mvg                             1/1     Running   21 (41h ago)     219d
harvester-whereabouts-mnx4v                             1/1     Running   36 (41h ago)     282d
harvester-whereabouts-rrwkg                             1/1     Running   17 (41h ago)     200d
kube-apiserver-harvester-01                             1/1     Running   1702 (41h ago)   282d
kube-apiserver-harvester-03                             1/1     Running   24 (41h ago)     30d
kube-controller-manager-harvester-01                    1/1     Running   1543 (41h ago)   282d
kube-controller-manager-harvester-03                    1/1     Running   38 (41h ago)     30d
kube-proxy-harvester-01                                 1/1     Running   37 (41h ago)     282d
kube-proxy-harvester-03                                 1/1     Running   23 (41h ago)     219d
kube-proxy-harvester-07                                 1/1     Running   18 (41h ago)     200d
kube-scheduler-harvester-01                             1/1     Running   56 (41h ago)     282d
kube-scheduler-harvester-03                             1/1     Running   11 (41h ago)     30d
rke2-canal-bzxwf                                        2/2     Running   34 (41h ago)     200d
rke2-canal-fkpnw                                        2/2     Running   99 (41h ago)     282d
rke2-canal-vx8m7                                        2/2     Running   43 (41h ago)     219d
rke2-coredns-rke2-coredns-58fd75f64b-jqgn6              1/1     Running   5 (41h ago)      14d
rke2-coredns-rke2-coredns-58fd75f64b-sjtnx              1/1     Running   5 (41h ago)      14d
rke2-coredns-rke2-coredns-autoscaler-768bfc5985-sw7kd   1/1     Running   5 (41h ago)      14d
rke2-ingress-nginx-controller-6mx6k                     1/1     Running   22 (41h ago)     219d
rke2-ingress-nginx-controller-dn5wj                     1/1     Running   36 (41h ago)     282d
rke2-ingress-nginx-controller-hht5t                     1/1     Running   22 (41h ago)     200d
rke2-metrics-server-5df44dfc84-28tx9                    1/1     Running   5 (41h ago)      14d
rke2-multus-ds-86zvx                                    1/1     Running   35 (41h ago)     282d
rke2-multus-ds-dxdwd                                    1/1     Running   21 (41h ago)     219d
rke2-multus-ds-kfl99                                    1/1     Running   17 (41h ago)     200d
snapshot-controller-7c4887cf-5rv67                      1/1     Running   7 (41h ago)      14d
snapshot-controller-7c4887cf-pggp6                      1/1     Running   14 (41h ago)     14d
g
also
kubectl get machine -n fleet-local
q
ty for the help btw...
NAME CLUSTER NODENAME PROVIDERID PHASE AGE VERSION custom-716cb3ba930e local harvester-01 rke2://harvester-01 Running 374d custom-955c2bcfc429 local Provisioning 7d5h custom-98b7fe6bc0be local harvester-03 rke2://harvester-03 Running 219d custom-eb75b22fefcf local harvester-07 rke2://harvester-07 Running 200d custom-f4c99741d9b4 local Provisioning 7d6h
g
are you able to delete the two stuck in
Provisioning
? they are likely references for your old machine that fails the addition
or we can do it later.. if you shell into
harvester-07
there should be a
rancher-system-agent
running i need to see its logs
journalctl -fu rancher-system-agent
q
machines deleted. here are longs from rancher-system-agent:
Copy code
rancher@harvester-07:~> journalctl -fu rancher-system-agent
-- Logs begin at Thu 2023-03-16 05:38:44 UTC. --
Oct 01 06:40:39 harvester-07 rancher-system-agent[11989]: time="2023-10-01T06:40:39Z" level=info msg="[02acbe9892ef3951f3ce1a95cecc909d312aae52712baaca9b97c35c2099bfdf_0:stderr]: + echo EnvironmentFile=-/var/lib/rancher/rke2/system-agent-installer/rke2-sa.env"
Oct 01 06:40:39 harvester-07 rancher-system-agent[11989]: time="2023-10-01T06:40:39Z" level=info msg="[02acbe9892ef3951f3ce1a95cecc909d312aae52712baaca9b97c35c2099bfdf_0:stderr]: + '[' -n ffb03c631d25480057e7bdad200aaf8835233029b9b271d0490e49198dd0b2aa ']'"
Oct 01 06:40:39 harvester-07 rancher-system-agent[11989]: time="2023-10-01T06:40:39Z" level=info msg="[02acbe9892ef3951f3ce1a95cecc909d312aae52712baaca9b97c35c2099bfdf_0:stderr]: + echo ffb03c631d25480057e7bdad200aaf8835233029b9b271d0490e49198dd0b2aa"
Oct 01 06:40:39 harvester-07 rancher-system-agent[11989]: time="2023-10-01T06:40:39Z" level=info msg="[02acbe9892ef3951f3ce1a95cecc909d312aae52712baaca9b97c35c2099bfdf_0:stderr]: + systemctl daemon-reload"
Oct 01 06:40:40 harvester-07 rancher-system-agent[11989]: time="2023-10-01T06:40:40Z" level=info msg="[02acbe9892ef3951f3ce1a95cecc909d312aae52712baaca9b97c35c2099bfdf_0:stderr]: + '[' '' = true ']'"
Oct 01 06:40:40 harvester-07 rancher-system-agent[11989]: time="2023-10-01T06:40:40Z" level=info msg="[02acbe9892ef3951f3ce1a95cecc909d312aae52712baaca9b97c35c2099bfdf_0:stderr]: + '[' agent = server ']'"
Oct 01 06:40:40 harvester-07 rancher-system-agent[11989]: time="2023-10-01T06:40:40Z" level=info msg="[02acbe9892ef3951f3ce1a95cecc909d312aae52712baaca9b97c35c2099bfdf_0:stderr]: + systemctl enable rke2-agent"
Oct 01 06:40:40 harvester-07 rancher-system-agent[11989]: time="2023-10-01T06:40:40Z" level=info msg="[02acbe9892ef3951f3ce1a95cecc909d312aae52712baaca9b97c35c2099bfdf_0:stderr]: + '[' '' = true ']'"
Oct 01 06:40:40 harvester-07 rancher-system-agent[11989]: time="2023-10-01T06:40:40Z" level=info msg="[02acbe9892ef3951f3ce1a95cecc909d312aae52712baaca9b97c35c2099bfdf_0:stderr]: + '[' false = true ']'"
Oct 01 06:40:40 harvester-07 rancher-system-agent[11989]: time="2023-10-01T06:40:40Z" level=info msg="[Applyinator] Command sh [-c run.sh] finished with err: <nil> and exit code: 0"
g
that is not today's date?
q
you are not wrong...
Copy code
rancher@harvester-07:~> sudo timedatectl
               Local time: Tue 2023-10-03 00:06:13 UTC
           Universal time: Tue 2023-10-03 00:06:13 UTC
                 RTC time: Tue 2023-10-03 00:06:13
                Time zone: Etc/UTC (UTC, +0000)
System clock synchronized: yes
              NTP service: active
          RTC in local TZ: no
g
it is the 3rd where i am.. so i figured it should not be the 1st October anywhere now 🀣
q
true story. lol
g
can you restart that?
systemctl restart rancher-system-agent
?
might need to check the logs once this is done
q
Copy code
-- Logs begin at Thu 2023-03-16 05:38:44 UTC. --
Oct 03 00:07:15 harvester-07 rancher-system-agent[16702]: time="2023-10-03T00:07:15Z" level=info msg="[02acbe9892ef3951f3ce1a95cecc909d312aae52712baaca9b97c35c2099bfdf_0:stderr]: + echo EnvironmentFile=-/var/lib/rancher/rke2/system-agent-installer/rke2-sa.env"
Oct 03 00:07:15 harvester-07 rancher-system-agent[16702]: time="2023-10-03T00:07:15Z" level=info msg="[02acbe9892ef3951f3ce1a95cecc909d312aae52712baaca9b97c35c2099bfdf_0:stderr]: + '[' -n ffb03c631d25480057e7bdad200aaf8835233029b9b271d0490e49198dd0b2aa ']'"
Oct 03 00:07:15 harvester-07 rancher-system-agent[16702]: time="2023-10-03T00:07:15Z" level=info msg="[02acbe9892ef3951f3ce1a95cecc909d312aae52712baaca9b97c35c2099bfdf_0:stderr]: + echo ffb03c631d25480057e7bdad200aaf8835233029b9b271d0490e49198dd0b2aa"
Oct 03 00:07:15 harvester-07 rancher-system-agent[16702]: time="2023-10-03T00:07:15Z" level=info msg="[02acbe9892ef3951f3ce1a95cecc909d312aae52712baaca9b97c35c2099bfdf_0:stderr]: + systemctl daemon-reload"
Oct 03 00:07:15 harvester-07 rancher-system-agent[16702]: time="2023-10-03T00:07:15Z" level=info msg="[02acbe9892ef3951f3ce1a95cecc909d312aae52712baaca9b97c35c2099bfdf_0:stderr]: + '[' '' = true ']'"
Oct 03 00:07:15 harvester-07 rancher-system-agent[16702]: time="2023-10-03T00:07:15Z" level=info msg="[02acbe9892ef3951f3ce1a95cecc909d312aae52712baaca9b97c35c2099bfdf_0:stderr]: + '[' agent = server ']'"
Oct 03 00:07:15 harvester-07 rancher-system-agent[16702]: time="2023-10-03T00:07:15Z" level=info msg="[02acbe9892ef3951f3ce1a95cecc909d312aae52712baaca9b97c35c2099bfdf_0:stderr]: + systemctl enable rke2-agent"
Oct 03 00:07:16 harvester-07 rancher-system-agent[16702]: time="2023-10-03T00:07:16Z" level=info msg="[02acbe9892ef3951f3ce1a95cecc909d312aae52712baaca9b97c35c2099bfdf_0:stderr]: + '[' '' = true ']'"
Oct 03 00:07:16 harvester-07 rancher-system-agent[16702]: time="2023-10-03T00:07:16Z" level=info msg="[02acbe9892ef3951f3ce1a95cecc909d312aae52712baaca9b97c35c2099bfdf_0:stderr]: + '[' false = true ']'"
Oct 03 00:07:16 harvester-07 rancher-system-agent[16702]: time="2023-10-03T00:07:16Z" level=info msg="[Applyinator] Command sh [-c run.sh] finished with err: <nil> and exit code: 0"
g
any chance i could please see this..
Copy code
kubectl get cluster.provisioning -n fleet-local -o yaml
q
Copy code
apiVersion: v1
items:
- apiVersion: <http://provisioning.cattle.io/v1|provisioning.cattle.io/v1>
  kind: Cluster
  metadata:
    annotations:
      <http://kubectl.kubernetes.io/last-applied-configuration|kubectl.kubernetes.io/last-applied-configuration>: |
        {"apiVersion":"<http://provisioning.cattle.io/v1|provisioning.cattle.io/v1>","kind":"Cluster","metadata":{"annotations":{},"labels":{"<http://rke.cattle.io/init-node-machine-id|rke.cattle.io/init-node-machine-id>":"42hqkq5728cv59wl99hmwjglvq6hv4pnw4ps9x2d6nfchstmtb82jp"},"name":"local","namespace":"fleet-local"},"spec":{"kubernetesVersion":"v1.22.12+rke2r1","rkeConfig":{"controlPlaneConfig":null}}}
      <http://objectset.rio.cattle.io/applied|objectset.rio.cattle.io/applied>: H4sIAAAAAAAA/4yQzU7DMBCEXwXt2Slt079Y4oAQ4sCVF9jYS2Ow15G9CYfK746SVqJC4udo78xovjlBIEGLgqBPgMxRUFzkPD1j+0ZGMskiubgwKOJp4eKts6ChT3F02UV2fKyMH7JQqkwiFAL1ozV+MKXqOL6DhoCMRwrEciUYa3Xz7NjePZwj/8xiDAQafDTo/yXOPZrJAUXB3NdFfnGBsmDoQfPgvQKPLflfR+gwd6Bhu9ztt3XdUGNwc7Crdr9u6jW1y/pg91vb2LXdbHarA6jzYpbSVwho6DCNNIMWBd9Yrtu+eiKpzpeiIPdkpnbzx2Wq+0G6R7Z9dCygT2WSCcpwwciURrJPxJRmZtDLUj4DAAD//5CVWGcAAgAA
      <http://objectset.rio.cattle.io/id|objectset.rio.cattle.io/id>: provisioning-cluster-create
      <http://objectset.rio.cattle.io/owner-gvk|objectset.rio.cattle.io/owner-gvk>: <http://management.cattle.io/v3|management.cattle.io/v3>, Kind=Cluster
      <http://objectset.rio.cattle.io/owner-name|objectset.rio.cattle.io/owner-name>: local
      <http://objectset.rio.cattle.io/owner-namespace|objectset.rio.cattle.io/owner-namespace>: ""
    creationTimestamp: "2022-09-22T17:45:55Z"
    finalizers:
    - <http://wrangler.cattle.io/provisioning-cluster-remove|wrangler.cattle.io/provisioning-cluster-remove>
    - <http://wrangler.cattle.io/rke-cluster-remove|wrangler.cattle.io/rke-cluster-remove>
    generation: 4
    labels:
      <http://objectset.rio.cattle.io/hash|objectset.rio.cattle.io/hash>: 50675339e9ca48d1b72932eb038d75d9d2d44618
      <http://provider.cattle.io|provider.cattle.io>: harvester
    name: local
    namespace: fleet-local
    resourceVersion: "624758340"
    uid: ee63516e-be79-43b0-a331-59e7e18c264b
  spec:
    kubernetesVersion: v1.24.7+rke2r1
    localClusterAuthEndpoint: {}
    rkeConfig:
      chartValues: null
      machineGlobalConfig: null
      provisionGeneration: 1
      upgradeStrategy:
        controlPlaneDrainOptions:
          timeout: 0
        workerDrainOptions:
          timeout: 0
  status:
    clientSecretName: local-kubeconfig
    clusterName: local
    conditions:
    - status: "True"
      type: Ready
    - lastUpdateTime: "2022-09-22T17:45:55Z"
      status: "False"
      type: Reconciling
    - lastUpdateTime: "2022-09-22T17:45:55Z"
      status: "False"
      type: Stalled
    - lastUpdateTime: "2023-02-24T21:16:06Z"
      status: "True"
      type: Created
    - lastUpdateTime: "2023-09-30T09:06:07Z"
      status: "True"
      type: RKECluster
    - status: Unknown
      type: DefaultProjectCreated
    - status: Unknown
      type: SystemProjectCreated
    - lastUpdateTime: "2022-12-24T02:22:27Z"
      status: "True"
      type: Provisioned
    - lastUpdateTime: "2023-09-30T09:06:07Z"
      message: 'configuring bootstrap node(s) custom-716cb3ba930e: waiting for probes:
        kube-controller-manager, kube-scheduler'
      reason: Waiting
      status: Unknown
      type: Updated
    - lastUpdateTime: "2022-12-24T03:43:59Z"
      status: "True"
      type: Connected
    observedGeneration: 4
    ready: true
kind: List
metadata:
  resourceVersion: ""
g
yeah its still waiting on the node probes to report back
any chance i could have a new support-bundle?
or tail the logs of rancher pods in
cattle-system
namespace.. based on the scripts run.. everything has been done to get rancher to trigger the reconcile of this node
q
sure, one sec.
sorry, ac techs from the move needed something.
g
no rush
you are missing a dns record..
Copy code
dial tcp: lookup <http://rancher.mgt.natimark.com|rancher.mgt.natimark.com> on 10.53.0.10:53: no such host
q
thats the rancher manager. its off now is that why?
g
the cluster agent and rancher pods are being spammed by it
q
i can bring it back online. give me a few
okay, rancherd is back online
should i restart that service again? or you want something else?
g
can you please check the logs for rancher now?
kubectl logs -f rancher-bbf4bdf96-6gl7w -n cattle-system
this should be running the promotion..
q
Copy code
2023/10/03 01:02:13 [ERROR] Failed to dial steve aggregation server: dial tcp: lookup rancher.mgt.natimark.com on 10.53.0.10:53: no such host
2023/10/03 01:02:18 [ERROR] Failed to dial steve aggregation server: dial tcp: lookup rancher.mgt.natimark.com on 10.53.0.10:53: no such host
2023/10/03 01:02:23 [ERROR] Failed to dial steve aggregation server: websocket: bad handshake
2023/10/03 01:02:28 [ERROR] Failed to dial steve aggregation server: dial tcp: lookup rancher.mgt.natimark.com on 10.53.0.10:53: no such host
2023/10/03 01:02:33 [ERROR] Failed to dial steve aggregation server: dial tcp: lookup rancher.mgt.natimark.com on 10.53.0.10:53: no such host
2023/10/03 01:04:35 [INFO] Downloading repo index from <https://releases.rancher.com/server-charts/stable/index.yaml>
2023/10/03 01:05:18 [INFO] [planner] rkecluster fleet-local/local: waiting: configuring bootstrap node(s) custom-716cb3ba930e: waiting for probes: kube-controller-manager, kube-scheduler
looks like its moving now
Copy code
2023/10/03 01:04:35 [INFO] Downloading repo index from <https://releases.rancher.com/server-charts/stable/index.yaml>
2023/10/03 01:05:18 [INFO] [planner] rkecluster fleet-local/local: waiting: configuring bootstrap node(s) custom-716cb3ba930e: waiting for probes: kube-controller-manager, kube-scheduler
2023/10/03 01:05:32 [INFO] Downloading repo index from <http://harvester-cluster-repo.cattle-system/charts/index.yaml>
2023/10/03 01:05:51 [INFO] [planner] rkecluster fleet-local/local: waiting: configuring bootstrap node(s) custom-716cb3ba930e: waiting for probes: kube-controller-manager, kube-scheduler
2023/10/03 01:06:34 [ERROR] Error during subscribe websocket: close sent
g
are you able to check on
harvester-07
node if anything got triggered?
q
still showing promoting, and same error on the journalctl -fu rancher-system-agent
g
k get machines.cluster custom-eb75b22fefcf -n fleet-local -o yaml
q
Copy code
apiVersion: <http://cluster.x-k8s.io/v1beta1|cluster.x-k8s.io/v1beta1>
kind: Machine
metadata:
  annotations:
    <http://objectset.rio.cattle.io/applied|objectset.rio.cattle.io/applied>: H4sIAAAAAAAA/5yTwW7bPBCEX+XHniXHlixREvBfWvQUtAe36H21XNqsKdIgV06LwO9eSHHcuEHSNkeJnMU3M8t7GFhQoyB094DeB0GxwafpM/TfmCSxLKINC0IRxwsbbqyGDkY/oMct63xA2lnPkL0oCHeeY7497qGDm+Mq++/Wev3/Z6bI8keZx4GhAxqThCHnXlV9URg2ZP5Kmg5Ik94FQgenDCjybPGLHTgJDgfo/OhcBg57drNxcmMSjovv+b5J07jzj0eWh1kZ7DAeeTrYkZ2unROBDiSOrwWyw7SDDpalMqQ1NTWu123F67pmXlFd9kvTlqYxpaqU5goyiHt+on+B5/rSuZd8rksva1UqomrVUFG0Rq2JG1Sq4LZet31ZVKopmrLqm74tWa206TUutTJFRbUyz4bfhbjnmMfg+NHuKYNXu3rahnHMkl86SQemKfg+BEkS8TC3ELyx2w2beTUP9ivHZIOH7jeU4woy2Fs/2dzcfnh3mfFWnmlJHgL+dJ2v9SZikjiSjJH/jez9zPDx8lbejJYEZUxXaW0Y9Q/oDLrEzyl/nZ1OPwMAAP//RguWUfADAAA
    <http://objectset.rio.cattle.io/id|objectset.rio.cattle.io/id>: unmanaged-machine
    <http://objectset.rio.cattle.io/owner-gvk|objectset.rio.cattle.io/owner-gvk>: /v1, Kind=Secret
    <http://objectset.rio.cattle.io/owner-name|objectset.rio.cattle.io/owner-name>: custom-eb75b22fefcf
    <http://objectset.rio.cattle.io/owner-namespace|objectset.rio.cattle.io/owner-namespace>: local
  creationTimestamp: "2023-03-16T05:39:03Z"
  finalizers:
  - <http://machine.cluster.x-k8s.io|machine.cluster.x-k8s.io>
  generation: 3
  labels:
    <http://cluster.x-k8s.io/cluster-name|cluster.x-k8s.io/cluster-name>: local
    <http://harvesterhci.io/managed|harvesterhci.io/managed>: "true"
    <http://objectset.rio.cattle.io/hash|objectset.rio.cattle.io/hash>: 037fcddc86a4495e466ee1c63b0f93f8f3757de5
    <http://rke.cattle.io/cluster-name|rke.cattle.io/cluster-name>: local
    <http://rke.cattle.io/control-plane-role|rke.cattle.io/control-plane-role>: "true"
    <http://rke.cattle.io/etcd-role|rke.cattle.io/etcd-role>: "true"
    <http://rke.cattle.io/machine-id|rke.cattle.io/machine-id>: d06737cc518c229f74ce8a772e9649b325782835b8b93e71dfbda0d7f25c67f
    <http://rke.cattle.io/worker-role|rke.cattle.io/worker-role>: "true"
  name: custom-eb75b22fefcf
  namespace: fleet-local
  ownerReferences:
  - apiVersion: <http://cluster.x-k8s.io/v1beta1|cluster.x-k8s.io/v1beta1>
    kind: Cluster
    name: local
    uid: 736f82bc-3db3-4252-84f2-4ef8a4c46846
  resourceVersion: "629257036"
  uid: 41236e43-79aa-40fd-8cc7-03122e34ff42
spec:
  bootstrap:
    configRef:
      apiVersion: <http://rke.cattle.io/v1|rke.cattle.io/v1>
      kind: RKEBootstrap
      name: custom-eb75b22fefcf
      namespace: fleet-local
    dataSecretName: custom-eb75b22fefcf-machine-bootstrap
  clusterName: local
  infrastructureRef:
    apiVersion: <http://rke.cattle.io/v1|rke.cattle.io/v1>
    kind: CustomMachine
    name: custom-eb75b22fefcf
    namespace: fleet-local
  providerID: <rke2://harvester-07>
status:
  addresses:
  - address: 192.168.5.170
    type: InternalIP
  - address: harvester-07
    type: Hostname
  bootstrapReady: true
  conditions:
  - lastTransitionTime: "2023-03-16T05:39:04Z"
    status: "True"
    type: Ready
  - lastTransitionTime: "2023-03-16T05:39:04Z"
    status: "True"
    type: BootstrapReady
  - lastTransitionTime: "2023-03-16T05:39:03Z"
    status: "True"
    type: InfrastructureReady
  - lastTransitionTime: "2023-10-01T06:40:43Z"
    status: "True"
    type: NodeHealthy
  - lastTransitionTime: "2023-03-16T05:39:03Z"
    status: "True"
    type: PlanApplied
  - lastTransitionTime: "2023-06-28T06:55:43Z"
    status: "True"
    type: Reconciled
  infrastructureReady: true
  lastUpdated: "2023-03-16T05:39:27Z"
  nodeInfo:
    architecture: amd64
    bootID: e120d133-b9c9-45ad-bb97-d1a42a3b261d
    containerRuntimeVersion: <containerd://1.6.8-k3s1>
    kernelVersion: 5.3.18-150300.59.101-default
    kubeProxyVersion: v1.24.7+rke2r1
    kubeletVersion: v1.24.7+rke2r1
    machineID: f9b21763a25ec86d013eafc56412ab6b
    operatingSystem: linux
    osImage: Harvester v1.1.1
    systemUUID: 00000000-0000-0000-0000-309c23e612c0
  nodeRef:
    apiVersion: v1
    kind: Node
    name: harvester-07
    uid: 80bc540c-30c0-46a5-b863-953187a2e314
  observedGeneration: 3
  phase: Running
g
q
what node should i do this on?
Copy code
echo "Rotating kube-controller-manager certificate"
does it have to be a master? or on har-7?
g
harv-7
i dont think this will be the case
q
so i dont hace the tls folder for:
Copy code
sudo rm /var/lib/rancher/rke2/server/tls/kube-controller-manager/kube-controller-manager.{crt,key}
should i keep going?
i do on a master, but not on har-7
and i've had this cluster for 375 days... so maybe
g
well issue would have happened on harv-7 i'd think
crictl ps
are you able to please check this on harv-7
q
Copy code
CONTAINER           IMAGE               CREATED             STATE               NAME                            ATTEMPT             POD ID              POD
d3f7b46617e71       219ee5171f800       2 hours ago         Running             promote                         0                   ec8543dca05e9       harvester-promote-harvester-07-qqnlp
529c10d7636d7       9b80adc8eaa31       41 hours ago        Running             compute                         0                   57bca8824ce13       virt-launcher-addrbbox-dhnbf
5accad7b5047d       9b80adc8eaa31       42 hours ago        Running             compute                         0                   ce0d629e94526       virt-launcher-accuzip-rvwnd
cd88f91d3224f       1a7095f7e9bc9       43 hours ago        Running             backing-image-manager           0                   faabb0153f36c       backing-image-manager-d7ad-8fd6
881d47a92dcba       de93f80351954       43 hours ago        Running             longhorn-csi-plugin             27                  018a77a10bf01       longhorn-csi-plugin-qxl9n
703120bf37f67       77c44d54b1211       43 hours ago        Running             virt-handler                    210                 3bd6db0b6bc2e       virt-handler-v86pv
72d8301278723       36b11648e019a       43 hours ago        Running             replica-manager                 0                   4290477270099       instance-manager-r-17fd168b
485914591d6f5       36b11648e019a       43 hours ago        Running             engine-manager                  0                   0796d85ade51a       instance-manager-e-ce98fd9a
bd92b9dd18961       28003e667aa9f       43 hours ago        Running             rke2-ingress-nginx-controller   22                  245e088876dda       rke2-ingress-nginx-controller-hht5t
bbcdda3ad8019       8318b9b61b32b       43 hours ago        Running             rancher                         4                   1c5a89a77e057       rancher-bbf4bdf96-5qqzz
5ff221b27fbf3       de93f80351954       43 hours ago        Running             longhorn-manager                22                  5c4df84d9f4f3       longhorn-manager-czr8c
366d62b2a4bee       cb03930a2bd42       43 hours ago        Running             node-driver-registrar           17                  018a77a10bf01       longhorn-csi-plugin-qxl9n
55a83adb228af       5131c4e1af289       43 hours ago        Running             fluent-bit                      17                  d56fb75e55ee6       rancher-logging-kube-audit-fluentbit-8xsvn
51a4170c2db93       5131c4e1af289       43 hours ago        Running             fluent-bit                      17                  3e2411765caea       rancher-logging-root-fluentbit-stzfj
f316b187484ae       8681890ac02c0       43 hours ago        Running             engine-image-ei-a5371358        17                  0175ce5479730       engine-image-ei-a5371358-2pj9f
ec0d39076e4b3       5131c4e1af289       43 hours ago        Running             fluentbit                       17                  8c53d37ef207b       rancher-logging-rke2-journald-aggregator-4zfdz
d0e625f5cb9c4       8203b8fd46399       43 hours ago        Running             apiserver                       7                   bd3db7e8593fb       harvester-7794f4b7c4-4b7n4
0754a0c885c59       803347fbe5a24       43 hours ago        Running             harvester-webhook               3                   d632f61ef9586       harvester-webhook-5b88c99f5d-jfxw5
e9b21c061a2ab       39dd4d1e9ee87       43 hours ago        Running             node-manager                    17                  66389f94b96e6       harvester-node-manager-gv9l2
7cba533871094       ab979157630fc       43 hours ago        Running             kube-flannel                    17                  f0f4fd67fa1ec       rke2-canal-bzxwf
061d519275735       5dddbb6d554c6       43 hours ago        Running             calico-node                     17                  f0f4fd67fa1ec       rke2-canal-bzxwf
cf5802af468b3       12f4ea63839f6       43 hours ago        Running             harvester-network               22                  55f3e1c0299de       harvester-network-controller-j8wgg
4b142c0e05db7       347508c544b98       43 hours ago        Running             harvester-node-disk-manager     24                  fdcf00e9d2a02       harvester-node-disk-manager-ttwx9
c467b19be9789       38df782a74380       43 hours ago        Running             kube-proxy                      18                  8251deb85f40c       kube-proxy-harvester-07
2cafc8ed23e17       a49a7ca14bb9d       43 hours ago        Running             whereabouts                     17                  b55955860656a       harvester-whereabouts-rrwkg
2cb89b4f9433b       0482afd7c6409       43 hours ago        Running             longhorn-loop-device-cleaner    17                  0ec3bcc0b8f7a       longhorn-loop-device-cleaner-vrvgc
47393a3a21fbe       9ef244af5338c       43 hours ago        Running             kube-rke2-multus                17                  745584a10ed80       rke2-multus-ds-kfl99
36c1f32678c9e       0fafea1498594       43 hours ago        Running             node-exporter                   25                  3ced0fb0ad0e8       rancher-monitoring-prometheus-node-exporter-9qgjw
g
we could try restarting rancher pods
the embedded rancher in harvester else i will need to ask in our rancher team since i am not across the logic its trying to run
q
i can try the delete. any specific command? or type of pod i wanna kill?
g
just the rancher pods..
kubectl delete pod rancher-bbf4bdf96-5qqzz rancher-bbf4bdf96-6gl7w rancher-bbf4bdf96-hkhpf -n cattle-system
it should spin up new ones and we will know if anything happens
q
🀞
its not spinning up on 7. but it's cordoned because of the promote... should i uncordon?
g
leave it as such
there are other pods.. only 1 is actually doing the processing anyways
q
got yah.
Copy code
I1003 01:44:04.652321      33 leaderelection.go:258] successfully acquired lease kube-system/cattle-controllers
2023/10/03 01:44:04 [INFO] Steve auth startup complete
2023/10/03 01:44:05 [INFO] Starting /v1, Kind=Node controller
2023/10/03 01:44:05 [INFO] Starting <http://management.cattle.io/v3|management.cattle.io/v3>, Kind=User controller
2023/10/03 01:44:05 [INFO] Starting /v1, Kind=Namespace controller
2023/10/03 01:44:05 [INFO] Starting /v1, Kind=Pod controller
2023/10/03 01:44:05 [INFO] Starting <http://rke.cattle.io/v1|rke.cattle.io/v1>, Kind=RKECluster controller
2023/10/03 01:44:05 [INFO] Starting apps/v1, Kind=Deployment controller
2023/10/03 01:44:05 [INFO] Starting <http://rke.cattle.io/v1|rke.cattle.io/v1>, Kind=RKEControlPlane controller
2023/10/03 01:44:05 [INFO] Starting <http://admissionregistration.k8s.io/v1|admissionregistration.k8s.io/v1>, Kind=MutatingWebhookConfiguration controller
2023/10/03 01:44:05 [INFO] Starting <http://rke.cattle.io/v1|rke.cattle.io/v1>, Kind=ETCDSnapshot controller
2023/10/03 01:44:05 [INFO] Starting <http://cluster.x-k8s.io/v1beta1|cluster.x-k8s.io/v1beta1>, Kind=Cluster controller
2023/10/03 01:44:05 [INFO] Starting <http://rke.cattle.io/v1|rke.cattle.io/v1>, Kind=RKEBootstrapTemplate controller
2023/10/03 01:44:05 [INFO] Starting <http://catalog.cattle.io/v1|catalog.cattle.io/v1>, Kind=Operation controller
2023/10/03 01:44:05 [INFO] Starting <http://admissionregistration.k8s.io/v1|admissionregistration.k8s.io/v1>, Kind=ValidatingWebhookConfiguration controller
2023/10/03 01:44:05 [INFO] Starting <http://catalog.cattle.io/v1|catalog.cattle.io/v1>, Kind=App controller
2023/10/03 01:44:05 [INFO] Starting <http://fleet.cattle.io/v1alpha1|fleet.cattle.io/v1alpha1>, Kind=Bundle controller
2023/10/03 01:44:05 [INFO] Starting <http://fleet.cattle.io/v1alpha1|fleet.cattle.io/v1alpha1>, Kind=Cluster controller
2023/10/03 01:44:05 [INFO] Starting <http://management.cattle.io/v3|management.cattle.io/v3>, Kind=FleetWorkspace controller
2023/10/03 01:44:05 [INFO] Starting /v1, Kind=Service controller
2023/10/03 01:44:05 [INFO] Starting apps/v1, Kind=DaemonSet controller
2023/10/03 01:44:05 [INFO] Starting <http://rke.cattle.io/v1|rke.cattle.io/v1>, Kind=CustomMachine controller
2023/10/03 01:44:05 [INFO] Starting <http://management.cattle.io/v3|management.cattle.io/v3>, Kind=ManagedChart controller
2023/10/03 01:44:05 [INFO] Starting <http://cluster.x-k8s.io/v1beta1|cluster.x-k8s.io/v1beta1>, Kind=MachineDeployment controller
2023/10/03 01:44:08 [INFO] [planner] rkecluster fleet-local/local: waiting: configuring bootstrap node(s) custom-716cb3ba930e: waiting for probes: kube-controller-manager, kube-scheduler
2023/10/03 01:44:08 [INFO] [planner] rkecluster fleet-local/local: waiting: configuring bootstrap node(s) custom-716cb3ba930e: waiting for probes: kube-controller-manager, kube-scheduler
2023/10/03 01:45:22 [INFO] [planner] rkecluster fleet-local/local: waiting: configuring bootstrap node(s) custom-716cb3ba930e: waiting for probes: kube-controller-manager, kube-scheduler
2023/10/03 01:45:55 [INFO] [planner] rkecluster fleet-local/local: waiting: configuring bootstrap node(s) custom-716cb3ba930e: waiting for probes: kube-controller-manager, kube-scheduler
2023/10/03 01:46:10 [ERROR] Error during subscribe websocket: close sent
2023/10/03 01:46:57 [ERROR] Error during subscribe websocket: close sent
2023/10/03 01:48:18 [INFO] Downloading repo index from <https://releases.rancher.com/server-charts/stable/index.yaml>
2023/10/03 01:48:18 [INFO] Downloading repo index from <http://harvester-cluster-repo.cattle-system/charts/index.yaml>
g
do you see anything in
rancher-system-agent
logs?
q
nothing.
bounce that service?
g
you could try.. but i doubt it will do anything
q
nope. same thing.
g
i will need to ask someone in the rancher team about this
q
is there a way to kill the promo if i try and join another node?
g
not sure.. i think it will fail eventually
and then it will try another node if there is one
q
okay. well if you wanna reach out to the rancherd team, i'll try and join another machine again. i'll let you know if it doesnt join like i was observing before.
for the grafana issue, it doesnt seem to want to attach the pvc, despite it being health and detached.
any reason why i shouldnt shutdown?
@great-bear-19718 anything on this stuck promotion?
g
i have not heard back yet.. i will chase up again
q
@great-bear-19718 no worries. thanks. also i have another node up, still not registering either. for that server, here's the long from rancherd:
Copy code
-- Logs begin at Wed 2023-10-04 21:22:22 UTC. --
Oct 04 21:22:54 harvester-02-r2 rancherd[2368]: time="2023-10-04T21:22:54Z" level=info msg="[stdout]: [INFO]  Successfully downloaded Rancher connection information"
Oct 04 21:22:54 harvester-02-r2 rancherd[2368]: time="2023-10-04T21:22:54Z" level=info msg="[stdout]: [INFO]  systemd: Creating service file"
Oct 04 21:22:54 harvester-02-r2 rancherd[2368]: time="2023-10-04T21:22:54Z" level=info msg="[stdout]: [INFO]  Creating environment file /etc/systemd/system/rancher-system-agent.env"
Oct 04 21:22:54 harvester-02-r2 rancherd[2368]: time="2023-10-04T21:22:54Z" level=info msg="[stdout]: [INFO]  Enabling rancher-system-agent.service"
Oct 04 21:22:54 harvester-02-r2 rancherd[2368]: time="2023-10-04T21:22:54Z" level=info msg="[stderr]: Created symlink /etc/systemd/system/multi-user.target.wants/rancher-system-agent.service β†’ /etc/systemd/system/rancher-system-agent.service."
Oct 04 21:22:54 harvester-02-r2 rancherd[2368]: time="2023-10-04T21:22:54Z" level=info msg="[stdout]: [INFO]  Starting/restarting rancher-system-agent.service"
Oct 04 21:22:54 harvester-02-r2 rancherd[2368]: time="2023-10-04T21:22:54Z" level=info msg="No image provided, creating empty working directory /var/lib/rancher/rancherd/plan/work/20231004-212252-applied.plan/_1"
Oct 04 21:22:54 harvester-02-r2 rancherd[2368]: time="2023-10-04T21:22:54Z" level=info msg="Running command: /usr/bin/rancherd [probe]"
Oct 04 21:22:54 harvester-02-r2 rancherd[2368]: time="2023-10-04T21:22:54Z" level=info msg="[stderr]: time=\"2023-10-04T21:22:54Z\" level=info msg=\"Running probes defined in /var/lib/rancher/rancherd/plan/plan.json\""
Oct 04 21:22:55 harvester-02-r2 rancherd[2368]: time="2023-10-04T21:22:55Z" level=info msg="[stderr]: time=\"2023-10-04T21:22:55Z\" level=info msg=\"Probe [kubelet] is unhealthy\""
here's the status for that machine too:
Copy code
Status:
  Bootstrap Ready:  true
  Conditions:
    Last Transition Time:  2023-10-04T21:22:55Z
    Status:                True
    Type:                  Ready
    Last Transition Time:  2023-10-04T21:22:55Z
    Status:                True
    Type:                  BootstrapReady
    Last Transition Time:  2023-10-04T21:22:54Z
    Status:                True
    Type:                  InfrastructureReady
    Last Transition Time:  2023-10-04T21:22:54Z
    Reason:                WaitingForNodeRef
    Severity:              Info
    Status:                False
    Type:                  NodeHealthy
  Last Updated:            2023-10-04T21:22:55Z
  Observed Generation:     2
  Phase:                   Provisioning
g
i assume
rancher-system-agent
is running on both nodes?
q
so two things 1) i removed the rancher management server from harvester and cluster incase it was part of the issue. 2) no, i do not see anything for rancher-system-agent running
g
i dont think running rancher is the issue..
q
@great-bear-19718 i wasnt sure it was either, just wanted to take it out of the equation just incase. posted a bundle i pulled last night
@great-bear-19718 any ideas on why i cant join a node or get the 3rd node to promote?
g
i dont have an answer yet.. do you see anything else in
rancher-system-agent
logs?
also are you able to zip up this folder?
/var/lib/rancher/rancherd/plan/
on the node which is waiting to be bootstrapped?>
might be best to DM me since it may have some sensitive info about the cluster
q
for rancher-system-agent. is that a service on the machine? and if yes, which machine? the node i'm trying to join? or a different one?
i can send you that folder tomorrow, the server is currently offline
@great-bear-19718 i see it's a service, which node do you want it off of?
g
the one which is not joining
there should be a plan folder which should have some info which might help me figure out what is going on
q
okay, i can check it tomorrow then. it's offline right 😞
g
πŸ‘