This message was deleted.
# k3s
a
This message was deleted.
a
This is a 3 node k3s cluster with an external RDS db
When I run
kubectl get deployments -n kube-system
I see
Copy code
local-path-provisioner 0/1
metrics-server 0/1
coredns 0/1
f
looks like your
kube-dns
service may already be listening on
10.43.0.10
also something about a bad/dead Bearer token seems it may go away later, possibly linked to the already provisioned service / IP
Probably the best shot i have, i'd suggest doing a reboot / fresh check of your pods to make sure things are starting correctly
a
Yea, actually I just saw that by doing
kubectl get service --all-namespaces
and the kube-dns service is running
f
might be an old pod you see before (initial message)
a
running and listening on that IP
f
what about a quick
kubectl get po -A
should show everything up i'd assume now
a
thta shows te fleet-controller and metrics derver in a crashloopbackoff. Then it shows these as Running
Copy code
svclb-traefik
gitjob
rancher-webhook
fleet-agent
it shows Rancher as running but with 0/1 ready
f
check
fleet-controller
pod logs if you can
Might need to start poking each pod on that node with a reboot and make sure it comes back up, it should self heal though ideally if you still have a proper cluster state (assuming 3+ nodes)
a
weird the fleet-controller log shows
Copy code
Error: Unauthorized
Usage: 
  fleet-manager [flags]
...
time="<timestampe>" level=fatal msg=Unauthorized
thought the usage statement was an error with my command at first lol
f
Weird, if that is
stdout
from the service maybe its being called wrong by the other containers. then possibly the one node is broken? At this point I'd take it out of the cluster and re-provision into the cluster again, I dont poke around too much with system level and just make sure i can re-provision quick
a
we keep our manager nodes in an autoscale group in AWS, it looks like all 3 were replaced within a few hours of eachother overnight a few days ago. I have no idea what could have happened, this is in our air-gapped network. So there's 3 nodes right now that all show as healthy. I did remove one and add another one about 45 minutes ago and it joined fine, but showing the same error as the one before. All 3 show the same messages from journalctl
Would it be easier to troubleshoot if I brought the cluster down to just 1 node?
I was hoping adding the new node would clear things up but it didn't. We run the k3s install.sh script from our user-data script
I just deleted the node that those pods were running on, it tried starting them on a new node but it's showing the same errors
I'm tempted to delete all of my nodes, restore my DB from backups, and bring up a new node. But I'd prefer to get this sorted the right way
The log from the fleet-agent pod says "Failed to register agent: Unauthorized"
I brought the cluster down to just 1 node and then rebooted that node so all pods should start fresh, same errors though