This message was deleted.
# k3s
a
This message was deleted.
a
Error in journalctl on one of the workers is:
Copy code
Mar 20 12:05:36 ip-10-200-1-76 k3s[4305]: E0320 12:05:36.246717    4305 remote_runtime.go:193] "RunPodSandbox from runtime service failed" err="rpc error: code = Unknown desc = failed to get sandbox image \"rancher/mirrored-pause:3.6\": failed to pull image \"rancher/mirrored-pause:3.6\": failed to pull and unpack image \"<http://docker.io/rancher/mirrored-pause:3.6\|docker.io/rancher/mirrored-pause:3.6\>": failed to resolve reference \"<http://docker.io/rancher/mirrored-pause:3.6\|docker.io/rancher/mirrored-pause:3.6\>": unexpected status from HEAD request to <https://127.0.0.1:6443/v2/rancher/mirrored-pause/manifests/3.6?ns=docker.io>: 500 Internal Server Error"
c
Check the logs for other messages from Spegel / libp2p to confirm that it's able to connect to the p2p mesh to discover images from other nodes. You're sure the p2p ports are open? You've configured the registries.yaml identically on all nodes?
a
Hmm no logs at all on any of the nodes for spegel or libp2p 🤔 All ports are open between all of the hosts, and registries.yaml are all the same across all hosts 😞
c
Are you sure you’re looking in the right place? grep for
dht
in the k3s/k3s-agent logs
you can add
debug: true
to the config to enable additional logging
a
Ahh I was looking for Spegel and libp2p... my bad!
dht
warnings every 10 mins though!
Copy code
Mar 21 07:45:00 ip-10-200-1-24 k3s[4293]: 2024-03-21T07:45:00.857Z        WARN        dht/RtRefreshManager        rtrefresh/rt_refresh_manager.go:233        failed when refreshing routing table        {"error": "2 errors occurred:\n\t* failed to query for self, err=failed to find any peer in table\n\t* failed to refresh cpl=0, err=failed to find any peer in table\n\n"}
c
looks like it can’t connect to the other nodes on the DHT port (5001). You’re sure that’s open?
if you enable debug you should see the connection attempts
a
I can see one of these logs every so often on the controlplane:
Copy code
Mar 21 09:55:05 ip-10-200-1-243 k3s[66500]: 2024-03-21T09:55:05.229Z        DEBUG        dht        go-libp2p-kad-dht@v0.25.2/routing.go:397        providing        {"cid": "bafkreiabxdoaa3ceibgvpy2mcazwu47ndya6xeh6sjrqtezqhf7jpzeo4a", "mh": "bciqadog4abweiqcnk7ruyebtnjz62hqb5oip5etdbgjtaol6s7si5ya"}
And one of these every so often on the agents:
Copy code
Mar 21 09:57:04 ip-10-200-1-76 k3s[60114]: 2024-03-21T09:57:04.842Z        DEBUG        dht        go-libp2p-kad-dht@v0.25.2/routing.go:510        finding providers        {"cid": "bafkreig5lsn4dh3jstov5pvfqvqqjhjp2yq26zry5rk2maeqqxt5efsgee", "mh": "bciqn2xe3ygpwtfg5l27klblbasos7vrbv5tdr3cvuyajbbph2ilemii"}
And that's it? 😞
The port is open - can I view the config of DHT anywhere? Maybe hostnames/IPs are wrong? 🤔
Hmm maybe an internal CA issue?
Copy code
Mar 21 10:11:01 ip-10-200-1-76 k3s[61080]: time="2024-03-21T10:11:01Z" level=info msg="spegel 2024/03/21 10:11:01 p2p: \"msg\"=\"could not get bootstrap addresses\" \"error\"=\"CA cert validation failed: Get \\\"<https://ip-10-200-1-243.eu-west-2.compute.internal:6443/cacerts>\\\": tls: failed to verify certificate: x509: certificate is valid for ip-10-200-1-243, kubernetes, kubernetes.default, kubernetes.default.svc, kubernetes.default.svc.cluster.local, localhost, not ip-10-200-1-243.eu-west-2.compute.internal\""
I added
tls-san
to my control plane's config.yaml and that fixed it! 😄 Thanks for your help @creamy-pencil-82913!
c
ah that’s interesting, we must be using node hostname instead of node name somewhere…