Hi! Using K3s 1.30.11+k3s1, I have a master node a...
# k3s
g
Hi! Using K3s 1.30.11+k3s1, I have a master node and an agent node. We are using the embedded registry to allow the agent node to pull images from the master node as we have a private registry protected by mTLS and don't want to put the mTLS certs on the agent node. This use case seems perfect for the embedded registry with p2p (with Spegel if I understood correctly). Some images are only used by the agent and pre-downloaded on the master (by running a dummy pod with a node selector) to make them available through the embedded registry (we made sure to adjust the GC thresholds so that the master does not remove an images that it does not used to keep it available for the agent). It worked fine for a while, but a few days ago, we updated the application running on the cluster and with it, the tag of some images used by the app After the image tag change, the agent can no longer pull the image through the embedded registry and I'm getting errors like this:
Jun 18 081655 SECPRDAPP12 k3s[1552863]: E0618 081655.323339 1552863 pod_workers.go:1298] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"superset\" with ErrImagePull: \"rpc error: code = NotFound desc = failed to pull and unpack image \\\"update.domain.com/rc/app/superset:1.30.0-rc3\\\": failed to resolve reference \\\"update.domain.com/rc/app/superset:1.30.0-rc3\\\": update.domain.com/rc/app/superset:1.30.0-rc3: not found\"" pod="app-data/superset-6476db494f-js5hc" podUID="d9d9b0f6-5886-4de3-b70d-cf2664917c83"
However , the image has been successfully pulled on the master and is still there according to
ctr
:
# ctr -n k8s.io images ls|grep superset
update.domain.com/rc/app/superset:1.30.0-rc3 application/vnd.oci.image.manifest.v1+json sha256:7cdd2ff8134ad0f176f6d63ce93e91b25c74ed127caa20d995e82b6da9575cb0 426.7 MiB linux/amd64 io.cri-containerd.image=managed
update.domain.com/rc/app/superset@sha256:7cdd2ff8134ad0f176f6d63ce93e91b25c74ed127caa20d995e82b6da9575cb0 application/vnd.oci.image.manifest.v1+json sha256:7cdd2ff8134ad0f176f6d63ce93e91b25c74ed127caa20d995e82b6da9575cb0 426.7 MiB linux/amd64
Does anyone have any idea why this would be happening, how I could debug it and how I could resolve it? Thanks a lot !
c
you’d need to run the server with
--debug
or
debug: true
in the config and check the logs for messages from spegel
running the agent with debug enabled, and doing
crictl pull <http://update.domain.com/rc/app/superset:1.30.0-rc3|update.domain.com/rc/app/superset:1.30.0-rc3>
might also show some relevant logs
what do you mean by
the tag of some images used by the app After the image tag change
though?
Are you mutating the tag - changing the content that it refers to?
g
The app is deployed using Helm charts. The newer version of the chart contains image tags in its values that correspond to the version of the app we are deploying. So when we did the most recent update, we went from 1.30.0-rc2 to 1.30.0-rc3 on all the in-house images.
Regarding the debug output, I will try to activate it but being a production server, it's not always easy to have to restart the k3s process mid-day.
Should I only activate the debug on the master node or will there also be interesting messages in the agent's logs?
Activated
--debug
on the master and when I
crictl pull
from the agent, I still get the
NotFound
eror but nothing in the master's logs. Worth noting, I have
--disable-default-registry-endpoint
set on the agent to force everyting to go through the embedded registry and my registries.yaml looks like this: ```mirrors: update.domain.com: docker.io: public.ecr.aws: quay.io: registry.k8s.io:```
I just got the info from one of my colleagues that the p2p registry might never have worked on this env. We tested the same deployment in AWS with no issues but this particular deployment is on Outscale, which might have specificities that AWS does not 🤔 The ports are open and accessible but maybe there is another network requirement that is not immediately obvious ?
I'm seeing messages like this on the agent node, maybe this has something to do with it?
Jun 18 132111 SECPRDAPP12 k3s[3885758]: time="2025-06-18T132111Z" level=info msg="spegel 2025/06/18 132111 p2p: \"msg\"=\"could not get bootstrap addresses\" \"error\"=\"client not ready\""
I can also see this in the master's logs :
Jun 18 145956 ip-10-201-101-11 k3s[1901844]: 2025-06-18T145956.698Z WARN dht/RtRefreshManager rtrefresh/rt_refresh_manager.go:233 failed when refreshing routing table {"error": "2 errors occurred:\n\t* failed to query for self, err=failed to find any peer in table\n\t* failed to refresh cpl=0, err=failed to find any peer in table\n\n"}
as well as in the agent's logs :
Jun 18 145711 SECPRDAPP12 k3s[3885758]: 2025-06-18T145711.078Z WARN dht/RtRefreshManager rtrefresh/rt_refresh_manager.go:233 failed when refreshing routing table {"error": "2 errors occurred:\n\t* failed to query for self, err=failed to find any peer in table\n\t* failed to refresh cpl=0, err=failed to find any peer in table\n\n"}
c
Yeah this node has no connected peers. That would definitely break mirroring
You need to make sure each node is reachable from other nodes at the address listed in its
<http://p2p.k3s.cattle.io/node-address|p2p.k3s.cattle.io/node-address>
annotation. If they are not routable to each other, or the p2p or registry ports are blocked, mirroring will not work.
If a node does not have that annotation, then the embedded registry isn’t enabled on it.
g
Great, thanks for the info, I will check that!
After checking, the master node does have the annotation, the agent node does not have it. They can both communicate over TCP 5001. I have a tcpdump running on that port but no packets are exchanged between the two machines. Are there any other network requirements for this to work? How is the node discovery made? Just by looking at the node IPs from the cluster info?
Are the hostnames used for communication at all or only IP addresses? (so that I can also check name resolution if needed)
c
IPs. If the agent doesn't have it set then it's not running spegel. Are you sure you configured registries.yaml on the agent? That is node specific, you need to make sure it is set properly on every node.
If you don't have any registries enabled for mirroring then spegel won't even start in that node and the annotation won't be set.
g
Yes, I'm sure the registries.yaml is there and not empty. I tried with specifying the registries directly and using the wildcard, neither worked. I use the exact same setup on an other env and it works. Also, Spegel does start as I see these messages:
Jun 18 102414 SECPRDAPP12 k3s[3657753]: time="2025-06-18T102414Z" level=info msg="Starting distributed registry mirror at https://10.51.1.12:6443/v2 for registries [update.domain.com docker.io public.ecr.aws quay.io registry.k8s.io]"
Jun 18 102414 SECPRDAPP12 k3s[3657753]: time="2025-06-18T102414Z" level=info msg="Starting distributed registry P2P node at 10.51.1.12:5001"
Or this one with the wildcard:
Jun 19 130732 SECPRDAPP12 k3s[2281683]: time="2025-06-19T130732Z" level=info msg="Starting distributed registry mirror at https://10.51.1.12:6443/v2 for registries [*]"
Jun 19 130732 SECPRDAPP12 k3s[2281683]: time="2025-06-19T130732Z" level=info msg="Starting distributed registry P2P node at 10.51.1.12:5001"
c
Hmm, maybe only servers set that annotation, it's been a minute since I touched that. The debug logs should show it trying to find peers for the content when the pull happens.
g
Given the other messages, I'm not confident in this working on a pull. Currently, if I try a crictl pull, here is what I get in the logs:
Jun 20 074259 SECPRDAPP12 k3s[15040]: time="2025-06-20T074259Z" level=info msg="spegel 2025/06/20 074259 \"msg\"=\"\" \"error\"=\"mirror with image component update.domain.com/rc/app/superset:1.30.0-rc3 could not be found\" \"path\"=\"/v2/rc/app/superset/manifests/1.30.0-rc3\" \"status\"=404 \"method\"=\"HEAD\" \"latency\"=\"769.416µs\" \"ip\"=\"127.0.0.1\" \"handler\"=\"mirror\""
With this in registries.yaml:
Copy code
mirrors:
  "*":
c
g
I did and had subscribed to notifications on thie thread, but unfortunately, my problem persists. No peer is ever found in my case.
c
what DO you see? Do you see another node publishing the key? Do you see the nodes connecting to each other?
g
No much more that what I pasted here. The nodes are joined as far as k3s is concerned, pods can be scheduled on any of them, but the embedded registry is not usable from the agent node. I can see from the logs (see previous messages) that Spegel appears to be started, I can see that there is a daemon listening on port 5001 on the agent, I can see rtrefresh WARN messages about not finding peers in the table (master and agent), I can see the expected labels/annotations on the master node but not on the agent node, and I can see that TCP/5001 traffic is open between the master and agent.
c
can you share the full logs from both nodes, with debug enabled?
g
Sure, from a clean restart?
I'm sending you the logs in MP for privacy reasons. I have removed the original app name and domain for privacy as well, otherwise no changes. Thank you again very much for your help and time!
b
I just tried out embedded registry and it doesn't work for me at all. I was getting pull errors on all non docker.io images (because those were configured to use mirror.gcr.io) I'll try to repro it on a test cluster and feed back more logs - sample pull error looks like this
Copy code
time="2025-06-23T10:03:54.825176653+01:00" level=error msg="PullImage \"<http://registry.k8s.io/sig-storage/csi-provisioner:v5.0.2\|registry.k8s.io/sig-storage/csi-provisioner:v5.0.2\>" failed" error="rpc error: code = NotFound desc = failed to pull and unpack image \"<http://registry.k8s.io/sig-storage/csi-provisioner:v5.0.2\|registry.k8s.io/sig-storage/csi-provisioner:v5.0.2\>": failed to resolve reference \"<http://registry.k8s.io/sig-storage/csi-provisioner:v5.0.2\|registry.k8s.io/sig-storage/csi-provisioner:v5.0.2\>": <http://registry.k8s.io/sig-storage/csi-provisioner:v5.0.2|registry.k8s.io/sig-storage/csi-provisioner:v5.0.2>: not found"
g
BTW, we found our issue, it was a (probably) a resolution issue from the
K3S_URL
that was using the wrong machine name. We set it to use the machine's IP (or correct hostname) and it worked.
👍 1
b
'K3S_URL' being the API address?
g
Indeed, the master's address
b
Interesting, I'll have to check but as far as I remember it is correct since new nodes are joining just fine and this setup has been in place for ages. I actually haven't reproduced this issue in my other test cluster so I just took out the "*" from the mirrors config. I was only trying to work around the issue with Docker registry throttling and figured I'd try embedded registry to reduce unnecessary pulls. I redirected Docker registry to the GCP mirror and the rest works directly.
g
It is worth mentioning that we did not have joining issues either, it was only the embedded registry that had issues
🙂 1