Hi Using K3s 1 30 11+k3s1 I have a master node and an agent Rancher Users #k3s

Hi! Using K3s 1.30.11+k3s1, I have a master node a...

glamorous-afternoon-50134

06/18/2025, 8:23 AM

Hi! Using K3s 1.30.11+k3s1, I have a master node and an agent node. We are using the embedded registry to allow the agent node to pull images from the master node as we have a private registry protected by mTLS and don't want to put the mTLS certs on the agent node. This use case seems perfect for the embedded registry with p2p (with Spegel if I understood correctly). Some images are only used by the agent and pre-downloaded on the master (by running a dummy pod with a node selector) to make them available through the embedded registry (we made sure to adjust the GC thresholds so that the master does not remove an images that it does not used to keep it available for the agent). It worked fine for a while, but a few days ago, we updated the application running on the cluster and with it, the tag of some images used by the app After the image tag change, the agent can no longer pull the image through the embedded registry and I'm getting errors like this:

Jun 18 081655 SECPRDAPP12 k3s[1552863]: E0618 081655.323339 1552863 pod_workers.go:1298] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"superset\" with ErrImagePull: \"rpc error: code = NotFound desc = failed to pull and unpack image \\\"update.domain.com/rc/app/superset:1.30.0-rc3\\\": failed to resolve reference \\\"update.domain.com/rc/app/superset:1.30.0-rc3\\\": update.domain.com/rc/app/superset:1.30.0-rc3: not found\"" pod="app-data/superset-6476db494f-js5hc" podUID="d9d9b0f6-5886-4de3-b70d-cf2664917c83"

However , the image has been successfully pulled on the master and is still there according to

ctr

# ctr -n k8s.io images ls|grep superset

update.domain.com/rc/app/superset:1.30.0-rc3 application/vnd.oci.image.manifest.v1+json sha256:7cdd2ff8134ad0f176f6d63ce93e91b25c74ed127caa20d995e82b6da9575cb0 426.7 MiB linux/amd64 io.cri-containerd.image=managed

update.domain.com/rc/app/superset@sha256:7cdd2ff8134ad0f176f6d63ce93e91b25c74ed127caa20d995e82b6da9575cb0 application/vnd.oci.image.manifest.v1+json sha256:7cdd2ff8134ad0f176f6d63ce93e91b25c74ed127caa20d995e82b6da9575cb0 426.7 MiB linux/amd64

Does anyone have any idea why this would be happening, how I could debug it and how I could resolve it? Thanks a lot !

creamy-pencil-82913

06/18/2025, 10:05 AM

you’d need to run the server with

--debug

debug: true

in the config and check the logs for messages from spegel

creamy-pencil-82913

06/18/2025, 10:07 AM

running the agent with debug enabled, and doing

crictl pull <http://update.domain.com/rc/app/superset:1.30.0-rc3|update.domain.com/rc/app/superset:1.30.0-rc3>

might also show some relevant logs

creamy-pencil-82913

06/18/2025, 10:08 AM

what do you mean by

the tag of some images used by the app After the image tag change

though?

creamy-pencil-82913

06/18/2025, 10:09 AM

Are you mutating the tag - changing the content that it refers to?

glamorous-afternoon-50134

06/18/2025, 10:11 AM

The app is deployed using Helm charts. The newer version of the chart contains image tags in its values that correspond to the version of the app we are deploying. So when we did the most recent update, we went from 1.30.0-rc2 to 1.30.0-rc3 on all the in-house images.

glamorous-afternoon-50134

06/18/2025, 10:12 AM

Regarding the debug output, I will try to activate it but being a production server, it's not always easy to have to restart the k3s process mid-day.

glamorous-afternoon-50134

06/18/2025, 10:13 AM

Should I only activate the debug on the master node or will there also be interesting messages in the agent's logs?

glamorous-afternoon-50134

06/18/2025, 10:47 AM

Activated

--debug

on the master and when I

crictl pull

from the agent, I still get the

NotFound

eror but nothing in the master's logs. Worth noting, I have

--disable-default-registry-endpoint

set on the agent to force everyting to go through the embedded registry and my registries.yaml looks like this: ```mirrors: update.domain.com: docker.io: public.ecr.aws: quay.io: registry.k8s.io:```

glamorous-afternoon-50134

06/18/2025, 1:13 PM

I just got the info from one of my colleagues that the p2p registry might never have worked on this env. We tested the same deployment in AWS with no issues but this particular deployment is on Outscale, which might have specificities that AWS does not 🤔 The ports are open and accessible but maybe there is another network requirement that is not immediately obvious ?

glamorous-afternoon-50134

06/18/2025, 1:23 PM

I'm seeing messages like this on the agent node, maybe this has something to do with it?

Jun 18 132111 SECPRDAPP12 k3s[3885758]: time="2025-06-18T132111Z" level=info msg="spegel 2025/06/18 132111 p2p: \"msg\"=\"could not get bootstrap addresses\" \"error\"=\"client not ready\""

glamorous-afternoon-50134

06/18/2025, 3:01 PM

I can also see this in the master's logs :

Jun 18 145956 ip-10-201-101-11 k3s[1901844]: 2025-06-18T145956.698Z WARN dht/RtRefreshManager rtrefresh/rt_refresh_manager.go:233 failed when refreshing routing table {"error": "2 errors occurred:\n\t* failed to query for self, err=failed to find any peer in table\n\t* failed to refresh cpl=0, err=failed to find any peer in table\n\n"}

as well as in the agent's logs :

Jun 18 145711 SECPRDAPP12 k3s[3885758]: 2025-06-18T145711.078Z WARN dht/RtRefreshManager rtrefresh/rt_refresh_manager.go:233 failed when refreshing routing table {"error": "2 errors occurred:\n\t* failed to query for self, err=failed to find any peer in table\n\t* failed to refresh cpl=0, err=failed to find any peer in table\n\n"}

creamy-pencil-82913

06/18/2025, 7:18 PM

Yeah this node has no connected peers. That would definitely break mirroring

creamy-pencil-82913

06/18/2025, 7:19 PM

You need to make sure each node is reachable from other nodes at the address listed in its

<http://p2p.k3s.cattle.io/node-address|p2p.k3s.cattle.io/node-address>

annotation. If they are not routable to each other, or the p2p or registry ports are blocked, mirroring will not work.

creamy-pencil-82913

06/18/2025, 7:20 PM

If a node does not have that annotation, then the embedded registry isn’t enabled on it.

glamorous-afternoon-50134

06/19/2025, 7:21 AM

Great, thanks for the info, I will check that!

glamorous-afternoon-50134

06/19/2025, 7:45 AM

After checking, the master node does have the annotation, the agent node does not have it. They can both communicate over TCP 5001. I have a tcpdump running on that port but no packets are exchanged between the two machines. Are there any other network requirements for this to work? How is the node discovery made? Just by looking at the node IPs from the cluster info?

glamorous-afternoon-50134

06/19/2025, 7:57 AM

Are the hostnames used for communication at all or only IP addresses? (so that I can also check name resolution if needed)

creamy-pencil-82913

06/19/2025, 7:17 PM

IPs. If the agent doesn't have it set then it's not running spegel. Are you sure you configured registries.yaml on the agent? That is node specific, you need to make sure it is set properly on every node.

creamy-pencil-82913

06/19/2025, 7:18 PM

If you don't have any registries enabled for mirroring then spegel won't even start in that node and the annotation won't be set.

glamorous-afternoon-50134

06/20/2025, 5:41 AM

Yes, I'm sure the registries.yaml is there and not empty. I tried with specifying the registries directly and using the wildcard, neither worked. I use the exact same setup on an other env and it works. Also, Spegel does start as I see these messages:

Jun 18 102414 SECPRDAPP12 k3s[3657753]: time="2025-06-18T102414Z" level=info msg="Starting distributed registry mirror at https://10.51.1.12:6443/v2 for registries [update.domain.com docker.io public.ecr.aws quay.io registry.k8s.io]"

Jun 18 102414 SECPRDAPP12 k3s[3657753]: time="2025-06-18T102414Z" level=info msg="Starting distributed registry P2P node at 10.51.1.12:5001"

Or this one with the wildcard:

Jun 19 130732 SECPRDAPP12 k3s[2281683]: time="2025-06-19T130732Z" level=info msg="Starting distributed registry mirror at https://10.51.1.12:6443/v2 for registries [*]"

Jun 19 130732 SECPRDAPP12 k3s[2281683]: time="2025-06-19T130732Z" level=info msg="Starting distributed registry P2P node at 10.51.1.12:5001"

creamy-pencil-82913

06/20/2025, 7:10 AM

Hmm, maybe only servers set that annotation, it's been a minute since I touched that. The debug logs should show it trying to find peers for the content when the pull happens.

glamorous-afternoon-50134

06/20/2025, 7:44 AM

Given the other messages, I'm not confident in this working on a pull. Currently, if I try a crictl pull, here is what I get in the logs:

Jun 20 074259 SECPRDAPP12 k3s[15040]: time="2025-06-20T074259Z" level=info msg="spegel 2025/06/20 074259 \"msg\"=\"\" \"error\"=\"mirror with image component update.domain.com/rc/app/superset:1.30.0-rc3 could not be found\" \"path\"=\"/v2/rc/app/superset/manifests/1.30.0-rc3\" \"status\"=404 \"method\"=\"HEAD\" \"latency\"=\"769.416µs\" \"ip\"=\"127.0.0.1\" \"handler\"=\"mirror\""

glamorous-afternoon-50134

06/20/2025, 7:44 AM

With this in registries.yaml:

Copy code

mirrors:
  "*":

creamy-pencil-82913

06/20/2025, 10:51 PM

you might take a look at https://github.com/k3s-io/k3s/discussions/12514#discussioncomment-13534695

glamorous-afternoon-50134

06/21/2025, 7:44 AM

I did and had subscribed to notifications on thie thread, but unfortunately, my problem persists. No peer is ever found in my case.

creamy-pencil-82913

06/21/2025, 8:02 AM

what DO you see? Do you see another node publishing the key? Do you see the nodes connecting to each other?

glamorous-afternoon-50134

06/21/2025, 8:06 AM

No much more that what I pasted here. The nodes are joined as far as k3s is concerned, pods can be scheduled on any of them, but the embedded registry is not usable from the agent node. I can see from the logs (see previous messages) that Spegel appears to be started, I can see that there is a daemon listening on port 5001 on the agent, I can see rtrefresh WARN messages about not finding peers in the table (master and agent), I can see the expected labels/annotations on the master node but not on the agent node, and I can see that TCP/5001 traffic is open between the master and agent.

creamy-pencil-82913

06/21/2025, 8:39 AM

can you share the full logs from both nodes, with debug enabled?

glamorous-afternoon-50134

06/21/2025, 8:46 AM

Sure, from a clean restart?

glamorous-afternoon-50134

06/21/2025, 9:12 AM

I'm sending you the logs in MP for privacy reasons. I have removed the original app name and domain for privacy as well, otherwise no changes. Thank you again very much for your help and time!

bland-painting-61617

06/23/2025, 9:16 AM

I just tried out embedded registry and it doesn't work for me at all. I was getting pull errors on all non docker.io images (because those were configured to use mirror.gcr.io) I'll try to repro it on a test cluster and feed back more logs - sample pull error looks like this

Copy code

time="2025-06-23T10:03:54.825176653+01:00" level=error msg="PullImage \"<http://registry.k8s.io/sig-storage/csi-provisioner:v5.0.2\|registry.k8s.io/sig-storage/csi-provisioner:v5.0.2\>" failed" error="rpc error: code = NotFound desc = failed to pull and unpack image \"<http://registry.k8s.io/sig-storage/csi-provisioner:v5.0.2\|registry.k8s.io/sig-storage/csi-provisioner:v5.0.2\>": failed to resolve reference \"<http://registry.k8s.io/sig-storage/csi-provisioner:v5.0.2\|registry.k8s.io/sig-storage/csi-provisioner:v5.0.2\>": <http://registry.k8s.io/sig-storage/csi-provisioner:v5.0.2|registry.k8s.io/sig-storage/csi-provisioner:v5.0.2>: not found"

glamorous-afternoon-50134

06/27/2025, 9:41 AM

BTW, we found our issue, it was a (probably) a resolution issue from the

K3S_URL

that was using the wrong machine name. We set it to use the machine's IP (or correct hostname) and it worked.

👍 1

bland-painting-61617

06/27/2025, 9:46 AM

'K3S_URL' being the API address?

glamorous-afternoon-50134

06/27/2025, 9:46 AM

Indeed, the master's address

bland-painting-61617

06/27/2025, 9:48 AM

Interesting, I'll have to check but as far as I remember it is correct since new nodes are joining just fine and this setup has been in place for ages. I actually haven't reproduced this issue in my other test cluster so I just took out the "*" from the mirrors config. I was only trying to work around the issue with Docker registry throttling and figured I'd try embedded registry to reduce unnecessary pulls. I redirected Docker registry to the GCP mirror and the rest works directly.

glamorous-afternoon-50134

06/27/2025, 9:49 AM

It is worth mentioning that we did not have joining issues either, it was only the embedded registry that had issues

🙂 1

3 Views

Open in Slack

Previous Next