On triggering a snapshot of a managed cluster Rancher shows Rancher Users #general

On triggering a snapshot of a managed cluster Ranc...

crooked-cat-21365

06/01/2024, 5:05 AM

On triggering a snapshot of a managed cluster Rancher shows a popup with some html garbage and "Request Entity too Large". https://ranchermanager.docs.rancher.com/troubleshooting/kubernetes-components/troubleshooting-nginx-proxy and the others don't tell. I've got several other clusters managed in Rancher (some are much larger), so how comes I ran into nginx' proxy-body-size limit for this cluster ?

creamy-pencil-82913

06/01/2024, 6:37 AM

Larger cluster spec probably?

creamy-pencil-82913

06/01/2024, 6:38 AM

It's not how many nodes or pods are in the cluster necessary, it's the size of the cluster resource itself when modified to trigger the snapshot.

crooked-cat-21365

06/01/2024, 10:00 AM

How can I check the size of the cluster resource? Shouldn't nginx accept at least 10x the size of an average cluster resource by default? I got the same red popup for the upgrade from rke2 1.28.8 to 1.28.9, choosing the new version in the selection box. If I edit the yaml file in the Web GUI instead, simply replacing the "1.28.8" by "1.28.9", then there is no popup on saving the new version and the upgrade succeeds. So I would guess the web GUI is to blame here.

creamy-pencil-82913

06/01/2024, 7:20 PM

If you're using chrome you can hit f12 and pull up the network monitor tab before making the change to see what exactly it's posting to the server and how large the request is...

crooked-cat-21365

06/02/2024, 9:02 AM

Hi @creamy-pencil-82913, thank you for the hint. Content-length is 1996145 for a [Snapshot Now] on the buggy cluster. On the neighbor cluster the content length is just 52508 for the same snapshot button. Wouldn't you agree that this looks weird?

crooked-cat-21365

06/02/2024, 9:34 AM

Looking at the data, there are a bazillion of snapshot entries in the json file it tries to transfer:

Copy code

:
        "state": "active",
        "message": "Resource is current"
      },
      {
        "toId": "fleet-default/extkube001-etcd-snapshot-node01.dmz.aixigo.de-1708488004-s3",
        "toType": "rke.cattle.io.etcdsnapshot",
        "rel": "owner",
        "state": "active",
        "message": "Resource is current"
      },
      {
        "toId": "fleet-default/extkube001-etcd-snapshot-node01.dmz.aixigo.de-1708617605-s3",
        "toType": "rke.cattle.io.etcdsnapshot",
:

The interesting part is, "backup snapshots to S3" is disabled, see the attached snapshot. Nevertheless, there are 162 days of failed hourly backups somewhere in Rancher's database, all with 0 bytes written. I can see them in the snapshots overview, listed for "S3". For each is an error message

Copy code

failed to test for existence of bucket rancher02: Head "<https://minio.ac.aixigo.de:9010/rancher02/>": dial tcp 172.19.96.219:9010: connect: connection refused

The ECONNREFUSED is expected. I had configured an internal S3 storage similar to other internal clusters, but this cluster is running on another network, so I had disabled S3 storage again. Question is, how can I get rid of these failed zombie backups to an S3 bucket that doesn't exist? I tried editing the rke2-etcd-snapshots configmap, but this didn't help; they are back.

crooked-cat-21365

06/03/2024, 11:02 AM

https://github.com/rancher/rancher/issues/45664. Some advice about how to get rid of the unwanted error entries in the cluster specs would be very welcome.

crooked-cat-21365

06/05/2024, 8:12 AM

@creamy-pencil-82913, maybe there is some way to reset the snapshot database to get rid of the unwanted entries?

45 Views

Open in Slack

Previous Next