Hello. Using Rancher to manage Helm/K8s (Docker) d...
# rke2
c
Hello. Using Rancher to manage Helm/K8s (Docker) deployments, I had been using an RKE1 Triton template to deploy a Debian-based (Debian 11) image that mounted a Triton NFS volume, using NFSv3, for permanent storage. However, after switching to RKE2, the underlying image was changed to use Ubuntu and also moved from Docker to containerd. At this same time, the NFS version was changed to NFSv4. For our deployment, we have a production system that takes backups and syncs them to a disaster recovery system in a different data center - this had been working just fine using RKE1/Docker/Debian/NFSv3. However, after all of the previously mentioned changes, every few weeks Postgres on the DR system reports after a backup that a
stale file handle
existed in the
$PGDATA/data
directory while trying to apply the WAL changes, and thus shut itself down. This causes the pod to restart and also remove the
standby.signal
file that tells Postgres to start in standby mode, and from here backups fail to be applied. After manually restoring the database files, the DR system runs fine for several weeks before it happens again. I had tried adding a few NFSv4 options to resolve the issue, but so far none of these have worked:
actimeo=0,noac,lookupcache=none
. Also importantly, trying to force NFSv3 with
nfsvers=3
in the PV for the mount has no affect (NFSv3 is used, but the
stale file handle
still occurs eventually). Might anyone have suggestions or ideas of what might be going on here? Thank you.