https://rancher.com/ logo
Title
a

acceptable-printer-7134

01/24/2023, 1:16 PM
API server is overwhelmed and causing intermittent timeout due to rancher. Any pointer how to solve this? K8s: EKS 1.21+ Rancher: 2.7
we see some rancher reconciliation error with respect to grafana
level=error msg="error syncing 'monitoring/sh.helm.release.v1.grafana.v49': handler helm-app-secret: failed to create monitoring/grafana <http://catalog.cattle.io/v1|catalog.cattle.io/v1>, Kind=App for helm-app monitoring/sh.helm.release.v1.grafana.v49: etcdserver: request is too large, requeuing"
how to avoid this reconciliation by rancher agent
This is affecting our prod cluster. @fast-piano-59234 any pointer? or anyone from rancher
trying my luck again here if anyone from rancher can help
level=error msg="error syncing 'monitoring/sh.helm.release.v1.grafana.v49': handler helm-app-secret: failed to create monitoring/grafana <http://catalog.cattle.io/v1|catalog.cattle.io/v1>, Kind=App for helm-app monitoring/sh.helm.release.v1.grafana.v49: etcdserver: request is too large, requeuing"
not sure what cluster-agent trying to do - we still have that problem.
sorry for tagging you @fast-piano-59234 but really stuck with this. can you please help. is there a way we can disable rancher-cluster-agent to perform this sync ?
posted this 14 days back still no response from Rancher. Not sure if we have anyone from rancher still supporting this channel? it's causing load on API Server.
w

witty-honey-18052

02/07/2023, 2:30 PM
have you tried uninstalling the monitoring?
a

acceptable-printer-7134

02/07/2023, 2:34 PM
in general i understand its the grafana helm release metadata size issue as helm keeps release info in the form of secret
helm.release.v1.grafana.v49
in this case. but why rancher agent keeps syncing that causing issue on API.
btw yes @witty-honey-18052 uninstalling grafana release does help. but thats not we can afford this time.
how can we avoid rancher agent doing any reconciliation in this case?
w

witty-honey-18052

02/07/2023, 2:37 PM
yea, it's obviously not ideal, but that's what i was checking, wondering if it was an issue with the helm chart, or an upgrade carrying over an object that's now too large
my thought was does a fresh install of the monitoring stack resolve it
i'm not sure why that isn't backing off
a

acceptable-printer-7134

02/07/2023, 2:38 PM
yes earlier
releases
didn't have this issue. actually dashboards json being deployed might the cause in our case. we have a plan t migrate to better monitoring architecture in near future.
w

witty-honey-18052

02/07/2023, 2:39 PM
could be if that's over 1mb
from what i'm reading that's going to be a general etcd issue, not limited to rancher
but it seems like there should be an error back-off regardless
have you opened a GH issue?
(i'm also guessing you don't have paid support w/suse?)
a

acceptable-printer-7134

02/07/2023, 2:41 PM
object being large is common issue i agree. but rancher agent keep requeuing it seems to be the cause.
gonna file
have you opened a GH issue?
but i was hoping since its causing load on API even in prod. wanted to check if we have some workaround other than uninstalling that release
w

witty-honey-18052

02/07/2023, 2:46 PM
I find the gh issues get surfaced a little better. support tickets obv the better option if available
did this start in 2.7 or did you upgrade after to try to resolve it?
a

acceptable-printer-7134

02/07/2023, 2:47 PM
have used rancher in the past but its new in this organisation and we are directly using 2.7 first time here.
issue started appearing after latest grafana release which definitely tells me its object size issue. but if i can stop rancher doing any reconciliation that would be helpful in this case.
Also.i feel this is related in someway https://github.com/rancher/rancher/issues/32939
i am also curious to know what
<http://catalog.cattle.io/v1|catalog.cattle.io/v1>
holds?
helm-app-secret
- these are rancher terms i believe
w

witty-honey-18052

02/07/2023, 2:58 PM
and monitoring probably isn't working at all anyways right now, right? or is it just failing to upgrade?
a

acceptable-printer-7134

02/07/2023, 3:00 PM
just failing to upgrade i believe. but major issue is this
seems to me if job
kubernetes-apiservers
can be disable in prometheus. That won't perform any POST on API server.
s

stocky-account-63046

02/08/2023, 10:09 AM
@acceptable-printer-7134 This slack is primarily for users of Rancher to get together, share their stories and support each other. Members of Rancher are here helping out where they can, but it's primarily for community support.
👍 1