Anyone ever seen this happen with the neuvector en...
# neuvector-security
a
Anyone ever seen this happen with the neuvector enforcer?
Copy code
Failed to put key - error=Put "<http://127.0.0.1:8500/v1/kv/object/host/agent/>...": dial tcp 127.0.0.1:8500: connect: connection refused
We see this on enforcer startup, and when it happens the enforcer fails to cluster up and show as connected in the manager/web console. For a while we just patched a liveness + readiness probe on the pods to check 8500 but with the latest version (5.4.3) it's happening a lot more often.
c
Do you have istio running in NeuVector Namespace?
a
Yep - istio injection + mTLS strict.
We have a few "workarounds" that were required to get that working in the past: • permissive mtls on controller port 18300 and 30443 • headless services for controller, enforcer, scanner with a number of ports to make sure Istio setup listeners + proper protocols for them (gossip ports, healthz ports, etc)
c
Would you be able to open an issue in github and provide an example enforcer log with the issue?
As a test, can you disable istio on the NeuVector namespace and see if it improves?
a
Yeah let me give that a shot - are you aware of any issues running with Istio? I know it definitely didn't seem to work out of the box for us in the past. Also if relevant this is on our dev clusters at the moment (k3d) - although I believe we've seen some of the same issues on rke2 as well.
Not really seeing any better results without istio injection/mtls enforcement. Going to try to get an issue written up with full reproduction steps.
c
Thank you.
a
Discovered our issue. We're using images from Ironbank (DoD repository of hardened images). The change in 5.4.3 to the affinity check (here) resulted in the
neuvector.role
label being required on the container image. The Ironbank image is not built with that label so it was getting stuck on the affinity check and not starting up consul.
👏 1
👍 1
Working with that team to hopefully resolve though...
@clean-magazine-25026 do you happen to know if there would be other places in the code looking for
neuvector.role
on the other images? We got it added to the enforcer image (and that resolved the enforcer issue) but experiencing some weirdness on controllers in some clusters and trying to debug if its related. I skimmed through the code and didn't see anything obvious where the label would be needed on the controller - working to get logs here to better debug...
c
I don't so it is likely the controller issue is something else. Please describe the symptom(s).
a
Controllers just aren't becoming ready - due to where this is happening I don't have logs yet, going to try and get those tomorrow. I noticed this and am wondering if the self id might be messing up - https://github.com/neuvector/neuvector/blob/main/share/container/cri_client.go#L290-L295 (because I don't have the role label on controllers).
c
the logs should give a hint on why it is not becoming ready. You can also enable debug logging upon startup if needed to see more output.
👍 1
a
Yeah we'll have to reproduce this in an environment where we have more access to debug. Some of these errors were hit in ephemeral environments - mostly just wanted to check if you were aware of anything before we start digging further.
So all we're really seeing is this on the pod events:
Copy code
Warning  Unhealthy  3m20s (x341 over 32m)  kubelet            Readiness probe failed:
Controller logs appear normal, we even see the readiness log (which should be when the
/tmp/ready
file gets created/updated?):
Copy code
❯ kl -n neuvector neuvector-controller-pod-xxx | grep "ctrl init done"
2025-04-03T14:10:15.407|INFO|CTL|utils.SetReady: - value=ctrl init done
Is it possible that the probe is getting killed by the enforcers? I was seeing this in enforcer logs:
Copy code
2025-04-03T14:53:15.843|DEBU|AGT|main.reportIncident: - eLog={LogUID: ID:11 HostID:xxx HostName:xxx AgentID:xxx AgentName:xxx WorkloadID:xxx WorkloadName: ReportedAt:2025-04-03 14:53:15.843069795 +0000 UTC ProcName:cat ProcPath:/usr/bin/busybox ProcCmds:[cat /tmp/ready ] ProcRealUID:0 ProcEffUID:0 ProcRealUser: ProcEffUser:root FilePath: Files:[] LocalIP:<nil> RemoteIP:<nil> EtherType:0 LocalPort:0 RemotePort:0 IPProto:0 ConnIngress:false LocalPeer:false ProcPName:runc ProcPPath:/usr/bin/runc Count:15 StartAt:2025-04-03 14:52:10.811128089 +0000 UTC m=+2506.654157597 Action:deny RuleID:00000000-0000-0000-0000-000000000006 Group:NV.Protect Msg:Process profile violation, not from its root process: execution denied}
Seems like it's probably our custom image registry again and
cat
being in a different place 😞