adamant-kite-43734
10/30/2023, 6:49 PMsalmon-bear-45866
10/30/2023, 7:19 PME1030 18:44:05.892151 1 controller.go:320] error processing service network/test-app (will retry): failed to ensure load balancer: update load balancer IP of service network/test-app failed, error: Operation cannot be fulfilled on services "test-app": the object has been modified; please apply your changes to the latest version and try again
I1030 18:44:05.892458 1 event.go:294] "Event occurred" object="network/test-app" fieldPath="" kind="Service" apiVersion="v1" type="Warning" reason="SyncLoadBalancerFailed" message="Error syncing load balancer: failed to ensure load balancer: update load balancer IP of service network/test-app failed, error: Operation cannot be fulfilled on services \"test-app\": the object has been modified; please apply your changes to the latest version and try again"
salmon-bear-45866
10/30/2023, 7:21 PMType Reason Age From Message
---- ------ ---- ---- -------
Warning SyncLoadBalancerFailed 36m service-controller Error syncing load balancer: failed to ensure load balancer: update load balancer IP of service network/test-app failed, error: Operation cannot be fulfilled on services "test-app": the object has been modified; please apply your changes to the latest version and try again
Normal EnsuringLoadBalancer 36m (x3 over 36m) service-controller Ensuring load balancer
Normal EnsuredLoadBalancer 36m (x2 over 36m) service-controller Ensured load balancer
salmon-bear-45866
10/30/2023, 8:11 PMLoadBalanacer
manifest to include a selector for all my guest cluster vmis. Going to create a fresh RKE2 guest cluster and see if it has the same issue tomorrow.great-bear-19718
10/31/2023, 2:34 AMhundreds-easter-25520
10/31/2023, 3:17 AMsalmon-bear-45866
10/31/2023, 3:45 AMharvester-load-balancer
, harvester-load-balancer-webhook
and harvester-cloud-provider
on the guest cluster, even working correctly.
The constructLB function on the guest cluster seems to be doing what it's supposed to, so I don't think it's a guest cluster config issue or an issue with the cloud provider. But I can't find in the harvester-load-balancer
or webhook
code where the spec.backendServerSelector
is getting injected into the spec. The v1alpha1 spec LoadBalancers
of workloadtype cluster
that were converted into v1beta1 have the backendServerSelector
defined, so that has to be the way it's done, right?
But I looked through the load balancer repo and I can't for the life of me find anything that would act to create that.
Is there anywhere that I can go to understand how the load balancer creation code would flow in a proper case?great-bear-19718
10/31/2023, 3:48 AMgreat-bear-19718
10/31/2023, 3:48 AMgreat-bear-19718
10/31/2023, 3:48 AMsalmon-bear-45866
10/31/2023, 3:51 AMsalmon-bear-45866
10/31/2023, 3:57 AMLoadBalancer
spec is created bare with just two values:
spec:
ipam: dhcp
workloadType: cluster
and this seems to be what the cloud provider intends
But when I create a LB manually, or look at the automatically converted LBs from v1alpha1, they have a spec.backendServerSelector
that defines what VMI they map to. I don't see where this gets injected, or how the LB manifest gets mapped to those VMIs for the guest cluster. My assumption would be that happens in the mutatingwebhook for the LB when a workloadType cluster is passed in, that something would map the backend servers to the manifest? But I can't find anything even touching the backendServerSelector or workloadType: clusterprehistoric-balloon-31801
10/31/2023, 5:41 AMred-king-19196
10/31/2023, 9:02 AMworkloadType
is cluster
, there’s nothing to do with .spec.backendServerSelector
. That selector is for the vm
use case (creating LoadBalancer CR directly on the Harvester cluster and pointing to a set of VMs). The LoadBalancer CR manifest you provided above is okay. Inside the manifest, the .status.allocatedAddress.ip
is 0.0.0.0
, which means it’s `kube-vip`’s turn to grab a valid IP address from the DHCP server.
In order to do this, we need to signify to kube-vip and the cloud provider that we don’t need one of their managed addresses. We do this by explicitly exposing a Service on the addressref Could you check and provide the log of the. When kube-vip sees a Service on this address, it will create a0.0.0.0
interface on the host and request a DHCP address. Once this address is provided, it will assign it as themacvlan
IP and update the Kubernetes Service.LoadBalancer
kube-vip
Pod on the guest cluster?
P.S. The workflow of how the cloud provider and the load balancer work together could refer to the HEPsalmon-bear-45866
10/31/2023, 1:48 PMkube-vip
pod in the guest cluster. Strangely, I'm not seeing any fresh logs in kube-vip
upon the creation of the load balancer named kubernetes-network-blocky-app-4-2ddd3a3f
.
Below are full kube-vip
logs.
❯ kubectl logs -n harvester-system kube-vip-dk44j ─╯
time="2023-10-29T02:27:26Z" level=info msg="Starting kube-vip.io [v0.6.0]"
time="2023-10-29T02:27:26Z" level=info msg="namespace [kube-system], Mode: [ARP], Features(s): Control Plane:[false], Services:[true]"
time="2023-10-29T02:27:26Z" level=info msg="No interface is specified for VIP in config, auto-detecting default Interface"
time="2023-10-29T02:27:26Z" level=info msg="prometheus HTTP server started"
time="2023-10-29T02:27:26Z" level=info msg="kube-vip will bind to interface [mgmt-br]"
time="2023-10-29T02:27:26Z" level=info msg="Starting Kube-vip Manager with the ARP engine"
time="2023-10-29T02:27:26Z" level=info msg="beginning services leadership, namespace [harvester-system], lock name [plndr-svcs-lock], id [harvester0]"
I1029 02:27:26.488358 1 leaderelection.go:248] attempting to acquire leader lease harvester-system/plndr-svcs-lock...
E1029 02:27:56.516269 1 leaderelection.go:330] error retrieving resource lock harvester-system/plndr-svcs-lock: Get "<https://10.53.0.1:443/apis/coordination.k8s.io/v1/namespaces/harvester-system/leases/plndr-svcs-lock>": dial tcp 10.53.0.1:443: i/o timeout
I1029 02:27:58.256755 1 leaderelection.go:258] successfully acquired lease harvester-system/plndr-svcs-lock
time="2023-10-29T02:27:58Z" level=info msg="starting services watcher for all namespaces"
time="2023-10-29T02:27:58Z" level=info msg="Creating new macvlan interface for DHCP [vip-f33fec36]"
time="2023-10-29T02:27:58Z" level=info msg="New interface [vip-f33fec36] mac is 00:00:6c:0c:e9:60"
time="2023-10-29T02:27:58Z" level=info msg="DHCP VIP [0.0.0.0] for [default/kubernetes-network-blocky-app-58296840] "
time="2023-10-29T02:27:58Z" level=info msg="[service] adding VIP [0.0.0.0] for [default/kubernetes-network-blocky-app-58296840]"
time="2023-10-29T02:27:58Z" level=info msg="[service] synchronised in 80ms"
time="2023-10-29T02:27:58Z" level=info msg="Creating new macvlan interface for DHCP [vip-fbbe9ecc]"
time="2023-10-29T02:27:58Z" level=info msg="New interface [vip-fbbe9ecc] mac is 00:00:6c:db:f2:c1"
time="2023-10-29T02:27:58Z" level=info msg="DHCP VIP [0.0.0.0] for [default/kubernetes-network-ingress-nginx-internal-controller-c6de3eb7] "
time="2023-10-29T02:27:58Z" level=info msg="[service] adding VIP [0.0.0.0] for [default/kubernetes-network-ingress-nginx-internal-controller-c6de3eb7]"
time="2023-10-29T02:27:58Z" level=info msg="[service] synchronised in 82ms"
time="2023-10-29T02:27:58Z" level=info msg="[service] adding VIP [192.168.10.2] for [kube-system/ingress-expose]"
time="2023-10-29T02:27:58Z" level=info msg="[service] synchronised in 12ms"
time="2023-10-29T14:28:33Z" level=error msg="renew failed, error: got an error while processing the request: no matching response packet received"
time="2023-10-29T14:28:33Z" level=error msg="renew failed, error: got an error while processing the request: no matching response packet received"
time="2023-10-29T23:28:33Z" level=error msg="rebind failed, error: got an error while processing the request: no matching response packet received"
time="2023-10-29T23:28:33Z" level=error msg="rebind failed, error: got an error while processing the request: no matching response packet received"
time="2023-10-30T13:56:26Z" level=info msg="Creating new macvlan interface for DHCP [vip-96b5a5ba]"
time="2023-10-30T13:56:26Z" level=info msg="Generated mac: 00:00:6C:62:97:a8"
time="2023-10-30T13:56:26Z" level=info msg="New interface [vip-96b5a5ba] mac is 00:00:6c:62:97:a8"
time="2023-10-30T13:56:27Z" level=info msg="DHCP VIP [0.0.0.0] for [default/manual-test] "
time="2023-10-30T13:56:27Z" level=info msg="[service] adding VIP [0.0.0.0] for [default/manual-test]"
time="2023-10-30T13:56:27Z" level=info msg="[service] synchronised in 1242ms"
time="2023-10-31T01:57:02Z" level=error msg="renew failed, error: got an error while processing the request: no matching response packet received"
time="2023-10-31T04:18:19Z" level=info msg="[LOADBALANCER] Stopping load balancers"
time="2023-10-31T04:18:19Z" level=info msg="[VIP] Releasing the Virtual IP [192.168.10.75]"
time="2023-10-31T04:18:19Z" level=info msg="release, lease: &{Offer:DHCPv4(xid=0x48665d6b hwaddr=00:00:6c:62:97:a8 msg_type=OFFER, your_ip=192.168.10.75, server_ip=192.168.10.1) ACK:DHCPv4(xid=0x48665d6b hwaddr=00:00:6c:62:97:a8 msg_type=ACK, your_ip=192.168.10.75, server_ip=192.168.10.1) CreationTime:2023-10-30 13:56:27.183661196 +0000 UTC m=+127740.787232155}"
time="2023-10-31T04:18:19Z" level=info msg="Removed [96b5a5ba-0224-43cb-9514-4a65fe7ba5d8] from manager, [3] advertised services remain"
time="2023-10-31T04:18:19Z" level=info msg="service [default/manual-test] has been deleted"
salmon-bear-45866
10/31/2023, 1:56 PMkube-vip
documentation, it looks like it acts on a service
with type loadbalancer
, not the <http://loadbalancers.harvesterhci.io|loadbalancers.harvesterhci.io>
loadbalancer. Those services, for the loadbalancer
workloadType: cluster
don't seem to exist or being created. Perhaps that's the missing link?salmon-bear-45866
10/31/2023, 1:59 PMsalmon-bear-45866
10/31/2023, 2:09 PMOnce thehttps://github.com/harvester/harvester/blob/master/enhancements/20220214-harvester-cloud-provider-enhancement.md Soinside the guest cluster watches the servicekube-vip
kube-vip
should be running in the guest cluster? That's probably why it's not working lmao.
I'm guessing something went wrong in the upgrade process - I harvester-cloud-provider:v0.2.0
running in the guest cluster (bumped from v0.1.5
i think??) but no kube-vip
pod!red-king-19196
10/31/2023, 2:10 PMkube-vip
(in the guest cluster)salmon-bear-45866
10/31/2023, 2:11 PM1.24.x
on harvester 1.1.2
but I'm not positive.salmon-bear-45866
10/31/2023, 2:11 PMsalmon-bear-45866
10/31/2023, 2:16 PMhelm-install-harvester-cloud-provider
job logs on the guest cluster:
❯ kubectl logs helm-install-harvester-cloud-provider-mg6h2 -n kube-system ─╯
if [[ ${KUBERNETES_SERVICE_HOST} =~ .*:.* ]]; then
echo "KUBERNETES_SERVICE_HOST is using IPv6"
CHART="${CHART//%\{KUBERNETES_API\}%/[${KUBERNETES_SERVICE_HOST}]:${KUBERNETES_SERVICE_PORT}}"
else
CHART="${CHART//%\{KUBERNETES_API\}%/${KUBERNETES_SERVICE_HOST}:${KUBERNETES_SERVICE_PORT}}"
fi
set +v -x
+ [[ true != \t\r\u\e ]]
+ [[ '' == \1 ]]
+ [[ '' == \v\2 ]]
+ shopt -s nullglob
+ [[ -f /config/ca-file.pem ]]
+ [[ -f /tmp/ca-file.pem ]]
+ [[ -n '' ]]
+ helm_content_decode
+ set -e
+ ENC_CHART_PATH=/chart/harvester-cloud-provider.tgz.base64
+ CHART_PATH=/tmp/harvester-cloud-provider.tgz
+ [[ ! -f /chart/harvester-cloud-provider.tgz.base64 ]]
+ base64 -d /chart/harvester-cloud-provider.tgz.base64
+ CHART=/tmp/harvester-cloud-provider.tgz
+ set +e
+ [[ install != \d\e\l\e\t\e ]]
+ helm_repo_init
+ grep -q -e 'https\?://'
+ [[ helm_v3 == \h\e\l\m\_\v\3 ]]
+ [[ /tmp/harvester-cloud-provider.tgz == stable/* ]]
+ [[ -n '' ]]
+ helm_update install --set-string global.clusterCIDR=10.42.0.0/16 --set-string global.clusterCIDRv4=10.42.0.0/16 --set-string global.clusterDNS=10.43.0.10 --set-string global.clusterDomain=cluster.local --set-string global.rke2DataDir=/var/lib/rancher/rke2 --set-string global.serviceCIDR=10.43.0.0/16
+ [[ helm_v3 == \h\e\l\m\_\v\3 ]]
++ jq -r '"\(.[0].app_version),\(.[0].status)"'
++ tr '[:upper:]' '[:lower:]'
++ helm_v3 ls --all -f '^harvester-cloud-provider$' --namespace kube-system --output json
+ LINE=v0.2.0,deployed
+ IFS=,
+ read -r INSTALLED_VERSION STATUS _
+ VALUES=
+ for VALUES_FILE in /config/*.yaml
+ VALUES=' --values /config/values-10_HelmChartConfig.yaml'
+ [[ install = \d\e\l\e\t\e ]]
+ [[ v0.2.0 =~ ^(|null)$ ]]
+ [[ deployed =~ ^(pending-install|pending-upgrade|pending-rollback)$ ]]
+ [[ deployed == \d\e\p\l\o\y\e\d ]]
+ echo 'Already installed harvester-cloud-provider'
Already installed harvester-cloud-provider
+ [[ helm_v3 == \h\e\l\m\_\v\3 ]]
+ helm_v3 mapkubeapis harvester-cloud-provider --namespace kube-system
2023/10/30 03:47:03 Release 'harvester-cloud-provider' will be checked for deprecated or removed Kubernetes APIs and will be updated if necessary to supported API versions.
2023/10/30 03:47:03 Get release 'harvester-cloud-provider' latest version.
2023/10/30 03:47:03 Check release 'harvester-cloud-provider' for deprecated or removed APIs...
2023/10/30 03:47:04 Finished checking release 'harvester-cloud-provider' for deprecated or removed APIs.
2023/10/30 03:47:04 Release 'harvester-cloud-provider' has no deprecated or removed APIs.
2023/10/30 03:47:04 Map of release 'harvester-cloud-provider' deprecated or removed APIs to supported versions, completed successfully.
+ echo 'Upgrading helm_v3 chart'
+ echo 'Upgrading harvester-cloud-provider'
+ shift 1
+ helm_v3 upgrade --set-string global.clusterCIDR=10.42.0.0/16 --set-string global.clusterCIDRv4=10.42.0.0/16 --set-string global.clusterDNS=10.43.0.10 --set-string global.clusterDomain=cluster.local --set-string global.rke2DataDir=/var/lib/rancher/rke2 --set-string global.serviceCIDR=10.43.0.0/16 harvester-cloud-provider /tmp/harvester-cloud-provider.tgz --values /config/values-10_HelmChartConfig.yaml
Upgrading harvester-cloud-provider
Release "harvester-cloud-provider" has been upgraded. Happy Helming!
NAME: harvester-cloud-provider
LAST DEPLOYED: Mon Oct 30 03:47:06 2023
NAMESPACE: kube-system
STATUS: deployed
REVISION: 5
TEST SUITE: None
+ exit
red-king-19196
10/31/2023, 2:18 PMDo you know where that requirement was introduced? I’m mostly sure that it did not exist when i was running rke2It was introduced in the latest change of the chart https://github.com/harvester/charts/commit/677c166aa61531e106b1db47878c0e595051ed68on harvester1.24.x
but I’m not positive.1.1.2
salmon-bear-45866
10/31/2023, 2:27 PM❯ helm repo update > /dev/null 2>&1 && helm search repo harvester-cloud-provider ─╯
NAME CHART VERSION APP VERSION DESCRIPTION
harvester/harvester-cloud-provider 0.2.2 v0.2.0 A Helm chart for Harvester Cloud Provider
rancher-charts/harvester-cloud-provider 102.0.1+up0.1.14 v0.1.5 A Helm chart for Harvester Cloud Provider
salmon-bear-45866
10/31/2023, 2:27 PMv0.1.5
red-king-19196
10/31/2023, 2:29 PMsalmon-bear-45866
10/31/2023, 2:32 PMComponent Version
Rancher v2.7.6
Dashboard v2.7.6
Helm v2.16.8-rancher2
Machine v0.15.0-rancher100
salmon-bear-45866
10/31/2023, 2:33 PMv1.26.4
, if that matterssalmon-bear-45866
10/31/2023, 2:39 PMkube-vip
daemonset already existed! It was added via the helm upgrade job. I had been looking for a kube-vip
pod, which is not running, and for whatever reason my alerting doesn't think this is a problem.
❯ kubectl get daemonset -n kube-system kube-vip ─╯
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
kube-vip 0 0 0 0 0 <http://node-role.kubernetes.io/control-plane=true|node-role.kubernetes.io/control-plane=true> 35h
Looking at why there's no desired pods here. Something with control plane tainting maybe?red-king-19196
10/31/2023, 2:41 PMdescribe
the DaemonSet to see if any events at the bottom?salmon-bear-45866
10/31/2023, 2:42 PM❯ kubectl describe daemonset -n kube-system kube-vip
...
Events: <none>
salmon-bear-45866
10/31/2023, 2:46 PMspec:
...
taints:
- effect: NoSchedule
key: <http://node-role.kubernetes.io/control-plane|node-role.kubernetes.io/control-plane>
- effect: NoExecute
key: <http://node-role.kubernetes.io/etcd|node-role.kubernetes.io/etcd>
And the kube-vip daemonset only tolerates:
tolerations:
- effect: NoSchedule
key: <http://node-role.kubernetes.io/control-plane|node-role.kubernetes.io/control-plane>
operator: Exists
I think it should need a toleration for NoExecute for the etcd role?salmon-bear-45866
10/31/2023, 2:49 PMsalmon-bear-45866
10/31/2023, 2:52 PMtolerations:
- effect: NoExecute
key: <http://node-role.kubernetes.io/etcd|node-role.kubernetes.io/etcd>
operator: Exists
salmon-bear-45866
10/31/2023, 2:52 PMsalmon-bear-45866
10/31/2023, 2:53 PM❯ kubectl get svc -n network blocky-app-4 ─╯
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
blocky-app-4 LoadBalancer 10.43.188.12 192.168.10.78 53:31656/UDP 73m
salmon-bear-45866
10/31/2023, 2:54 PMsalmon-bear-45866
10/31/2023, 3:00 PMhundreds-easter-25520
10/31/2023, 3:04 PMhundreds-easter-25520
10/31/2023, 7:02 PMhundreds-easter-25520
11/01/2023, 3:53 AMkube-vip
at 0 pods because there are no nodes that match the NodeSelector that don’t have a taint that the pods can’t tollerate.hundreds-easter-25520
11/01/2023, 3:59 AMkube-vip
DaemonSet at all, but I ended up with an upgraded rancher2 terraform provider, and I think the changes there might have broken something. I’ll dig into that, but I don’t think it’s this issue but a configuration problem.
#3 works fine, there are no taints on the node, so there is nothing to stop kube-vip
from happily running on the control-plane node that just happens to also be and etcd and worker. LoadBalancer service worked fine.
#4 also worked perfectly well. kube-vip
is running on the control-plane node as it matches both the node selector and can tolerate the taint.
I think that the helm chart needs to be updated to include the toleration for the etcd:NoExecute taint, and then all these cases should work finebrave-napkin-80104
11/03/2023, 9:48 PM