This message was deleted.
# harvester
a
This message was deleted.
đź‘€ 1
s
The error from harvester-cloud-provider might be of some use?
Copy code
E1030 18:44:05.892151       1 controller.go:320] error processing service network/test-app (will retry): failed to ensure load balancer: update load balancer IP of service network/test-app failed, error: Operation cannot be fulfilled on services "test-app": the object has been modified; please apply your changes to the latest version and try again
I1030 18:44:05.892458       1 event.go:294] "Event occurred" object="network/test-app" fieldPath="" kind="Service" apiVersion="v1" type="Warning" reason="SyncLoadBalancerFailed" message="Error syncing load balancer: failed to ensure load balancer: update load balancer IP of service network/test-app failed, error: Operation cannot be fulfilled on services \"test-app\": the object has been modified; please apply your changes to the latest version and try again"
Ultimately though the guest cluster service says the load balancer is ensured... meaning it's a harvester side issue?
Copy code
Type     Reason                  Age                From                Message
  ----     ------                  ----               ----                -------
  Warning  SyncLoadBalancerFailed  36m                service-controller  Error syncing load balancer: failed to ensure load balancer: update load balancer IP of service network/test-app failed, error: Operation cannot be fulfilled on services "test-app": the object has been modified; please apply your changes to the latest version and try again
  Normal   EnsuringLoadBalancer    36m (x3 over 36m)  service-controller  Ensuring load balancer
  Normal   EnsuredLoadBalancer     36m (x2 over 36m)  service-controller  Ensured load balancer
I manually updated the
LoadBalanacer
manifest to include a selector for all my guest cluster vmis. Going to create a fresh RKE2 guest cluster and see if it has the same issue tomorrow.
g
if it fails a support bundle would be nice
👍 1
h
I’ve had this problem with terraform created clusters in 1.2.0. Manually creating the cluster through the Rancher GUI always was able to get an IP, but never in the terraform created cluster. I’ve spent hours poring over the yaml for the two clusters, the logs, the events, and the code, but have never been able to figure out what’s happening.
s
Oof, that's unfortunate, but at least it's a lead. I'm confused about the proper interactions of the
harvester-load-balancer
,
harvester-load-balancer-webhook
and
harvester-cloud-provider
on the guest cluster, even working correctly. The constructLB function on the guest cluster seems to be doing what it's supposed to, so I don't think it's a guest cluster config issue or an issue with the cloud provider. But I can't find in the
harvester-load-balancer
or
webhook
code where the
spec.backendServerSelector
is getting injected into the spec. The v1alpha1 spec
LoadBalancers
of workloadtype
cluster
that were converted into v1beta1 have the
backendServerSelector
defined, so that has to be the way it's done, right? But I looked through the load balancer repo and I can't for the life of me find anything that would act to create that. Is there anywhere that I can go to understand how the load balancer creation code would flow in a proper case?
g
when a downstream cluster is created by rancher.. it automatically generates a scoped kubeconfig for the hosting cluster and injects the harvester cloud provider via cloud-init
the harvester-cloud provider syncs the load balancer requests to the host harvester cluster and does the work of fulfilling these requests
s
(i created an issue with my above description and a support bundle attached: https://github.com/harvester/harvester/issues/4678)
@great-bear-19718 I've got that far, but I'm not sure what's supposed to happen once it gets into the harvester side. The
LoadBalancer
spec is created bare with just two values:
Copy code
spec:
  ipam: dhcp
  workloadType: cluster
and this seems to be what the cloud provider intends But when I create a LB manually, or look at the automatically converted LBs from v1alpha1, they have a
spec.backendServerSelector
that defines what VMI they map to. I don't see where this gets injected, or how the LB manifest gets mapped to those VMIs for the guest cluster. My assumption would be that happens in the mutatingwebhook for the LB when a workloadType cluster is passed in, that something would map the backend servers to the manifest? But I can't find anything even touching the backendServerSelector or workloadType: cluster
p
@red-king-19196 @ancient-pizza-13099 do you have any insight to the issue here?
r
If the
workloadType
is
cluster
, there’s nothing to do with
.spec.backendServerSelector
. That selector is for the
vm
use case (creating LoadBalancer CR directly on the Harvester cluster and pointing to a set of VMs). The LoadBalancer CR manifest you provided above is okay. Inside the manifest, the
.status.allocatedAddress.ip
is
0.0.0.0
, which means it’s `kube-vip`’s turn to grab a valid IP address from the DHCP server.
In order to do this, we need to signify to kube-vip and the cloud provider that we don’t need one of their managed addresses. We do this by explicitly exposing a Service on the address
0.0.0.0
. When kube-vip sees a Service on this address, it will create a
macvlan
interface on the host and request a DHCP address. Once this address is provided, it will assign it as the
LoadBalancer
IP and update the Kubernetes Service.
ref Could you check and provide the log of the
kube-vip
Pod on the guest cluster? P.S. The workflow of how the cloud provider and the load balancer work together could refer to the HEP
s
Thanks @red-king-19196 for the explanation and HEP link! I re-created the service in the guest cluster. I am checking the kube-vip logs in the harvester cluster, as there is no
kube-vip
pod in the guest cluster. Strangely, I'm not seeing any fresh logs in
kube-vip
upon the creation of the load balancer named
kubernetes-network-blocky-app-4-2ddd3a3f
. Below are full
kube-vip
logs.
Copy code
❯ kubectl logs -n harvester-system kube-vip-dk44j                                                                                                                                          ─╯
time="2023-10-29T02:27:26Z" level=info msg="Starting kube-vip.io [v0.6.0]"
time="2023-10-29T02:27:26Z" level=info msg="namespace [kube-system], Mode: [ARP], Features(s): Control Plane:[false], Services:[true]"
time="2023-10-29T02:27:26Z" level=info msg="No interface is specified for VIP in config, auto-detecting default Interface"
time="2023-10-29T02:27:26Z" level=info msg="prometheus HTTP server started"
time="2023-10-29T02:27:26Z" level=info msg="kube-vip will bind to interface [mgmt-br]"
time="2023-10-29T02:27:26Z" level=info msg="Starting Kube-vip Manager with the ARP engine"
time="2023-10-29T02:27:26Z" level=info msg="beginning services leadership, namespace [harvester-system], lock name [plndr-svcs-lock], id [harvester0]"
I1029 02:27:26.488358       1 leaderelection.go:248] attempting to acquire leader lease harvester-system/plndr-svcs-lock...
E1029 02:27:56.516269       1 leaderelection.go:330] error retrieving resource lock harvester-system/plndr-svcs-lock: Get "<https://10.53.0.1:443/apis/coordination.k8s.io/v1/namespaces/harvester-system/leases/plndr-svcs-lock>": dial tcp 10.53.0.1:443: i/o timeout
I1029 02:27:58.256755       1 leaderelection.go:258] successfully acquired lease harvester-system/plndr-svcs-lock
time="2023-10-29T02:27:58Z" level=info msg="starting services watcher for all namespaces"
time="2023-10-29T02:27:58Z" level=info msg="Creating new macvlan interface for DHCP [vip-f33fec36]"
time="2023-10-29T02:27:58Z" level=info msg="New interface [vip-f33fec36] mac is 00:00:6c:0c:e9:60"
time="2023-10-29T02:27:58Z" level=info msg="DHCP VIP [0.0.0.0] for [default/kubernetes-network-blocky-app-58296840] "
time="2023-10-29T02:27:58Z" level=info msg="[service] adding VIP [0.0.0.0] for [default/kubernetes-network-blocky-app-58296840]"
time="2023-10-29T02:27:58Z" level=info msg="[service] synchronised in 80ms"
time="2023-10-29T02:27:58Z" level=info msg="Creating new macvlan interface for DHCP [vip-fbbe9ecc]"
time="2023-10-29T02:27:58Z" level=info msg="New interface [vip-fbbe9ecc] mac is 00:00:6c:db:f2:c1"
time="2023-10-29T02:27:58Z" level=info msg="DHCP VIP [0.0.0.0] for [default/kubernetes-network-ingress-nginx-internal-controller-c6de3eb7] "
time="2023-10-29T02:27:58Z" level=info msg="[service] adding VIP [0.0.0.0] for [default/kubernetes-network-ingress-nginx-internal-controller-c6de3eb7]"
time="2023-10-29T02:27:58Z" level=info msg="[service] synchronised in 82ms"
time="2023-10-29T02:27:58Z" level=info msg="[service] adding VIP [192.168.10.2] for [kube-system/ingress-expose]"
time="2023-10-29T02:27:58Z" level=info msg="[service] synchronised in 12ms"
time="2023-10-29T14:28:33Z" level=error msg="renew failed, error: got an error while processing the request: no matching response packet received"
time="2023-10-29T14:28:33Z" level=error msg="renew failed, error: got an error while processing the request: no matching response packet received"
time="2023-10-29T23:28:33Z" level=error msg="rebind failed, error: got an error while processing the request: no matching response packet received"
time="2023-10-29T23:28:33Z" level=error msg="rebind failed, error: got an error while processing the request: no matching response packet received"
time="2023-10-30T13:56:26Z" level=info msg="Creating new macvlan interface for DHCP [vip-96b5a5ba]"
time="2023-10-30T13:56:26Z" level=info msg="Generated mac: 00:00:6C:62:97:a8"
time="2023-10-30T13:56:26Z" level=info msg="New interface [vip-96b5a5ba] mac is 00:00:6c:62:97:a8"
time="2023-10-30T13:56:27Z" level=info msg="DHCP VIP [0.0.0.0] for [default/manual-test] "
time="2023-10-30T13:56:27Z" level=info msg="[service] adding VIP [0.0.0.0] for [default/manual-test]"
time="2023-10-30T13:56:27Z" level=info msg="[service] synchronised in 1242ms"
time="2023-10-31T01:57:02Z" level=error msg="renew failed, error: got an error while processing the request: no matching response packet received"
time="2023-10-31T04:18:19Z" level=info msg="[LOADBALANCER] Stopping load balancers"
time="2023-10-31T04:18:19Z" level=info msg="[VIP] Releasing the Virtual IP [192.168.10.75]"
time="2023-10-31T04:18:19Z" level=info msg="release, lease: &{Offer:DHCPv4(xid=0x48665d6b hwaddr=00:00:6c:62:97:a8 msg_type=OFFER, your_ip=192.168.10.75, server_ip=192.168.10.1) ACK:DHCPv4(xid=0x48665d6b hwaddr=00:00:6c:62:97:a8 msg_type=ACK, your_ip=192.168.10.75, server_ip=192.168.10.1) CreationTime:2023-10-30 13:56:27.183661196 +0000 UTC m=+127740.787232155}"
time="2023-10-31T04:18:19Z" level=info msg="Removed [96b5a5ba-0224-43cb-9514-4a65fe7ba5d8] from manager, [3] advertised services remain"
time="2023-10-31T04:18:19Z" level=info msg="service [default/manual-test] has been deleted"
Looking at the
kube-vip
documentation, it looks like it acts on a
service
with type
loadbalancer
, not the
<http://loadbalancers.harvesterhci.io|loadbalancers.harvesterhci.io>
loadbalancer. Those services, for the
loadbalancer
workloadType: cluster
don't seem to exist or being created. Perhaps that's the missing link?
❌ 1
.
Once the
kube-vip
inside the guest cluster watches the service
https://github.com/harvester/harvester/blob/master/enhancements/20220214-harvester-cloud-provider-enhancement.md So
kube-vip
should be running in the guest cluster? That's probably why it's not working lmao. I'm guessing something went wrong in the upgrade process - I
harvester-cloud-provider:v0.2.0
running in the guest cluster (bumped from
v0.1.5
i think??) but no
kube-vip
pod!
r
That could be the case. The whole process needs
kube-vip
(in the guest cluster)
s
Do you know where that requirement was introduced? I'm mostly sure that it did not exist when i was running rke2
1.24.x
on harvester
1.1.2
but I'm not positive.
I can try and install the latest CCM helm chart in the guests cluster to get kube vip and see if that works https://github.com/harvester/charts/blob/master/charts/harvester-cloud-provider/values.yaml
For reference, these are the logs from the
helm-install-harvester-cloud-provider
job logs on the guest cluster:
Copy code
❯ kubectl logs helm-install-harvester-cloud-provider-mg6h2 -n kube-system                                                                     ─╯
if [[ ${KUBERNETES_SERVICE_HOST} =~ .*:.* ]]; then
        echo "KUBERNETES_SERVICE_HOST is using IPv6"
        CHART="${CHART//%\{KUBERNETES_API\}%/[${KUBERNETES_SERVICE_HOST}]:${KUBERNETES_SERVICE_PORT}}"
else
        CHART="${CHART//%\{KUBERNETES_API\}%/${KUBERNETES_SERVICE_HOST}:${KUBERNETES_SERVICE_PORT}}"
fi

set +v -x
+ [[ true != \t\r\u\e ]]
+ [[ '' == \1 ]]
+ [[ '' == \v\2 ]]
+ shopt -s nullglob
+ [[ -f /config/ca-file.pem ]]
+ [[ -f /tmp/ca-file.pem ]]
+ [[ -n '' ]]
+ helm_content_decode
+ set -e
+ ENC_CHART_PATH=/chart/harvester-cloud-provider.tgz.base64
+ CHART_PATH=/tmp/harvester-cloud-provider.tgz
+ [[ ! -f /chart/harvester-cloud-provider.tgz.base64 ]]
+ base64 -d /chart/harvester-cloud-provider.tgz.base64
+ CHART=/tmp/harvester-cloud-provider.tgz
+ set +e
+ [[ install != \d\e\l\e\t\e ]]
+ helm_repo_init
+ grep -q -e 'https\?://'
+ [[ helm_v3 == \h\e\l\m\_\v\3 ]]
+ [[ /tmp/harvester-cloud-provider.tgz == stable/* ]]
+ [[ -n '' ]]
+ helm_update install --set-string global.clusterCIDR=10.42.0.0/16 --set-string global.clusterCIDRv4=10.42.0.0/16 --set-string global.clusterDNS=10.43.0.10 --set-string global.clusterDomain=cluster.local --set-string global.rke2DataDir=/var/lib/rancher/rke2 --set-string global.serviceCIDR=10.43.0.0/16
+ [[ helm_v3 == \h\e\l\m\_\v\3 ]]
++ jq -r '"\(.[0].app_version),\(.[0].status)"'
++ tr '[:upper:]' '[:lower:]'
++ helm_v3 ls --all -f '^harvester-cloud-provider$' --namespace kube-system --output json
+ LINE=v0.2.0,deployed
+ IFS=,
+ read -r INSTALLED_VERSION STATUS _
+ VALUES=
+ for VALUES_FILE in /config/*.yaml
+ VALUES=' --values /config/values-10_HelmChartConfig.yaml'
+ [[ install = \d\e\l\e\t\e ]]
+ [[ v0.2.0 =~ ^(|null)$ ]]
+ [[ deployed =~ ^(pending-install|pending-upgrade|pending-rollback)$ ]]
+ [[ deployed == \d\e\p\l\o\y\e\d ]]
+ echo 'Already installed harvester-cloud-provider'
Already installed harvester-cloud-provider
+ [[ helm_v3 == \h\e\l\m\_\v\3 ]]
+ helm_v3 mapkubeapis harvester-cloud-provider --namespace kube-system
2023/10/30 03:47:03 Release 'harvester-cloud-provider' will be checked for deprecated or removed Kubernetes APIs and will be updated if necessary to supported API versions.
2023/10/30 03:47:03 Get release 'harvester-cloud-provider' latest version.
2023/10/30 03:47:03 Check release 'harvester-cloud-provider' for deprecated or removed APIs...
2023/10/30 03:47:04 Finished checking release 'harvester-cloud-provider' for deprecated or removed APIs.
2023/10/30 03:47:04 Release 'harvester-cloud-provider' has no deprecated or removed APIs.
2023/10/30 03:47:04 Map of release 'harvester-cloud-provider' deprecated or removed APIs to supported versions, completed successfully.
+ echo 'Upgrading helm_v3 chart'
+ echo 'Upgrading harvester-cloud-provider'
+ shift 1
+ helm_v3 upgrade --set-string global.clusterCIDR=10.42.0.0/16 --set-string global.clusterCIDRv4=10.42.0.0/16 --set-string global.clusterDNS=10.43.0.10 --set-string global.clusterDomain=cluster.local --set-string global.rke2DataDir=/var/lib/rancher/rke2 --set-string global.serviceCIDR=10.43.0.0/16 harvester-cloud-provider /tmp/harvester-cloud-provider.tgz --values /config/values-10_HelmChartConfig.yaml
Upgrading harvester-cloud-provider
Release "harvester-cloud-provider" has been upgraded. Happy Helming!
NAME: harvester-cloud-provider
LAST DEPLOYED: Mon Oct 30 03:47:06 2023
NAMESPACE: kube-system
STATUS: deployed
REVISION: 5
TEST SUITE: None
+ exit
r
Do you know where that requirement was introduced? I’m mostly sure that it did not exist when i was running rke2
1.24.x
on harvester
1.1.2
but I’m not positive.
It was introduced in the latest change of the chart https://github.com/harvester/charts/commit/677c166aa61531e106b1db47878c0e595051ed68
🙌 1
s
Side note, it looks like there's 2 sources for this chart, from both rancher-charts and harvester-charts, which look like they're out of sync.
Copy code
❯ helm repo update > /dev/null 2>&1 && helm search repo harvester-cloud-provider                                                              ─╯
NAME                                    CHART VERSION           APP VERSION     DESCRIPTION                              
harvester/harvester-cloud-provider      0.2.2                   v0.2.0          A Helm chart for Harvester Cloud Provider
rancher-charts/harvester-cloud-provider 102.0.1+up0.1.14        v0.1.5          A Helm chart for Harvester Cloud Provider
rancher-charts still has app version
v0.1.5
r
May I ask what’s the version of your external Rancher?
s
From the about page:
Copy code
Component 	Version
Rancher 	v2.7.6
Dashboard 	v2.7.6
Helm 	v2.16.8-rancher2
Machine 	v0.15.0-rancher100
Running on rke1
v1.26.4
, if that matters
🙌 1
OK closing in on the issue. The
kube-vip
daemonset already existed! It was added via the helm upgrade job. I had been looking for a
kube-vip
pod, which is not running, and for whatever reason my alerting doesn't think this is a problem.
Copy code
❯ kubectl get daemonset -n kube-system kube-vip                                                                                                                      ─╯
NAME       DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                                AGE
kube-vip   0         0         0       0            0           <http://node-role.kubernetes.io/control-plane=true|node-role.kubernetes.io/control-plane=true>   35h
Looking at why there's no desired pods here. Something with control plane tainting maybe?
r
Could you
describe
the DaemonSet to see if any events at the bottom?
s
Copy code
❯ kubectl describe daemonset -n kube-system kube-vip 
...
Events:                         <none>
So my CP nodes are tained with:
Copy code
spec:
...
  taints:
  - effect: NoSchedule
    key: <http://node-role.kubernetes.io/control-plane|node-role.kubernetes.io/control-plane>
  - effect: NoExecute
    key: <http://node-role.kubernetes.io/etcd|node-role.kubernetes.io/etcd>
And the kube-vip daemonset only tolerates:
Copy code
tolerations:
      - effect: NoSchedule
        key: <http://node-role.kubernetes.io/control-plane|node-role.kubernetes.io/control-plane>
        operator: Exists
I think it should need a toleration for NoExecute for the etcd role?
That was it, kube vip is running with the addition of
Copy code
tolerations:
        - effect: NoExecute
          key: <http://node-role.kubernetes.io/etcd|node-role.kubernetes.io/etcd>
          operator: Exists
Now on to test the service load balancer stuff again!
WOOHOO! It works!
Copy code
❯ kubectl get svc -n network blocky-app-4                                                                                                                            ─╯
NAME           TYPE           CLUSTER-IP     EXTERNAL-IP     PORT(S)        AGE
blocky-app-4   LoadBalancer   10.43.188.12   192.168.10.78   53:31656/UDP   73m
👍 1
Well, that ended up being super simple 🤕
If @hundreds-easter-25520 is correct though, there's something misaligned here with the taints and tolerations created by the terraform provider and a cluster created by rancher. My guess is the terraform provider creates the etcd nodes with that taint, and the rancher interface doesn't.
h
I’ve been reinstalling 1.2.0 to look at another issue, so I can’t check exactly, but my cluster that was built via the GUI was only a single node, while my terraform cluster was a 3 node control/etcd + workers. 1.2.0 just came back up, so I’ll install via terraform again and see about the taints.
Adding the toleration to the kube-vip daemonset let the LB service get an IP address in the terraform created cluster. I’m going to start up the Rancher GUI created cluster and take a look at it, both in single node, and split control-plane/etcd and worker.
I don’t think there’s much else to add here. I tested 4 cluster configurations, one via terraform and the other three created through the GUI. The 4 clusters where 1. GUI created etcd+control separate from workers 2. Terraform created etcd+control separate from workers 3. All-in-one, etcd+control+worker on one node 4. Full split, etcd, control, and worker all in different pools. For cluster #1 we had exactly what we expect after this discussion.
kube-vip
at 0 pods because there are no nodes that match the NodeSelector that don’t have a taint that the pods can’t tollerate.
LoadBalancer services are created fine. #2 doesn’t have a
kube-vip
DaemonSet at all, but I ended up with an upgraded rancher2 terraform provider, and I think the changes there might have broken something. I’ll dig into that, but I don’t think it’s this issue but a configuration problem. #3 works fine, there are no taints on the node, so there is nothing to stop
kube-vip
from happily running on the control-plane node that just happens to also be and etcd and worker. LoadBalancer service worked fine. #4 also worked perfectly well.
kube-vip
is running on the control-plane node as it matches both the node selector and can tolerate the taint. I think that the helm chart needs to be updated to include the toleration for the etcd:NoExecute taint, and then all these cases should work fine
🙌 1
👍 1
b
thanks for your updates and workaround here 👍