This message was deleted Rancher Users #harvester

Join Slack

This message was deleted.

# harvester

adamant-kite-43734

10/30/2023, 6:49 PM

This message was deleted.

👀 1

salmon-bear-45866

10/30/2023, 7:19 PM

The error from harvester-cloud-provider might be of some use?

Copy code

E1030 18:44:05.892151       1 controller.go:320] error processing service network/test-app (will retry): failed to ensure load balancer: update load balancer IP of service network/test-app failed, error: Operation cannot be fulfilled on services "test-app": the object has been modified; please apply your changes to the latest version and try again
I1030 18:44:05.892458       1 event.go:294] "Event occurred" object="network/test-app" fieldPath="" kind="Service" apiVersion="v1" type="Warning" reason="SyncLoadBalancerFailed" message="Error syncing load balancer: failed to ensure load balancer: update load balancer IP of service network/test-app failed, error: Operation cannot be fulfilled on services \"test-app\": the object has been modified; please apply your changes to the latest version and try again"

salmon-bear-45866

10/30/2023, 7:21 PM

Ultimately though the guest cluster service says the load balancer is ensured... meaning it's a harvester side issue?

Copy code

Type     Reason                  Age                From                Message
  ----     ------                  ----               ----                -------
  Warning  SyncLoadBalancerFailed  36m                service-controller  Error syncing load balancer: failed to ensure load balancer: update load balancer IP of service network/test-app failed, error: Operation cannot be fulfilled on services "test-app": the object has been modified; please apply your changes to the latest version and try again
  Normal   EnsuringLoadBalancer    36m (x3 over 36m)  service-controller  Ensuring load balancer
  Normal   EnsuredLoadBalancer     36m (x2 over 36m)  service-controller  Ensured load balancer

salmon-bear-45866

10/30/2023, 8:11 PM

I manually updated the

LoadBalanacer

manifest to include a selector for all my guest cluster vmis. Going to create a fresh RKE2 guest cluster and see if it has the same issue tomorrow.

great-bear-19718

10/31/2023, 2:34 AM

if it fails a support bundle would be nice

👍 1

hundreds-easter-25520

10/31/2023, 3:17 AM

I’ve had this problem with terraform created clusters in 1.2.0. Manually creating the cluster through the Rancher GUI always was able to get an IP, but never in the terraform created cluster. I’ve spent hours poring over the yaml for the two clusters, the logs, the events, and the code, but have never been able to figure out what’s happening.

salmon-bear-45866

10/31/2023, 3:45 AM

Oof, that's unfortunate, but at least it's a lead. I'm confused about the proper interactions of the

harvester-load-balancer

harvester-load-balancer-webhook

and

harvester-cloud-provider

on the guest cluster, even working correctly. The constructLB function on the guest cluster seems to be doing what it's supposed to, so I don't think it's a guest cluster config issue or an issue with the cloud provider. But I can't find in the

harvester-load-balancer

webhook

code where the

spec.backendServerSelector

is getting injected into the spec. The v1alpha1 spec

LoadBalancers

of workloadtype

cluster

that were converted into v1beta1 have the

backendServerSelector

defined, so that has to be the way it's done, right? But I looked through the load balancer repo and I can't for the life of me find anything that would act to create that. Is there anywhere that I can go to understand how the load balancer creation code would flow in a proper case?

great-bear-19718

10/31/2023, 3:48 AM

when a downstream cluster is created by rancher.. it automatically generates a scoped kubeconfig for the hosting cluster and injects the harvester cloud provider via cloud-init

great-bear-19718

10/31/2023, 3:48 AM

the harvester-cloud provider syncs the load balancer requests to the host harvester cluster and does the work of fulfilling these requests

great-bear-19718

10/31/2023, 3:48 AM

https://github.com/harvester/cloud-provider-harvester

salmon-bear-45866

10/31/2023, 3:51 AM

(i created an issue with my above description and a support bundle attached: https://github.com/harvester/harvester/issues/4678)

salmon-bear-45866

10/31/2023, 3:57 AM

@great-bear-19718 I've got that far, but I'm not sure what's supposed to happen once it gets into the harvester side. The

LoadBalancer

spec is created bare with just two values:

Copy code

spec:
  ipam: dhcp
  workloadType: cluster

and this seems to be what the cloud provider intends But when I create a LB manually, or look at the automatically converted LBs from v1alpha1, they have a

spec.backendServerSelector

that defines what VMI they map to. I don't see where this gets injected, or how the LB manifest gets mapped to those VMIs for the guest cluster. My assumption would be that happens in the mutatingwebhook for the LB when a workloadType cluster is passed in, that something would map the backend servers to the manifest? But I can't find anything even touching the backendServerSelector or workloadType: cluster

prehistoric-balloon-31801

10/31/2023, 5:41 AM

@red-king-19196 @ancient-pizza-13099 do you have any insight to the issue here?

red-king-19196

10/31/2023, 9:02 AM

If the

workloadType

cluster

, there’s nothing to do with

.spec.backendServerSelector

. That selector is for the

vm

use case (creating LoadBalancer CR directly on the Harvester cluster and pointing to a set of VMs). The LoadBalancer CR manifest you provided above is okay. Inside the manifest, the

.status.allocatedAddress.ip

0.0.0.0

, which means it’s `kube-vip`’s turn to grab a valid IP address from the DHCP server.

In order to do this, we need to signify to kube-vip and the cloud provider that we don’t need one of their managed addresses. We do this by explicitly exposing a Service on the address
0.0.0.0
. When kube-vip sees a Service on this address, it will create a
macvlan
interface on the host and request a DHCP address. Once this address is provided, it will assign it as the
LoadBalancer
IP and update the Kubernetes Service.

ref Could you check and provide the log of the

kube-vip

Pod on the guest cluster? P.S. The workflow of how the cloud provider and the load balancer work together could refer to the HEP

salmon-bear-45866

10/31/2023, 1:48 PM

Thanks @red-king-19196 for the explanation and HEP link! I re-created the service in the guest cluster. I am checking the kube-vip logs in the harvester cluster, as there is no

kube-vip

pod in the guest cluster. Strangely, I'm not seeing any fresh logs in

kube-vip

upon the creation of the load balancer named

kubernetes-network-blocky-app-4-2ddd3a3f

. Below are full

kube-vip

logs.

Copy code

❯ kubectl logs -n harvester-system kube-vip-dk44j                                                                                                                                          ─╯
time="2023-10-29T02:27:26Z" level=info msg="Starting kube-vip.io [v0.6.0]"
time="2023-10-29T02:27:26Z" level=info msg="namespace [kube-system], Mode: [ARP], Features(s): Control Plane:[false], Services:[true]"
time="2023-10-29T02:27:26Z" level=info msg="No interface is specified for VIP in config, auto-detecting default Interface"
time="2023-10-29T02:27:26Z" level=info msg="prometheus HTTP server started"
time="2023-10-29T02:27:26Z" level=info msg="kube-vip will bind to interface [mgmt-br]"
time="2023-10-29T02:27:26Z" level=info msg="Starting Kube-vip Manager with the ARP engine"
time="2023-10-29T02:27:26Z" level=info msg="beginning services leadership, namespace [harvester-system], lock name [plndr-svcs-lock], id [harvester0]"
I1029 02:27:26.488358       1 leaderelection.go:248] attempting to acquire leader lease harvester-system/plndr-svcs-lock...
E1029 02:27:56.516269       1 leaderelection.go:330] error retrieving resource lock harvester-system/plndr-svcs-lock: Get "<https://10.53.0.1:443/apis/coordination.k8s.io/v1/namespaces/harvester-system/leases/plndr-svcs-lock>": dial tcp 10.53.0.1:443: i/o timeout
I1029 02:27:58.256755       1 leaderelection.go:258] successfully acquired lease harvester-system/plndr-svcs-lock
time="2023-10-29T02:27:58Z" level=info msg="starting services watcher for all namespaces"
time="2023-10-29T02:27:58Z" level=info msg="Creating new macvlan interface for DHCP [vip-f33fec36]"
time="2023-10-29T02:27:58Z" level=info msg="New interface [vip-f33fec36] mac is 00:00:6c:0c:e9:60"
time="2023-10-29T02:27:58Z" level=info msg="DHCP VIP [0.0.0.0] for [default/kubernetes-network-blocky-app-58296840] "
time="2023-10-29T02:27:58Z" level=info msg="[service] adding VIP [0.0.0.0] for [default/kubernetes-network-blocky-app-58296840]"
time="2023-10-29T02:27:58Z" level=info msg="[service] synchronised in 80ms"
time="2023-10-29T02:27:58Z" level=info msg="Creating new macvlan interface for DHCP [vip-fbbe9ecc]"
time="2023-10-29T02:27:58Z" level=info msg="New interface [vip-fbbe9ecc] mac is 00:00:6c:db:f2:c1"
time="2023-10-29T02:27:58Z" level=info msg="DHCP VIP [0.0.0.0] for [default/kubernetes-network-ingress-nginx-internal-controller-c6de3eb7] "
time="2023-10-29T02:27:58Z" level=info msg="[service] adding VIP [0.0.0.0] for [default/kubernetes-network-ingress-nginx-internal-controller-c6de3eb7]"
time="2023-10-29T02:27:58Z" level=info msg="[service] synchronised in 82ms"
time="2023-10-29T02:27:58Z" level=info msg="[service] adding VIP [192.168.10.2] for [kube-system/ingress-expose]"
time="2023-10-29T02:27:58Z" level=info msg="[service] synchronised in 12ms"
time="2023-10-29T14:28:33Z" level=error msg="renew failed, error: got an error while processing the request: no matching response packet received"
time="2023-10-29T14:28:33Z" level=error msg="renew failed, error: got an error while processing the request: no matching response packet received"
time="2023-10-29T23:28:33Z" level=error msg="rebind failed, error: got an error while processing the request: no matching response packet received"
time="2023-10-29T23:28:33Z" level=error msg="rebind failed, error: got an error while processing the request: no matching response packet received"
time="2023-10-30T13:56:26Z" level=info msg="Creating new macvlan interface for DHCP [vip-96b5a5ba]"
time="2023-10-30T13:56:26Z" level=info msg="Generated mac: 00:00:6C:62:97:a8"
time="2023-10-30T13:56:26Z" level=info msg="New interface [vip-96b5a5ba] mac is 00:00:6c:62:97:a8"
time="2023-10-30T13:56:27Z" level=info msg="DHCP VIP [0.0.0.0] for [default/manual-test] "
time="2023-10-30T13:56:27Z" level=info msg="[service] adding VIP [0.0.0.0] for [default/manual-test]"
time="2023-10-30T13:56:27Z" level=info msg="[service] synchronised in 1242ms"
time="2023-10-31T01:57:02Z" level=error msg="renew failed, error: got an error while processing the request: no matching response packet received"
time="2023-10-31T04:18:19Z" level=info msg="[LOADBALANCER] Stopping load balancers"
time="2023-10-31T04:18:19Z" level=info msg="[VIP] Releasing the Virtual IP [192.168.10.75]"
time="2023-10-31T04:18:19Z" level=info msg="release, lease: &{Offer:DHCPv4(xid=0x48665d6b hwaddr=00:00:6c:62:97:a8 msg_type=OFFER, your_ip=192.168.10.75, server_ip=192.168.10.1) ACK:DHCPv4(xid=0x48665d6b hwaddr=00:00:6c:62:97:a8 msg_type=ACK, your_ip=192.168.10.75, server_ip=192.168.10.1) CreationTime:2023-10-30 13:56:27.183661196 +0000 UTC m=+127740.787232155}"
time="2023-10-31T04:18:19Z" level=info msg="Removed [96b5a5ba-0224-43cb-9514-4a65fe7ba5d8] from manager, [3] advertised services remain"
time="2023-10-31T04:18:19Z" level=info msg="service [default/manual-test] has been deleted"

salmon-bear-45866

10/31/2023, 1:56 PM

Looking at the

kube-vip

documentation, it looks like it acts on a

service

with type

loadbalancer

, not the

<http://loadbalancers.harvesterhci.io|loadbalancers.harvesterhci.io>

loadbalancer. Those services, for the

loadbalancer

workloadType: cluster

don't seem to exist or being created. Perhaps that's the missing link?

❌ 1

salmon-bear-45866

10/31/2023, 1:59 PM

salmon-bear-45866

10/31/2023, 2:09 PM

Once the
kube-vip
inside the guest cluster watches the service

https://github.com/harvester/harvester/blob/master/enhancements/20220214-harvester-cloud-provider-enhancement.md So

kube-vip

should be running in the guest cluster? That's probably why it's not working lmao. I'm guessing something went wrong in the upgrade process - I

harvester-cloud-provider:v0.2.0

running in the guest cluster (bumped from

v0.1.5

i think??) but no

kube-vip

pod!

red-king-19196

10/31/2023, 2:10 PM

That could be the case. The whole process needs

kube-vip

(in the guest cluster)

salmon-bear-45866

10/31/2023, 2:11 PM

Do you know where that requirement was introduced? I'm mostly sure that it did not exist when i was running rke2

1.24.x

on harvester

1.1.2

but I'm not positive.

salmon-bear-45866

10/31/2023, 2:11 PM

I can try and install the latest CCM helm chart in the guests cluster to get kube vip and see if that works https://github.com/harvester/charts/blob/master/charts/harvester-cloud-provider/values.yaml

salmon-bear-45866

10/31/2023, 2:16 PM

For reference, these are the logs from the

helm-install-harvester-cloud-provider

job logs on the guest cluster:

Copy code

❯ kubectl logs helm-install-harvester-cloud-provider-mg6h2 -n kube-system                                                                     ─╯
if [[ ${KUBERNETES_SERVICE_HOST} =~ .*:.* ]]; then
        echo "KUBERNETES_SERVICE_HOST is using IPv6"
        CHART="${CHART//%\{KUBERNETES_API\}%/[${KUBERNETES_SERVICE_HOST}]:${KUBERNETES_SERVICE_PORT}}"
else
        CHART="${CHART//%\{KUBERNETES_API\}%/${KUBERNETES_SERVICE_HOST}:${KUBERNETES_SERVICE_PORT}}"
fi

set +v -x
+ [[ true != \t\r\u\e ]]
+ [[ '' == \1 ]]
+ [[ '' == \v\2 ]]
+ shopt -s nullglob
+ [[ -f /config/ca-file.pem ]]
+ [[ -f /tmp/ca-file.pem ]]
+ [[ -n '' ]]
+ helm_content_decode
+ set -e
+ ENC_CHART_PATH=/chart/harvester-cloud-provider.tgz.base64
+ CHART_PATH=/tmp/harvester-cloud-provider.tgz
+ [[ ! -f /chart/harvester-cloud-provider.tgz.base64 ]]
+ base64 -d /chart/harvester-cloud-provider.tgz.base64
+ CHART=/tmp/harvester-cloud-provider.tgz
+ set +e
+ [[ install != \d\e\l\e\t\e ]]
+ helm_repo_init
+ grep -q -e 'https\?://'
+ [[ helm_v3 == \h\e\l\m\_\v\3 ]]
+ [[ /tmp/harvester-cloud-provider.tgz == stable/* ]]
+ [[ -n '' ]]
+ helm_update install --set-string global.clusterCIDR=10.42.0.0/16 --set-string global.clusterCIDRv4=10.42.0.0/16 --set-string global.clusterDNS=10.43.0.10 --set-string global.clusterDomain=cluster.local --set-string global.rke2DataDir=/var/lib/rancher/rke2 --set-string global.serviceCIDR=10.43.0.0/16
+ [[ helm_v3 == \h\e\l\m\_\v\3 ]]
++ jq -r '"\(.[0].app_version),\(.[0].status)"'
++ tr '[:upper:]' '[:lower:]'
++ helm_v3 ls --all -f '^harvester-cloud-provider$' --namespace kube-system --output json
+ LINE=v0.2.0,deployed
+ IFS=,
+ read -r INSTALLED_VERSION STATUS _
+ VALUES=
+ for VALUES_FILE in /config/*.yaml
+ VALUES=' --values /config/values-10_HelmChartConfig.yaml'
+ [[ install = \d\e\l\e\t\e ]]
+ [[ v0.2.0 =~ ^(|null)$ ]]
+ [[ deployed =~ ^(pending-install|pending-upgrade|pending-rollback)$ ]]
+ [[ deployed == \d\e\p\l\o\y\e\d ]]
+ echo 'Already installed harvester-cloud-provider'
Already installed harvester-cloud-provider
+ [[ helm_v3 == \h\e\l\m\_\v\3 ]]
+ helm_v3 mapkubeapis harvester-cloud-provider --namespace kube-system
2023/10/30 03:47:03 Release 'harvester-cloud-provider' will be checked for deprecated or removed Kubernetes APIs and will be updated if necessary to supported API versions.
2023/10/30 03:47:03 Get release 'harvester-cloud-provider' latest version.
2023/10/30 03:47:03 Check release 'harvester-cloud-provider' for deprecated or removed APIs...
2023/10/30 03:47:04 Finished checking release 'harvester-cloud-provider' for deprecated or removed APIs.
2023/10/30 03:47:04 Release 'harvester-cloud-provider' has no deprecated or removed APIs.
2023/10/30 03:47:04 Map of release 'harvester-cloud-provider' deprecated or removed APIs to supported versions, completed successfully.
+ echo 'Upgrading helm_v3 chart'
+ echo 'Upgrading harvester-cloud-provider'
+ shift 1
+ helm_v3 upgrade --set-string global.clusterCIDR=10.42.0.0/16 --set-string global.clusterCIDRv4=10.42.0.0/16 --set-string global.clusterDNS=10.43.0.10 --set-string global.clusterDomain=cluster.local --set-string global.rke2DataDir=/var/lib/rancher/rke2 --set-string global.serviceCIDR=10.43.0.0/16 harvester-cloud-provider /tmp/harvester-cloud-provider.tgz --values /config/values-10_HelmChartConfig.yaml
Upgrading harvester-cloud-provider
Release "harvester-cloud-provider" has been upgraded. Happy Helming!
NAME: harvester-cloud-provider
LAST DEPLOYED: Mon Oct 30 03:47:06 2023
NAMESPACE: kube-system
STATUS: deployed
REVISION: 5
TEST SUITE: None
+ exit

red-king-19196

10/31/2023, 2:18 PM

Do you know where that requirement was introduced? I’m mostly sure that it did not exist when i was running rke2
1.24.x
on harvester
1.1.2
but I’m not positive.

It was introduced in the latest change of the chart https://github.com/harvester/charts/commit/677c166aa61531e106b1db47878c0e595051ed68

🙌 1

salmon-bear-45866

10/31/2023, 2:27 PM

Side note, it looks like there's 2 sources for this chart, from both rancher-charts and harvester-charts, which look like they're out of sync.

Copy code

❯ helm repo update > /dev/null 2>&1 && helm search repo harvester-cloud-provider                                                              ─╯
NAME                                    CHART VERSION           APP VERSION     DESCRIPTION                              
harvester/harvester-cloud-provider      0.2.2                   v0.2.0          A Helm chart for Harvester Cloud Provider
rancher-charts/harvester-cloud-provider 102.0.1+up0.1.14        v0.1.5          A Helm chart for Harvester Cloud Provider

salmon-bear-45866

10/31/2023, 2:27 PM

rancher-charts still has app version

v0.1.5

red-king-19196

10/31/2023, 2:29 PM

May I ask what’s the version of your external Rancher?

salmon-bear-45866

10/31/2023, 2:32 PM

From the about page:

Copy code

Component 	Version
Rancher 	v2.7.6
Dashboard 	v2.7.6
Helm 	v2.16.8-rancher2
Machine 	v0.15.0-rancher100

salmon-bear-45866

10/31/2023, 2:33 PM

Running on rke1

v1.26.4

, if that matters

🙌 1

salmon-bear-45866

10/31/2023, 2:39 PM

OK closing in on the issue. The

kube-vip

daemonset already existed! It was added via the helm upgrade job. I had been looking for a

kube-vip

pod, which is not running, and for whatever reason my alerting doesn't think this is a problem.

Copy code

❯ kubectl get daemonset -n kube-system kube-vip                                                                                                                      ─╯
NAME       DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                                AGE
kube-vip   0         0         0       0            0           <http://node-role.kubernetes.io/control-plane=true|node-role.kubernetes.io/control-plane=true>   35h

Looking at why there's no desired pods here. Something with control plane tainting maybe?

red-king-19196

10/31/2023, 2:41 PM

Could you

describe

the DaemonSet to see if any events at the bottom?

salmon-bear-45866

10/31/2023, 2:42 PM

Copy code

❯ kubectl describe daemonset -n kube-system kube-vip 
...
Events:                         <none>

salmon-bear-45866

10/31/2023, 2:46 PM

So my CP nodes are tained with:

Copy code

spec:
...
  taints:
  - effect: NoSchedule
    key: <http://node-role.kubernetes.io/control-plane|node-role.kubernetes.io/control-plane>
  - effect: NoExecute
    key: <http://node-role.kubernetes.io/etcd|node-role.kubernetes.io/etcd>

And the kube-vip daemonset only tolerates:

Copy code

tolerations:
      - effect: NoSchedule
        key: <http://node-role.kubernetes.io/control-plane|node-role.kubernetes.io/control-plane>
        operator: Exists

I think it should need a toleration for NoExecute for the etcd role?

salmon-bear-45866

10/31/2023, 2:49 PM

For reference here is how I'm creating my CP/etcd nodes: https://github.com/ionfury/homelab/blob/dd162e9502f4cea31ccdd14b4b731bdd68683192/terraform/.modules/rancher-harvester-cluster/main.tf#L128

👀 1

salmon-bear-45866

10/31/2023, 2:52 PM

That was it, kube vip is running with the addition of

Copy code

tolerations:
        - effect: NoExecute
          key: <http://node-role.kubernetes.io/etcd|node-role.kubernetes.io/etcd>
          operator: Exists

salmon-bear-45866

10/31/2023, 2:52 PM

Now on to test the service load balancer stuff again!

salmon-bear-45866

10/31/2023, 2:53 PM

WOOHOO! It works!

Copy code

❯ kubectl get svc -n network blocky-app-4                                                                                                                            ─╯
NAME           TYPE           CLUSTER-IP     EXTERNAL-IP     PORT(S)        AGE
blocky-app-4   LoadBalancer   10.43.188.12   192.168.10.78   53:31656/UDP   73m

👍 1

salmon-bear-45866

10/31/2023, 2:54 PM

Well, that ended up being super simple 🤕

salmon-bear-45866

10/31/2023, 3:00 PM

If @hundreds-easter-25520 is correct though, there's something misaligned here with the taints and tolerations created by the terraform provider and a cluster created by rancher. My guess is the terraform provider creates the etcd nodes with that taint, and the rancher interface doesn't.

hundreds-easter-25520

10/31/2023, 3:04 PM

I’ve been reinstalling 1.2.0 to look at another issue, so I can’t check exactly, but my cluster that was built via the GUI was only a single node, while my terraform cluster was a 3 node control/etcd + workers. 1.2.0 just came back up, so I’ll install via terraform again and see about the taints.

hundreds-easter-25520

10/31/2023, 7:02 PM

Adding the toleration to the kube-vip daemonset let the LB service get an IP address in the terraform created cluster. I’m going to start up the Rancher GUI created cluster and take a look at it, both in single node, and split control-plane/etcd and worker.

hundreds-easter-25520

11/01/2023, 3:53 AM

I don’t think there’s much else to add here. I tested 4 cluster configurations, one via terraform and the other three created through the GUI. The 4 clusters where 1. GUI created etcd+control separate from workers 2. Terraform created etcd+control separate from workers 3. All-in-one, etcd+control+worker on one node 4. Full split, etcd, control, and worker all in different pools. For cluster #1 we had exactly what we expect after this discussion.

kube-vip

at 0 pods because there are no nodes that match the NodeSelector that don’t have a taint that the pods can’t tollerate.

hundreds-easter-25520

11/01/2023, 3:59 AM

LoadBalancer services are created fine. #2 doesn’t have a

kube-vip

DaemonSet at all, but I ended up with an upgraded rancher2 terraform provider, and I think the changes there might have broken something. I’ll dig into that, but I don’t think it’s this issue but a configuration problem. #3 works fine, there are no taints on the node, so there is nothing to stop

kube-vip

from happily running on the control-plane node that just happens to also be and etcd and worker. LoadBalancer service worked fine. #4 also worked perfectly well.

kube-vip

is running on the control-plane node as it matches both the node selector and can tolerate the taint. I think that the helm chart needs to be updated to include the toleration for the etcd:NoExecute taint, and then all these cases should work fine

🙌 1

👍 1

brave-napkin-80104

11/03/2023, 9:48 PM

thanks for your updates and workaround here 👍

6 Views

Open in Slack

Previous Next