Help me troubleshoot. Management URL stays unavail...
# harvester
h
Help me troubleshoot. Management URL stays unavailable and Setting up Harvester. Single Cluster Node on Dell R630 and DHCP worked and Node hostname and ip address show on the Harvester ASCII screen. I've poked through numerous logs but cannot pin down the reason yet. Which logs will help me while I'm running this test for Harvester 1.6.0-rc5 ?
h
I’ve seen that it says Unavailable, but is it really?
t
dhcp is a bad idea for the management ip. You should be able to ssh in as the rancher user and the password you set up during the install.
b
• Management URL is fqdn? • Is the IP on the ASCII console a VIP or DHCP for the box you set up? • What's the output of the
ip -br a
when you log into the node? (Using ssh like Andy said, or via
f12
from the console? • Can you ping the IP from the same subnet/VLAN? What does
nmap
say is open for your host node?
h
mgmt-br
is 192.168.1.100
mgmt-bo
has no IP but is assigned it seems to same MAC as
mgmt-br
I can SSH into rancher@192.168.1.100 and also
sudo -i
just fine. I cannot seem to get or find any responding user URL for managing via WEB GUI as Docs suggest.
eno1
and
eno2
are disabled (only 2 ports on Intel NIC for this server)...leaving both
eno3
and
eno4
available and
eno3
is mapped with same MAC as
mgmt-br
I really do need help at the
harvester
or
rancher-system-agent
logs and checked them. Seems like
kubectl get pods
is complaining about not reaching the API it reports as saying
<https://127.0.0.1:6443>
is not responding. The first time I installed via ISO and left as static and not DHCP and did get the Management URL. But with DHCP mode it doesn't and trying to learn how to trace this down to help with maybe better error logging, etc. with a PR.
I can change the IP via DHCP mapping on router just fine and reboot of Harvester server does indeed change it for
mgmt-br
and
mgmt-bo
.
vip_hw_addr
is set to same MAC as
mgmt-br
in the
/oem/90_custom.yaml
??? weird? I also see Cattle reporting an IP 192.168.1.91 that is unexpected in the
90_custom.yaml
which says external Url: `https://192.168.1.91/api/v1/namespaces/cattle-monitoring-system/services ? And
tls-san
has that same 192.168.1.91 under path for
90-harvester-server.yaml
?
Maybe there's some weird auto detection for IP happening with this Intel NIC slave or bonding? Just don't know how to filter down in the Harvester logs ecosystem just yet but learning more about it as I read the code base.
but 192.168.1.91 is not shown as mapped to any interface I see when I do
ip addr
p
Seems like
kubectl get pods
is complaining about not reaching the API it reports as saying
<https://127.0.0.1:6443>
is not responding.
sounds borked. reinstall w/ static IP config and try again?
☝️ 1
t
Yup
h
Copy code
Aug 14 14:20:25 junglebox rke2[2952]: time="2025-08-14T14:20:25Z" level=info msg="Failed to test etcd connection: this server is a not a member of the etcd cluster. Found [junglebox-63a6a0d9=<https://192.168.1.83:2380>], expect: junglebox-63a6a0d9=<https://192.168.1.100:2380>"
Looks like the
etcd
membership just needs a tweak? How can I update the server membership if
etcdctl
command doesn't exist as part of Harvester directly?
h
It’s in the etcd pod - you’d shell into the host pod. But it’s likely an issue with the wrong IP. Did you check that’s really the correct original IP?
And then reboot
h
the IP is 192.168.1.100 indeed... but not .83
After reboot, where would the etcd pod be picking up configuration for it? Somehow that .83 IP is stuck in somewhere some file? My VIP and MGMT BR are both static now and still problem exists with the wrong IP found/expected.
b
The suggestion to start with a fresh install with static values is a good one.
You don't want this to cause you problems a year from now.
h
this is not about "a year from now" but about testing 1.6 rc5 and contributing back
b
Still probably worth the 20 minutes to reinstall
1
Plus any bug you find might be related to this
h
I can see now that the rke2 config gets deleted by a shell script upon bootup. that's actually where the IP is stored correctly... but that config gets
rm
from the shell script, and I don't know why someone would have done that... I guess because the config gets recreated? But in actual use after reboot, I'm not seeing that
rancher-vip-config.yaml
getting recreated.
t
upon every boot harvester rebuilds the host OS. the configs are based on /oem/ yamls. It is easy to bork up that files and cause the node not to boot.
b
Even then, there are certain values that k8s expects to be immutable. So even if you change it, there's some value in an object that can't be changed and will always pull the wrong one.
h
The issue happened after I changed in
/oem/90_custom.yaml
from the VIP ip addr from mode: static to mode: dhcp
b
I think node IP is one of those values, which is probably why etcd is wrong.
kubectl get nodes -owide
I had the same problem with VMs on DHCP failing to check in with rancher properly.
h
Copy code
E0814 15:16:06.726863   19984 memcache.go:265] "Unhandled Error" err="couldn't get current server API group list: Get \"<https://127.0.0.1:6443/api?timeout=32s>\": dial tcp 127.0.0.1:6443: connect: connection refused"
Likely because the rke2 config defaults to 127.0.0.1 when that rancher-vip- file gets
rm
from the shell script I think?
Copy code
/etc/rancher/rke2/rke2.yaml
p
journalctl -u rke2-server.service
?
h
no errors and it ends with successfully generating the self-signed certificate
at the beginning says:
Copy code
r-url[3133]: + HARVESTER_CONFIG_FILE=/oem/harvester.config
r-url[3133]: + RKE2_VIP_CONFIG_FILE=/etc/rancher/rke2/config.yaml.d/90-harvester-vip.yaml
r-url[3133]: + case $1 in
r-url[3133]: + rm -f /etc/rancher/rke2/config.yaml.d/90-harvester-vip.yaml
1 sec... more....
Copy code
r":"v3rpc/health.go:61","msg":"grpc service status changed","service":"","status":"SERVING"}
r":"etcdserver/server.go:759","msg":"started as single-node; fast-forwarding election ticks","l>
r":"embed/etcd.go:633","msg":"serving peer traffic","address":"127.0.0.1:2400"}
r":"embed/etcd.go:292","msg":"now serving peer/client/metrics","local-member-id":"c5a37df222778>
r":"embed/etcd.go:603","msg":"cmux::serve","address":"127.0.0.1:2400"}
eived: \"terminated\", canceling context..."
t temporary data store connection: failed to get etcd status: context canceled"
t temporary data store connection: etcd datastore is not started"
t temporary data store connection: etcd datastore is not started"
t temporary data store connection: etcd datastore is not started"
t temporary data store connection: etcd datastore is not started"
I will reinstall harvester again via ISO and start with static VIP and MGMT BR again...then look at files and containers in more detail... but the bug was a simple, change from VIP static mode to dhcp in
/oem/90_custom.yaml
and then reboot.
b
I remember getting exact same error message (about IP being not what is expected) after I accidentally switched two NIC cables (both DHCP) of the node after RKE2 install.
h
@brash-petabyte-67855 thanks but I've confirmed the NIC ports already.