Part 109: Mind the maintenance window

We're going to have a puppy in about one week. Cannot wait!

BUT ... our home router was in a bit of a danger zone considering the puppy, so I moved the home router from our living room to my home office.

First things first

As our home is very much monitored by Zabbix, of course I needed to first setup a maintenance window before starting the move.

Next I checked from my local Jenkins if my home router daily backup had succeeded last night. Yes, it had.

If you are curious why it's raining for the Jenkins job status, well, few nights before the backup had failed with this spectacular error. Something with Jenkins and/or my Mac running the Jenkins:

01:52:00 Started by timer
01:52:00 Running as SYSTEM
01:52:00 Building on the built-in node in workspace /Users/jpikkarainen/.jenkins/workspace/WhatsUpHome/BackupAsusToPi
01:52:00 [BackupAsusToPi] $ /bin/sh -xe /var/folders/wj/2zz9vf1x5vq75mdn4tq117fw0000gp/T/jenkins13524886402736305905.sh
01:52:00 FATAL: command execution failed
01:52:00 java.io.IOException: error=0, posix_spawn failed
01:52:00 	at java.base/java.lang.ProcessImpl.forkAndExec(Native Method)
01:52:00 	at java.base/java.lang.ProcessImpl.<init>(ProcessImpl.java:295)
01:52:00 	at java.base/java.lang.ProcessImpl.start(ProcessImpl.java:225)
01:52:00 	at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1126)
01:52:00 Caused: java.io.IOException: Cannot run program "/bin/sh" (in directory "/Users/jpikkarainen/.jenkins/workspace/WhatsUpHome/BackupAsusToPi"): error=0, posix_spawn failed
01:52:00 	at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1170)
01:52:00 	at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1089)
01:52:00 	at hudson.Proc$LocalProc.<init>(Proc.java:252)
01:52:00 	at hudson.Proc$LocalProc.<init>(Proc.java:221)
01:52:00 	at hudson.Launcher$LocalLauncher.launch(Launcher.java:995)
01:52:00 	at hudson.Launcher$ProcStarter.start(Launcher.java:507)
01:52:00 	at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:144)
01:52:00 	at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:92)
01:52:00 	at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:20)
01:52:00 	at hudson.model.AbstractBuild$AbstractBuildExecution.perform(AbstractBuild.java:818)
01:52:00 	at hudson.model.Build$BuildExecution.build(Build.java:199)
01:52:00 	at hudson.model.Build$BuildExecution.doRun(Build.java:164)
01:52:00 	at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:527)
01:52:00 	at hudson.model.Run.execute(Run.java:1833)
01:52:00 	at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:44)
01:52:00 	at hudson.model.ResourceController.execute(ResourceController.java:101)
01:52:00 	at hudson.model.Executor.run(Executor.java:446)
01:52:00 Build step 'Execute shell' marked build as failure
01:52:00 Started calculate disk usage of build
01:52:00 Finished Calculation of disk usage of build in 0 seconds
01:52:00 Started calculate disk usage of workspace
01:52:00 Finished Calculation of disk usage of workspace in 0 seconds
01:52:01 Finished: FAILURE

Do the move

For the next part, nothing much to blog about: move the router from the living room to my home office, attach the cables. Then, in our hallway, switch a cable on the patch panel from one port to another. Plug some cables, power on the router, done.

Or was it really done?

This is the part where Zabbix can be so handy. And, today, SO WEIRD. This is a proof for you that many times I'm doing these blog posts in real time. For the duration of the move, my Raspberry Pi and thus Zabbix was powered on all the time. I can access my Zabbix just fine. It reports fresh new data. Yet still, what I see for the problems right now is

Wait? What? I usually have there at least SOME problems, if nothing else, then at least my Proxmox is complaining about something as my Linux laptop and its Proxmox are my real lab rats for anything.

Now, let's debug this together.

Debug Zabbix logs

Next, I went to see Zabbix server logs on my Raspberry Pi /var/log/zabbix/zabbix_server.log. Ah-ha! IP addresses had changed even though I'm pretty sure I had hardcoded them to stay the same, no matter what happens with the home router.

2903748:20250131:201747.895 error reason for "192.168.50.45:proxmox.qemu.mem[qemu/101]" changed: Cannot perform request: Failed to connect to 192.168.50.45 port 8006 after 3075 ms: Couldn't connect to server

That's likely the reason for my Proxmox issues, but what else happened? My Cozify is not alerting anything either, and it's a separate physical appliance. What's going on with that?

Debug Cozify logs

Not exactly on Cozify itself, but rather on my Raspberry Pi as it's using my custom Python scripts to harvest the Cozify data. This is what is happening there:

cozify.Error.ConnectionError: Connection error: HTTPSConnectionPool(host='cloud2.cozify.fi', port=443): Max retries exceeded with url: /ui/0.2/user/hubkeys (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x7fb968ce50>: Failed to resolve 'cloud2.cozify.fi' ([Errno -3] Temporary failure in name resolution)"))

Ah-ha! Next clue, some DNS error. My home router is -- or should be -- running a local DNS server, so let's check that next. Can my Raspberry Pi resolve DNS names OK?

pi@raspberrypi ~/g/broadlink_ac_mqtt (master)> ping google.com
PING google.com (216.58.211.238) 56(84) bytes of data.
64 bytes from mad01s24-in-f238.1e100.net (216.58.211.238): icmp_seq=1 ttl=59 time=4.64 ms
64 bytes from mad07s20-in-f14.1e100.net (216.58.211.238): icmp_seq=2 ttl=59 time=4.73 ms
64 bytes from mad07s20-in-f14.1e100.net (216.58.211.238): icmp_seq=3 ttl=59 time=4.83 ms
64 bytes from mad01s24-in-f238.1e100.net (216.58.211.238): icmp_seq=4 ttl=59 time=4.63 ms

Yes it can.

What else could be going on? Let me restart my custom Python stuff.

... or, while I typed that sentence, I decided not to do that, as I realized that those must have happened while I made the physical move, and true enough, my Zabbix is showing only a short "oops, no data" on the temperatures graph, for example

Then it hit me

Let's get back to very beginning of this blog post -- the maintenance window. I had marked ALL my host groups to be under maintenance, which is something I don't do even at work. Thus, all this presented murky waters for me.

Of course the maintenance window was the root cause. I did not remove the maintenance, and after I finally did, this happened:

Now that's more like it. I have a Selenium instance running on my Proxmox, and also on my personal Mac, but both of those are now unreachable due whatever is going on with the IP addresses. I'll leave that to be another story, but here you can see live how Zabbix can help you to find something that's invisible for you but then over time, would be less unvisible.