The following is a story about partially black monitoring with many missing details due reasons told later in this post, but is also a story about how a single hint on your dashboard can sometimes be your best friend.
Last year my Raspberry Pi 4 SD card started to tell me it should be retired. Some I/O errors, and more than that, suddenly the I/O request times started to be up to 10,000-20,000 milliseconds.
Now, I suspect the same is happening to old external HDD that is connected to my home router. The router runs on Asus WRT-Merlin firmware, which comes with entware packages. The latter are stored -- along some data files -- on external HDD. The hard disk has been working just fine, but now, possibly not much longer.
How did Zabbix help me spot this?
First of all, without me changing anything or no changes in network traffic, home router started to consume lots of CPU. It changed from this
.. to this.
Now, when hovering over any spike, we get this:
26% I/O wait time? 44% system time? That's not normal.
While that is happening, load averages are also very high.
Next, I went to verify that nothing is consuming too much RAM. For that, more than the current memory usage, I wanted to see trends from longer time. Here's 30 days graph -- yes there's changes, but if anything, it's for the better.
Highway to shell
That's about where my hints end through Zabbix, though. I'm monitoring the router through SNMP and there's something making the router snmpd to be very simplified. Even though I have enabled block device and filesystem SNMP templates for the host, and those templates are using the same HashiCorp Vault secret than the working SNMP Interfaces and SNMP Generic templates, they return nothing. Go figure.
Maybe the device snmpd does not reveal everything I would want it to reveal, as I have not touched its config apart from the SNMP get community, which I did through Asus WRT-Merlin web interface. Or, maybe more likely the snmpd on the device does not have those deeper system MIBs enabled, only the networking part, to keep the resource usage as small as possible.
Anyway, because of these reasons I had to continue investigation through shell. Sadly, this did not take me too far.
- dmesg is not reporting any I/O errors
- usual file operations on hard drive seem fast after some tests
- either due the home router or my very old and cheap external HDD, SMART is not properly supported
smartctl 7.4 2023-08-01 r5530 [aarch64-linux-4.1.52] (localbuild)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Vendor: WD
Product: Elements 10A2
Revision: 1033
Compliance: SPC-4
User Capacity: 500,074,283,008 bytes [500 GB]
Logical block size: 512 bytes
Physical block size: 4096 bytes
Rotation Rate: 5400 rpm
Serial number: WX91E13YDA31
Device type: disk
Local Time is: Wed Aug 7 23:18:46 2024 EEST
SMART support is: Unavailable - device lacks SMART capability.
=== START OF READ SMART DATA SECTION ===
Current Drive Temperature: 0 C
Drive Trip Temperature: 0 C
Error Counter logging not supported
No Self-tests have been logged
Has the disk broken yet? No, but I'm not going to trust it much longer. Luckily my local Jenkins instance is taking backups of the external HDD every day.
So, that's it, end of story, nothing more can be done? Everybody, go home? No, of course not, this is Zabbix.
There's an agent for that
By searching the entware repository -- that's like your typical package repository, but for this router --, see what's there:
# opkg search *zabbix*
zabbix-agentd - 7.0.0-1
zabbix-agentd - 7.0.0-1
zabbix-agentd - 7.0.0-1
.. and when I just went and created another host for my router on Zabbix, I started to get more details. A LOT more details.
Displaying 401 to 418 of 418 found
418 items for this host, and the best part is that now it also contains the storage.
As I just installed the agent, there's not much data yet, but I will have some answers in a day or two after Zabbix has been collecting the data. This story will continue then. But, it's amazing that Zabbix is such universal and lightweight that its agent is available even for a humble home router. And why wouldn't it, it's only taking few megs of RAM and practically no CPU.
Add new comment