Part 86: Thanks to Zabbix, a possible early warning about a dying HDD

What's up, home? part 86 cover image -- an exploding hard drive

The following is a story about partially black monitoring with many missing details due reasons told later in this post, but is also a story about how a single hint on your dashboard can sometimes be your best friend.


Last year my Raspberry Pi 4 SD card started to tell me it should be retired. Some I/O errors, and more than that, suddenly the I/O request times started to be up to 10,000-20,000 milliseconds.

Now, I suspect the same is happening to old external HDD that is connected to my home router. The router runs on Asus WRT-Merlin firmware, which comes with entware packages. The latter are stored -- along some data files -- on external HDD. The hard disk has been working just fine, but now, possibly not much longer. 

How did Zabbix help me spot this?

First of all, without me changing anything or no changes in network traffic, home router started to consume lots of CPU. It changed from this

Before the odd behaviour

.. to this.

Spikes! Spikes everywhere!

Now, when hovering over any spike, we get this:

Lots of I/O wait

26% I/O wait time? 44% system time? That's not normal.

While that is happening, load averages are also very high.

Load average

Next, I went to verify that nothing is consuming too much RAM. For that, more than the current memory usage, I wanted to see trends from longer time. Here's 30 days graph -- yes there's changes, but if anything, it's for the better.

RAM usage

Highway to shell

That's about where my hints end through Zabbix, though. I'm monitoring the router through SNMP and there's something making the router snmpd to be very simplified. Even though I have enabled block device and filesystem SNMP templates for the host, and those templates are using the same HashiCorp Vault secret than the working SNMP Interfaces and SNMP Generic templates, they return nothing. Go figure. 

Templates for home router

Maybe the device snmpd does not reveal everything I would want it to reveal, as I have not touched its config apart from the SNMP get community, which I did through Asus WRT-Merlin web interface. Or, maybe more likely the snmpd on the device does not have those deeper system MIBs enabled, only the networking part, to keep the resource usage as small as possible.

Anyway, because of these reasons I had to continue investigation through shell. Sadly, this did not take me too far. 

  • dmesg is not reporting any I/O errors
  • usual file operations on hard drive seem fast after some tests 
  • either due the home router or my very old and cheap external HDD, SMART is not properly supported
smartctl 7.4 2023-08-01 r5530 [aarch64-linux-4.1.52] (localbuild)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Vendor:               WD
Product:              Elements 10A2
Revision:             1033
Compliance:           SPC-4
User Capacity:        500,074,283,008 bytes [500 GB]
Logical block size:   512 bytes
Physical block size:  4096 bytes
Rotation Rate:        5400 rpm
Serial number:        WX91E13YDA31
Device type:          disk
Local Time is:        Wed Aug  7 23:18:46 2024 EEST
SMART support is:     Unavailable - device lacks SMART capability.
=== START OF READ SMART DATA SECTION ===
Current Drive Temperature:     0 C
Drive Trip Temperature:        0 C
Error Counter logging not supported
No Self-tests have been logged

Has the disk broken yet? No, but I'm not going to trust it much longer. Luckily my local Jenkins instance is taking backups of the external HDD every day.

So, that's it, end of story, nothing more can be done? Everybody, go home? No, of course not, this is Zabbix.

There's an agent for that

By searching the entware repository -- that's like your typical package repository, but for this router --, see what's there:

# opkg search *zabbix*
zabbix-agentd - 7.0.0-1
zabbix-agentd - 7.0.0-1
zabbix-agentd - 7.0.0-1

.. and when I just went and created another host for my router on Zabbix, I started to get more details. A LOT more details.

Displaying 401 to 418 of 418 found

418 items for this host, and the best part is that now it also contains the storage. 

Zabbix agent details

As I just installed the agent, there's not much data yet, but I will have some answers in a day or two after Zabbix has been collecting the data. This story will continue then. But, it's amazing that Zabbix is such universal and lightweight that its agent is available even for a humble home router. And why wouldn't it, it's only taking few megs of RAM and practically no CPU.

Zabbix agent htop

Add new comment

Restricted HTML

  • Allowed HTML tags: <a href hreflang> <em> <strong> <cite> <blockquote cite> <code> <ul type> <ol start type> <li> <dl> <dt> <dd> <h2 id> <h3 id> <h4 id> <h5 id> <h6 id>
  • Lines and paragraphs break automatically.
  • Web page addresses and email addresses turn into links automatically.
Buy me a coffee

Like these posts? Support the project and Buy me a coffee