Part 62: Zabbix the AI alert journalist

Zabbix the AI journalist

As I am now entering the dungeons of Generative AI, of course I'm having some fun. Instead of just analyzing a single alert through Zabbix manual actions, how about using the Zabbix API and the official zabbix_utils for querying the active alerts and feeding the list to GPT4All?

See what happens if I feed several active alerts at once to my GPT4All Python script, in which I changed the default max_tokens=200 to max_tokens=1000, and ask it to generate various content. To make this blog post more readable, I applied some formatting by hand.

'tis the script

The script is pretty much from Zabbix blog post for the Zabbix API part, and for the GPT4All part, almost 1:1 with their own example hello world script, glued together with some special ninja moves and very limited time. But it works!

#!/usr/bin/python3
from zabbix_utils import ZabbixAPI
from gpt4all import GPT4All
import argparse
ZABBIX_SERVER = "https://my.zabbix.server/"
zapi = ZabbixAPI(ZABBIX_SERVER)
zapi.login(token='my_access_token')
# Get a list of all issues (AKA tripped triggers)
triggers = zapi.trigger.get(
   only_true=1,
   skipDependent=1,
   monitored=1,
   active=1,
   output="extend",
   expandDescription=1,
   selectHosts=["host"],
)
# Do another query to find out which issues are Unacknowledged
unack_triggers = zapi.trigger.get(
   only_true=1,
   skipDependent=1,
   monitored=1,
   active=1,
   output="extend",
   expandDescription=1,
   selectHosts=["host"],
   withLastEventUnacknowledged=1,
)
unack_trigger_ids = [t["triggerid"] for t in unack_triggers]
for t in triggers:
   t["unacknowledged"] = True if t["triggerid"] in unack_trigger_ids else False
# Print a list containing only "tripped" triggers
alerts = ""
for t in triggers:
   if int(t["value"]) == 1:
      alerts += "{} - {} {}".format(
               t["hosts"][0]["host"],
               t["description"],
               "(Unack)" if t["unacknowledged"] else "",
               )
parser = argparse.ArgumentParser(description='Pass question to GPT4All')
parser.add_argument('-q', '--question')
args=parser.parse_args()
model = GPT4All('wizardlm-13b-v1.2.Q4_0.gguf')
system_template = 'A chat between a curious user and an artificial intelligence assistant.'
# many models use triple hash '###' for keywords, Vicunas are simpler:
prompt_template = 'USER: {0}\nASSISTANT: '
with model.chat_session(system_template, prompt_template):
   response1 = model.generate(args.question+alerts, max_tokens=1000)
   print(response1)

Using the script

Using the script is easy, I just call it with

python3 ./zabbix_journalist.py --question "Explain the active alerts like a teenager"

And it responds back in its own, creative way.

🚨 Alerts, alerts everywhere! Let's break it down like a boss. šŸ’Ŗ

1ļøāƒ£ lappy.whatsuphome.local - 192.168.xxx.xxx: Backup error (Unack)

This means there was an issue with the backup process on this device, and it didn't complete successfully. It's like when you try to save your homework but something goes wrong, so you have to do it again. šŸ“āœļø

2ļøāƒ£ lappy.whatsuphome.local - ES whatsuphome-1: Node whatsuphome-1 has been restarted (uptime < 10m) (Unack)

This alert means that a server called "whatshome-1" was restarted, and it's been running for less than 10 minutes. It's like when you accidentally turn off your gaming console while playing your favorite game; you have to wait until it starts up again. šŸ•¹ļø

3ļøāƒ£ lappy.whatsuphome.local - ES: Cluster has been restarted (uptime < 10m) (Unack)

This alert is similar to the previous one, but it's about a cluster of servers that got restarted and have been running for less than 10 minutes. It's like when you accidentally knock over your tower of blocks; you have to rebuild them all again! šŸ—ļø

4ļøāƒ£ lappy.whatsuphome.local - ES: Service is down (Unack)

This alert means that a service on the server isn't working properly, like when your favorite app crashes and you can't use it until it gets fixed. šŸ“±

 

Embed the stuff inside Zabbix

I still need to do the Zabbix 7.0 custom widget that would directly communicate with GPT4all, but for now if I just convert the plain text to HTML with txt2html or similar tools, put the resulting HTML file in reach of Zabbix and embed the file inside Zabbix with Zabbix URL widget... 

Zabbix URL widget

... this happens. Sure, that's super ugly without any additional formatting and CSS styles, but the concept is cool.

Zabbix AI journalist

As this is generative AI, we can get very creative. Let's do some more examples.

Group alerts by similarities

My prompt: Group alerts by similarities. Respond in table format. In alerts, DC means Designated Controller and ES means Elasticsearch. Alerts are separated with a newline.

GPT4All response:

1. Alerts related to system restarts or uptime issues:
       * lappy.whatsuphome.local - ES whatsuphome-1: Node whatsuphome-1 has been restarted (uptime < 10m) (Unack)
       * lappy.whatsuphome.local - ES: Cluster has been restarted (uptime < 10m) (Unack)
2. Alerts related to backup errors or issues:
       * lappy.whatsuphome.local - Backup error (Unack)
3. Alerts related to Pacemaker and resource status:
       * pacemaker1 - pacemaker1/pacemaker: DC was changed (Unack)
       * pacemaker1 - pacemaker1/pacemaker: Failed actions found (Unack)
       * pacemaker1 - pacemaker1/pacemaker: One or more resources have passed into "failed" state (Unack)
4. Alerts related to Zabbix and sensor issues:
       * Zabbix server - Hallway motion sensor battery getting low (Unack)
       * Zabbix server - Living room air is fucking dry (Unack)
       * Zabbix server - Outdoors lights power button off or fuse probably tripped (Unack)
5. Alerts related to disk space issues:
       * Work Mac - /Library/Developer/CoreSimulator/Volumes/watchOS_21S364: Free disk space is less than 5%
       * Work Mac - /Library/Developer/CoreSimulator/Volumes/iOS_21C62: Free disk space is less than 5%
       * Work Mac - /Library/Developer/CoreSimulator/Volumes/tvOS_21K364: Free disk space is less than 5%
6. Alerts related to service issues or not running services:
       * Zabbix server - smartmontools.service: Service is not running (Unack)
       * Zabbix server - gldriver-test.service: Service is not running (Unack)
       * Zabbix server - dhcpcd.service: Service is not running (Unack)
7. Alerts related to package installation or removal issues:
       * Zabbix server - Linux: Number of installed packages has been changed (Unack)
8. Alerts related to system availability by ICMP ping:
       * Personal MacBook Pro - Unavailable by ICMP ping (Unack)
       * Living room TV - Unavailable by ICMP ping (Unack)

How about some nature documentary?

My prompt: You are now a nature documentary guy David Attenborough. Comment the following Zabbix alerts like alerts would be animals. In alerts, DC means Designated Controller and ES means Elasticsearch. Alerts are separated with a newline.

GPT4All response:

1. lappy.whatsuphome.local - 192.168.xxx.xxx: Backup error (Unack): A rogue gorilla, disrupting the backup process with its unruly strength and agility.
2. lappy.whatsuphome.local - ES whatsuphome-1: Node whatsuphome-1 has been restarted (uptime < 10m) (Unack): A swift cheetah, quickly rejuvenating itself after a brief slumber.
3. lappy.whatsuphome.local - ES: Cluster has been restarted (uptime < 10m) (Unack): A colossal blue whale, rising from the depths to breach the surface with a mighty splash.
4. lappy.whatsuphome.local - ES: Service is down (Unack): A majestic eagle, grounded by an unfortunate injury, unable to soar through the skies.
5. pacemaker1 - pacemaker1/pacemaker: DC was changed (Unack): A nimble chameleon, adapting its coloration to blend in with its surroundings as it shifts between different roles.
6. Zabbix server - Hallway motion sensor battery getting low (Unack): A humble ant, struggling under the weight of a heavy load as its energy supply dwindles.
7. Zabbix server - Living room air is dry (Unack): A parched desert landscape, cracked and barren from prolonged drought.
8. pacemaker1 - pacemaker1/pacemaker: Failed actions found (Unack): A falcon, circling above its prey with precision and patience, waiting for the perfect moment to strike.
9. pacemaker1 - pacemaker1/pacemaker: One or more resources have passed into "failed" state (Unack): A once-thriving forest, now reduced to a desolate wasteland as its vital components succumb to decay and neglect.
10. Zabbix server - Outdoors lights power button off or fuse probably tripped: A campfire in the wilderness, extinguished by an unseen force, leaving only embers and darkness behind.
11. Lunch menus - A new lunch menu! (Unack): A vibrant coral reef teeming with life as a new species emerges to dazzle the underwater world.
12. Work Mac - /Library/Developer/CoreSimulator/Volumes/watchOS_21S364: Free disk space is less than 5% : A treacherous iceberg, looming in the distance with a warning of impending doom for those who venture too close.
13. Work Mac - /Library/Developer/CoreSimulator/Volumes/iOS_21C62: Free disk space is less than 5% : A vast desert sandstorm, engulfing everything in its path with relentless fury and destruction.
14. Work Mac - /Library/Developer/CoreSimulator/Volumes/tvOS_21K364: Free disk space is less than 5% : A collapsing glacier, slowly melting under the relentless heat of a changing climate.
15. Zabbix server - smartmontools.service: Service is not running (Unack): A slumbering bear, hibernating through the harsh winter months as it conserves energy and prepares for the coming spring.
16. Zabbix server - gldriver-test.service: Service is not running (Unack): A dormant volcano, bubbling with molten lava beneath its tranquil surface, waiting for the perfect moment to erupt once more.
17. Living room TV - Unavailable by ICMP ping (Unack): A silent, abandoned observatory, left to gather dust as it awaits the return of curious minds eager to explore the mysteries of the universe.

A tech blog post

My prompt: You are now a tech journalist. Write an entertaining blog post about the following Zabbix alerts. In alerts, DC means Designated Controller and ES is Elasticsearch.The alerts are separated with a newline.

GPT4All response: 

 Zabbix Alerts Galore! 🚨
=====================================

It's been a wild day in the world of tech, folks! The Zabbix alerts are buzzing with activity like bees around a honeypot. Let's dive into what's happening and why these alerts matter for your Designated Controllers (DC) and Elasticsearch (ES).

First up, we have lappy.whatsuphome.local - 192.168.xxx.xxx: Backup error (Unack), indicating a potential issue with the backup process on this designated controller. We'll need to investigate and resolve this promptly to avoid any data loss or corruption.

Next, we have two alerts related to Elasticsearch nodes whatsuphome-1 and the entire cluster: Node whatsuphome-1 has been restarted (uptime < 10m) (Unack), and Cluster has been restarted (uptime < 10m) (Unack). These alerts suggest that there might be some instability or performance issues with these ES nodes, which we'll need to address promptly.

Moving on, we have a service down alert for Elasticsearch itself: Service is down (Unack), indicating a more severe issue with the overall functionality of this critical system. We should prioritize getting this back up and running as soon as possible.

In the world of pacemaker1, we've got two alerts: DC was changed (Unack) and Failed actions found (Unack). These suggest that there might be some issues with the designated controller or its tasks, which require our immediate attention to prevent any disruptions in service.

Zabbix server is also causing a stir with three alerts: Hallway motion sensor battery getting low (Unack), Living room air is dry (Unack), and Outdoors lights power button off or fuse probably tripped (Unack). These indicate potential issues with the physical environment around our Zabbix server, which we should address to ensure optimal performance.

In the realm of lunch menus, there's an alert for a new lunch menu: A new lunch menu! (Unack), but this one seems less urgent and more exciting šŸ˜‹

Moving on to hardware issues, we have three alerts related to low disk space on Work Mac devices running watchOS_21S364, iOS_21C62, and tvOS_21K364: Free disk space is less than 5% (Unack). These should be addressed promptly to avoid any data loss or performance issues.

Last but not least, we have three service-related alerts for Zabbix server: smartmontools.service: Service is not running (Unack), gldriver-test.service: Service is not running (Unack), and dhcpcd.service: Service is not available (Unack). These indicate that some critical services are down or not functioning properly, which we'll need to rectify as soon as possible.

In conclusion, it seems like there's a lot going on in the world of Zabbix alerts today! We should prioritize these issues based on their severity and potential impact on our systems, ensuring that everything is running smoothly and efficiently. Stay tuned for more updates as we tackle each one of these alerts head-on! šŸ’»

An on-call handover from team member to another

My prompt: Write like an operations team member during a handover to another team member about the following Zabbix alerts. The alerts are separated with a newline. 

GPT4All response: 

1. lappy.whatsuphome.local - 192.168.xxx.xxx: Backup error (Unack): We have a backup error on the "lappy" server at WhatsUpHome. The issue is currently unacknowledged, but we should monitor this closely to ensure that data isn't being lost or corrupted during backups.

2. lappy.whatsuphome.local - ES whatsuphome-1: Node whatsuphome-1 has been restarted (uptime < 10m) (Unack): The "ES" node on the "lappy" server has been restarted, and it's uptime is less than 10 minutes. We should investigate this issue to determine if there was a problem with the node or if it was part of an intentional maintenance operation.

3. lappy.whatsuphome.local - ES: Cluster has been restarted (uptime < 10m): The entire "ES" cluster on the "lappy" server has also been restarted, with uptime less than 10 minutes. This could be related to the previous alert about node whatsuphome-1 being restarted. We should investigate this issue as well.

4. lappy.whatsuphome.local - ES: Service is down (Unack): The "ES" service on the "lappy" server has gone offline, and we don't have any information about why yet. This needs immediate attention to determine if there was a problem with the service or if it was intentionally shut down for maintenance.

5. pacemaker1 - pacemaker/pacemaker: DC was changed (Unack): There has been a change in the data center configuration on "pacemaker1". We should investigate this to ensure that there are no issues with our infrastructure or services as a result of this change.

6. Zabbix server - Hallway motion sensor battery getting low: The hallway motion sensor's battery level is low, which could indicate an issue with the sensor itself or its power source. We should check on this to ensure that it doesn't affect our monitoring capabilities.

7. Zabbix server - Living room air is dry (Unack): This alert seems to be a humorous comment rather than a serious issue, but we should still double-check the environment around the Zabbix server in case there are any actual issues with the system's cooling or ventilation.

8. pacemaker1 - pacemaker/pacemaker: Failed actions found (Unack): There have been failed actions on "pacemaker1". We should investigate this to determine if there were any problems with our services that need attention.

9. pacemaker1 - pacemaker/pacemaker: One or more resources have passed into "failed" state (Unack): Additionally, one or more resources on "pacemaker1" have gone into a failed state. We should investigate this to determine if there were any problems with our services that need attention.

10. Zabbix server - Outdoors lights power button off or fuse probably tripped: The outdoor lights' power source seems to be off, which could indicate an issue with the power supply or a tripped fuse. We should check on this to ensure that it doesn't affect our monitoring capabilities.

11. Lunch menus - A new lunch menu! (Unack): This alert appears to be unrelated to system operations, but we should still acknowledge the update and inform relevant teams about any changes in their meal plans.

12. Work Mac - /Library/Developer/CoreSimulator/Volumes/watchOS_21S364: Free disk space is less than 5% (Unack): The free disk space on the "Work Mac" device running watchOS has dropped below 5%. We should monitor this closely to ensure that there's enough storage available for future updates and operations.

13. Work Mac - /Library/Developer/CoreSimulator/Volumes/iOS_21C62: Free disk space is less than 5% (Unack): The same issue has occurred on another device running iOS, so we should monitor this closely to ensure that there's enough storage available for future updates and operations.

Closing words

I'm surprised how capable GPT4all actually is. All this is happening without me training it at all with my alert history or something. 

Maybe in addition to embedding its content inside my Zabbix, I will go on and make it create some blog posts for me. Not under these What's up, home? stories as that would be confusing and annoying, but to a separate category or even a separate Drupal instance in here. There's an API for that.

 

 

Comments

I’m really inspired, me and my team is following you, and started improving our monitoring environment.

Keep it up,

I would be great to train a model on your own infrastructure, you could include documentation from the manufacturer. Thinking of this makes my head spin :)

Add new comment

Restricted HTML

  • Allowed HTML tags: <a href hreflang> <em> <strong> <cite> <blockquote cite> <code> <ul type> <ol start type> <li> <dl> <dt> <dd> <h2 id> <h3 id> <h4 id> <h5 id> <h6 id>
  • Lines and paragraphs break automatically.
  • Web page addresses and email addresses turn into links automatically.
Buy me a coffee

Like these posts? Support the project and Buy me a coffee