AI-Powered Infrastructure Monitoring with n8n, Ollama, and Zabbix
Getting a Telegram notification that says "Host CPU > 80%" is useful. Getting one that says "docker02 CPU spiked to 92% — likely caused by the weekly Proxmox backup that started 10 minutes ago; no action required" is transformative. That's what I built using n8n, Ollama, and Zabbix.
The Stack#
- Zabbix 7: Monitors 18 hosts across my homelab — hypervisors, Docker hosts, LXC containers, NAS, and network devices
- n8n: Workflow automation engine running on docker03
- Ollama: Local LLM server running Mistral 7B on a GTX 1050 Ti 4GB
- Telegram: Notification delivery via bot API
All of this runs entirely on local hardware. No OpenAI API keys, no cloud costs, no data leaving my network.
Why AI-Powered Alerts?#
Traditional monitoring alerts tell you what happened but rarely why. In a homelab with 18 monitored hosts, you learn quickly that most alerts are noise:
- CPU spikes during scheduled backups
- Memory pressure when Docker pulls new images
- Disk I/O alerts during Proxmox snapshot operations
- Network utilization spikes during media streaming
An AI layer between Zabbix and Telegram can provide context: "This problem correlates with the daily PBS backup window" or "Multiple hosts showing elevated CPU — check for a cluster-wide event."
Architecture#
Zabbix API → n8n (scheduled trigger) → Ollama (Mistral 7B) → Telegram Bot
- n8n polls Zabbix every hour for active problems
- Active problems are formatted and sent to Ollama for analysis
- Ollama generates a human-readable summary with context
- The summary is sent to Telegram via bot API
Why problem.get and Not trigger.get#
This is an important distinction. The Zabbix API offers two approaches:
trigger.get: Returns triggers that are in a "problem" state — but triggers can remain "firing" even after the underlying problem resolvesproblem.get: Returns only currently unresolved problems
Using trigger.get would flood the AI with stale alerts that have already been resolved. problem.get gives a clean picture of what's actually wrong right now.
The n8n Workflow#
Step 1: Schedule Trigger#
The workflow runs every hour via n8n's built-in cron trigger. I chose hourly because:
- Frequent enough to catch real issues before they escalate
- Infrequent enough to avoid Telegram notification fatigue
- Matches my Zabbix maintenance windows (backups run at specific hours)
Step 2: Query Zabbix API#
The HTTP Request node calls the Zabbix API:
{
"jsonrpc": "2.0",
"method": "problem.get",
"params": {
"output": "extend",
"selectTags": "extend",
"recent": true,
"sortfield": ["eventid"],
"sortorder": "DESC"
},
"auth": "your-zabbix-api-token",
"id": 1
}This returns all active problems with full details — host, severity, timestamp, and any associated tags.
Step 3: Format for AI Analysis#
A Function node transforms the raw Zabbix data into a structured prompt:
You are a homelab infrastructure monitoring assistant.
Analyze the following active Zabbix problems and provide:
1. A brief summary of the overall infrastructure health
2. For each problem, assess severity and likely cause
3. Recommend whether immediate action is needed
Active problems:
- [Host: docker02] CPU utilization > 85% (started 22:15)
- [Host: pve01] Disk I/O wait > 20% (started 21:05)
Step 4: Ollama Analysis#
The HTTP Request node sends the formatted prompt to Ollama:
{
"model": "mistral:7b-instruct-q4_0",
"prompt": "...",
"stream": false
}Mistral 7B processes the prompt and returns a contextual analysis. A typical response:
Infrastructure Health: Minor Issues
The two active problems appear related. pve01 disk I/O started at 21:05, which aligns with the daily PBS backup schedule (21:00 EST). The docker02 CPU spike at 22:15 is likely a downstream effect — containers on docker02 may be handling backup-related network I/O.
Recommended Action: No immediate action needed. Monitor for persistence beyond the backup window (typically 45 minutes). If problems persist past 23:00, investigate PBS backup for stuck jobs.
Step 5: Telegram Notification#
The final node sends the AI analysis to Telegram using the Bot API:
POST https://api.telegram.org/bot<token>/sendMessage
{
"chat_id": "<chat_id>",
"text": "🔍 Homelab Alert Analysis\n\n<ai_response>",
"parse_mode": "Markdown"
}
Model Selection: Why Mistral 7B?#
Running on a GTX 1050 Ti with only 4 GB VRAM, model selection is critical. I tested several options:
| Model | GPU Offload | Response Time | Quality |
|---|---|---|---|
mistral:7b-instruct-q4_0 | 75% GPU / 25% CPU | ~15 seconds | Good context, actionable advice |
starcoder2:3b | 100% GPU | ~5 seconds | Too code-focused, poor at ops analysis |
llama3:8b | 0% (CPU only) | ~90 seconds | Better quality but risks system freeze |
mistral:7b-instruct-q4_0 hits the sweet spot:
- Mostly runs on the GPU (75% offload)
- Generates useful infrastructure analysis
- 15-second response time is fine for hourly batch processing
- Doesn't risk memory exhaustion on pve02
The OLLAMA_KEEP_ALIVE=5m setting ensures the model unloads from VRAM after 5 minutes, freeing resources between analysis runs.
Auto-Resolve for Informational Alerts#
Not every alert needs AI analysis. Zabbix generates severity-1 (informational) alerts for events like configuration changes, which are expected and harmless. To prevent these from cluttering the AI's input:
A cron job on the Zabbix LXC automatically resolves informational alerts older than 24 hours:
# /etc/cron.d/zabbix-auto-resolve
0 * * * * zabbix /usr/local/bin/auto-resolve-info-alerts.shThis keeps the problem.get response focused on genuine issues, improving the quality of AI analysis.
Real-World Results#
After running this workflow for several weeks, here's what I've observed:
Noise Reduction#
About 60% of Zabbix alerts correlate with known scheduled events (backups, updates, cron jobs). The AI correctly identifies these and recommends "no action needed" — saving me from investigating false positives.
Faster Triage#
When a real problem occurs, the AI provides immediate context. A disk space alert on docker01 came with the analysis: "Docker image cache consuming space — run docker system prune to reclaim." That's faster than SSH-ing in and running diagnostic commands manually.
Pattern Recognition#
The AI occasionally spots correlations I'd miss: "Three hosts showing elevated memory simultaneously — check if a cluster-wide operation is running." That kind of cross-host analysis is exactly what makes the AI layer worthwhile.
Limitations#
The model isn't perfect. It sometimes:
- Suggests causes that don't apply to my specific setup
- Over-recommends restarting services when the issue is transient
- Misses context about custom configurations (it doesn't know my backup schedule unless I encode it in the prompt)
These are acceptable trade-offs for a fully local, zero-cost solution.
Lessons Learned#
1. Prompt Engineering Matters#
The quality of AI output depends heavily on the prompt. Including context like "this is a homelab environment" and "backups run at 21:00 EST" dramatically improves relevance.
2. Batch Processing > Real-Time#
Running analysis hourly rather than on every alert is the right approach. Real-time AI analysis of every Zabbix event would overwhelm the GPU and flood Telegram. Batch processing gives the AI a complete picture of all current problems.
3. Local LLMs Are Good Enough#
You don't need GPT-4 for infrastructure alert analysis. Mistral 7B running on a consumer GPU provides adequate analysis for homelab-scale monitoring. The latency and cost advantages of local inference far outweigh the quality gap.
4. Keep the Human in the Loop#
The AI provides analysis and recommendations, but I make the decisions. It never auto-remediates — that's a line I'm not comfortable crossing with a 7B parameter model on a 4 GB GPU.
Future Improvements#
- Richer context: Feed the AI historical data about recurring problems to improve pattern recognition
- Severity-based routing: Only trigger AI analysis for severity 3+ problems, send lower severities directly to Telegram
- Larger model: If I upgrade the GPU, a 13B or 30B model would provide significantly better analysis
For details on the GPU passthrough that makes this possible, see GPU Passthrough in Proxmox LXC Containers. For the complete homelab overview, check out Building My Homelab.