Monitoring Stack: Zabbix Migration & Automation
Led Zabbix 4 → 7 migration across multiple internal cloud teams with zero monitoring gaps. Built Python automation eliminating ~10 hrs/week of manual reporting across 1000+ servers.
Overview
This is production observability work at SymphonyAI — not a home lab. The scope covers 1000+ servers across Azure and OCI, four internal cloud-team categories, and a full Zabbix 4 → 7 migration executed with zero monitoring blind spots during cutover.
The Python automation suite I built against the Zabbix API eliminated approximately 10 hours/week of manual reporting work.
Zabbix 4 → 7 Migration
The Problem
The legacy Zabbix 4 instance had grown organically across multiple teams — inconsistent template naming, overlapping host groups, 15,000+ URL checks, and custom trigger expressions that varied per team. A hard switchover would guarantee monitoring gaps.
What I Built
1. Bulk host migration with automated validation
Python script using the Zabbix JSON-RPC API to import hosts from CSV with groups, templates, interfaces, and tags — then validate each import against expected state:
def add_host(session, hostname, ip, group_id, template_id, tags):
response = session.post(ZABBIX_URL, json={
"jsonrpc": "2.0",
"method": "host.create",
"params": {
"host": hostname,
"interfaces": [{"type": 1, "main": 1, "useip": 1,
"ip": ip, "dns": "", "port": "10050"}],
"groups": [{"groupid": group_id}],
"templates": [{"templateid": template_id}],
"tags": tags,
},
"auth": AUTH_TOKEN,
"id": 1,
})
return response.json()
Output: CSV with hostid, status (success/error), and tags per host — full audit trail.
2. Custom items and triggers extraction
Before decommissioning the old instance, I extracted all non-discovered items and triggers (the custom ones teams had added manually over years) to preserve institutional knowledge:
# Extract only custom items — not inherited from templates
for host in hosts:
items = pyzabbix_api.item.get(
hostids=host['hostid'],
inherited=False, # custom only
output=['itemid', 'name', 'key_', 'type', 'status', 'delay']
)
triggers = pyzabbix_api.trigger.get(
hostids=host['hostid'],
inherited=False,
output=['triggerid', 'description', 'expression', 'priority', 'status']
)
Progress output: [N/M] counter per host so ops team could track a multi-hour run.
3. Enhanced trigger expressions for web scenarios
The old expressions only checked HTTP response codes. I upgraded all 15,000+ URL check triggers to also alert on data timeout — catching cases where monitoring itself goes silent:
# Old pattern: last(web.test.rspcode[...])<>200
# Enhanced: adds nodata condition — catches monitoring gaps too
new_expr = re.sub(
r'(last\(/[^/]+/web\.test\.rspcode\[.*?\]\))<>200',
r'\1<>200 or nodata(\1,600)=1',
old_expression
)
This was a significant alerting improvement — before this, a Zabbix agent failure would silently drop URL checks with no alert.
4. Ansible playbook for agent rollout
Idempotent Ansible playbook deploying Zabbix 7 agent across the Linux fleet. Reduced per-server onboarding from ~30 minutes to under 2 minutes:
- name: Deploy Zabbix Agent 7
hosts: all
become: yes
tasks:
- name: Copy agent config
copy:
src: zabbix_agentd.conf.org
dest: /etc/zabbix/zabbix_agentd.conf
owner: root
group: root
mode: '0644'
- name: Run agent installation script
script: z_agent.sh
args:
creates: /usr/sbin/zabbix_agentd
Python Automation: 10 hrs/week Eliminated
Monthly Utilisation Reports (4 Internal Cloud Teams)
The biggest toil item: hand-collecting CPU and memory stats per host, categorising by team, formatting into Excel. I automated the full pipeline.
Host categorisation — 40+ hostname patterns per team, matched in Python to sort 1000+ hosts into the right bucket automatically:
TEAM_PATTERNS = {
'Team A': ['prefix-a', 'svc-a', 'a-prod', ...], # 40 patterns
'Team B': ['prefix-b', 'svc-b', 'b-prod', ...], # 45 patterns
'Team C': ['prefix-c', 'svc-c', ...], # 18 patterns
'Team D': [] # catch-all
}
Metric extraction from Zabbix trend data — pulls previous month’s CPU and memory min/avg/max from the Zabbix API:
# CPU utilization trend (Linux)
trend_data = zabbix_api.trend.get(
itemids=[item_id],
time_from=month_start_epoch,
time_till=month_end_epoch,
output=['clock', 'num', 'value_min', 'value_avg', 'value_max']
)
Threshold-based flagging:
| Metric | Under-utilised | Over-utilised |
|---|---|---|
| CPU | < 20% | > 80% |
| Memory | < 50% | > 80% |
| Storage | — | > 2 TB flagged |
Output: four Excel workbooks per month — one per internal cloud team.
Unresolved Alert Digest
Daily automated report of open Zabbix problems by severity (Average, High, Disaster) — emailed to the on-call team each morning, removing the need for anyone to manually check the Zabbix dashboard before shift start.
Failed Agent Tracking
Script to identify disabled/unreachable Zabbix agents across all teams — catches agent drift before it causes monitoring blind spots. Outputs CSV with agent name, status, and last-seen timestamp.
Architecture
┌───────────────────────────────────────────────────────────────────┐
│ Infrastructure (1000+ servers) │
│ Azure VMs · OCI Compute · Kubernetes Nodes │
└────┬────────────────────────┬─────────────────────────────────────┘
│ Zabbix Agent (port 10050) │ Prometheus exporter
▼ ▼
┌───────────────────┐ ┌───────────────────┐
│ Zabbix 7.2 │ │ Prometheus │
│ 1000+ hosts │ │Kubernetes metrics │
│ 15k+ URL checks │ │ node-exporter │
│ 4 internal teams │ └───┬───────────────┘
└───┬───────────────┘ │
└─────────────┬──────────────┘
▼
┌──────────────────┐
│ Grafana │
│Unified dashboards│
│ SLA tracking │
└──────────────────┘
│
▼
┌─────────────────────────┐
│ Python Automation │
│ Zabbix API JSON-RPC │
│ Monthly Excel reports │
│ Daily alert digests │
└─────────────────────────┘
What This Demonstrates
- Migration execution at scale — 15,000+ URL checks, zero gap. Parallel validation before cutover, not a hard switchover.
- Alerting improvement —
nodata()condition upgrade means monitoring failures are now visible, not silent. - API-first automation — JSON-RPC Zabbix API, pyzabbix, pandas, openpyxl — the full Python data pipeline.
- Toil quantification — ~10 hrs/week eliminated is a number I can defend: 4 reports × ~2.5 hrs manual effort each.
- Ansible at fleet scale — idempotent agent rollout, 30 min → 2 min per server onboarding.