~ / projects / monitoring-stack
complete

Monitoring Stack: Zabbix Migration & Automation

Led Zabbix 4 → 7 migration across multiple internal cloud teams with zero monitoring gaps. Built Python automation eliminating ~10 hrs/week of manual reporting across 1000+ servers.

Zabbix Python Ansible Prometheus Grafana Observability

Overview

This is production observability work at SymphonyAI — not a home lab. The scope covers 1000+ servers across Azure and OCI, four internal cloud-team categories, and a full Zabbix 4 → 7 migration executed with zero monitoring blind spots during cutover.

The Python automation suite I built against the Zabbix API eliminated approximately 10 hours/week of manual reporting work.


Zabbix 4 → 7 Migration

The Problem

The legacy Zabbix 4 instance had grown organically across multiple teams — inconsistent template naming, overlapping host groups, 15,000+ URL checks, and custom trigger expressions that varied per team. A hard switchover would guarantee monitoring gaps.

What I Built

1. Bulk host migration with automated validation

Python script using the Zabbix JSON-RPC API to import hosts from CSV with groups, templates, interfaces, and tags — then validate each import against expected state:

def add_host(session, hostname, ip, group_id, template_id, tags):
    response = session.post(ZABBIX_URL, json={
        "jsonrpc": "2.0",
        "method": "host.create",
        "params": {
            "host": hostname,
            "interfaces": [{"type": 1, "main": 1, "useip": 1,
                           "ip": ip, "dns": "", "port": "10050"}],
            "groups": [{"groupid": group_id}],
            "templates": [{"templateid": template_id}],
            "tags": tags,
        },
        "auth": AUTH_TOKEN,
        "id": 1,
    })
    return response.json()

Output: CSV with hostid, status (success/error), and tags per host — full audit trail.

2. Custom items and triggers extraction

Before decommissioning the old instance, I extracted all non-discovered items and triggers (the custom ones teams had added manually over years) to preserve institutional knowledge:

# Extract only custom items — not inherited from templates
for host in hosts:
    items = pyzabbix_api.item.get(
        hostids=host['hostid'],
        inherited=False,   # custom only
        output=['itemid', 'name', 'key_', 'type', 'status', 'delay']
    )
    triggers = pyzabbix_api.trigger.get(
        hostids=host['hostid'],
        inherited=False,
        output=['triggerid', 'description', 'expression', 'priority', 'status']
    )

Progress output: [N/M] counter per host so ops team could track a multi-hour run.

3. Enhanced trigger expressions for web scenarios

The old expressions only checked HTTP response codes. I upgraded all 15,000+ URL check triggers to also alert on data timeout — catching cases where monitoring itself goes silent:

# Old pattern: last(web.test.rspcode[...])<>200
# Enhanced: adds nodata condition — catches monitoring gaps too
new_expr = re.sub(
    r'(last\(/[^/]+/web\.test\.rspcode\[.*?\]\))<>200',
    r'\1<>200 or nodata(\1,600)=1',
    old_expression
)

This was a significant alerting improvement — before this, a Zabbix agent failure would silently drop URL checks with no alert.

4. Ansible playbook for agent rollout

Idempotent Ansible playbook deploying Zabbix 7 agent across the Linux fleet. Reduced per-server onboarding from ~30 minutes to under 2 minutes:

- name: Deploy Zabbix Agent 7
  hosts: all
  become: yes
  tasks:
    - name: Copy agent config
      copy:
        src: zabbix_agentd.conf.org
        dest: /etc/zabbix/zabbix_agentd.conf
        owner: root
        group: root
        mode: '0644'

    - name: Run agent installation script
      script: z_agent.sh
      args:
        creates: /usr/sbin/zabbix_agentd

Python Automation: 10 hrs/week Eliminated

Monthly Utilisation Reports (4 Internal Cloud Teams)

The biggest toil item: hand-collecting CPU and memory stats per host, categorising by team, formatting into Excel. I automated the full pipeline.

Host categorisation — 40+ hostname patterns per team, matched in Python to sort 1000+ hosts into the right bucket automatically:

TEAM_PATTERNS = {
    'Team A': ['prefix-a', 'svc-a', 'a-prod', ...],   # 40 patterns
    'Team B': ['prefix-b', 'svc-b', 'b-prod', ...],   # 45 patterns
    'Team C': ['prefix-c', 'svc-c', ...],             # 18 patterns
    'Team D': []  # catch-all
}

Metric extraction from Zabbix trend data — pulls previous month’s CPU and memory min/avg/max from the Zabbix API:

# CPU utilization trend (Linux)
trend_data = zabbix_api.trend.get(
    itemids=[item_id],
    time_from=month_start_epoch,
    time_till=month_end_epoch,
    output=['clock', 'num', 'value_min', 'value_avg', 'value_max']
)

Threshold-based flagging:

MetricUnder-utilisedOver-utilised
CPU< 20%> 80%
Memory< 50%> 80%
Storage> 2 TB flagged

Output: four Excel workbooks per month — one per internal cloud team.

Unresolved Alert Digest

Daily automated report of open Zabbix problems by severity (Average, High, Disaster) — emailed to the on-call team each morning, removing the need for anyone to manually check the Zabbix dashboard before shift start.

Failed Agent Tracking

Script to identify disabled/unreachable Zabbix agents across all teams — catches agent drift before it causes monitoring blind spots. Outputs CSV with agent name, status, and last-seen timestamp.


Architecture

┌───────────────────────────────────────────────────────────────────┐
│                  Infrastructure (1000+ servers)                   │
│           Azure VMs · OCI Compute · Kubernetes Nodes              │
└────┬────────────────────────┬─────────────────────────────────────┘
    │ Zabbix Agent (port 10050)   │ Prometheus exporter
    ▼                              ▼
    ┌───────────────────┐        ┌───────────────────┐
    │    Zabbix 7.2     │        │    Prometheus     │
    │    1000+ hosts    │        │Kubernetes metrics │
    │  15k+ URL checks  │        │   node-exporter   │
    │ 4 internal teams  │        └───┬───────────────┘
    └───┬───────────────┘             │
        └─────────────┬──────────────┘

             ┌──────────────────┐
             │     Grafana      │
             │Unified dashboards│
             │   SLA tracking   │
             └──────────────────┘


          ┌─────────────────────────┐
          │    Python Automation    │
          │   Zabbix API JSON-RPC   │
          │  Monthly Excel reports  │
          │   Daily alert digests   │
          └─────────────────────────┘

What This Demonstrates

  • Migration execution at scale — 15,000+ URL checks, zero gap. Parallel validation before cutover, not a hard switchover.
  • Alerting improvementnodata() condition upgrade means monitoring failures are now visible, not silent.
  • API-first automation — JSON-RPC Zabbix API, pyzabbix, pandas, openpyxl — the full Python data pipeline.
  • Toil quantification — ~10 hrs/week eliminated is a number I can defend: 4 reports × ~2.5 hrs manual effort each.
  • Ansible at fleet scale — idempotent agent rollout, 30 min → 2 min per server onboarding.