Skip to content

Adding New Infrastructure — Quick Start

Quick reference for the GPUS-IT team. The full rules live in governance/component-coverage-standard.md on infras.greenpeace.us.

What inventory.yaml is

inventory.yaml at the repo root is the single source of truth for every host, hypervisor, network appliance, storage device, and power device we monitor. It feeds:

  • the docs portal (via documented_in: page links),
  • the status and SOC portals (via inventory.json baked at Docker build),
  • the backends (via the generated servers.py).

If a thing isn't in inventory.yaml, it doesn't really exist for ops. If it's in inventory.yaml but not in the portals, the Cloud Build fails.

The three monitoring states

  • live — telemetry is wired and reporting today. Green dot. Requires monitoring_intent naming the active source.
  • planned — documented and committed, telemetry not yet wired. Striped-yellow dot. Default for new entries. Requires monitoring_intent naming the eventual source.
  • unmonitored — exists, but we have no intent to wire telemetry. Grey dot. Requires a justification: field explaining why.

(decommissioned also exists for retired gear — strike-through, requires decommissioned_reason:.)

The 7-step workflow (checklist)

  1. Add an entry to inventory.yaml under the right category, with monitoring_status: planned and a monitoring_intent value.
  2. Add (or extend) a doc page under docs/architecture/, docs/infrastructure/, docs/hostregistry/, or docs/response-plans/, then list its path in the entry's documented_in: field.
  3. If it's a Linux host: run python3 scripts/regenerate-servers-py.py to refresh both backends' servers.py.
  4. Add a card to the status-site front-end.
  5. Add a tile to the soc-site front-end.
  6. Run python3 scripts/check-component-coverage.py locally.
  7. Push to main. Cloud Build re-runs the coverage check pre-deploy.

Common Cloud Build coverage failures

The script's error messages are intentionally workflow-teaching — they tell you the exact fix.

Error prefix What it means Fix
[schema] … missing required universal field 'monitoring_intent' (or documented_in, location, monitoring_status) Inventory entry is missing a required field. Add the field to the entry in inventory.yaml.
[schema] … monitoring_status=unmonitored requires 'justification' field An unmonitored entry has no explanation. Add justification: <reason> to the entry.
[schema] … monitoring_status=decommissioned requires 'decommissioned_reason' field Same idea, for retired gear. Add decommissioned_reason: <reason>.
[reference] …: unresolved reference 'X' A powers: / powered_by: / fed_from: / hosted_on: / vms: value points at something that isn't in the inventory or external_references. Add the target to inventory.yaml, or add an entry under external_references: if it's truly outside our scope.
[doc] …: documented_in is empty Inventory entry has no doc page. Add a path in documented_in: and ensure that page mentions the entity ID.
[servers.py] <dir>: 'X' present in inventory.linux_hosts but not in <dir>/servers.py You added a host but didn't regenerate. python3 scripts/regenerate-servers-py.py from repo root, then commit.
[servers.py] <dir>: 'X' present in servers.py but not in inventory.yaml Someone edited servers.py by hand. Add the entity to inventory.yaml under linux_hosts, then run python3 scripts/regenerate-servers-py.py.

Warnings (CSV hostname mismatches, missing doc paths pre-merge, inventory.json drift) do not fail the build but should be cleaned up.

Temporary exceptions

If you genuinely need to defer a fix, add an entry to .coverage-exceptions.yaml at the repo root. Every exception must have an expires: date in ISO format (YYYY-MM-DD) — indefinite exceptions are rejected by schema. Expired entries are ignored and the finding becomes a build failure again.

exceptions:
  - finding_id: csv:meraki-hostregistry.csv:wdc-wap-1
    expires: 2026-06-15
    rationale: WAPs covered by inventory but CSV hostname slug differs.
    owner: rajesh.chhetry@greenpeace.us

The finding_id is the stable identifier printed by the coverage script — copy it from the error or warning line.

Going deeper

Read the full Component Coverage Standard on infras.greenpeace.us for the schema, compliance mapping, and the planned → live promotion workflow.