Skip to content

Creating Procedures

Procedures are executable step-by-step runbooks that your team follows during incidents, maintenance windows, and routine operations. They reduce human error by providing clear instructions, and they create an audit trail of every execution.

Each procedure consists of:

  • Steps: Ordered actions with descriptions, expected outcomes, and optional commands
  • Variables: Dynamic placeholders replaced at execution time (service names, environment targets, thresholds)
  • Approval gates: Checkpoints that require explicit sign-off before continuing
  • Estimated durations: Time estimates per step for progress tracking
  • Linked incident types: Associations that surface the procedure when relevant alerts fire
  1. Navigate to Procedures in the left sidebar of the Overwatch dashboard
  2. Click Create Procedure
  3. Fill in the procedure metadata:
    • Title: A clear, action-oriented name (e.g., “Restart Checkout Service on Kubernetes”)
    • Description: When and why this procedure should be used
    • Category: Group related procedures (Deployment, Database, Networking, Scaling)
    • Severity scope: Which incident severity levels this procedure applies to
  4. Click Save Draft to begin adding steps

After resolving an incident with the AI chat, the resolution summary can be converted into a procedure directly. Click Save as Procedure in the resolution panel to pre-populate steps based on the commands that were executed.

Tip: Procedures created from real incident resolutions tend to be more accurate and complete than those written from memory. Build procedures immediately after resolving an incident while the context is fresh.

Click Add Step to append a new step to the procedure. Each step has the following fields:

  • Title: Brief action description (e.g., “Check pod status”)
  • Instructions: Detailed explanation of what to do and why
  • Command (optional): A shell command that can be executed via the Helper CLI
  • Expected outcome: What the operator should observe if the step succeeds
  • Failure guidance: What to do if the expected outcome is not met
  • Estimated duration: How long this step typically takes

Example step:

Title: Verify database connectivity
Instructions: Confirm the application can reach the primary database.
Command: kubectl exec -n production deploy/checkout-service -- pg_isready -h $DB_HOST
Expected outcome: "accepting connections" message
Failure guidance: Check security group rules and RDS instance status in AWS console
Duration: 2 minutes

Approval gates pause procedure execution and require explicit authorization from a user with Manager or Admin role before continuing. Use them for steps that involve:

  • Production database modifications
  • Infrastructure scaling or teardown
  • Customer-facing configuration changes
  • Credential rotation or secret updates

To add an approval gate, toggle Requires Approval on the step configuration. Specify which roles can approve and an optional justification prompt.

Note: When a procedure execution reaches an approval gate, all subscribed users receive a notification. The execution remains paused until an authorized user approves or rejects continuation.

Variables make procedures reusable across services and environments. Define them in the procedure header, and reference them in step commands and instructions with $VARIABLE_NAME syntax.

Common variable patterns:

VariablePurposeExample values
$SERVICE_NAMETarget servicecheckout-service, auth-api
$NAMESPACEKubernetes namespaceproduction, staging
$ENVIRONMENTDeploy environmentprod, staging, dev
$REPLICA_COUNTScaling target3, 5, 10
$DB_HOSTDatabase endpointprimary.db.internal

When a team member executes the procedure, they are prompted to fill in each variable before the first step begins.

Associate procedures with specific alert patterns so they surface automatically during incident response:

  1. Open the procedure and click Settings
  2. Under Incident Associations, add matching criteria:
    • Service name: Matches against the service field in incoming alerts
    • Alert keywords: Terms that appear in alert titles or descriptions
    • Severity levels: Which severity levels trigger the association
  3. Save the associations

When an incident matches these criteria, the procedure appears in the Suggested Procedures section of the incident detail page.

Procedures start in Draft status and are only visible to their creator and Admins. Before publishing:

  1. Click Test Run to execute the procedure in dry-run mode
  2. Walk through each step and verify that commands, instructions, and expected outcomes are accurate
  3. Ask a teammate to review the procedure for clarity and completeness
  4. Once satisfied, click Publish to make the procedure available to your team

Tip: Run the test against a staging environment first. This validates that commands work without risking production systems.

Overwatch includes starter templates for common operational scenarios:

  • Service Restart: Rolling restart with health checks
  • Database Failover: Primary/replica promotion with connection verification
  • Scaling Up/Down: Adjust replica count with load validation
  • Certificate Renewal: TLS certificate rotation with endpoint verification
  • Dependency Health Check: Verify upstream and downstream service connectivity

To use a template, click Create from Template on the Procedures page, select a template, and customize the steps and variables for your environment.

  • Keep steps atomic: Each step should do one thing. If a step has multiple sub-actions, split it into separate steps.
  • Include rollback steps: Add steps at the end of the procedure that reverse the changes if something goes wrong.
  • Update after incidents: Revise procedures when post-incident reviews reveal missing steps or incorrect assumptions.
  • Version your procedures: Overwatch tracks procedure revisions. Add a note when you update a procedure explaining what changed and why.