Skip to content

Incident Response Workflow

This guide walks through a complete incident response scenario using Overwatch, from the moment an alert fires to verified resolution. The example uses a Datadog alert, but the workflow applies to any supported monitoring platform.

A Datadog monitor triggers a critical alert: “High error rate on checkout-service (>5% 5xx responses)”. Here is how to investigate and resolve it with Overwatch.

You are viewing your Datadog dashboard when the monitor fires. The Overwatch Chrome extension detects the alert automatically through its content script and network interception layer.

Look for the Overwatch icon in your browser toolbar. A red badge indicates an active alert has been captured.

Tip: The extension works best when the alert detail page is open. Navigate to the specific monitor or event page so the extension can extract the full alert payload, including tags, thresholds, and affected hosts.

Press Ctrl+Shift+I (Windows/Linux) or Cmd+Shift+I (macOS) to open the Overwatch side panel. The AI chat interface appears alongside your Datadog dashboard.

The extension automatically injects the detected alert context into the conversation. You will see a summary of the alert data at the top of the chat, including:

  • Alert title and severity
  • Affected service and hosts
  • Metric values that breached the threshold
  • Relevant tags from Datadog

Even though the extension captures alert metadata, you often have additional context that improves the AI’s analysis. Type a message that includes what you observe:

The checkout-service started throwing 5xx errors about 20 minutes ago.
We deployed version 2.4.1 roughly 30 minutes ago. The error rate chart
shows a sharp increase right after the deployment window.

The AI combines your description with the injected alert data and your Service Registry configuration to build a full picture of the problem.

Tip: Good prompts are specific. Mention recent changes (deployments, config updates, scaling events), timing correlations, and any patterns you notice. The more context you provide, the more targeted the diagnosis.

Based on the alert and your description, the AI proposes a set of diagnostic commands. For a Kubernetes-hosted checkout service, you might see:

Suggested commands:
1. kubectl get pods -n production -l app=checkout-service
2. kubectl logs -n production -l app=checkout-service --tail=100 --since=30m
3. kubectl describe deployment checkout-service -n production

Each command includes an explanation of what it checks and why it is relevant to the current incident.

If the Helper CLI is running on your machine, click Approve next to each command to execute it. The Helper runs the command locally using your existing credentials and Kubernetes context, then streams the results back to the AI.

The AI reads the output in real time. For example, the pod listing might reveal CrashLoopBackOff on two of four pods, and the logs might show a database connection error introduced by the new deployment.

Note: You can approve commands individually or in batch. The Helper CLI enforces an allowlist of safe commands and blocks destructive operations. See the Helper CLI security model for details.

After reviewing the command output, the AI synthesizes its findings:

Root Cause: The v2.4.1 deployment introduced a new environment variable
(DB_POOL_MAX) that is not set in the production ConfigMap. Two pods are
crash-looping because the application fails to parse the missing variable
at startup.
Affected: checkout-service pods checkout-service-7b9f4-abc12 and
checkout-service-7b9f4-def34 in the production namespace.

The AI then suggests a remediation plan with specific commands to fix the issue.

The AI proposes a fix, such as:

1. kubectl edit configmap checkout-service-config -n production
→ Add DB_POOL_MAX=20
2. kubectl rollout restart deployment/checkout-service -n production
3. kubectl rollout status deployment/checkout-service -n production

After you approve and execute these commands, run a verification step:

kubectl get pods -n production -l app=checkout-service

Confirm all pods are in Running state. Check the Datadog monitor to verify the error rate has dropped below the threshold.

Once the incident is verified as resolved:

  1. The AI generates a resolution summary that you can save to the incident record
  2. Navigate to the Overwatch dashboard and update the incident status to Resolved
  3. Add any additional notes about the root cause and preventive actions
  4. Close the incident after the monitoring window confirms stability

Tip: Resolutions stored in Overwatch feed the semantic cache. The next time a similar alert fires, the AI references this resolution to provide faster, more accurate diagnosis.

The quality of your prompts directly affects the quality of the AI’s diagnosis. Follow these guidelines:

DoAvoid
Include timestamps and durationsVague descriptions like “it’s broken”
Mention recent changes (deploys, config)Assuming the AI knows your change history
Reference specific services and hostsGeneric references like “the server”
Describe observed vs expected behaviorOnly describing the symptom without context
Share relevant metrics or thresholdsOmitting numerical data visible on your dashboard

Not every incident can be resolved through the AI loop alone. Escalate when:

  • The AI’s suggested commands require permissions you do not have
  • The root cause involves infrastructure outside your team’s ownership (network, DNS, third-party providers)
  • The incident has been open for longer than your organization’s escalation threshold
  • Multiple unrelated services are affected, suggesting a broader infrastructure problem
  • The AI explicitly recommends involving another team or specialist