Incident Response Workflow

This guide walks through a complete incident response scenario using Overwatch, from the moment an alert fires to verified resolution. The example uses a Datadog alert, but the workflow applies to any supported monitoring platform.

Scenario: Production Alert on Datadog

A Datadog monitor triggers a critical alert: “High error rate on checkout-service (>5% 5xx responses)”. Here is how to investigate and resolve it with Overwatch.

Step 1: Alert Detection

You are viewing your Datadog dashboard when the monitor fires. The Overwatch Chrome extension detects the alert automatically through its content script and network interception layer.

Look for the Overwatch icon in your browser toolbar. A red badge indicates an active alert has been captured.

Tip: The extension works best when the alert detail page is open. Navigate to the specific monitor or event page so the extension can extract the full alert payload, including tags, thresholds, and affected hosts.

Step 2: Open the AI Chat Side Panel

Press Ctrl+Shift+I (Windows/Linux) or Cmd+Shift+I (macOS) to open the Overwatch side panel. The AI chat interface appears alongside your Datadog dashboard.

The extension automatically injects the detected alert context into the conversation. You will see a summary of the alert data at the top of the chat, including:

Alert title and severity
Affected service and hosts
Metric values that breached the threshold
Relevant tags from Datadog

Step 3: Describe What You See

Even though the extension captures alert metadata, you often have additional context that improves the AI’s analysis. Type a message that includes what you observe:

The checkout-service started throwing 5xx errors about 20 minutes ago.
We deployed version 2.4.1 roughly 30 minutes ago. The error rate chart
shows a sharp increase right after the deployment window.

The AI combines your description with the injected alert data and your Service Registry configuration to build a full picture of the problem.

Tip: Good prompts are specific. Mention recent changes (deployments, config updates, scaling events), timing correlations, and any patterns you notice. The more context you provide, the more targeted the diagnosis.

Step 4: AI Suggests Diagnostic Commands

Based on the alert and your description, the AI proposes a set of diagnostic commands. For a Kubernetes-hosted checkout service, you might see:

Suggested commands:
1. kubectl get pods -n production -l app=checkout-service
2. kubectl logs -n production -l app=checkout-service --tail=100 --since=30m
3. kubectl describe deployment checkout-service -n production

Each command includes an explanation of what it checks and why it is relevant to the current incident.

Step 5: Execute Commands via Helper CLI

If the Helper CLI is running on your machine, click Approve next to each command to execute it. The Helper runs the command locally using your existing credentials and Kubernetes context, then streams the results back to the AI.

The AI reads the output in real time. For example, the pod listing might reveal CrashLoopBackOff on two of four pods, and the logs might show a database connection error introduced by the new deployment.

Note: You can approve commands individually or in batch. The Helper CLI enforces an allowlist of safe commands and blocks destructive operations. See the Helper CLI security model for details.

Step 6: Root Cause Identification

After reviewing the command output, the AI synthesizes its findings:

Root Cause: The v2.4.1 deployment introduced a new environment variable
(DB_POOL_MAX) that is not set in the production ConfigMap. Two pods are
crash-looping because the application fails to parse the missing variable
at startup.

Affected: checkout-service pods checkout-service-7b9f4-abc12 and
checkout-service-7b9f4-def34 in the production namespace.

The AI then suggests a remediation plan with specific commands to fix the issue.

Step 7: Execute the Fix and Verify

The AI proposes a fix, such as:

1. kubectl edit configmap checkout-service-config -n production
   → Add DB_POOL_MAX=20

2. kubectl rollout restart deployment/checkout-service -n production

3. kubectl rollout status deployment/checkout-service -n production

After you approve and execute these commands, run a verification step:

kubectl get pods -n production -l app=checkout-service

Confirm all pods are in Running state. Check the Datadog monitor to verify the error rate has dropped below the threshold.

Step 8: Document and Close

Once the incident is verified as resolved:

The AI generates a resolution summary that you can save to the incident record
Navigate to the Overwatch dashboard and update the incident status to Resolved
Add any additional notes about the root cause and preventive actions
Close the incident after the monitoring window confirms stability

Tip: Resolutions stored in Overwatch feed the semantic cache. The next time a similar alert fires, the AI references this resolution to provide faster, more accurate diagnosis.

Writing Effective Prompts

The quality of your prompts directly affects the quality of the AI’s diagnosis. Follow these guidelines:

Do	Avoid
Include timestamps and durations	Vague descriptions like “it’s broken”
Mention recent changes (deploys, config)	Assuming the AI knows your change history
Reference specific services and hosts	Generic references like “the server”
Describe observed vs expected behavior	Only describing the symptom without context
Share relevant metrics or thresholds	Omitting numerical data visible on your dashboard

When to Escalate

Not every incident can be resolved through the AI loop alone. Escalate when:

The AI’s suggested commands require permissions you do not have
The root cause involves infrastructure outside your team’s ownership (network, DNS, third-party providers)
The incident has been open for longer than your organization’s escalation threshold
Multiple unrelated services are affected, suggesting a broader infrastructure problem
The AI explicitly recommends involving another team or specialist

Next Steps

Creating Procedures — Build runbooks to standardize common fixes
Integration Setup — Connect additional monitoring platforms
Team Collaboration — Coordinate with teammates during incidents
AI Chat Guide — Advanced prompting and chat features