Incident Response Workflow
Incident Response Workflow
Section titled “Incident Response Workflow”This guide walks through a complete incident response scenario using Overwatch, from the moment an alert fires to verified resolution. The example uses a Datadog alert, but the workflow applies to any supported monitoring platform.
Scenario: Production Alert on Datadog
Section titled “Scenario: Production Alert on Datadog”A Datadog monitor triggers a critical alert: “High error rate on checkout-service (>5% 5xx responses)”. Here is how to investigate and resolve it with Overwatch.
Step 1: Alert Detection
Section titled “Step 1: Alert Detection”You are viewing your Datadog dashboard when the monitor fires. The Overwatch Chrome extension detects the alert automatically through its content script and network interception layer.
Look for the Overwatch icon in your browser toolbar. A red badge indicates an active alert has been captured.
Tip: The extension works best when the alert detail page is open. Navigate to the specific monitor or event page so the extension can extract the full alert payload, including tags, thresholds, and affected hosts.
Step 2: Open the AI Chat Side Panel
Section titled “Step 2: Open the AI Chat Side Panel”Press Ctrl+Shift+I (Windows/Linux) or Cmd+Shift+I (macOS) to open the Overwatch side panel. The AI chat interface appears alongside your Datadog dashboard.
The extension automatically injects the detected alert context into the conversation. You will see a summary of the alert data at the top of the chat, including:
- Alert title and severity
- Affected service and hosts
- Metric values that breached the threshold
- Relevant tags from Datadog
Step 3: Describe What You See
Section titled “Step 3: Describe What You See”Even though the extension captures alert metadata, you often have additional context that improves the AI’s analysis. Type a message that includes what you observe:
The checkout-service started throwing 5xx errors about 20 minutes ago.We deployed version 2.4.1 roughly 30 minutes ago. The error rate chartshows a sharp increase right after the deployment window.The AI combines your description with the injected alert data and your Service Registry configuration to build a full picture of the problem.
Tip: Good prompts are specific. Mention recent changes (deployments, config updates, scaling events), timing correlations, and any patterns you notice. The more context you provide, the more targeted the diagnosis.
Step 4: AI Suggests Diagnostic Commands
Section titled “Step 4: AI Suggests Diagnostic Commands”Based on the alert and your description, the AI proposes a set of diagnostic commands. For a Kubernetes-hosted checkout service, you might see:
Suggested commands:1. kubectl get pods -n production -l app=checkout-service2. kubectl logs -n production -l app=checkout-service --tail=100 --since=30m3. kubectl describe deployment checkout-service -n productionEach command includes an explanation of what it checks and why it is relevant to the current incident.
Step 5: Execute Commands via Helper CLI
Section titled “Step 5: Execute Commands via Helper CLI”If the Helper CLI is running on your machine, click Approve next to each command to execute it. The Helper runs the command locally using your existing credentials and Kubernetes context, then streams the results back to the AI.
The AI reads the output in real time. For example, the pod listing might reveal CrashLoopBackOff on two of four pods, and the logs might show a database connection error introduced by the new deployment.
Note: You can approve commands individually or in batch. The Helper CLI enforces an allowlist of safe commands and blocks destructive operations. See the Helper CLI security model for details.
Step 6: Root Cause Identification
Section titled “Step 6: Root Cause Identification”After reviewing the command output, the AI synthesizes its findings:
Root Cause: The v2.4.1 deployment introduced a new environment variable(DB_POOL_MAX) that is not set in the production ConfigMap. Two pods arecrash-looping because the application fails to parse the missing variableat startup.
Affected: checkout-service pods checkout-service-7b9f4-abc12 andcheckout-service-7b9f4-def34 in the production namespace.The AI then suggests a remediation plan with specific commands to fix the issue.
Step 7: Execute the Fix and Verify
Section titled “Step 7: Execute the Fix and Verify”The AI proposes a fix, such as:
1. kubectl edit configmap checkout-service-config -n production → Add DB_POOL_MAX=20
2. kubectl rollout restart deployment/checkout-service -n production
3. kubectl rollout status deployment/checkout-service -n productionAfter you approve and execute these commands, run a verification step:
kubectl get pods -n production -l app=checkout-serviceConfirm all pods are in Running state. Check the Datadog monitor to verify the error rate has dropped below the threshold.
Step 8: Document and Close
Section titled “Step 8: Document and Close”Once the incident is verified as resolved:
- The AI generates a resolution summary that you can save to the incident record
- Navigate to the Overwatch dashboard and update the incident status to Resolved
- Add any additional notes about the root cause and preventive actions
- Close the incident after the monitoring window confirms stability
Tip: Resolutions stored in Overwatch feed the semantic cache. The next time a similar alert fires, the AI references this resolution to provide faster, more accurate diagnosis.
Writing Effective Prompts
Section titled “Writing Effective Prompts”The quality of your prompts directly affects the quality of the AI’s diagnosis. Follow these guidelines:
| Do | Avoid |
|---|---|
| Include timestamps and durations | Vague descriptions like “it’s broken” |
| Mention recent changes (deploys, config) | Assuming the AI knows your change history |
| Reference specific services and hosts | Generic references like “the server” |
| Describe observed vs expected behavior | Only describing the symptom without context |
| Share relevant metrics or thresholds | Omitting numerical data visible on your dashboard |
When to Escalate
Section titled “When to Escalate”Not every incident can be resolved through the AI loop alone. Escalate when:
- The AI’s suggested commands require permissions you do not have
- The root cause involves infrastructure outside your team’s ownership (network, DNS, third-party providers)
- The incident has been open for longer than your organization’s escalation threshold
- Multiple unrelated services are affected, suggesting a broader infrastructure problem
- The AI explicitly recommends involving another team or specialist
Next Steps
Section titled “Next Steps”- Creating Procedures — Build runbooks to standardize common fixes
- Integration Setup — Connect additional monitoring platforms
- Team Collaboration — Coordinate with teammates during incidents
- AI Chat Guide — Advanced prompting and chat features