Incident Management

Incidents in Overwatch represent problems or issues that require investigation and resolution. The platform provides intelligent assistance throughout the incident lifecycle, from creation through resolution and documentation.

Creating an Incident

Quick Create

There are multiple ways to create incidents:

From Dashboard

Dashboard → "Create Incident" button

From Incidents Page

Incidents → "New Incident" button

From Chrome Extension

Monitoring Platform → Detect Alert → "Report to Overwatch" button

Incident Details

When creating an incident, provide the following information:

Required Fields

Title: Clear, descriptive incident title (e.g., “Database connection timeout in production”)
Description: Detailed problem description including error messages and impact
Severity: Priority level for triage and response

Optional Fields

Assignee: Team member responsible for resolution
Tags: Keywords for organization and searchability (e.g., “database”, “production”, “timeout”)
Related Links: Links to monitoring dashboards, logs, or external resources

Severity Levels

Choose the appropriate severity to ensure proper prioritization:

Severity	Description	Response Time
Critical	Complete service outage affecting all users	Immediate (< 15 min)
High	Major functionality impaired, significant user impact	Urgent (< 1 hour)
Medium	Partial functionality affected, workarounds available	Normal (< 4 hours)
Low	Minor issues, minimal impact on users	Scheduled (< 24 hours)

Auto-Population from Chrome Extension

The Chrome extension significantly accelerates incident creation:

Automatic Context Extraction

Detects alerts in monitoring platforms (Datadog, New Relic, Grafana, etc.)
Extracts error messages and stack traces automatically
Captures relevant metrics and time ranges
Links to monitoring platform dashboards
Includes screenshots of alerts (optional)

Supported Platforms

Datadog - Monitor alerts and APM traces
New Relic - Violations and anomalies
Grafana - Dashboard alerts and annotations
PagerDuty - Incident webhooks
Prometheus - Alertmanager notifications
Elasticsearch - Watcher alerts
SigNoz - OpenTelemetry alerts
Sentry - Error tracking events

How It Works

Extension detects alert on monitoring platform page
Extracts relevant context (error messages, metrics, timestamps)
Pre-populates incident creation form with extracted data
You review and adjust details before creating incident
Incident is linked to original monitoring platform resource

Incident Workflow

Standard Lifecycle

Incidents progress through a standard workflow:

New → In Progress → Resolved → Closed

1. New

Incident created, awaiting assignment
Triage phase to determine severity and priority
Initial investigation to gather context

2. In Progress

Assigned to team member or actively being investigated
Active resolution efforts underway
Regular updates and communication happening
Procedures may be executed during this phase

3. Resolved

Solution has been implemented
Awaiting verification that issue is fixed
Monitoring for recurrence before closing

4. Closed

Issue fully resolved and verified
Documentation complete
Post-incident review completed (for critical/high severity)

Incident Assignment

Manual Assignment

Incident Detail Page → Assignee Dropdown → Select Team Member

Auto-Assignment Rules (Admin Configuration)

By severity level
By tags or categories
By team member availability
Round-robin distribution
Based on expertise and past resolution success

Reassignment

Any team member can reassign incidents
Assignee receives notification
Assignment history tracked in timeline
Original assignee can be notified of reassignment

AI-Powered Resolution

Overwatch provides intelligent assistance throughout the incident resolution process:

Semantic Search Integration

When viewing an incident, the platform automatically:

Searches for Similar Incidents

Analyzes incident description using semantic understanding
Finds conceptually related incidents (not just keyword matches)
Ranks by relevance and resolution success
Shows resolution time and success rates

Suggests Relevant Procedures

Recommends procedures based on incident details
Highlights procedures with high success rates for similar issues
Shows procedure execution history
Indicates estimated resolution time

Displays Historical Context

Previous incidents with similar symptoms
Successful resolution patterns
Common root causes
Prevention measures that worked

Search Confidence Scores

Results include confidence scores to help you evaluate suggestions:

Score Range	Meaning	Action
0.8 - 1.0	Very high confidence match	Strongly recommended
0.7 - 0.79	Good match	Recommended
0.6 - 0.69	Moderate match	Consider carefully
< 0.6	Low confidence	Use caution

AI Suggestion Types

Layer 2: Public Knowledge Base (Current)

Community-contributed solutions from Weaviate
Best practices and common resolutions
Confidence: 0.7-0.9

Layer 3: LLM-Generated Solutions (Available)

AI-generated guidance via AWS Bedrock
Used when no existing solutions found
Confidence: 0.6-0.8
Cost-optimized with semantic caching

Layer 1: Customer-Specific

Your organization’s historical solutions
Highest confidence (0.8-1.0)
Learns from your team’s successful resolutions

Blast Radius Analysis

The incident detail page includes a Blast Radius visualization that shows the scope of impact:

Total Services Impacted: How many services are affected by this incident
Correlated Alerts: Count of related alerts across monitoring sources
Unique Sources: Number of distinct monitoring platforms reporting related issues

This helps you quickly determine whether an incident is isolated to a single service or part of a broader cascading failure. When multiple services are impacted, the blast radius view helps prioritize which components to investigate first.

AI Chat Diagnosis

For hands-on investigation, open the AI Chat panel from the Chrome extension:

Navigate to the monitoring platform page related to the incident
Open the side panel (Ctrl+Shift+I / Cmd+Shift+I)
The extension auto-detects the alert and injects context into the chat
Describe what you’re seeing or ask the AI to analyze the alert
The AI provides diagnosis, suggests commands, and guides you through resolution

If the Helper CLI is running, the AI can suggest diagnostic commands (kubectl, aws, docker, etc.) that execute locally and feed results back into the conversation.

See the AI Chat Guide for detailed usage instructions.

Working with Incidents

Incident Detail Page

The incident detail page provides comprehensive information and actions:

Information Sections

Header: Title, severity, status, assignee, timestamps
Description: Full incident details with formatting
Timeline: Complete audit trail of all activities
Comments: Threaded discussions and @mentions
Attachments: Logs, screenshots, diagnostic files
Related Items: Linked procedures, similar incidents
AI Suggestions: Recommended solutions and procedures

Available Actions

Update incident details
Change status or severity
Assign/reassign to team members
Add comments and @mention colleagues
Upload attachments
Execute suggested procedures
Link to related incidents
Close or reopen incident

Comments and Collaboration

Adding Comments

Incident Detail → Comments Section → Type Comment → Post

Comment Features

Markdown Support: Format comments with markdown syntax
@Mentions: Tag team members for notifications (@username)
Threading: Reply to specific comments for organized discussions
Timestamps: All comments include author and timestamp
Edit History: Track comment modifications
File Attachments: Attach files directly to comments

@Mention Notifications When you mention someone:

They receive an in-app notification
Email notification sent (if configured)
Slack/Teams message (if integration enabled)
Highlighted in their activity feed

Best Practices for Comments

Use comments for status updates and findings
@mention team members for urgent items
Document decisions and rationale
Link to relevant resources
Keep comments professional and constructive

Attachments

Supported File Types

Logs (.log, .txt)
Screenshots (.png, .jpg, .gif)
Configuration files (.yaml, .json, .conf)
Diagnostic reports (.pdf, .csv)
Code snippets (any text file)

File Size Limits

Individual files: Up to 25 MB
Total per incident: Up to 100 MB
Large files automatically compressed

Attachment Management

Incident Detail → Attachments Tab → Upload Files

Drag and drop files to upload
Preview images inline
Download individual or all files
Delete attachments (with permission)

Incident Analytics

Track incident metrics to improve your team’s performance:

Individual Incident Metrics

Resolution Statistics

Time to first response
Time to resolution (MTTR)
Number of updates/comments
Team members involved
Procedures executed
Similar incidents found

Success Indicators

Was resolution successful?
Did suggested procedures help?
Was incident reopened?
Customer satisfaction (if applicable)

Team Analytics

Access team-wide metrics:

Dashboard → Analytics → Incidents

Key Metrics

MTTR by Severity: Average resolution time broken down by severity level
Incident Volume: Trends over time with severity distribution
Team Performance: Individual and team resolution statistics
Common Issues: Most frequent incident types and tags
Resolution Patterns: Successful resolution methods and procedures
Escalation Trends: Incidents requiring escalation or reassignment

Filtering Options

Time range (1 day to 90 days)
Severity level
Team member
Tags or categories
Status (resolved, open, closed)

Using Analytics for Improvement

Identify Patterns

Recurring incidents that could be prevented
Procedures that consistently succeed
Team members with high success rates
Time periods with high incident volume

Process Improvements

Create procedures for recurring issues
Update documentation based on successful resolutions
Adjust on-call rotations based on incident patterns
Implement preventive measures for common problems

Training Opportunities

Share successful resolution techniques
Identify knowledge gaps
Cross-train team members on high-frequency issues
Document lessons learned

Best Practices

Creating Effective Incidents

Clear Titles ✅ Good: “PostgreSQL connection timeout in production API” ❌ Bad: “Database problem”

Detailed Descriptions Include:

What happened (symptoms, error messages)
When it started
What systems/users are affected
What you’ve tried (if anything)
Relevant logs or error messages

Proper Tagging

Use consistent tag naming
Include technology (e.g., “postgresql”, “kubernetes”)
Add environment (e.g., “production”, “staging”)
Include service name (e.g., “api”, “frontend”)

Collaboration Guidelines

Regular Updates

Post updates every 30-60 minutes for critical incidents
Update status when investigation milestones reached
Notify stakeholders of progress
Document dead ends to save others’ time

Effective Communication

Use @mentions for urgent items requiring specific attention
Keep comments professional and factual
Document decisions and rationale clearly
Link to relevant resources and context
Summarize findings before closing

Knowledge Capture

Document the root cause clearly
Describe the solution in detail
Note any preventive measures taken
Link to related incidents or documentation
Add tags to improve future searchability

Incident Closure Checklist

Before closing an incident, ensure:

Root cause identified and documented
Solution implemented and verified
No recurrence in monitoring period (varies by severity)
Documentation updated (if applicable)
Preventive measures implemented (if applicable)
Post-incident review completed (for critical/high severity)
Related procedures created or updated
All stakeholders notified of resolution

Incident Templates

Create templates for common incident types to speed up creation:

Using Templates

Incidents → New Incident → Select Template

Template Examples

Database connection issues
API timeout errors
Memory leak investigations
Deployment rollback scenarios
Security incident response

Creating Templates (Manager/Admin role)

Settings → Incident Templates → Create New Template

Define:

Template name and description
Pre-filled title pattern
Default severity and tags
Standard investigation steps
Relevant procedures to suggest

Troubleshooting

Common Issues

Can’t Create Incidents

Verify you have Engineer role or higher
Check organization quota hasn’t been exceeded
Ensure required fields are completed
Contact administrator if issue persists

AI Suggestions Not Appearing

Wait a few seconds for search to complete
Check incident description is detailed enough
Verify Weaviate connection in system status
Try rephrasing incident description

Attachments Won’t Upload

Check file size (max 25 MB per file)
Verify file type is supported
Check network connection
Try different browser if issue persists

Chrome Extension Not Detecting Alerts

Verify extension is installed and enabled
Check you’re on a supported monitoring platform
Ensure you’re viewing an alert or incident page
Refresh the page and try again

Next Steps

AI Chat Guide — Conversational incident diagnosis
Helper CLI — Local command execution for investigation
Procedure Management — Execute runbooks for resolution
Search Features — Semantic search for faster solutions
Analytics Dashboard — Track team performance metrics
Chrome Extension — Browser integration guide

Need Help?

Troubleshooting: See Common Issues
Support: Contact support@overwatch-observability.com