Skip to content

Incident Management

Incidents in Overwatch represent problems or issues that require investigation and resolution. The platform provides intelligent assistance throughout the incident lifecycle, from creation through resolution and documentation.

There are multiple ways to create incidents:

From Dashboard

Dashboard → "Create Incident" button

From Incidents Page

Incidents → "New Incident" button

From Chrome Extension

Monitoring Platform → Detect Alert → "Report to Overwatch" button

When creating an incident, provide the following information:

Required Fields

  • Title: Clear, descriptive incident title (e.g., “Database connection timeout in production”)
  • Description: Detailed problem description including error messages and impact
  • Severity: Priority level for triage and response

Optional Fields

  • Assignee: Team member responsible for resolution
  • Tags: Keywords for organization and searchability (e.g., “database”, “production”, “timeout”)
  • Related Links: Links to monitoring dashboards, logs, or external resources

Choose the appropriate severity to ensure proper prioritization:

SeverityDescriptionResponse Time
CriticalComplete service outage affecting all usersImmediate (< 15 min)
HighMajor functionality impaired, significant user impactUrgent (< 1 hour)
MediumPartial functionality affected, workarounds availableNormal (< 4 hours)
LowMinor issues, minimal impact on usersScheduled (< 24 hours)

The Chrome extension significantly accelerates incident creation:

Automatic Context Extraction

  • Detects alerts in monitoring platforms (Datadog, New Relic, Grafana, etc.)
  • Extracts error messages and stack traces automatically
  • Captures relevant metrics and time ranges
  • Links to monitoring platform dashboards
  • Includes screenshots of alerts (optional)

Supported Platforms

  • Datadog - Monitor alerts and APM traces
  • New Relic - Violations and anomalies
  • Grafana - Dashboard alerts and annotations
  • PagerDuty - Incident webhooks
  • Prometheus - Alertmanager notifications
  • Elasticsearch - Watcher alerts
  • SigNoz - OpenTelemetry alerts
  • Sentry - Error tracking events

How It Works

  1. Extension detects alert on monitoring platform page
  2. Extracts relevant context (error messages, metrics, timestamps)
  3. Pre-populates incident creation form with extracted data
  4. You review and adjust details before creating incident
  5. Incident is linked to original monitoring platform resource

Incidents progress through a standard workflow:

New → In Progress → Resolved → Closed

1. New

  • Incident created, awaiting assignment
  • Triage phase to determine severity and priority
  • Initial investigation to gather context

2. In Progress

  • Assigned to team member or actively being investigated
  • Active resolution efforts underway
  • Regular updates and communication happening
  • Procedures may be executed during this phase

3. Resolved

  • Solution has been implemented
  • Awaiting verification that issue is fixed
  • Monitoring for recurrence before closing

4. Closed

  • Issue fully resolved and verified
  • Documentation complete
  • Post-incident review completed (for critical/high severity)

Manual Assignment

Incident Detail Page → Assignee Dropdown → Select Team Member

Auto-Assignment Rules (Admin Configuration)

  • By severity level
  • By tags or categories
  • By team member availability
  • Round-robin distribution
  • Based on expertise and past resolution success

Reassignment

  • Any team member can reassign incidents
  • Assignee receives notification
  • Assignment history tracked in timeline
  • Original assignee can be notified of reassignment

Overwatch provides intelligent assistance throughout the incident resolution process:

When viewing an incident, the platform automatically:

Searches for Similar Incidents

  • Analyzes incident description using semantic understanding
  • Finds conceptually related incidents (not just keyword matches)
  • Ranks by relevance and resolution success
  • Shows resolution time and success rates

Suggests Relevant Procedures

  • Recommends procedures based on incident details
  • Highlights procedures with high success rates for similar issues
  • Shows procedure execution history
  • Indicates estimated resolution time

Displays Historical Context

  • Previous incidents with similar symptoms
  • Successful resolution patterns
  • Common root causes
  • Prevention measures that worked

Results include confidence scores to help you evaluate suggestions:

Score RangeMeaningAction
0.8 - 1.0Very high confidence matchStrongly recommended
0.7 - 0.79Good matchRecommended
0.6 - 0.69Moderate matchConsider carefully
< 0.6Low confidenceUse caution

Layer 2: Public Knowledge Base (Current)

  • Community-contributed solutions from Weaviate
  • Best practices and common resolutions
  • Confidence: 0.7-0.9

Layer 3: LLM-Generated Solutions (Available)

  • AI-generated guidance via AWS Bedrock
  • Used when no existing solutions found
  • Confidence: 0.6-0.8
  • Cost-optimized with semantic caching

Layer 1: Customer-Specific (Phase 2)

  • Your organization’s historical solutions
  • Highest confidence (0.8-1.0)
  • Learns from your team’s successful resolutions

The incident detail page provides comprehensive information and actions:

Information Sections

  • Header: Title, severity, status, assignee, timestamps
  • Description: Full incident details with formatting
  • Timeline: Complete audit trail of all activities
  • Comments: Threaded discussions and @mentions
  • Attachments: Logs, screenshots, diagnostic files
  • Related Items: Linked procedures, similar incidents
  • AI Suggestions: Recommended solutions and procedures

Available Actions

  • Update incident details
  • Change status or severity
  • Assign/reassign to team members
  • Add comments and @mention colleagues
  • Upload attachments
  • Execute suggested procedures
  • Link to related incidents
  • Close or reopen incident

Adding Comments

Incident Detail → Comments Section → Type Comment → Post

Comment Features

  • Markdown Support: Format comments with markdown syntax
  • @Mentions: Tag team members for notifications (@username)
  • Threading: Reply to specific comments for organized discussions
  • Timestamps: All comments include author and timestamp
  • Edit History: Track comment modifications
  • File Attachments: Attach files directly to comments

@Mention Notifications When you mention someone:

  1. They receive an in-app notification
  2. Email notification sent (if configured)
  3. Slack/Teams message (if integration enabled)
  4. Highlighted in their activity feed

Best Practices for Comments

  • Use comments for status updates and findings
  • @mention team members for urgent items
  • Document decisions and rationale
  • Link to relevant resources
  • Keep comments professional and constructive

Supported File Types

  • Logs (.log, .txt)
  • Screenshots (.png, .jpg, .gif)
  • Configuration files (.yaml, .json, .conf)
  • Diagnostic reports (.pdf, .csv)
  • Code snippets (any text file)

File Size Limits

  • Individual files: Up to 25 MB
  • Total per incident: Up to 100 MB
  • Large files automatically compressed

Attachment Management

Incident Detail → Attachments Tab → Upload Files
  • Drag and drop files to upload
  • Preview images inline
  • Download individual or all files
  • Delete attachments (with permission)

Track incident metrics to improve your team’s performance:

Resolution Statistics

  • Time to first response
  • Time to resolution (MTTR)
  • Number of updates/comments
  • Team members involved
  • Procedures executed
  • Similar incidents found

Success Indicators

  • Was resolution successful?
  • Did suggested procedures help?
  • Was incident reopened?
  • Customer satisfaction (if applicable)

Access team-wide metrics:

Dashboard → Analytics → Incidents

Key Metrics

  • MTTR by Severity: Average resolution time broken down by severity level
  • Incident Volume: Trends over time with severity distribution
  • Team Performance: Individual and team resolution statistics
  • Common Issues: Most frequent incident types and tags
  • Resolution Patterns: Successful resolution methods and procedures
  • Escalation Trends: Incidents requiring escalation or reassignment

Filtering Options

  • Time range (1 day to 90 days)
  • Severity level
  • Team member
  • Tags or categories
  • Status (resolved, open, closed)

Identify Patterns

  • Recurring incidents that could be prevented
  • Procedures that consistently succeed
  • Team members with high success rates
  • Time periods with high incident volume

Process Improvements

  • Create procedures for recurring issues
  • Update documentation based on successful resolutions
  • Adjust on-call rotations based on incident patterns
  • Implement preventive measures for common problems

Training Opportunities

  • Share successful resolution techniques
  • Identify knowledge gaps
  • Cross-train team members on high-frequency issues
  • Document lessons learned

Clear Titles ✅ Good: “PostgreSQL connection timeout in production API” ❌ Bad: “Database problem”

Detailed Descriptions Include:

  • What happened (symptoms, error messages)
  • When it started
  • What systems/users are affected
  • What you’ve tried (if anything)
  • Relevant logs or error messages

Proper Tagging

  • Use consistent tag naming
  • Include technology (e.g., “postgresql”, “kubernetes”)
  • Add environment (e.g., “production”, “staging”)
  • Include service name (e.g., “api”, “frontend”)

Regular Updates

  • Post updates every 30-60 minutes for critical incidents
  • Update status when investigation milestones reached
  • Notify stakeholders of progress
  • Document dead ends to save others’ time

Effective Communication

  • Use @mentions for urgent items requiring specific attention
  • Keep comments professional and factual
  • Document decisions and rationale clearly
  • Link to relevant resources and context
  • Summarize findings before closing

Knowledge Capture

  • Document the root cause clearly
  • Describe the solution in detail
  • Note any preventive measures taken
  • Link to related incidents or documentation
  • Add tags to improve future searchability

Before closing an incident, ensure:

  • Root cause identified and documented
  • Solution implemented and verified
  • No recurrence in monitoring period (varies by severity)
  • Documentation updated (if applicable)
  • Preventive measures implemented (if applicable)
  • Post-incident review completed (for critical/high severity)
  • Related procedures created or updated
  • All stakeholders notified of resolution

Create templates for common incident types to speed up creation:

Using Templates

Incidents → New Incident → Select Template

Template Examples

  • Database connection issues
  • API timeout errors
  • Memory leak investigations
  • Deployment rollback scenarios
  • Security incident response

Creating Templates (Manager/Admin role)

Settings → Incident Templates → Create New Template

Define:

  • Template name and description
  • Pre-filled title pattern
  • Default severity and tags
  • Standard investigation steps
  • Relevant procedures to suggest

Can’t Create Incidents

  • Verify you have Engineer role or higher
  • Check organization quota hasn’t been exceeded
  • Ensure required fields are completed
  • Contact administrator if issue persists

AI Suggestions Not Appearing

  • Wait a few seconds for search to complete
  • Check incident description is detailed enough
  • Verify Weaviate connection in system status
  • Try rephrasing incident description

Attachments Won’t Upload

  • Check file size (max 25 MB per file)
  • Verify file type is supported
  • Check network connection
  • Try different browser if issue persists

Chrome Extension Not Detecting Alerts

  • Verify extension is installed and enabled
  • Check you’re on a supported monitoring platform
  • Ensure you’re viewing an alert or incident page
  • Refresh the page and try again
  • In-App Help: Press ? key for keyboard shortcuts and help
  • Troubleshooting: See Common Issues
  • Support: Contact your system administrator

Last updated: October 2025 | Edit this page