Incident Management
Incident Management
Section titled “Incident Management”Incidents in Overwatch represent problems or issues that require investigation and resolution. The platform provides intelligent assistance throughout the incident lifecycle, from creation through resolution and documentation.
Creating an Incident
Section titled “Creating an Incident”Quick Create
Section titled “Quick Create”There are multiple ways to create incidents:
From Dashboard
Dashboard → "Create Incident" buttonFrom Incidents Page
Incidents → "New Incident" buttonFrom Chrome Extension
Monitoring Platform → Detect Alert → "Report to Overwatch" buttonIncident Details
Section titled “Incident Details”When creating an incident, provide the following information:
Required Fields
- Title: Clear, descriptive incident title (e.g., “Database connection timeout in production”)
- Description: Detailed problem description including error messages and impact
- Severity: Priority level for triage and response
Optional Fields
- Assignee: Team member responsible for resolution
- Tags: Keywords for organization and searchability (e.g., “database”, “production”, “timeout”)
- Related Links: Links to monitoring dashboards, logs, or external resources
Severity Levels
Section titled “Severity Levels”Choose the appropriate severity to ensure proper prioritization:
| Severity | Description | Response Time |
|---|---|---|
| Critical | Complete service outage affecting all users | Immediate (< 15 min) |
| High | Major functionality impaired, significant user impact | Urgent (< 1 hour) |
| Medium | Partial functionality affected, workarounds available | Normal (< 4 hours) |
| Low | Minor issues, minimal impact on users | Scheduled (< 24 hours) |
Auto-Population from Chrome Extension
Section titled “Auto-Population from Chrome Extension”The Chrome extension significantly accelerates incident creation:
Automatic Context Extraction
- Detects alerts in monitoring platforms (Datadog, New Relic, Grafana, etc.)
- Extracts error messages and stack traces automatically
- Captures relevant metrics and time ranges
- Links to monitoring platform dashboards
- Includes screenshots of alerts (optional)
Supported Platforms
- Datadog - Monitor alerts and APM traces
- New Relic - Violations and anomalies
- Grafana - Dashboard alerts and annotations
- PagerDuty - Incident webhooks
- Prometheus - Alertmanager notifications
- Elasticsearch - Watcher alerts
- SigNoz - OpenTelemetry alerts
- Sentry - Error tracking events
How It Works
- Extension detects alert on monitoring platform page
- Extracts relevant context (error messages, metrics, timestamps)
- Pre-populates incident creation form with extracted data
- You review and adjust details before creating incident
- Incident is linked to original monitoring platform resource
Incident Workflow
Section titled “Incident Workflow”Standard Lifecycle
Section titled “Standard Lifecycle”Incidents progress through a standard workflow:
New → In Progress → Resolved → Closed1. New
- Incident created, awaiting assignment
- Triage phase to determine severity and priority
- Initial investigation to gather context
2. In Progress
- Assigned to team member or actively being investigated
- Active resolution efforts underway
- Regular updates and communication happening
- Procedures may be executed during this phase
3. Resolved
- Solution has been implemented
- Awaiting verification that issue is fixed
- Monitoring for recurrence before closing
4. Closed
- Issue fully resolved and verified
- Documentation complete
- Post-incident review completed (for critical/high severity)
Incident Assignment
Section titled “Incident Assignment”Manual Assignment
Incident Detail Page → Assignee Dropdown → Select Team MemberAuto-Assignment Rules (Admin Configuration)
- By severity level
- By tags or categories
- By team member availability
- Round-robin distribution
- Based on expertise and past resolution success
Reassignment
- Any team member can reassign incidents
- Assignee receives notification
- Assignment history tracked in timeline
- Original assignee can be notified of reassignment
AI-Powered Resolution
Section titled “AI-Powered Resolution”Overwatch provides intelligent assistance throughout the incident resolution process:
Semantic Search Integration
Section titled “Semantic Search Integration”When viewing an incident, the platform automatically:
Searches for Similar Incidents
- Analyzes incident description using semantic understanding
- Finds conceptually related incidents (not just keyword matches)
- Ranks by relevance and resolution success
- Shows resolution time and success rates
Suggests Relevant Procedures
- Recommends procedures based on incident details
- Highlights procedures with high success rates for similar issues
- Shows procedure execution history
- Indicates estimated resolution time
Displays Historical Context
- Previous incidents with similar symptoms
- Successful resolution patterns
- Common root causes
- Prevention measures that worked
Search Confidence Scores
Section titled “Search Confidence Scores”Results include confidence scores to help you evaluate suggestions:
| Score Range | Meaning | Action |
|---|---|---|
| 0.8 - 1.0 | Very high confidence match | Strongly recommended |
| 0.7 - 0.79 | Good match | Recommended |
| 0.6 - 0.69 | Moderate match | Consider carefully |
| < 0.6 | Low confidence | Use caution |
AI Suggestion Types
Section titled “AI Suggestion Types”Layer 2: Public Knowledge Base (Current)
- Community-contributed solutions from Weaviate
- Best practices and common resolutions
- Confidence: 0.7-0.9
Layer 3: LLM-Generated Solutions (Available)
- AI-generated guidance via AWS Bedrock
- Used when no existing solutions found
- Confidence: 0.6-0.8
- Cost-optimized with semantic caching
Layer 1: Customer-Specific (Phase 2)
- Your organization’s historical solutions
- Highest confidence (0.8-1.0)
- Learns from your team’s successful resolutions
Working with Incidents
Section titled “Working with Incidents”Incident Detail Page
Section titled “Incident Detail Page”The incident detail page provides comprehensive information and actions:
Information Sections
- Header: Title, severity, status, assignee, timestamps
- Description: Full incident details with formatting
- Timeline: Complete audit trail of all activities
- Comments: Threaded discussions and @mentions
- Attachments: Logs, screenshots, diagnostic files
- Related Items: Linked procedures, similar incidents
- AI Suggestions: Recommended solutions and procedures
Available Actions
- Update incident details
- Change status or severity
- Assign/reassign to team members
- Add comments and @mention colleagues
- Upload attachments
- Execute suggested procedures
- Link to related incidents
- Close or reopen incident
Comments and Collaboration
Section titled “Comments and Collaboration”Adding Comments
Incident Detail → Comments Section → Type Comment → PostComment Features
- Markdown Support: Format comments with markdown syntax
- @Mentions: Tag team members for notifications (
@username) - Threading: Reply to specific comments for organized discussions
- Timestamps: All comments include author and timestamp
- Edit History: Track comment modifications
- File Attachments: Attach files directly to comments
@Mention Notifications When you mention someone:
- They receive an in-app notification
- Email notification sent (if configured)
- Slack/Teams message (if integration enabled)
- Highlighted in their activity feed
Best Practices for Comments
- Use comments for status updates and findings
- @mention team members for urgent items
- Document decisions and rationale
- Link to relevant resources
- Keep comments professional and constructive
Attachments
Section titled “Attachments”Supported File Types
- Logs (
.log,.txt) - Screenshots (
.png,.jpg,.gif) - Configuration files (
.yaml,.json,.conf) - Diagnostic reports (
.pdf,.csv) - Code snippets (any text file)
File Size Limits
- Individual files: Up to 25 MB
- Total per incident: Up to 100 MB
- Large files automatically compressed
Attachment Management
Incident Detail → Attachments Tab → Upload Files- Drag and drop files to upload
- Preview images inline
- Download individual or all files
- Delete attachments (with permission)
Incident Analytics
Section titled “Incident Analytics”Track incident metrics to improve your team’s performance:
Individual Incident Metrics
Section titled “Individual Incident Metrics”Resolution Statistics
- Time to first response
- Time to resolution (MTTR)
- Number of updates/comments
- Team members involved
- Procedures executed
- Similar incidents found
Success Indicators
- Was resolution successful?
- Did suggested procedures help?
- Was incident reopened?
- Customer satisfaction (if applicable)
Team Analytics
Section titled “Team Analytics”Access team-wide metrics:
Dashboard → Analytics → Incidents
Key Metrics
- MTTR by Severity: Average resolution time broken down by severity level
- Incident Volume: Trends over time with severity distribution
- Team Performance: Individual and team resolution statistics
- Common Issues: Most frequent incident types and tags
- Resolution Patterns: Successful resolution methods and procedures
- Escalation Trends: Incidents requiring escalation or reassignment
Filtering Options
- Time range (1 day to 90 days)
- Severity level
- Team member
- Tags or categories
- Status (resolved, open, closed)
Using Analytics for Improvement
Section titled “Using Analytics for Improvement”Identify Patterns
- Recurring incidents that could be prevented
- Procedures that consistently succeed
- Team members with high success rates
- Time periods with high incident volume
Process Improvements
- Create procedures for recurring issues
- Update documentation based on successful resolutions
- Adjust on-call rotations based on incident patterns
- Implement preventive measures for common problems
Training Opportunities
- Share successful resolution techniques
- Identify knowledge gaps
- Cross-train team members on high-frequency issues
- Document lessons learned
Best Practices
Section titled “Best Practices”Creating Effective Incidents
Section titled “Creating Effective Incidents”Clear Titles ✅ Good: “PostgreSQL connection timeout in production API” ❌ Bad: “Database problem”
Detailed Descriptions Include:
- What happened (symptoms, error messages)
- When it started
- What systems/users are affected
- What you’ve tried (if anything)
- Relevant logs or error messages
Proper Tagging
- Use consistent tag naming
- Include technology (e.g., “postgresql”, “kubernetes”)
- Add environment (e.g., “production”, “staging”)
- Include service name (e.g., “api”, “frontend”)
Collaboration Guidelines
Section titled “Collaboration Guidelines”Regular Updates
- Post updates every 30-60 minutes for critical incidents
- Update status when investigation milestones reached
- Notify stakeholders of progress
- Document dead ends to save others’ time
Effective Communication
- Use @mentions for urgent items requiring specific attention
- Keep comments professional and factual
- Document decisions and rationale clearly
- Link to relevant resources and context
- Summarize findings before closing
Knowledge Capture
- Document the root cause clearly
- Describe the solution in detail
- Note any preventive measures taken
- Link to related incidents or documentation
- Add tags to improve future searchability
Incident Closure Checklist
Section titled “Incident Closure Checklist”Before closing an incident, ensure:
- Root cause identified and documented
- Solution implemented and verified
- No recurrence in monitoring period (varies by severity)
- Documentation updated (if applicable)
- Preventive measures implemented (if applicable)
- Post-incident review completed (for critical/high severity)
- Related procedures created or updated
- All stakeholders notified of resolution
Incident Templates
Section titled “Incident Templates”Create templates for common incident types to speed up creation:
Using Templates
Incidents → New Incident → Select TemplateTemplate Examples
- Database connection issues
- API timeout errors
- Memory leak investigations
- Deployment rollback scenarios
- Security incident response
Creating Templates (Manager/Admin role)
Settings → Incident Templates → Create New TemplateDefine:
- Template name and description
- Pre-filled title pattern
- Default severity and tags
- Standard investigation steps
- Relevant procedures to suggest
Troubleshooting
Section titled “Troubleshooting”Common Issues
Section titled “Common Issues”Can’t Create Incidents
- Verify you have Engineer role or higher
- Check organization quota hasn’t been exceeded
- Ensure required fields are completed
- Contact administrator if issue persists
AI Suggestions Not Appearing
- Wait a few seconds for search to complete
- Check incident description is detailed enough
- Verify Weaviate connection in system status
- Try rephrasing incident description
Attachments Won’t Upload
- Check file size (max 25 MB per file)
- Verify file type is supported
- Check network connection
- Try different browser if issue persists
Chrome Extension Not Detecting Alerts
- Verify extension is installed and enabled
- Check you’re on a supported monitoring platform
- Ensure you’re viewing an alert or incident page
- Refresh the page and try again
Next Steps
Section titled “Next Steps”- Procedure Management - Execute runbooks for resolution
- Search Features - Master semantic search for faster solutions
- Analytics Dashboard - Track team performance metrics
- Chrome Extension - Install browser integration
Need Help?
Section titled “Need Help?”- In-App Help: Press
?key for keyboard shortcuts and help - Troubleshooting: See Common Issues
- Support: Contact your system administrator
Last updated: October 2025 | Edit this page