Key Concepts
Key Concepts
Section titled “Key Concepts”This guide introduces the core concepts and terminology you’ll encounter when using Overwatch. Understanding these concepts will help you work more effectively with the platform.
Incidents
Section titled “Incidents”Incidents are records of problems or issues that require investigation and resolution.
Incident Lifecycle
Section titled “Incident Lifecycle”Incidents follow a standard workflow:
- New → Incident created, awaiting assignment
- In Progress → Active investigation and resolution underway
- Resolved → Solution implemented, awaiting verification
- Closed → Incident fully resolved and documented
Incident Properties
Section titled “Incident Properties”- Title: Clear, descriptive incident title for quick identification
- Description: Detailed problem description including error messages and context
- Severity: Priority level (Critical, High, Medium, Low)
- Status: Current state in the incident lifecycle
- Assignee: Team member responsible for resolution
- Tags: Keywords for organization and searchability
Context Integration
Section titled “Context Integration”The Chrome extension can automatically populate incident details by:
- Detecting alerts in monitoring platforms (Datadog, New Relic, Grafana, etc.)
- Extracting error messages and metrics from dashboards
- Linking to relevant monitoring platform resources
- Capturing screenshots and diagnostic information
Procedures
Section titled “Procedures”Procedures are executable runbooks that provide step-by-step guidance for operational tasks.
Procedure Components
Section titled “Procedure Components”- Steps: Ordered sequence of actions to complete
- Approvals: Required confirmations for sensitive operations
- Variables: Dynamic values that can be substituted during execution
- Estimated Duration: Expected time to complete each step
- Execution History: Complete record of all procedure executions
Procedure Types
Section titled “Procedure Types”- Manual Procedures: Human-executed steps with verification checkpoints
- Approval-Required: Procedures that need manager/admin approval to execute
- Template Procedures: Standardized procedures for common scenarios
- Custom Procedures: Organization-specific operational runbooks
Execution Tracking
Section titled “Execution Tracking”When executing a procedure, the platform provides:
- Real-time progress monitoring via WebSocket updates
- Step-by-step guidance with completion tracking
- Note-taking capability for observations and deviations
- Success/failure outcome recording
- Automatic linkage to related incidents
Search & AI Features
Section titled “Search & AI Features”Semantic Search
Section titled “Semantic Search”Overwatch uses vector search technology to understand the meaning of your queries:
- Natural Language: Search using plain language descriptions
- Conceptual Matching: Finds related content even without keyword matches
- Context-Aware: Results ranked by relevance, recency, and success rates
- Cross-Content: Searches across incidents, procedures, comments, and history
3-Layer Search Architecture
Section titled “3-Layer Search Architecture”The platform uses a progressive search strategy:
Layer 1: Customer-Specific (Phase 2)
Section titled “Layer 1: Customer-Specific (Phase 2)”- Confidence: 0.8-1.0
- Source: Your organization’s historical incident resolutions
- Use Case: Problems your team has solved before
Layer 2: Public Knowledge Base (Current)
Section titled “Layer 2: Public Knowledge Base (Current)”- Confidence: 0.7-0.9
- Source: Community knowledge base in Weaviate vector database
- Use Case: Common problems with established solutions
Layer 3: LLM-Generated Solutions
Section titled “Layer 3: LLM-Generated Solutions”- Confidence: 0.6-0.8
- Source: AWS Bedrock (Claude Sonnet, Nova Lite, GPT-4)
- Use Case: Novel problems requiring AI-generated guidance
- Cost Management: Semantic caching reduces costs by 30-50%
AI Suggestions
Section titled “AI Suggestions”The platform provides intelligent suggestions for:
- Relevant procedures based on incident descriptions
- Similar incidents with successful resolutions
- Next steps during incident investigation
- Procedure improvements based on execution patterns
Roles & Permissions
Section titled “Roles & Permissions”Overwatch uses role-based access control (RBAC) to manage permissions.
User Roles
Section titled “User Roles”Viewer
Section titled “Viewer”- Permissions: Read-only access to incidents and procedures
- Use Case: Stakeholders who need visibility without modification rights
- Actions: View incidents, procedures, analytics, and dashboards
Engineer
Section titled “Engineer”- Permissions: Create and manage incidents, execute procedures
- Use Case: DevOps engineers and SRE team members doing day-to-day operations
- Actions: All Viewer permissions plus create/update incidents, execute procedures, comment on incidents
Manager
Section titled “Manager”- Permissions: All Engineer permissions plus procedure management
- Use Case: Team leads responsible for standardizing operational processes
- Actions: All Engineer permissions plus create/update procedures, approve executions, manage procedure templates
- Permissions: Full platform access including organization management
- Use Case: System administrators responsible for platform configuration
- Actions: All Manager permissions plus user management, integration setup, organization settings, billing and usage monitoring
Multi-Tenant Architecture
Section titled “Multi-Tenant Architecture”Multi-tenancy means multiple organizations can use the same platform with complete data isolation.
Organization Isolation
Section titled “Organization Isolation”- Each organization has a dedicated namespace
- All data is automatically scoped to your organization
- Zero cross-organization data visibility
- Separate user management and permissions per organization
How It Works
Section titled “How It Works”- Database models include
organization_idfor data isolation - SQL queries automatically filtered by organization context
- API requests include organization context from authentication
- RBAC enforced at the service layer before database access
Benefits
Section titled “Benefits”- Secure data separation between organizations
- Centralized platform management by Overwatch
- Consistent feature updates across all organizations
- Cost-effective infrastructure sharing
Real-Time Collaboration
Section titled “Real-Time Collaboration”Overwatch uses WebSocket technology for live collaboration features.
Live Updates
Section titled “Live Updates”- Changes sync instantly across all connected users
- See team member activities as they happen
- Automatic conflict resolution for concurrent edits
- Green indicator shows active WebSocket connection
Collaboration Features
Section titled “Collaboration Features”- Comments: Threaded discussions on incidents and procedures
- @Mentions: Tag team members for notifications and attention
- Activity Feed: Real-time stream of team activities
- Status Indicators: Online/offline presence for team members
- Live Execution Monitoring: Watch procedure executions in real-time
Connection Management
Section titled “Connection Management”- Automatic reconnection if connection is lost
- Offline queue for actions taken while disconnected
- Sync when connection is restored
- Connection status always visible in the interface
Chrome Extension
Section titled “Chrome Extension”The Chrome Extension bridges your monitoring platforms with Overwatch.
Key Features
Section titled “Key Features”- Platform Detection: Automatically detects alerts in Datadog, New Relic, Grafana, PagerDuty, and more
- Context Extraction: Captures alert details, error messages, and metrics
- On-Demand Reporting: Report problems directly from monitoring dashboards
- Automatic Incident Creation: Optionally create incidents automatically from alerts
- Screenshot Capture: Capture and attach visual context to incidents
Supported Platforms
Section titled “Supported Platforms”Production Platforms:
- Datadog
- New Relic
- Grafana Cloud
- PagerDuty
- SigNoz Cloud
- Elasticsearch Cloud
- Kibana
- Sentry
- Honeycomb
Local Development (localhost):
- Prometheus (port 9090)
- Alertmanager (port 9093)
- Grafana (port 3001)
- SigNoz (port 3301)
- Kibana (port 5601)
- Elasticsearch (port 9200)
How It Works
Section titled “How It Works”- Extension detects alerts on monitoring platform pages
- Extracts relevant context (error messages, metrics, timestamps)
- Pre-populates incident creation form with extracted data
- Links incident to monitoring platform dashboard
- Maintains context throughout incident resolution
Integrations
Section titled “Integrations”Integrations connect Overwatch with your existing observability tools.
Integration Types
Section titled “Integration Types”Monitoring Platforms
Section titled “Monitoring Platforms”- Purpose: Receive alerts and incidents automatically
- Method: Webhook-based or API polling
- Examples: Datadog monitors, New Relic violations, Grafana alerts
Communication Channels
Section titled “Communication Channels”- Purpose: Send notifications to your team
- Method: Webhook delivery to external services
- Examples: Slack channels, Microsoft Teams, email, SMS
Identity Providers
Section titled “Identity Providers”- Purpose: Single sign-on (SSO) authentication
- Method: SAML 2.0 or OAuth 2.0
- Examples: Okta, Auth0, Azure AD, Google Workspace
Integration Architecture
Section titled “Integration Architecture”Each monitoring platform integration includes:
- Alert Parser: Platform-specific alert format parsing
- API Client: Native API integration for bidirectional communication
- Webhook Processor: Handles incoming webhooks from platform
- Data Transformer: Normalizes data into Overwatch format
Configuration
Section titled “Configuration”Integrations are configured through:
- Admin Dashboard: Web UI for integration setup
- API Keys: Secure credential management
- Webhook URLs: Unique URLs for each integration
- Test Functions: Validate integration configuration
Analytics & Monitoring
Section titled “Analytics & Monitoring”Incident Analytics
Section titled “Incident Analytics”Track key performance metrics:
- Resolution Time: Average time to resolve by severity level
- Team Performance: Individual and team effectiveness metrics
- Pattern Analysis: Common incident types and frequencies
- Trend Analysis: Volume and severity trends over time
Procedure Analytics
Section titled “Procedure Analytics”Monitor operational efficiency:
- Execution Success Rates: Success vs failure rates by procedure
- Execution Times: Average completion time tracking
- Most Used Procedures: Identify frequently executed runbooks
- Optimization Opportunities: Procedures with low success rates or long durations
LLM Cost Monitoring
Section titled “LLM Cost Monitoring”Track AI feature usage and costs:
- Monthly Budget Tracking: Real-time cost monitoring
- Usage by Provider: Cost breakdown by model (Claude, Nova, GPT-4)
- Budget Alerts: Notifications at 25%, 50%, 75%, 100% thresholds
- Caching Savings: Cost reduction from semantic caching
- Hard Limits: Automatic cutoff at configured spending limits
Next Steps
Section titled “Next Steps”Now that you understand the core concepts, you’re ready to:
- Follow the Quickstart Guide - Get hands-on experience in 15 minutes
- Explore the User Guide - Learn detailed operational workflows
- Read the Admin Guide - Set up your organization (admins only)
Need Clarification?
Section titled “Need Clarification?”- Glossary: See the Glossary for definitions of all platform terms
- API Reference: Explore the API Documentation for technical details
- Support: Contact your system administrator for organization-specific help
Last updated: October 2025 | Edit this page