Skip to content

Key Concepts

This guide introduces the core concepts and terminology you’ll encounter when using Overwatch. Understanding these concepts will help you work more effectively with the platform.

Incidents are records of problems or issues that require investigation and resolution.

Incidents follow a standard workflow:

  1. New → Incident created, awaiting assignment
  2. In Progress → Active investigation and resolution underway
  3. Resolved → Solution implemented, awaiting verification
  4. Closed → Incident fully resolved and documented
  • Title: Clear, descriptive incident title for quick identification
  • Description: Detailed problem description including error messages and context
  • Severity: Priority level (Critical, High, Medium, Low)
  • Status: Current state in the incident lifecycle
  • Assignee: Team member responsible for resolution
  • Tags: Keywords for organization and searchability

The Chrome extension can automatically populate incident details by:

  • Detecting alerts in monitoring platforms (Datadog, New Relic, Grafana, etc.)
  • Extracting error messages and metrics from dashboards
  • Linking to relevant monitoring platform resources
  • Capturing screenshots and diagnostic information

Procedures are executable runbooks that provide step-by-step guidance for operational tasks.

  • Steps: Ordered sequence of actions to complete
  • Approvals: Required confirmations for sensitive operations
  • Variables: Dynamic values that can be substituted during execution
  • Estimated Duration: Expected time to complete each step
  • Execution History: Complete record of all procedure executions
  • Manual Procedures: Human-executed steps with verification checkpoints
  • Approval-Required: Procedures that need manager/admin approval to execute
  • Template Procedures: Standardized procedures for common scenarios
  • Custom Procedures: Organization-specific operational runbooks

When executing a procedure, the platform provides:

  • Real-time progress monitoring via WebSocket updates
  • Step-by-step guidance with completion tracking
  • Note-taking capability for observations and deviations
  • Success/failure outcome recording
  • Automatic linkage to related incidents

Overwatch uses vector search technology to understand the meaning of your queries:

  • Natural Language: Search using plain language descriptions
  • Conceptual Matching: Finds related content even without keyword matches
  • Context-Aware: Results ranked by relevance, recency, and success rates
  • Cross-Content: Searches across incidents, procedures, comments, and history

The platform uses a progressive search strategy:

  • Confidence: 0.8-1.0
  • Source: Your organization’s historical incident resolutions
  • Use Case: Problems your team has solved before
  • Confidence: 0.7-0.9
  • Source: Community knowledge base in Weaviate vector database
  • Use Case: Common problems with established solutions
  • Confidence: 0.6-0.8
  • Source: AWS Bedrock (Claude Sonnet, Nova Lite, GPT-4)
  • Use Case: Novel problems requiring AI-generated guidance
  • Cost Management: Semantic caching reduces costs by 30-50%

The platform provides intelligent suggestions for:

  • Relevant procedures based on incident descriptions
  • Similar incidents with successful resolutions
  • Next steps during incident investigation
  • Procedure improvements based on execution patterns

Overwatch uses role-based access control (RBAC) to manage permissions.

  • Permissions: Read-only access to incidents and procedures
  • Use Case: Stakeholders who need visibility without modification rights
  • Actions: View incidents, procedures, analytics, and dashboards
  • Permissions: Create and manage incidents, execute procedures
  • Use Case: DevOps engineers and SRE team members doing day-to-day operations
  • Actions: All Viewer permissions plus create/update incidents, execute procedures, comment on incidents
  • Permissions: All Engineer permissions plus procedure management
  • Use Case: Team leads responsible for standardizing operational processes
  • Actions: All Engineer permissions plus create/update procedures, approve executions, manage procedure templates
  • Permissions: Full platform access including organization management
  • Use Case: System administrators responsible for platform configuration
  • Actions: All Manager permissions plus user management, integration setup, organization settings, billing and usage monitoring

Multi-tenancy means multiple organizations can use the same platform with complete data isolation.

  • Each organization has a dedicated namespace
  • All data is automatically scoped to your organization
  • Zero cross-organization data visibility
  • Separate user management and permissions per organization
  • Database models include organization_id for data isolation
  • SQL queries automatically filtered by organization context
  • API requests include organization context from authentication
  • RBAC enforced at the service layer before database access
  • Secure data separation between organizations
  • Centralized platform management by Overwatch
  • Consistent feature updates across all organizations
  • Cost-effective infrastructure sharing

Overwatch uses WebSocket technology for live collaboration features.

  • Changes sync instantly across all connected users
  • See team member activities as they happen
  • Automatic conflict resolution for concurrent edits
  • Green indicator shows active WebSocket connection
  • Comments: Threaded discussions on incidents and procedures
  • @Mentions: Tag team members for notifications and attention
  • Activity Feed: Real-time stream of team activities
  • Status Indicators: Online/offline presence for team members
  • Live Execution Monitoring: Watch procedure executions in real-time
  • Automatic reconnection if connection is lost
  • Offline queue for actions taken while disconnected
  • Sync when connection is restored
  • Connection status always visible in the interface

The Chrome Extension bridges your monitoring platforms with Overwatch.

  • Platform Detection: Automatically detects alerts in Datadog, New Relic, Grafana, PagerDuty, and more
  • Context Extraction: Captures alert details, error messages, and metrics
  • On-Demand Reporting: Report problems directly from monitoring dashboards
  • Automatic Incident Creation: Optionally create incidents automatically from alerts
  • Screenshot Capture: Capture and attach visual context to incidents

Production Platforms:

  • Datadog
  • New Relic
  • Grafana Cloud
  • PagerDuty
  • SigNoz Cloud
  • Elasticsearch Cloud
  • Kibana
  • Sentry
  • Honeycomb

Local Development (localhost):

  • Prometheus (port 9090)
  • Alertmanager (port 9093)
  • Grafana (port 3001)
  • SigNoz (port 3301)
  • Kibana (port 5601)
  • Elasticsearch (port 9200)
  1. Extension detects alerts on monitoring platform pages
  2. Extracts relevant context (error messages, metrics, timestamps)
  3. Pre-populates incident creation form with extracted data
  4. Links incident to monitoring platform dashboard
  5. Maintains context throughout incident resolution

Integrations connect Overwatch with your existing observability tools.

  • Purpose: Receive alerts and incidents automatically
  • Method: Webhook-based or API polling
  • Examples: Datadog monitors, New Relic violations, Grafana alerts
  • Purpose: Send notifications to your team
  • Method: Webhook delivery to external services
  • Examples: Slack channels, Microsoft Teams, email, SMS
  • Purpose: Single sign-on (SSO) authentication
  • Method: SAML 2.0 or OAuth 2.0
  • Examples: Okta, Auth0, Azure AD, Google Workspace

Each monitoring platform integration includes:

  1. Alert Parser: Platform-specific alert format parsing
  2. API Client: Native API integration for bidirectional communication
  3. Webhook Processor: Handles incoming webhooks from platform
  4. Data Transformer: Normalizes data into Overwatch format

Integrations are configured through:

  • Admin Dashboard: Web UI for integration setup
  • API Keys: Secure credential management
  • Webhook URLs: Unique URLs for each integration
  • Test Functions: Validate integration configuration

Track key performance metrics:

  • Resolution Time: Average time to resolve by severity level
  • Team Performance: Individual and team effectiveness metrics
  • Pattern Analysis: Common incident types and frequencies
  • Trend Analysis: Volume and severity trends over time

Monitor operational efficiency:

  • Execution Success Rates: Success vs failure rates by procedure
  • Execution Times: Average completion time tracking
  • Most Used Procedures: Identify frequently executed runbooks
  • Optimization Opportunities: Procedures with low success rates or long durations

Track AI feature usage and costs:

  • Monthly Budget Tracking: Real-time cost monitoring
  • Usage by Provider: Cost breakdown by model (Claude, Nova, GPT-4)
  • Budget Alerts: Notifications at 25%, 50%, 75%, 100% thresholds
  • Caching Savings: Cost reduction from semantic caching
  • Hard Limits: Automatic cutoff at configured spending limits

Now that you understand the core concepts, you’re ready to:

  1. Follow the Quickstart Guide - Get hands-on experience in 15 minutes
  2. Explore the User Guide - Learn detailed operational workflows
  3. Read the Admin Guide - Set up your organization (admins only)
  • Glossary: See the Glossary for definitions of all platform terms
  • API Reference: Explore the API Documentation for technical details
  • Support: Contact your system administrator for organization-specific help

Last updated: October 2025 | Edit this page