Key Concepts
Key Concepts
Section titled “Key Concepts”This guide introduces the core concepts and terminology you’ll encounter when using Overwatch. Understanding these concepts will help you work more effectively with the platform.
Incidents
Section titled “Incidents”Incidents are records of problems or issues that require investigation and resolution.
Incident Lifecycle
Section titled “Incident Lifecycle”Incidents follow a standard workflow:
- New → Incident created, awaiting assignment
- In Progress → Active investigation and resolution underway
- Resolved → Solution implemented, awaiting verification
- Closed → Incident fully resolved and documented
Incident Properties
Section titled “Incident Properties”- Title: Clear, descriptive incident title for quick identification
- Description: Detailed problem description including error messages and context
- Severity: Priority level (Critical, High, Medium, Low)
- Status: Current state in the incident lifecycle
- Assignee: Team member responsible for resolution
- Tags: Keywords for organization and searchability
Context Integration
Section titled “Context Integration”The Chrome extension can automatically populate incident details by:
- Detecting alerts in monitoring platforms (Datadog, New Relic, Grafana, etc.)
- Extracting error messages and metrics from dashboards
- Linking to relevant monitoring platform resources
- Capturing screenshots and diagnostic information
Procedures
Section titled “Procedures”Procedures are executable runbooks that provide step-by-step guidance for operational tasks.
Procedure Components
Section titled “Procedure Components”- Steps: Ordered sequence of actions to complete
- Approvals: Required confirmations for sensitive operations
- Variables: Dynamic values that can be substituted during execution
- Estimated Duration: Expected time to complete each step
- Execution History: Complete record of all procedure executions
Procedure Types
Section titled “Procedure Types”- Manual Procedures: Human-executed steps with verification checkpoints
- Approval-Required: Procedures that need manager/admin approval to execute
- Template Procedures: Standardized procedures for common scenarios
- Custom Procedures: Organization-specific operational runbooks
Execution Tracking
Section titled “Execution Tracking”When executing a procedure, the platform provides:
- Real-time progress monitoring via WebSocket updates
- Step-by-step guidance with completion tracking
- Note-taking capability for observations and deviations
- Success/failure outcome recording
- Automatic linkage to related incidents
AI Chat
Section titled “AI Chat”AI Chat is Overwatch’s core diagnostic interface — a conversational AI that helps you investigate and resolve incidents in real time.
How It Works
Section titled “How It Works”- You describe a problem or the extension auto-detects an alert
- The AI analyzes alert data, service registry context, and your environment
- It suggests diagnostic commands that the Helper CLI can execute
- You approve commands, results flow back to the AI, and it refines its diagnosis
- The loop continues until the incident is resolved
Key Properties
Section titled “Key Properties”- Sessions: Each chat is linked to a specific incident
- Multi-Turn: Maintains full conversation history (up to 20 messages)
- Context Injection: Alert data and service registry info are fed to the AI automatically
- Prompt Injection Detection: Built-in security checks on user messages
Helper CLI
Section titled “Helper CLI”The Helper CLI is an optional local agent that bridges AI diagnosis with your actual infrastructure.
What It Does
Section titled “What It Does”- Executes approved shell commands on your machine (kubectl, aws, docker, gh, etc.)
- Auto-detects your environment: Kubernetes context, AWS profile, Docker setup, installed tools
- Rate-limited to 15 commands per minute with allowlist-based security
- Available for macOS, Linux, and Windows
Security Model
Section titled “Security Model”- Allowlist: Only approved CLI tools can run (kubectl, aws, docker, gh, terraform, etc.)
- Blocked: Destructive operations (sudo, rm, mv), credential-exposing commands
- Audit Log: Every executed command is logged per session
Service Registry
Section titled “Service Registry”The Service Registry maps monitoring alerts to your infrastructure so the AI knows what to investigate.
Configuration
Section titled “Configuration”For each service you register:
- Service Name: As it appears in monitoring platform alerts
- GitHub Repository: Where the service code lives
- Deploy Platform: Railway, ECS, Kubernetes, GCP Cloud Run, Azure, Vercel, Fly.io
- Deploy Identifier: Platform-specific service/project name
Impact
Section titled “Impact”When an alert fires for a registered service, the AI automatically:
- Suggests relevant GitHub repo searches
- Provides deployment-specific CLI commands
- Uses the correct platform context in its analysis
Model Tiers
Section titled “Model Tiers”Overwatch uses 5-tier model routing to balance response quality with cost:
| Tier | Model | Use Case |
|---|---|---|
| 1 | Amazon Nova Micro | Quick triage, simple queries |
| 2 | Claude Haiku | Fast responses, minor incidents |
| 3 | Claude Sonnet | Default tier, balanced analysis |
| 4 | Claude Opus | Complex root-cause analysis |
| 5 | Weaviate Search | Knowledge base fallback |
The system automatically selects the appropriate tier based on incident complexity and your organization’s quota.
Semantic Caching
Section titled “Semantic Caching”Semantic caching stores vector embeddings of previous AI queries. When a similar question is asked, the cached response is returned instead of making a new LLM call — reducing costs by 30-50%.
- Cache is scoped per organization
- Similar queries (not just identical ones) trigger cache hits
- Cache hits return near-instant responses
Blast Radius
Section titled “Blast Radius”Blast radius is a visualization that shows the scope of an incident’s impact:
- Total services impacted by a single incident
- Correlated alerts across monitoring sources
- Unique monitoring sources reporting related issues
This helps you understand whether an incident is isolated or part of a broader failure.
Search & AI Features
Section titled “Search & AI Features”Semantic Search
Section titled “Semantic Search”Overwatch uses vector search to understand the meaning of your queries:
- Natural Language: Search using plain language descriptions
- Conceptual Matching: Finds related content even without keyword matches
- Context-Aware: Results ranked by relevance, recency, and success rates
- Cross-Content: Searches across incidents, procedures, comments, and history
3-Layer Search Architecture
Section titled “3-Layer Search Architecture”The platform uses a progressive search strategy:
- Layer 1 (Customer): Organization-specific historical solutions
- Layer 2 (Public): Community knowledge base via Weaviate vector database
- Layer 3 (LLM): AI-generated solutions via AWS Bedrock with semantic caching
Roles & Permissions
Section titled “Roles & Permissions”Overwatch uses role-based access control (RBAC) to manage permissions.
User Roles
Section titled “User Roles”Viewer
Section titled “Viewer”- Permissions: Read-only access to incidents and procedures
- Use Case: Stakeholders who need visibility without modification rights
- Actions: View incidents, procedures, analytics, and dashboards
Engineer
Section titled “Engineer”- Permissions: Create and manage incidents, execute procedures
- Use Case: DevOps engineers and SRE team members doing day-to-day operations
- Actions: All Viewer permissions plus create/update incidents, execute procedures, comment on incidents
Manager
Section titled “Manager”- Permissions: All Engineer permissions plus procedure management
- Use Case: Team leads responsible for standardizing operational processes
- Actions: All Engineer permissions plus create/update procedures, approve executions, manage procedure templates
- Permissions: Full platform access including organization management
- Use Case: System administrators responsible for platform configuration
- Actions: All Manager permissions plus user management, integration setup, organization settings, billing and usage monitoring
Multi-Tenant Architecture
Section titled “Multi-Tenant Architecture”Multi-tenancy means multiple organizations can use the same platform with complete data isolation.
Organization Isolation
Section titled “Organization Isolation”- Each organization has a dedicated namespace
- All data is automatically scoped to your organization
- Zero cross-organization data visibility
- Separate user management and permissions per organization
How It Works
Section titled “How It Works”- Database models include
organization_idfor data isolation - SQL queries automatically filtered by organization context
- API requests include organization context from authentication
- RBAC enforced at the service layer before database access
Benefits
Section titled “Benefits”- Secure data separation between organizations
- Centralized platform management by Overwatch
- Consistent feature updates across all organizations
- Cost-effective infrastructure sharing
Real-Time Collaboration
Section titled “Real-Time Collaboration”Overwatch uses WebSocket technology for live collaboration features.
Live Updates
Section titled “Live Updates”- Changes sync instantly across all connected users
- See team member activities as they happen
- Automatic conflict resolution for concurrent edits
- Green indicator shows active WebSocket connection
Collaboration Features
Section titled “Collaboration Features”- Comments: Threaded discussions on incidents and procedures
- @Mentions: Tag team members for notifications and attention
- Activity Feed: Real-time stream of team activities
- Status Indicators: Online/offline presence for team members
- Live Execution Monitoring: Watch procedure executions in real-time
Connection Management
Section titled “Connection Management”- Automatic reconnection if connection is lost
- Offline queue for actions taken while disconnected
- Sync when connection is restored
- Connection status always visible in the interface
Chrome Extension
Section titled “Chrome Extension”The Chrome Extension (v3, Manifest V3) is your primary interface with Overwatch, embedding AI chat directly into your monitoring dashboards.
Key Features
Section titled “Key Features”- Side-Panel AI Chat: Open a conversational AI panel from any monitoring dashboard (Ctrl+Shift+I / Cmd+Shift+I)
- Alert Auto-Detection: Content scripts monitor dashboard DOM for active alerts and extract context
- Network Interception: Captures monitoring platform API responses for enriched context
- Service Registry Integration: Connects alerts to your configured services and deploy targets
- Helper CLI Connection: Detects when the local Helper CLI is running for command execution
Supported Platforms
Section titled “Supported Platforms”- Datadog (including EU region)
- Grafana Cloud and Grafana.net
- New Relic
- PagerDuty
- Prometheus
- SigNoz
- Elasticsearch
- AWS CloudWatch
How It Works
Section titled “How It Works”- Extension detects alerts on your monitoring platform
- Extracts alert context (error messages, metrics, timestamps, source data)
- You open the AI chat side panel
- Alert data is injected into the AI prompt before analysis
- AI diagnoses the issue and suggests commands
- Helper CLI executes approved commands locally
- Results flow back to the chat for iterative diagnosis
Integrations
Section titled “Integrations”Integrations connect Overwatch with your existing observability tools.
Integration Types
Section titled “Integration Types”Monitoring Platforms
Section titled “Monitoring Platforms”- Purpose: Receive alerts and incidents automatically
- Method: Webhook-based or API polling
- Examples: Datadog monitors, New Relic violations, Grafana alerts
Communication Channels
Section titled “Communication Channels”- Purpose: Send notifications to your team
- Method: Webhook delivery to external services
- Examples: Slack channels, Microsoft Teams, email, SMS
Identity Providers
Section titled “Identity Providers”- Purpose: Single sign-on (SSO) authentication
- Method: SAML 2.0 or OAuth 2.0
- Examples: Okta, Auth0, Azure AD, Google Workspace
Integration Architecture
Section titled “Integration Architecture”Each monitoring platform integration includes:
- Alert Parser: Platform-specific alert format parsing
- API Client: Native API integration for bidirectional communication
- Webhook Processor: Handles incoming webhooks from platform
- Data Transformer: Normalizes data into Overwatch format
Configuration
Section titled “Configuration”Integrations are configured through:
- Admin Dashboard: Web UI for integration setup
- API Keys: Secure credential management
- Webhook URLs: Unique URLs for each integration
- Test Functions: Validate integration configuration
Analytics & Monitoring
Section titled “Analytics & Monitoring”Incident Analytics
Section titled “Incident Analytics”Track key performance metrics:
- Resolution Time: Average time to resolve by severity level
- Team Performance: Individual and team effectiveness metrics
- Pattern Analysis: Common incident types and frequencies
- Trend Analysis: Volume and severity trends over time
Procedure Analytics
Section titled “Procedure Analytics”Monitor operational efficiency:
- Execution Success Rates: Success vs failure rates by procedure
- Execution Times: Average completion time tracking
- Most Used Procedures: Identify frequently executed runbooks
- Optimization Opportunities: Procedures with low success rates or long durations
LLM Cost Monitoring
Section titled “LLM Cost Monitoring”Track AI feature usage and costs:
- Monthly Budget Tracking: Real-time cost monitoring
- Usage by Provider: Cost breakdown by model (Claude, Nova, GPT-4)
- Budget Alerts: Notifications at 25%, 50%, 75%, 100% thresholds
- Caching Savings: Cost reduction from semantic caching
- Hard Limits: Automatic cutoff at configured spending limits
Next Steps
Section titled “Next Steps”Now that you understand the core concepts:
- Follow the Quickstart Guide — Get running in 15 minutes
- AI Chat Guide — Learn conversational incident diagnosis
- Chrome Extension Guide — Advanced extension features
- Helper CLI Guide — Local command execution details
Need Clarification?
Section titled “Need Clarification?”- Glossary: See the Glossary for definitions of all platform terms
- API Reference: Explore the API Documentation for technical details
- Support: Contact support@overwatch-observability.com