Key Concepts

This guide introduces the core concepts and terminology you’ll encounter when using Overwatch. Understanding these concepts will help you work more effectively with the platform.

Incidents

Incidents are records of problems or issues that require investigation and resolution.

Incident Lifecycle

Incidents follow a standard workflow:

New → Incident created, awaiting assignment
In Progress → Active investigation and resolution underway
Resolved → Solution implemented, awaiting verification
Closed → Incident fully resolved and documented

Incident Properties

Title: Clear, descriptive incident title for quick identification
Description: Detailed problem description including error messages and context
Severity: Priority level (Critical, High, Medium, Low)
Status: Current state in the incident lifecycle
Assignee: Team member responsible for resolution
Tags: Keywords for organization and searchability

Context Integration

The Chrome extension can automatically populate incident details by:

Detecting alerts in monitoring platforms (Datadog, New Relic, Grafana, etc.)
Extracting error messages and metrics from dashboards
Linking to relevant monitoring platform resources
Capturing screenshots and diagnostic information

Procedures

Procedures are executable runbooks that provide step-by-step guidance for operational tasks.

Procedure Components

Steps: Ordered sequence of actions to complete
Approvals: Required confirmations for sensitive operations
Variables: Dynamic values that can be substituted during execution
Estimated Duration: Expected time to complete each step
Execution History: Complete record of all procedure executions

Procedure Types

Manual Procedures: Human-executed steps with verification checkpoints
Approval-Required: Procedures that need manager/admin approval to execute
Template Procedures: Standardized procedures for common scenarios
Custom Procedures: Organization-specific operational runbooks

Execution Tracking

When executing a procedure, the platform provides:

Real-time progress monitoring via WebSocket updates
Step-by-step guidance with completion tracking
Note-taking capability for observations and deviations
Success/failure outcome recording
Automatic linkage to related incidents

AI Chat

AI Chat is Overwatch’s core diagnostic interface — a conversational AI that helps you investigate and resolve incidents in real time.

How It Works

You describe a problem or the extension auto-detects an alert
The AI analyzes alert data, service registry context, and your environment
It suggests diagnostic commands that the Helper CLI can execute
You approve commands, results flow back to the AI, and it refines its diagnosis
The loop continues until the incident is resolved

Key Properties

Sessions: Each chat is linked to a specific incident
Multi-Turn: Maintains full conversation history (up to 20 messages)
Context Injection: Alert data and service registry info are fed to the AI automatically
Prompt Injection Detection: Built-in security checks on user messages

Helper CLI

The Helper CLI is an optional local agent that bridges AI diagnosis with your actual infrastructure.

What It Does

Executes approved shell commands on your machine (kubectl, aws, docker, gh, etc.)
Auto-detects your environment: Kubernetes context, AWS profile, Docker setup, installed tools
Rate-limited to 15 commands per minute with allowlist-based security
Available for macOS, Linux, and Windows

Security Model

Allowlist: Only approved CLI tools can run (kubectl, aws, docker, gh, terraform, etc.)
Blocked: Destructive operations (sudo, rm, mv), credential-exposing commands
Audit Log: Every executed command is logged per session

Service Registry

The Service Registry maps monitoring alerts to your infrastructure so the AI knows what to investigate.

Configuration

For each service you register:

Service Name: As it appears in monitoring platform alerts
GitHub Repository: Where the service code lives
Deploy Platform: Railway, ECS, Kubernetes, GCP Cloud Run, Azure, Vercel, Fly.io
Deploy Identifier: Platform-specific service/project name

Impact

When an alert fires for a registered service, the AI automatically:

Suggests relevant GitHub repo searches
Provides deployment-specific CLI commands
Uses the correct platform context in its analysis

Model Tiers

Overwatch uses 5-tier model routing to balance response quality with cost:

Tier	Model	Use Case
1	Amazon Nova Micro	Quick triage, simple queries
2	Claude Haiku	Fast responses, minor incidents
3	Claude Sonnet	Default tier, balanced analysis
4	Claude Opus	Complex root-cause analysis
5	Weaviate Search	Knowledge base fallback

The system automatically selects the appropriate tier based on incident complexity and your organization’s quota.

Semantic Caching

Semantic caching stores vector embeddings of previous AI queries. When a similar question is asked, the cached response is returned instead of making a new LLM call — reducing costs by 30-50%.

Cache is scoped per organization
Similar queries (not just identical ones) trigger cache hits
Cache hits return near-instant responses

Blast Radius

Blast radius is a visualization that shows the scope of an incident’s impact:

Total services impacted by a single incident
Correlated alerts across monitoring sources
Unique monitoring sources reporting related issues

This helps you understand whether an incident is isolated or part of a broader failure.

Search & AI Features

Semantic Search

Overwatch uses vector search to understand the meaning of your queries:

Natural Language: Search using plain language descriptions
Conceptual Matching: Finds related content even without keyword matches
Context-Aware: Results ranked by relevance, recency, and success rates
Cross-Content: Searches across incidents, procedures, comments, and history

3-Layer Search Architecture

The platform uses a progressive search strategy:

Layer 1 (Customer): Organization-specific historical solutions
Layer 2 (Public): Community knowledge base via Weaviate vector database
Layer 3 (LLM): AI-generated solutions via AWS Bedrock with semantic caching

Roles & Permissions

Overwatch uses role-based access control (RBAC) to manage permissions.

User Roles

Viewer

Permissions: Read-only access to incidents and procedures
Use Case: Stakeholders who need visibility without modification rights
Actions: View incidents, procedures, analytics, and dashboards

Engineer

Permissions: Create and manage incidents, execute procedures
Use Case: DevOps engineers and SRE team members doing day-to-day operations
Actions: All Viewer permissions plus create/update incidents, execute procedures, comment on incidents

Manager

Permissions: All Engineer permissions plus procedure management
Use Case: Team leads responsible for standardizing operational processes
Actions: All Engineer permissions plus create/update procedures, approve executions, manage procedure templates

Admin

Permissions: Full platform access including organization management
Use Case: System administrators responsible for platform configuration
Actions: All Manager permissions plus user management, integration setup, organization settings, billing and usage monitoring

Multi-Tenant Architecture

Multi-tenancy means multiple organizations can use the same platform with complete data isolation.

Organization Isolation

Each organization has a dedicated namespace
All data is automatically scoped to your organization
Zero cross-organization data visibility
Separate user management and permissions per organization

How It Works

Database models include organization_id for data isolation
SQL queries automatically filtered by organization context
API requests include organization context from authentication
RBAC enforced at the service layer before database access

Benefits

Secure data separation between organizations
Centralized platform management by Overwatch
Consistent feature updates across all organizations
Cost-effective infrastructure sharing

Real-Time Collaboration

Overwatch uses WebSocket technology for live collaboration features.

Live Updates

Changes sync instantly across all connected users
See team member activities as they happen
Automatic conflict resolution for concurrent edits
Green indicator shows active WebSocket connection

Collaboration Features

Comments: Threaded discussions on incidents and procedures
@Mentions: Tag team members for notifications and attention
Activity Feed: Real-time stream of team activities
Status Indicators: Online/offline presence for team members
Live Execution Monitoring: Watch procedure executions in real-time

Connection Management

Automatic reconnection if connection is lost
Offline queue for actions taken while disconnected
Sync when connection is restored
Connection status always visible in the interface

Chrome Extension

The Chrome Extension (v3, Manifest V3) is your primary interface with Overwatch, embedding AI chat directly into your monitoring dashboards.

Key Features

Side-Panel AI Chat: Open a conversational AI panel from any monitoring dashboard (Ctrl+Shift+I / Cmd+Shift+I)
Alert Auto-Detection: Content scripts monitor dashboard DOM for active alerts and extract context
Network Interception: Captures monitoring platform API responses for enriched context
Service Registry Integration: Connects alerts to your configured services and deploy targets
Helper CLI Connection: Detects when the local Helper CLI is running for command execution

Supported Platforms

Datadog (including EU region)
Grafana Cloud and Grafana.net
New Relic
PagerDuty
Prometheus
SigNoz
Elasticsearch
AWS CloudWatch

How It Works

Extension detects alerts on your monitoring platform
Extracts alert context (error messages, metrics, timestamps, source data)
You open the AI chat side panel
Alert data is injected into the AI prompt before analysis
AI diagnoses the issue and suggests commands
Helper CLI executes approved commands locally
Results flow back to the chat for iterative diagnosis

Integrations

Integrations connect Overwatch with your existing observability tools.

Integration Types

Monitoring Platforms

Purpose: Receive alerts and incidents automatically
Method: Webhook-based or API polling
Examples: Datadog monitors, New Relic violations, Grafana alerts

Communication Channels

Purpose: Send notifications to your team
Method: Webhook delivery to external services
Examples: Slack channels, Microsoft Teams, email, SMS

Identity Providers

Purpose: Single sign-on (SSO) authentication
Method: SAML 2.0 or OAuth 2.0
Examples: Okta, Auth0, Azure AD, Google Workspace

Integration Architecture

Each monitoring platform integration includes:

Alert Parser: Platform-specific alert format parsing
API Client: Native API integration for bidirectional communication
Webhook Processor: Handles incoming webhooks from platform
Data Transformer: Normalizes data into Overwatch format

Configuration

Integrations are configured through:

Admin Dashboard: Web UI for integration setup
API Keys: Secure credential management
Webhook URLs: Unique URLs for each integration
Test Functions: Validate integration configuration

Analytics & Monitoring

Incident Analytics

Track key performance metrics:

Resolution Time: Average time to resolve by severity level
Team Performance: Individual and team effectiveness metrics
Pattern Analysis: Common incident types and frequencies
Trend Analysis: Volume and severity trends over time

Procedure Analytics

Monitor operational efficiency:

Execution Success Rates: Success vs failure rates by procedure
Execution Times: Average completion time tracking
Most Used Procedures: Identify frequently executed runbooks
Optimization Opportunities: Procedures with low success rates or long durations

LLM Cost Monitoring

Track AI feature usage and costs:

Monthly Budget Tracking: Real-time cost monitoring
Usage by Provider: Cost breakdown by model (Claude, Nova, GPT-4)
Budget Alerts: Notifications at 25%, 50%, 75%, 100% thresholds
Caching Savings: Cost reduction from semantic caching
Hard Limits: Automatic cutoff at configured spending limits

Next Steps

Now that you understand the core concepts:

Follow the Quickstart Guide — Get running in 15 minutes
AI Chat Guide — Learn conversational incident diagnosis
Chrome Extension Guide — Advanced extension features
Helper CLI Guide — Local command execution details

Need Clarification?

Glossary: See the Glossary for definitions of all platform terms
API Reference: Explore the API Documentation for technical details
Support: Contact support@overwatch-observability.com