Platform Overview

Overwatch Platform Overview

Version 2.0 | Last Updated: February 2026

Overwatch is an AI-powered incident resolution platform that helps DevOps teams diagnose and resolve incidents through conversational AI, directly integrated into the monitoring tools you already use.

What is Overwatch?

Overwatch combines three core components:

Chrome Extension — Detects alerts on your monitoring dashboards and opens an AI chat panel for real-time diagnosis
AI Chat — Conversational interface powered by AWS Bedrock with 5-tier model routing for cost-optimized incident analysis
Helper CLI — Optional local agent that executes approved diagnostic commands on your infrastructure

Together, these provide a complete loop: detect an alert, diagnose it with AI, execute commands to gather data, and iterate until the problem is resolved.

Key Features

AI Chat Interface

Conversational Diagnosis: Describe a problem in plain language and get step-by-step guidance
Alert Context Injection: Alert data is automatically fed to the AI before analysis begins
Multi-Turn Sessions: Linked to specific incidents for full conversation history
Command Suggestions: AI suggests diagnostic commands that the Helper CLI can execute locally

Chrome Extension (v3)

Alert Auto-Detection: Content scripts monitor your monitoring dashboards for active alerts
Side-Panel AI Chat: Open Overwatch’s AI chat directly from any monitoring platform (Ctrl+Shift+I / Cmd+Shift+I)
Network Interception: Captures monitoring platform API responses for enriched context
8+ Platform Support: Datadog, Grafana, New Relic, PagerDuty, Prometheus, SigNoz, Elasticsearch, CloudWatch

Helper CLI Module

Local Command Execution: Run kubectl, aws, docker, gh, and other CLI tools with AI guidance
Environment Auto-Detection: Discovers your Kubernetes context, AWS profile, Docker setup, and installed tools
Security Controls: Allowlist-based command validation, rate limiting (15 cmd/min), and audit logging
Cross-Platform: macOS (ARM/x86), Linux (ARM/x86), Windows

Service Registry

Alert-to-Service Mapping: Map monitoring alerts to GitHub repos and deploy targets
Multi-Cloud Support: Railway, AWS ECS, Kubernetes, GCP Cloud Run, Azure, Vercel, Fly.io
AI Context Enrichment: Service registry data is injected into chat prompts so the AI knows your infrastructure

Smart Cost Optimization

5-Tier Model Routing: Nova Micro → Haiku → Sonnet → Opus → Weaviate fallback
Semantic Caching: Reduces AI costs 30-50% by caching similar queries
Organization Quotas: Per-org budget controls with admin overrides
Per-Message Tracking: Decimal-precision cost tracking for every AI interaction

Integration Ecosystem

Monitoring Platforms:

Datadog, New Relic, Grafana, PagerDuty
Prometheus, Elasticsearch, SigNoz, AWS CloudWatch

Communication: Slack webhooks and notifications

API-First Design:

REST API with interactive Swagger documentation
WebSocket API for real-time collaboration
Webhook support for external notifications

Collaboration & Workflow

Real-Time Updates: WebSocket-powered multi-user incident rooms
Multi-Tenant Architecture: Organization-level data isolation with RBAC
Role-Based Access: Engineer, Manager, Admin, and Viewer roles
Procedure Management: Executable runbooks with step tracking and approval gates

Analytics & Monitoring

Incident Analytics: MTTR tracking, severity trends, team performance
LLM Cost Monitoring: Per-model cost breakdown, caching savings, budget alerts
Procedure Analytics: Execution success rates and optimization insights

Architecture

Core Components

Frontend: Next.js 15 dashboard with React 18 and TypeScript
Backend: FastAPI async API with service-layer architecture
Data Layer: PostgreSQL (relational), Redis (cache), Weaviate (vector search)
Chrome Extension: Manifest V3 with side-panel chat interface
Helper CLI: Rust-based local command execution agent

Search Architecture

The platform uses a progressive search strategy:

Layer 1 (Customer): Organization-specific historical solutions
Layer 2 (Public): Community knowledge base via Weaviate vector database
Layer 3 (LLM): AI-generated solutions via AWS Bedrock with semantic caching

Multi-Tenant Isolation

All database models scoped by organization_id
Queries automatically filtered by organization context
RBAC enforced at the service layer
Zero cross-organization data visibility

Use Cases

Incident Response Teams

Real-time alert detection and AI-powered diagnosis
Helper CLI for hands-on infrastructure debugging
Procedure-guided resolution workflows

Site Reliability Engineers

Service registry maps alerts to infrastructure components
Blast radius analysis shows incident impact scope
Analytics for MTTR trends and team performance

Engineering Managers

LLM cost management and quota controls
Standardized procedures across teams
Compliance-ready audit trails

What’s Next?

For New Users

Quickstart Guide — Get running in 15 minutes
Key Concepts — Core platform terminology
AI Chat Guide — Learn conversational incident diagnosis

For Administrators

Organization Setup — Configure your organization
Service Registry — Map alerts to infrastructure
LLM Cost Management — Control AI spending

For Developers

API Documentation — Explore the REST API
Webhooks — Receive events from Overwatch
API Examples — Code samples in multiple languages

For support, contact support@overwatch-observability.com or see the Troubleshooting Guide.