Skip to content

Platform Overview

Version 2.0 | Last Updated: February 2026

Overwatch is an AI-powered incident resolution platform that helps DevOps teams diagnose and resolve incidents through conversational AI, directly integrated into the monitoring tools you already use.

Overwatch combines three core components:

  1. Chrome Extension — Detects alerts on your monitoring dashboards and opens an AI chat panel for real-time diagnosis
  2. AI Chat — Conversational interface powered by AWS Bedrock with 5-tier model routing for cost-optimized incident analysis
  3. Helper CLI — Optional local agent that executes approved diagnostic commands on your infrastructure

Together, these provide a complete loop: detect an alert, diagnose it with AI, execute commands to gather data, and iterate until the problem is resolved.

  • Conversational Diagnosis: Describe a problem in plain language and get step-by-step guidance
  • Alert Context Injection: Alert data is automatically fed to the AI before analysis begins
  • Multi-Turn Sessions: Linked to specific incidents for full conversation history
  • Command Suggestions: AI suggests diagnostic commands that the Helper CLI can execute locally
  • Alert Auto-Detection: Content scripts monitor your monitoring dashboards for active alerts
  • Side-Panel AI Chat: Open Overwatch’s AI chat directly from any monitoring platform (Ctrl+Shift+I / Cmd+Shift+I)
  • Network Interception: Captures monitoring platform API responses for enriched context
  • 8+ Platform Support: Datadog, Grafana, New Relic, PagerDuty, Prometheus, SigNoz, Elasticsearch, CloudWatch
  • Local Command Execution: Run kubectl, aws, docker, gh, and other CLI tools with AI guidance
  • Environment Auto-Detection: Discovers your Kubernetes context, AWS profile, Docker setup, and installed tools
  • Security Controls: Allowlist-based command validation, rate limiting (15 cmd/min), and audit logging
  • Cross-Platform: macOS (ARM/x86), Linux (ARM/x86), Windows
  • Alert-to-Service Mapping: Map monitoring alerts to GitHub repos and deploy targets
  • Multi-Cloud Support: Railway, AWS ECS, Kubernetes, GCP Cloud Run, Azure, Vercel, Fly.io
  • AI Context Enrichment: Service registry data is injected into chat prompts so the AI knows your infrastructure
  • 5-Tier Model Routing: Nova Micro → Haiku → Sonnet → Opus → Weaviate fallback
  • Semantic Caching: Reduces AI costs 30-50% by caching similar queries
  • Organization Quotas: Per-org budget controls with admin overrides
  • Per-Message Tracking: Decimal-precision cost tracking for every AI interaction

Monitoring Platforms:

  • Datadog, New Relic, Grafana, PagerDuty
  • Prometheus, Elasticsearch, SigNoz, AWS CloudWatch

Communication: Slack webhooks and notifications

API-First Design:

  • REST API with interactive Swagger documentation
  • WebSocket API for real-time collaboration
  • Webhook support for external notifications
  • Real-Time Updates: WebSocket-powered multi-user incident rooms
  • Multi-Tenant Architecture: Organization-level data isolation with RBAC
  • Role-Based Access: Engineer, Manager, Admin, and Viewer roles
  • Procedure Management: Executable runbooks with step tracking and approval gates
  • Incident Analytics: MTTR tracking, severity trends, team performance
  • LLM Cost Monitoring: Per-model cost breakdown, caching savings, budget alerts
  • Procedure Analytics: Execution success rates and optimization insights
  1. Frontend: Next.js 15 dashboard with React 18 and TypeScript
  2. Backend: FastAPI async API with service-layer architecture
  3. Data Layer: PostgreSQL (relational), Redis (cache), Weaviate (vector search)
  4. Chrome Extension: Manifest V3 with side-panel chat interface
  5. Helper CLI: Rust-based local command execution agent

The platform uses a progressive search strategy:

  • Layer 1 (Customer): Organization-specific historical solutions
  • Layer 2 (Public): Community knowledge base via Weaviate vector database
  • Layer 3 (LLM): AI-generated solutions via AWS Bedrock with semantic caching
  • All database models scoped by organization_id
  • Queries automatically filtered by organization context
  • RBAC enforced at the service layer
  • Zero cross-organization data visibility
  • Real-time alert detection and AI-powered diagnosis
  • Helper CLI for hands-on infrastructure debugging
  • Procedure-guided resolution workflows
  • Service registry maps alerts to infrastructure components
  • Blast radius analysis shows incident impact scope
  • Analytics for MTTR trends and team performance
  • LLM cost management and quota controls
  • Standardized procedures across teams
  • Compliance-ready audit trails
  1. Quickstart Guide — Get running in 15 minutes
  2. Key Concepts — Core platform terminology
  3. AI Chat Guide — Learn conversational incident diagnosis
  1. Organization Setup — Configure your organization
  2. Service Registry — Map alerts to infrastructure
  3. LLM Cost Management — Control AI spending
  1. API Documentation — Explore the REST API
  2. Webhooks — Receive events from Overwatch
  3. API Examples — Code samples in multiple languages

For support, contact support@overwatch-observability.com or see the Troubleshooting Guide.