Skip to content

Platform Overview

Version 1.0 | Production Ready | Last Updated: October 2025

Welcome to Overwatch, an AI-powered incident resolution platform designed to help DevOps teams resolve incidents faster and more reliably through intelligent, turn-by-turn guidance.

Overwatch is a production-ready platform that provides:

  • AI-Powered Resolution: Semantic search and intelligent solution suggestions
  • Turn-by-Turn Guidance: Step-by-step resolution procedures with execution tracking
  • Context Extraction: Automatic context gathering from monitoring platforms
  • Real-Time Collaboration: WebSocket-powered live team collaboration
  • Enterprise Integration: Native support for Datadog, New Relic, Grafana, PagerDuty, and more
  • Semantic Search: AI-powered vector search using Weaviate for finding relevant solutions
  • LLM Layer 3: AWS Bedrock integration for AI-generated solutions when needed
  • Context Extraction: Automatic context gathering from monitoring platforms via Chrome extension
  • Learning System: Platform learns from successful incident resolutions
  • Real-Time Updates: WebSocket-powered live collaboration with automatic sync
  • Multi-Tenant Architecture: Secure organization isolation with comprehensive RBAC
  • Role-Based Access: Granular permissions for Engineers, Managers, Admins, and Viewers
  • Audit Trail: Complete activity logging for compliance and post-incident analysis

Monitoring Platforms:

  • Datadog - Complete monitors API integration
  • New Relic - NerdGraph GraphQL client with NRQL support
  • Grafana - LogQL/PromQL parsers with alert rules
  • PagerDuty - Incident webhooks and escalation policies
  • Prometheus - PromQL queries and Alertmanager integration
  • Elasticsearch - Watcher API and alert rule management
  • SigNoz - OpenTelemetry-based observability
  • Slack - Webhook processing and notifications

Browser Integration:

  • Chrome Extension - Browser overlay for context extraction from observability platforms
  • On-Demand Reporting - Report problems directly from monitoring dashboards
  • Platform Detection - Automatic detection of alerts across multiple platforms

API-First Design:

  • Comprehensive REST API for custom integrations
  • WebSocket API for real-time features
  • JavaScript and Python SDKs (coming soon)
  • Webhook support for external notifications
  • SSO Authentication: JWT-based authentication with API key support
  • Advanced Analytics: Performance metrics, incident insights, and LLM cost monitoring
  • Resource Management: Usage tracking and quota management by subscription tier
  • High Availability: AWS-optimized architecture supporting 100+ concurrent users
  • Security: SOC 2 ready, GDPR compliant, comprehensive audit logging
  • Incident Response: Streamlined incident management with AI assistance
  • Knowledge Sharing: Centralized runbook library for consistent procedures
  • Performance Tracking: Analytics to improve MTTR and resolution success rates
  • Team Coordination: Real-time collaboration during critical incident response
  • Proactive Monitoring: Integration with observability platforms for early detection
  • Automated Procedures: Executable runbooks with approval workflows and gates
  • Post-Incident Analysis: Comprehensive incident analysis and learning loops
  • Escalation Management: Structured escalation paths and multi-channel notifications
  • Team Performance: Analytics dashboard for team effectiveness and MTTR trends
  • Process Standardization: Standardized procedures across teams and services
  • Resource Planning: Usage metrics for capacity planning and cost management
  • Compliance Reporting: Audit trails for regulatory requirements
  1. Presentation Layer: Next.js 15 frontend with server-side rendering
  2. Application Layer: FastAPI async backend with business logic services
  3. Data Layer: PostgreSQL (relational), Redis (cache), Weaviate (vector search)

Current Phase 1: Layer 2 (Weaviate public database) Phase 2 Ready: Layer 1 → Layer 2 → Layer 3 cascade

  • Layer 1 (Customer): 0.8-1.0 confidence - Organization-specific historical solutions
  • Layer 2 (Public): 0.7-0.9 confidence - Community knowledge base from Weaviate
  • Layer 3 (LLM): 0.6-0.8 confidence - AI-generated solutions via AWS Bedrock
  • All database models have organization_id for data isolation
  • SQLAlchemy queries automatically filtered by organization context
  • RBAC enforced at service layer before database access
  • Zero cross-organization data leakage
  • Modern browsers with JavaScript enabled (Chrome, Firefox, Safari, Edge)
  • Chrome/Chromium-based browsers for extension features
  • WebSocket support for real-time features
  • HTTPS endpoints with valid authentication
  • WebSocket connections for live collaboration
  • Outbound HTTPS for integration webhooks
  • Data encryption in transit (TLS 1.2+) and at rest
  • SOC 2 ready security controls
  • GDPR compliant data privacy controls
  • Comprehensive audit logging for compliance

Ready to get started with Overwatch? Here are your next steps:

  1. Quickstart Guide - Get up and running in 15 minutes
  2. Key Concepts - Understand core platform concepts
  3. User Guide - Learn day-to-day operations
  1. Admin Guide - Configure your organization
  2. User Management - Invite team members and assign roles
  3. Integrations - Connect your observability platforms
  1. API Documentation - Explore the REST API
  2. WebSocket API - Implement real-time features
  3. SDKs - Use JavaScript or Python SDKs
  • In-App Help: Access help documentation directly in the platform (press ? key)
  • API Documentation: Interactive API docs at /docs endpoint
  • Troubleshooting: See Troubleshooting Guide for common issues
  • Support: Contact your system administrator for organization-specific help

Version 1.0 (October 2025) - Production Release

Section titled “Version 1.0 (October 2025) - Production Release”
  • ✅ Complete full-stack application with all Phase 1 features
  • ✅ Multi-tenant architecture with enterprise RBAC
  • ✅ AI-powered vector search and semantic incident resolution
  • ✅ LLM Layer 3 with AWS Bedrock (Claude Sonnet, Nova Lite, GPT-4)
  • ✅ Real-time collaboration via WebSocket
  • ✅ Complete integrations with 8 major observability platforms
  • ✅ Chrome extension for context extraction and on-demand reporting
  • ✅ Comprehensive analytics, dashboards, and LLM cost monitoring
  • ✅ AWS-optimized production architecture
  • Customer-specific vector database (Layer 1 search)
  • Community-driven knowledge sharing with voting and reputation
  • Enhanced LLM analytics and cost optimization
  • Extended integration ecosystem
  • Advanced analytics and predictive incident detection

For technical support or feature requests, please contact your system administrator or refer to the Troubleshooting Guide.