Performance

This page addresses performance-related issues across the Overwatch platform, from AI response latency to dashboard rendering and search quality. Each section identifies the symptom, explains the underlying cause, and provides actionable steps to improve performance.

Slow AI Responses

Problem: The AI chat takes several seconds or longer to return a diagnosis.

Cause: AI response time depends on three factors: whether the semantic cache contains a similar previous query, which model tier handles the request, and network latency between your browser and the API. A cache miss on a complex query that escalates to a higher model tier (Sonnet or Opus) results in the longest response times.

Solution:

Check model tier routing: Go to Settings > AI Configuration to see which model tier is handling your queries. Lower tiers (Nova Micro, Haiku) respond in under 2 seconds for simple queries, while higher tiers (Sonnet, Opus) may take 5-15 seconds for complex analysis
Review semantic cache hit rate: Navigate to Analytics > AI Usage and check the cache hit ratio. A low hit rate means most queries are going to the LLM instead of being served from cache
Reduce query complexity: Break complex questions into smaller, focused queries. A specific question like “What caused the 500 errors on the payments service in the last hour?” resolves faster than “What is wrong with production?”
Check network latency: Run a basic connectivity check against your API endpoint to rule out network-level delays

Tip: The first query on a new topic will always be slower because there is no cached result. Subsequent similar queries benefit from semantic cache hits and return near-instantly.

High LLM Costs

Problem: Your organization’s AI usage costs are higher than expected or approaching the budget limit.

Cause: Every AI query that misses the semantic cache results in an LLM API call. Costs increase when queries frequently escalate to higher model tiers, when caching is underutilized, or when team members ask repetitive questions in slightly different phrasing that defeats deduplication.

Solution:

Enable and tune semantic caching: Verify caching is enabled in Settings > AI Configuration. The cache stores vector embeddings of previous queries and returns cached responses for semantically similar questions, reducing LLM calls by 30-50%
Configure model tier limits: Restrict the maximum model tier available to your organization. If most queries are routine triage, limiting to Tier 2 (Haiku) or Tier 3 (Sonnet) significantly reduces per-query costs
Set budget alerts and hard limits: In Settings > Billing, configure alerts at 25%, 50%, and 75% of your monthly budget, and set a hard cutoff to prevent unexpected overages
Monitor usage by team: Review per-user and per-team usage in Analytics > AI Usage to identify patterns and optimize usage habits
Standardize common queries: Encourage your team to use consistent phrasing for routine questions, which improves cache hit rates

Note: You can view cost breakdowns by model tier and date range in Analytics > AI Usage. Use this to identify which query types drive the most cost.

Dashboard Slow to Load

Problem: The Overwatch dashboard takes a long time to render, especially the Incidents list or Analytics pages.

Cause: The browser is loading and rendering too many records at once. This typically happens when pagination is set to a high page size, when filters are not applied, or when the browser is under memory pressure from other tabs.

Solution:

Apply filters: Use severity, status, date range, and assignee filters to reduce the number of incidents loaded at once
Check pagination settings: Reduce the page size if it is set above the default. Loading 100+ incident cards simultaneously impacts render time
Close unnecessary browser tabs: Heavy pages in other tabs compete for memory and CPU, slowing React rendering
Check your browser: Open Chrome Task Manager (Shift+Esc) and check if the Overwatch tab is using excessive memory. If so, refresh the page
Disable heavy browser extensions: Extensions that inject scripts into every page (ad blockers, Grammarly, etc.) add overhead

WebSocket Lag

Problem: Real-time updates (incident status changes, new alerts, procedure execution progress) arrive with noticeable delay.

Cause: WebSocket messages travel through the same network path as API requests. High network latency, unstable connections, or too many concurrent WebSocket connections from the same browser can introduce delays.

Solution:

Check network quality: Run a latency test against your Overwatch API endpoint. Anything above 200ms round-trip will produce noticeable lag in real-time updates
Reduce concurrent connections: If you have multiple Overwatch dashboard tabs open, each maintains its own WebSocket connection. Close duplicate tabs
Check your VPN: VPN routing can add 50-200ms of latency. If real-time responsiveness is critical, test without the VPN to isolate the impact
Monitor connection status: The connection indicator in the dashboard header shows the WebSocket state. If it flickers between connected and disconnected, the issue is likely network instability

Tip: For the best real-time experience, keep a single Overwatch dashboard tab open and use browser notifications for alerts from other tabs.

Search Returning Poor Results

Problem: Searching for incidents or solutions returns irrelevant or low-quality results.

Cause: Overwatch uses semantic (vector) search, which matches on meaning rather than exact keywords. Poor results typically indicate that incident descriptions lack sufficient detail, the service registry is incomplete, or the knowledge base has not been enriched with enough historical data.

Solution:

Improve incident descriptions: When creating incidents, include specific error messages, affected service names, timestamps, and observable symptoms. Richer descriptions produce better vector embeddings and improve future search results
Enrich the service registry: Go to the Chrome extension options and ensure all services are registered with correct names, GitHub repos, and deploy targets. The AI uses this context to scope its search
Build historical context: The semantic search improves over time as more incidents and resolutions are recorded. The 3-layer search architecture (organization history, public knowledge base, LLM generation) produces the best results when the first two layers have sufficient data
Use natural language queries: The search engine understands meaning, so phrasing like “memory leak in checkout service” will outperform keyword searches like “OOM checkout”
Check the search scope: Verify you are searching across the correct content types (incidents, procedures, comments) in the search interface

Performance Monitoring

To proactively identify performance issues before they impact your team:

AI Usage dashboard: Track cache hit rates, average response times, and model tier distribution in Analytics > AI Usage
Incident volume trends: Monitor incident creation rates to anticipate periods of high dashboard load
WebSocket health: The connection indicator provides real-time feedback on connectivity quality
Browser DevTools: Use the Network and Performance tabs in Chrome DevTools to profile slow dashboard interactions

Still Stuck?

Check the Common Issues page for general platform problems
Review the API Errors page if slow responses are accompanied by error codes
Contact support at support@overwatch-observability.com with your browser version, network environment details, and specific performance observations