Performance
Performance
Section titled “Performance”This page addresses performance-related issues across the Overwatch platform, from AI response latency to dashboard rendering and search quality. Each section identifies the symptom, explains the underlying cause, and provides actionable steps to improve performance.
Slow AI Responses
Section titled “Slow AI Responses”Problem: The AI chat takes several seconds or longer to return a diagnosis.
Cause: AI response time depends on three factors: whether the semantic cache contains a similar previous query, which model tier handles the request, and network latency between your browser and the API. A cache miss on a complex query that escalates to a higher model tier (Sonnet or Opus) results in the longest response times.
Solution:
- Check model tier routing: Go to Settings > AI Configuration to see which model tier is handling your queries. Lower tiers (Nova Micro, Haiku) respond in under 2 seconds for simple queries, while higher tiers (Sonnet, Opus) may take 5-15 seconds for complex analysis
- Review semantic cache hit rate: Navigate to Analytics > AI Usage and check the cache hit ratio. A low hit rate means most queries are going to the LLM instead of being served from cache
- Reduce query complexity: Break complex questions into smaller, focused queries. A specific question like “What caused the 500 errors on the payments service in the last hour?” resolves faster than “What is wrong with production?”
- Check network latency: Run a basic connectivity check against your API endpoint to rule out network-level delays
Tip: The first query on a new topic will always be slower because there is no cached result. Subsequent similar queries benefit from semantic cache hits and return near-instantly.
High LLM Costs
Section titled “High LLM Costs”Problem: Your organization’s AI usage costs are higher than expected or approaching the budget limit.
Cause: Every AI query that misses the semantic cache results in an LLM API call. Costs increase when queries frequently escalate to higher model tiers, when caching is underutilized, or when team members ask repetitive questions in slightly different phrasing that defeats deduplication.
Solution:
- Enable and tune semantic caching: Verify caching is enabled in Settings > AI Configuration. The cache stores vector embeddings of previous queries and returns cached responses for semantically similar questions, reducing LLM calls by 30-50%
- Configure model tier limits: Restrict the maximum model tier available to your organization. If most queries are routine triage, limiting to Tier 2 (Haiku) or Tier 3 (Sonnet) significantly reduces per-query costs
- Set budget alerts and hard limits: In Settings > Billing, configure alerts at 25%, 50%, and 75% of your monthly budget, and set a hard cutoff to prevent unexpected overages
- Monitor usage by team: Review per-user and per-team usage in Analytics > AI Usage to identify patterns and optimize usage habits
- Standardize common queries: Encourage your team to use consistent phrasing for routine questions, which improves cache hit rates
Note: You can view cost breakdowns by model tier and date range in Analytics > AI Usage. Use this to identify which query types drive the most cost.
Dashboard Slow to Load
Section titled “Dashboard Slow to Load”Problem: The Overwatch dashboard takes a long time to render, especially the Incidents list or Analytics pages.
Cause: The browser is loading and rendering too many records at once. This typically happens when pagination is set to a high page size, when filters are not applied, or when the browser is under memory pressure from other tabs.
Solution:
- Apply filters: Use severity, status, date range, and assignee filters to reduce the number of incidents loaded at once
- Check pagination settings: Reduce the page size if it is set above the default. Loading 100+ incident cards simultaneously impacts render time
- Close unnecessary browser tabs: Heavy pages in other tabs compete for memory and CPU, slowing React rendering
- Check your browser: Open Chrome Task Manager (Shift+Esc) and check if the Overwatch tab is using excessive memory. If so, refresh the page
- Disable heavy browser extensions: Extensions that inject scripts into every page (ad blockers, Grammarly, etc.) add overhead
WebSocket Lag
Section titled “WebSocket Lag”Problem: Real-time updates (incident status changes, new alerts, procedure execution progress) arrive with noticeable delay.
Cause: WebSocket messages travel through the same network path as API requests. High network latency, unstable connections, or too many concurrent WebSocket connections from the same browser can introduce delays.
Solution:
- Check network quality: Run a latency test against your Overwatch API endpoint. Anything above 200ms round-trip will produce noticeable lag in real-time updates
- Reduce concurrent connections: If you have multiple Overwatch dashboard tabs open, each maintains its own WebSocket connection. Close duplicate tabs
- Check your VPN: VPN routing can add 50-200ms of latency. If real-time responsiveness is critical, test without the VPN to isolate the impact
- Monitor connection status: The connection indicator in the dashboard header shows the WebSocket state. If it flickers between connected and disconnected, the issue is likely network instability
Tip: For the best real-time experience, keep a single Overwatch dashboard tab open and use browser notifications for alerts from other tabs.
Search Returning Poor Results
Section titled “Search Returning Poor Results”Problem: Searching for incidents or solutions returns irrelevant or low-quality results.
Cause: Overwatch uses semantic (vector) search, which matches on meaning rather than exact keywords. Poor results typically indicate that incident descriptions lack sufficient detail, the service registry is incomplete, or the knowledge base has not been enriched with enough historical data.
Solution:
- Improve incident descriptions: When creating incidents, include specific error messages, affected service names, timestamps, and observable symptoms. Richer descriptions produce better vector embeddings and improve future search results
- Enrich the service registry: Go to the Chrome extension options and ensure all services are registered with correct names, GitHub repos, and deploy targets. The AI uses this context to scope its search
- Build historical context: The semantic search improves over time as more incidents and resolutions are recorded. The 3-layer search architecture (organization history, public knowledge base, LLM generation) produces the best results when the first two layers have sufficient data
- Use natural language queries: The search engine understands meaning, so phrasing like “memory leak in checkout service” will outperform keyword searches like “OOM checkout”
- Check the search scope: Verify you are searching across the correct content types (incidents, procedures, comments) in the search interface
Performance Monitoring
Section titled “Performance Monitoring”To proactively identify performance issues before they impact your team:
- AI Usage dashboard: Track cache hit rates, average response times, and model tier distribution in Analytics > AI Usage
- Incident volume trends: Monitor incident creation rates to anticipate periods of high dashboard load
- WebSocket health: The connection indicator provides real-time feedback on connectivity quality
- Browser DevTools: Use the Network and Performance tabs in Chrome DevTools to profile slow dashboard interactions
Still Stuck?
Section titled “Still Stuck?”- Check the Common Issues page for general platform problems
- Review the API Errors page if slow responses are accompanied by error codes
- Contact support at support@overwatch-observability.com with your browser version, network environment details, and specific performance observations