1. Configure the Service Registry
Enriched incident context (service names, dependencies, runbook URLs) helps the LLM produce accurate resolutions on the first attempt, reducing follow-up queries.
Overwatch uses large language models (LLMs) to generate resolution procedures when knowledge base results are insufficient. Because every LLM call has a per-token cost, the platform includes a multi-layered cost management system: tiered model routing, organization-level quotas, and semantic caching.
This guide explains how each mechanism works and how to configure them.
Overwatch calculates a complexity score (0.0 — 1.0) for each incident and uses it to select the most cost-effective model that can still produce a quality response. The score is derived from four factors:
| Factor | Low weight | High weight |
|---|---|---|
| Incident severity | low = 0.1 | critical = 0.6 |
| Technology stack size | 1—3 components = 0.15 | 5+ components = 0.25 |
| Error message count | 1 error = 0.15 | 3+ errors = 0.25 |
| Infrastructure components | fewer than 4 = 0 | 4+ = 0.15 |
The final score is capped at 1.0. Based on that score, the system selects a model tier.
Overwatch routes requests across five tiers of LLM models, each optimized for a different trade-off between cost and reasoning depth.
| Tier | Model | Complexity range | Use case | Relative cost |
|---|---|---|---|---|
| 1 | Amazon Nova Micro | Below 0.15 | Trivial alerts, simple triage | Lowest |
| 2 | Amazon Nova Lite | 0.15 — 0.30 | Straightforward incidents, fast responses | Low |
| 3 | Claude Haiku 4.5 | 0.30 — 0.50 | Standard incidents, balanced analysis | Medium |
| 4 | Claude Sonnet 4.5 | 0.50 — 0.70 | Complex multi-component incidents | High |
| 5 | Claude Opus 4.1 | Above 0.70 | Critical root-cause analysis, P0 incidents | Highest |
All models run on AWS Bedrock, so no API keys for external LLM providers are required. Overwatch uses the organization’s AWS IAM role credentials.
If a selected model is throttled (AWS rate limit), Overwatch automatically falls back down the tier chain until a response is obtained:
Opus 4.1 --> Sonnet 4.5 --> Haiku 4.5 --> Nova Lite --> Nova MicroThis ensures that users always receive a response, even during high-demand periods. The actual model used is recorded in the response metadata so cost tracking remains accurate.
The router also considers how much of an organization’s monthly budget remains:
| Budget remaining | Routing behavior |
|---|---|
| Above 75% | Normal tiered routing (full tier selection) |
| 25% — 75% | Mid-tier preferred (avoids Opus-class models) |
| Below 25% | Budget-constrained (routes all requests to Nova Lite) |
This prevents unexpected cost overruns toward the end of a billing period.
Each organization has an AI usage budget that administrators can configure. The system tracks per-request costs with decimal precision and enforces limits automatically.
Administrators can configure multiple alert thresholds to receive early warnings before limits are hit:
| Threshold | Action |
|---|---|
| 50% of monthly budget | Informational notification sent to admins |
| 75% of monthly budget | Warning notification; routing shifts to mid-tier models |
| 90% of monthly budget | Critical alert; routing forced to budget-constrained mode |
| 100% of monthly budget | Hard block; all LLM requests return an error until the next billing period |
Semantic caching reduces LLM costs by 30—50% by reusing responses for queries that are similar to previously answered ones. This avoids sending duplicate or near-duplicate queries to the LLM.
text-embedding-3-small model.| Setting | Default | Description |
|---|---|---|
| Similarity threshold | 0.95 | Minimum cosine similarity for a cache hit |
| Cache TTL | 30 days | How long cached responses are retained |
| Organization isolation | Enabled | Each organization’s cache is separate |
Administrators can monitor cache performance through the following metrics:
Follow these practices to keep LLM costs low while maintaining resolution quality.
1. Configure the Service Registry
Enriched incident context (service names, dependencies, runbook URLs) helps the LLM produce accurate resolutions on the first attempt, reducing follow-up queries.
2. Include Specific Error Messages
Pasting the exact error message or stack trace into the incident description gives the model enough context to avoid back-and-forth clarification, which would consume additional tokens.
3. Use Procedures for Recurring Issues
Documented procedures in the knowledge base are returned directly from Weaviate without invoking any LLM at all. Building a procedure library for common incidents is the single most effective cost reduction strategy.
4. Monitor Usage via Analytics
The Analytics dashboard shows per-organization LLM usage, cost trends, and cache hit rates. Review these monthly to identify optimization opportunities.
Administrators have the following tools for managing LLM costs.
Dashboard --> Organization --> Billing --> AI UsageThe AI Usage panel displays:
Navigate to Dashboard —> Organization —> Settings —> AI Quotas to:
Use the Admin API to manage quotas programmatically:
# Get current quota statuscurl -H "Authorization: Bearer $TOKEN" \ https://api.overwatch-observability.com/api/v1/organizations/{org_id}/ai-quota
# Update quota limitcurl -X PATCH \ -H "Authorization: Bearer $TOKEN" \ -H "Content-Type: application/json" \ -d '{"monthly_budget_usd": 500.00}' \ https://api.overwatch-observability.com/api/v1/organizations/{org_id}/ai-quotaOrganization owners and system administrators can temporarily raise or remove quotas during critical incidents:
Dashboard --> Organization --> Settings --> AI Quotas --> OverrideCost alerts are sent through the organization’s configured notification channels (email, Slack, webhooks). Each alert includes:
The following table shows approximate per-1K-token pricing for each model tier. Actual costs depend on prompt length and response length.
| Model | Input (per 1K tokens) | Output (per 1K tokens) | Typical request cost |
|---|---|---|---|
| Nova Micro | $0.000035 | $0.00014 | Under $0.001 |
| Nova Lite | $0.00006 | $0.00024 | Under $0.001 |
| Haiku 4.5 | $0.00025 | $0.00125 | $0.002 — $0.005 |
| Sonnet 4.5 | $0.003 | $0.015 | $0.02 — $0.05 |
| Opus 4.1 | $0.015 | $0.075 | $0.10 — $0.30 |
Organization Setup
Configure billing preferences and resource quotas for your organization.
Integration Management
Connect monitoring platforms to enrich incident context and reduce LLM reliance.
Security & Compliance
Review audit logging for quota overrides and cost alert history.
If you have questions about LLM cost management or need to adjust your organization’s quotas, contact support@overwatch-observability.com.
Related Documentation: