LLM Cost Management

Overwatch uses large language models (LLMs) to generate resolution procedures when knowledge base results are insufficient. Because every LLM call has a per-token cost, the platform includes a multi-layered cost management system: tiered model routing, organization-level quotas, and semantic caching.

This guide explains how each mechanism works and how to configure them.

How Model Routing Works

Overwatch calculates a complexity score (0.0 — 1.0) for each incident and uses it to select the most cost-effective model that can still produce a quality response. The score is derived from four factors:

Factor	Low weight	High weight
Incident severity	`low` = 0.1	`critical` = 0.6
Technology stack size	1—3 components = 0.15	5+ components = 0.25
Error message count	1 error = 0.15	3+ errors = 0.25
Infrastructure components	fewer than 4 = 0	4+ = 0.15

The final score is capped at 1.0. Based on that score, the system selects a model tier.

Model Tiers

Overwatch routes requests across five tiers of LLM models, each optimized for a different trade-off between cost and reasoning depth.

Tier	Model	Complexity range	Use case	Relative cost
1	Amazon Nova Micro	Below 0.15	Trivial alerts, simple triage	Lowest
2	Amazon Nova Lite	0.15 — 0.30	Straightforward incidents, fast responses	Low
3	Claude Haiku 4.5	0.30 — 0.50	Standard incidents, balanced analysis	Medium
4	Claude Sonnet 4.5	0.50 — 0.70	Complex multi-component incidents	High
5	Claude Opus 4.1	Above 0.70	Critical root-cause analysis, P0 incidents	Highest

All models run on AWS Bedrock, so no API keys for external LLM providers are required. Overwatch uses the organization’s AWS IAM role credentials.

Automatic Fallback

If a selected model is throttled (AWS rate limit), Overwatch automatically falls back down the tier chain until a response is obtained:

Opus 4.1 --> Sonnet 4.5 --> Haiku 4.5 --> Nova Lite --> Nova Micro

This ensures that users always receive a response, even during high-demand periods. The actual model used is recorded in the response metadata so cost tracking remains accurate.

Budget-Aware Routing

The router also considers how much of an organization’s monthly budget remains:

Budget remaining	Routing behavior
Above 75%	Normal tiered routing (full tier selection)
25% — 75%	Mid-tier preferred (avoids Opus-class models)
Below 25%	Budget-constrained (routes all requests to Nova Lite)

This prevents unexpected cost overruns toward the end of a billing period.

Organization Quotas

Each organization has an AI usage budget that administrators can configure. The system tracks per-request costs with decimal precision and enforces limits automatically.

How Quotas Are Enforced

Before each LLM request, the cost monitor service checks whether the organization has exceeded its hard limit.
If the hard limit is reached, the request is blocked and the user receives an error indicating that LLM Layer 3 will resume in the next billing period.
After each LLM request, the cost of the request is recorded and threshold checks run. If a warning or alert threshold is crossed, the system generates a cost alert.

Quota Thresholds

Administrators can configure multiple alert thresholds to receive early warnings before limits are hit:

Threshold	Action
50% of monthly budget	Informational notification sent to admins
75% of monthly budget	Warning notification; routing shifts to mid-tier models
90% of monthly budget	Critical alert; routing forced to budget-constrained mode
100% of monthly budget	Hard block; all LLM requests return an error until the next billing period

Semantic Caching

Semantic caching reduces LLM costs by 30—50% by reusing responses for queries that are similar to previously answered ones. This avoids sending duplicate or near-duplicate queries to the LLM.

How It Works

When a new query arrives, the system generates an embedding vector for it using OpenAI’s text-embedding-3-small model.
It first checks for an exact hash match in Redis (fast path).
If no exact match exists, it compares the query embedding against all cached embeddings for the organization using cosine similarity.
If a cached entry meets the similarity threshold (default: 0.95), the cached response is returned instantly at zero LLM cost.
If no match is found, the query goes to the LLM and the response is cached for future reuse.

Cache Configuration

Setting	Default	Description
Similarity threshold	0.95	Minimum cosine similarity for a cache hit
Cache TTL	30 days	How long cached responses are retained
Organization isolation	Enabled	Each organization’s cache is separate

Cache Metrics

Administrators can monitor cache performance through the following metrics:

Hit rate: Percentage of queries served from cache
Total cost saved: Cumulative dollar amount saved by cache hits
Cache size: Number of cached responses per organization

Cost Optimization Tips

Follow these practices to keep LLM costs low while maintaining resolution quality.

1. Configure the Service Registry

Enriched incident context (service names, dependencies, runbook URLs) helps the LLM produce accurate resolutions on the first attempt, reducing follow-up queries.

2. Include Specific Error Messages

Pasting the exact error message or stack trace into the incident description gives the model enough context to avoid back-and-forth clarification, which would consume additional tokens.

3. Use Procedures for Recurring Issues

Documented procedures in the knowledge base are returned directly from Weaviate without invoking any LLM at all. Building a procedure library for common incidents is the single most effective cost reduction strategy.

4. Monitor Usage via Analytics

The Analytics dashboard shows per-organization LLM usage, cost trends, and cache hit rates. Review these monthly to identify optimization opportunities.

Additional Strategies

Set realistic alert thresholds — Configure the 50% and 75% thresholds so you have time to adjust before hitting hard limits.
Review high-cost incidents — Periodically review which incidents triggered Opus-tier models. If similar incidents recur, add a procedure to the knowledge base so future instances are resolved from cache.
Leverage the fallback chain — The automatic fallback to cheaper models during rate limiting means you do not need to over-provision expensive model access.

Admin Controls

Administrators have the following tools for managing LLM costs.

Viewing Usage

Dashboard --> Organization --> Billing --> AI Usage

The AI Usage panel displays:

Current month spend against the configured budget
Per-model breakdown showing how many tokens were consumed at each tier
Cost trend chart for the past 6 months
Cache hit rate and estimated savings

Navigate to Dashboard —> Organization —> Settings —> AI Quotas to:

Set the monthly budget limit
Configure warning thresholds (50%, 75%, 90%)
Enable or disable the hard block at 100%
View current usage against limits

Use the Admin API to manage quotas programmatically:

# Get current quota status
curl -H "Authorization: Bearer $TOKEN" \
  https://api.overwatch-observability.com/api/v1/organizations/{org_id}/ai-quota

# Update quota limit
curl -X PATCH \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"monthly_budget_usd": 500.00}' \
  https://api.overwatch-observability.com/api/v1/organizations/{org_id}/ai-quota

Overriding Quotas

Organization owners and system administrators can temporarily raise or remove quotas during critical incidents:

Dashboard --> Organization --> Settings --> AI Quotas --> Override

Cost Alerts

Cost alerts are sent through the organization’s configured notification channels (email, Slack, webhooks). Each alert includes:

The threshold that was crossed
Current spend and remaining budget
The routing mode that was activated (mid-tier, budget-constrained, or hard block)
A link to the AI Usage panel

Per-Request Cost Reference

The following table shows approximate per-1K-token pricing for each model tier. Actual costs depend on prompt length and response length.

Model	Input (per 1K tokens)	Output (per 1K tokens)	Typical request cost
Nova Micro	$0.000035	$0.00014	Under $0.001
Nova Lite	$0.00006	$0.00024	Under $0.001
Haiku 4.5	$0.00025	$0.00125	$0.002 — $0.005
Sonnet 4.5	$0.003	$0.015	$0.02 — $0.05
Opus 4.1	$0.015	$0.075	$0.10 — $0.30

Next Steps

Organization Setup

Configure billing preferences and resource quotas for your organization.

Organization Setup —>

Integration Management

Connect monitoring platforms to enrich incident context and reduce LLM reliance.

Manage Integrations —>

Security & Compliance

Review audit logging for quota overrides and cost alert history.

Security Guide —>

Need Help?

If you have questions about LLM cost management or need to adjust your organization’s quotas, contact support@overwatch-observability.com.

Related Documentation: