Skip to content

LLM Cost Management

Overwatch uses large language models (LLMs) to generate resolution procedures when knowledge base results are insufficient. Because every LLM call has a per-token cost, the platform includes a multi-layered cost management system: tiered model routing, organization-level quotas, and semantic caching.

This guide explains how each mechanism works and how to configure them.

Overwatch calculates a complexity score (0.0 — 1.0) for each incident and uses it to select the most cost-effective model that can still produce a quality response. The score is derived from four factors:

FactorLow weightHigh weight
Incident severitylow = 0.1critical = 0.6
Technology stack size1—3 components = 0.155+ components = 0.25
Error message count1 error = 0.153+ errors = 0.25
Infrastructure componentsfewer than 4 = 04+ = 0.15

The final score is capped at 1.0. Based on that score, the system selects a model tier.

Overwatch routes requests across five tiers of LLM models, each optimized for a different trade-off between cost and reasoning depth.

TierModelComplexity rangeUse caseRelative cost
1Amazon Nova MicroBelow 0.15Trivial alerts, simple triageLowest
2Amazon Nova Lite0.15 — 0.30Straightforward incidents, fast responsesLow
3Claude Haiku 4.50.30 — 0.50Standard incidents, balanced analysisMedium
4Claude Sonnet 4.50.50 — 0.70Complex multi-component incidentsHigh
5Claude Opus 4.1Above 0.70Critical root-cause analysis, P0 incidentsHighest

All models run on AWS Bedrock, so no API keys for external LLM providers are required. Overwatch uses the organization’s AWS IAM role credentials.

If a selected model is throttled (AWS rate limit), Overwatch automatically falls back down the tier chain until a response is obtained:

Opus 4.1 --> Sonnet 4.5 --> Haiku 4.5 --> Nova Lite --> Nova Micro

This ensures that users always receive a response, even during high-demand periods. The actual model used is recorded in the response metadata so cost tracking remains accurate.

The router also considers how much of an organization’s monthly budget remains:

Budget remainingRouting behavior
Above 75%Normal tiered routing (full tier selection)
25% — 75%Mid-tier preferred (avoids Opus-class models)
Below 25%Budget-constrained (routes all requests to Nova Lite)

This prevents unexpected cost overruns toward the end of a billing period.

Each organization has an AI usage budget that administrators can configure. The system tracks per-request costs with decimal precision and enforces limits automatically.

  1. Before each LLM request, the cost monitor service checks whether the organization has exceeded its hard limit.
  2. If the hard limit is reached, the request is blocked and the user receives an error indicating that LLM Layer 3 will resume in the next billing period.
  3. After each LLM request, the cost of the request is recorded and threshold checks run. If a warning or alert threshold is crossed, the system generates a cost alert.

Administrators can configure multiple alert thresholds to receive early warnings before limits are hit:

ThresholdAction
50% of monthly budgetInformational notification sent to admins
75% of monthly budgetWarning notification; routing shifts to mid-tier models
90% of monthly budgetCritical alert; routing forced to budget-constrained mode
100% of monthly budgetHard block; all LLM requests return an error until the next billing period

Semantic caching reduces LLM costs by 30—50% by reusing responses for queries that are similar to previously answered ones. This avoids sending duplicate or near-duplicate queries to the LLM.

  1. When a new query arrives, the system generates an embedding vector for it using OpenAI’s text-embedding-3-small model.
  2. It first checks for an exact hash match in Redis (fast path).
  3. If no exact match exists, it compares the query embedding against all cached embeddings for the organization using cosine similarity.
  4. If a cached entry meets the similarity threshold (default: 0.95), the cached response is returned instantly at zero LLM cost.
  5. If no match is found, the query goes to the LLM and the response is cached for future reuse.
SettingDefaultDescription
Similarity threshold0.95Minimum cosine similarity for a cache hit
Cache TTL30 daysHow long cached responses are retained
Organization isolationEnabledEach organization’s cache is separate

Administrators can monitor cache performance through the following metrics:

  • Hit rate: Percentage of queries served from cache
  • Total cost saved: Cumulative dollar amount saved by cache hits
  • Cache size: Number of cached responses per organization

Follow these practices to keep LLM costs low while maintaining resolution quality.

1. Configure the Service Registry

Enriched incident context (service names, dependencies, runbook URLs) helps the LLM produce accurate resolutions on the first attempt, reducing follow-up queries.

2. Include Specific Error Messages

Pasting the exact error message or stack trace into the incident description gives the model enough context to avoid back-and-forth clarification, which would consume additional tokens.

3. Use Procedures for Recurring Issues

Documented procedures in the knowledge base are returned directly from Weaviate without invoking any LLM at all. Building a procedure library for common incidents is the single most effective cost reduction strategy.

4. Monitor Usage via Analytics

The Analytics dashboard shows per-organization LLM usage, cost trends, and cache hit rates. Review these monthly to identify optimization opportunities.

  • Set realistic alert thresholds — Configure the 50% and 75% thresholds so you have time to adjust before hitting hard limits.
  • Review high-cost incidents — Periodically review which incidents triggered Opus-tier models. If similar incidents recur, add a procedure to the knowledge base so future instances are resolved from cache.
  • Leverage the fallback chain — The automatic fallback to cheaper models during rate limiting means you do not need to over-provision expensive model access.

Administrators have the following tools for managing LLM costs.

Dashboard --> Organization --> Billing --> AI Usage

The AI Usage panel displays:

  • Current month spend against the configured budget
  • Per-model breakdown showing how many tokens were consumed at each tier
  • Cost trend chart for the past 6 months
  • Cache hit rate and estimated savings

Navigate to Dashboard —> Organization —> Settings —> AI Quotas to:

  • Set the monthly budget limit
  • Configure warning thresholds (50%, 75%, 90%)
  • Enable or disable the hard block at 100%
  • View current usage against limits

Organization owners and system administrators can temporarily raise or remove quotas during critical incidents:

Dashboard --> Organization --> Settings --> AI Quotas --> Override

Cost alerts are sent through the organization’s configured notification channels (email, Slack, webhooks). Each alert includes:

  • The threshold that was crossed
  • Current spend and remaining budget
  • The routing mode that was activated (mid-tier, budget-constrained, or hard block)
  • A link to the AI Usage panel

The following table shows approximate per-1K-token pricing for each model tier. Actual costs depend on prompt length and response length.

ModelInput (per 1K tokens)Output (per 1K tokens)Typical request cost
Nova Micro$0.000035$0.00014Under $0.001
Nova Lite$0.00006$0.00024Under $0.001
Haiku 4.5$0.00025$0.00125$0.002 — $0.005
Sonnet 4.5$0.003$0.015$0.02 — $0.05
Opus 4.1$0.015$0.075$0.10 — $0.30

Organization Setup

Configure billing preferences and resource quotas for your organization.

Organization Setup —>

Integration Management

Connect monitoring platforms to enrich incident context and reduce LLM reliance.

Manage Integrations —>

Security & Compliance

Review audit logging for quota overrides and cost alert history.

Security Guide —>

If you have questions about LLM cost management or need to adjust your organization’s quotas, contact support@overwatch-observability.com.


Related Documentation: