Skip to content

Analytics Dashboard

The Analytics Dashboard provides comprehensive insights into incident response performance, procedure effectiveness, team productivity, and AI-powered solution costs. Use these metrics to identify improvement opportunities, optimize workflows, and demonstrate ROI.

Access the Analytics Dashboard:

Dashboard → Analytics

What You Can Track

The analytics system provides four main categories of insights:

CategoryKey MetricsAccess Level
Incident AnalyticsMTTR, MTTA, resolution trends, severity distributionAll users
Procedure AnalyticsSuccess rates, execution times, usage patternsAll users
Team PerformanceIndividual metrics, team comparisons, workload distributionManagers and above
LLM Cost MonitoringAI generation costs, budget tracking, efficiency metricsAdmins and above

Role-Based Visibility

  • Engineers: View personal metrics, team-wide incident and procedure analytics
  • Managers: All Engineer access plus team performance comparisons and detailed reports
  • Admins: Full access including LLM cost analytics and organization-wide metrics

Time Range Selection

All analytics pages support flexible time range selection:

  • 24 Hours: Real-time operational view
  • 7 Days: Weekly trends and patterns
  • 30 Days: Monthly performance analysis (default)
  • 90 Days: Quarterly trends and long-term patterns

Detailed metrics about incident response and resolution:

Access Incident Analytics

Dashboard → Analytics → Incident Analytics

Mean Time To Resolution (MTTR)

  • Average time from incident creation to resolution
  • Calculated per severity level (Critical, High, Medium, Low)
  • Trend analysis shows improvement or degradation over time
  • Industry benchmark: Critical incidents < 1 hour, High < 4 hours

Example MTTR Display

Overall MTTR: 4.2 hours
├─ Critical: 0.8 hours (target: < 1 hour) ✅
├─ High: 3.2 hours (target: < 4 hours) ✅
├─ Medium: 6.1 hours (target: < 8 hours) ✅
└─ Low: 12.5 hours (target: < 24 hours) ✅

Mean Time To Acknowledge (MTTA)

  • Time from incident creation to first acknowledgment
  • Measures initial response speed
  • Critical for on-call and alerting effectiveness
  • Target: < 15 minutes for Critical severity

Severity Breakdown

Visual distribution of incidents by severity:

SeverityCountPercentageResolution Time
Critical36.7%0.8h average
High1226.7%3.2h average
Medium2248.9%6.1h average
Low817.8%12.5h average

Status Distribution

Current state of incidents in the system:

  • Open: 4 incidents - Newly created, awaiting assignment
  • In Progress: 3 incidents - Active investigation and resolution
  • Resolved: 35 incidents - Solution implemented, awaiting verification
  • Closed: 3 incidents - Fully resolved and documented

Daily Incident Trends

Line chart showing incident creation over time:

  • Identify peak incident periods
  • Detect anomalous spikes requiring investigation
  • Track effectiveness of preventive measures
  • Forecast capacity planning needs

Pattern Analysis

Automated pattern detection identifies:

  • Recurring Incidents: Same root cause appearing multiple times
  • Time-Based Patterns: Incidents occurring at specific times (e.g., after deployments)
  • Cascading Failures: Multiple related incidents from single root cause
  • Seasonal Trends: Periodic patterns related to business cycles

Resolution Time Trends

Track how resolution efficiency changes over time:

  • Average resolution time per day/week
  • Comparison against baseline and targets
  • Impact of process improvements
  • Training effectiveness measurement

Longest Incidents

List of incidents taking the most time to resolve:

Top 5 Longest Incidents:
1. Database connection timeout - 12.5 hours (Critical)
2. API gateway overload - 8.2 hours (High)
3. Memory leak investigation - 7.8 hours (High)
4. DNS propagation delay - 6.5 hours (Medium)
5. Certificate expiration - 5.2 hours (Medium)

Use Case: Identify incidents needing procedure improvements or better documentation.

Most frequent incident categories:

TypeCount% of TotalAvg Resolution Time
Database Performance1226.7%4.2 hours
Resource Exhaustion1022.2%5.1 hours
API Latency817.8%3.8 hours
Network Connectivity613.3%6.2 hours
Service Outage511.1%8.5 hours
Security Breach24.4%15.2 hours
Other24.4%3.2 hours

Actionable Insights

  • High-frequency types: Create dedicated procedures or automation
  • High-duration types: Improve documentation and training
  • Recurring types: Investigate root causes for permanent fixes

Performance metrics for runbook procedures and execution tracking:

Access Procedure Analytics

Dashboard → Analytics → Procedure Analytics

Overall Success Rate

  • Percentage of successful procedure executions
  • Calculated across all procedures
  • Target: > 90% success rate
  • Trend analysis shows execution quality improvements

Example Display

Success Rate: 87.1%
├─ Successful Executions: 298
├─ Failed Executions: 44
├─ Total Executions: 342
└─ Trend: ↑ 3.2% vs last month

Average Execution Time

  • Mean time to complete procedure executions
  • Compared against estimated duration
  • Used for capacity planning
  • Identifies procedures needing optimization

Usage Statistics

  • Total number of procedures in library
  • Total executions in time period
  • Most frequently executed procedures
  • Procedures never executed (candidates for archival)

Top procedures by execution frequency:

Top 5 Most Executed Procedures:
1. Database Performance Recovery
- Executions: 45
- Success Rate: 91.1%
- Avg Duration: 12.3 minutes
2. API Gateway Restart
- Executions: 38
- Success Rate: 94.7%
- Avg Duration: 8.5 minutes
3. Memory Cleanup Process
- Executions: 32
- Success Rate: 84.4%
- Avg Duration: 15.2 minutes
4. Cache Invalidation
- Executions: 28
- Success Rate: 96.4%
- Avg Duration: 3.1 minutes
5. Certificate Renewal
- Executions: 24
- Success Rate: 100.0%
- Avg Duration: 6.8 minutes

Strategic Actions

  • High-execution procedures: Candidates for automation
  • Low success rate: Need revision or better error handling
  • Long duration: Opportunities for step optimization

Most reliable procedures:

ProcedureSuccess RateExecutionsNotes
SSL Certificate Renewal100.0%12Fully automated verification
Load Balancer Health Check98.5%67Clear success criteria
DNS Update97.8%45Well-documented rollback
Cache Clear96.4%89Simple, low-risk operation
Log Rotation95.2%34Scheduled maintenance

Best Practices Identified

  • Clear verification steps increase success rate
  • Automated checks reduce human error
  • Rollback capabilities increase confidence
  • Regular testing maintains reliability

Execution Time Distribution

  • Fastest 10% of executions
  • Average execution time
  • Slowest 10% of executions
  • Variance analysis (consistency measure)

Longest Procedures

Procedures requiring most time:

Longest Average Execution Time:
1. Full System Recovery - 45.2 minutes (3 executions)
2. Database Migration - 38.7 minutes (7 executions)
3. Blue-Green Deployment - 32.4 minutes (12 executions)

Time Comparison: Estimated vs Actual

  • Procedures consistently exceeding estimates need revised time estimates
  • Procedures completing faster indicate potential over-estimation
  • Large variance suggests inconsistent execution or environment differences

AI-Powered Insights

Track effectiveness of AI-suggested procedures:

Metrics Tracked

  • AI Accuracy: 88.5% - Percentage of AI suggestions that successfully resolve incidents
  • AI Confidence Average: 0.82 - Average confidence score of AI recommendations
  • Suggestion Acceptance Rate: 76.3% - How often engineers choose AI-suggested procedures
  • Time Saved: 3.2 hours average per incident using AI suggestions

Learning Loop Impact

  • Accuracy improves as system learns from successful resolutions
  • Confidence scores correlate with actual success rates
  • Low-confidence suggestions prompt human expertise
  • Failed suggestions improve future recommendations

Individual and team metrics for performance tracking and training identification:

Access Team Performance

Dashboard → Analytics → Performance Analytics

Note: Team performance details are visible to Managers and Admins only. Engineers see their own metrics.

Performance tracking per team member:

Example Team Performance Table

Team MemberIncidents AssignedResolvedAvg Resolution TimeSuccess Rate
John Doe15133.8 hours86.7%
Jane Smith18174.2 hours94.4%
Mike Johnson12105.1 hours83.3%
Sarah Williams14133.2 hours92.9%

Metrics Explained

Incidents Assigned

  • Total incidents assigned to team member
  • Includes all severities and statuses
  • Used for workload balancing

Incidents Resolved

  • Successfully resolved incidents
  • Excludes reassigned or escalated incidents
  • Success rate = Resolved / Assigned

Average Resolution Time

  • Mean time from assignment to resolution
  • Adjusted for incident severity
  • Lower is generally better (but context matters)

Success Rate

  • Percentage of incidents resolved without escalation
  • High success rate indicates expertise and effectiveness
  • Low success rate may indicate training needs or complexity

Workload Distribution

Visualize assignment balance across team:

  • Identify overloaded team members
  • Ensure fair distribution of on-call responsibilities
  • Plan capacity for vacation and sick time
  • Detect skill-based assignment patterns

Performance Rankings

Not for competition, but for identifying:

  • Top performers to mentor others
  • Team members needing additional training
  • Expertise areas for knowledge sharing
  • Recognition opportunities

Skill Gap Identification

Analytics reveal areas where training would help:

Low Success Rate in Specific Categories

Database Incidents - John Doe: 62% success rate
→ Recommendation: Database troubleshooting training
Network Issues - Mike Johnson: 58% success rate
→ Recommendation: Network fundamentals course

Long Resolution Times

API Performance - All team members: 8.2h average
→ Recommendation: Create dedicated procedure or documentation

Procedure Usage Patterns

  • Team members not using certain procedures: May not know they exist
  • Team members with high procedure success rates: Good mentors for others
  • Procedures never executed by certain members: Training opportunity

Response Time Metrics

  • Average First Response Time: 8.5 minutes - How quickly incidents are acknowledged
  • Average Resolution Time: 4.3 hours - Team-wide incident resolution speed
  • Escalation Rate: 12.4% - Incidents requiring manager or expert intervention

Collaboration Indicators

  • Incidents with multiple team members involved
  • Comment and mention activity levels
  • Knowledge sharing through procedure updates
  • Cross-training effectiveness

Track AI-powered solution generation costs and optimize budget usage:

Access LLM Cost Analytics

Dashboard → Analytics → LLM Costs

Note: LLM cost analytics are visible to Admins and Organization Owners only.

What is LLM Layer 3?

When vector database search (Weaviate) doesn’t find suitable solutions with high confidence, the system falls back to AI-powered solution generation using AWS Bedrock:

3-Layer Search Architecture

  1. Layer 1 (Phase 2): Customer-specific solutions from private vector database
  2. Layer 2 (Current): Public community solutions from Weaviate vector database
  3. Layer 3 (Current): AI-generated solutions via AWS Bedrock (fallback)

Why Monitor LLM Costs?

Unlike free vector database queries, LLM generation incurs costs:

  • AWS Bedrock charges per token generated
  • Costs vary by provider (Nova Lite, Claude Sonnet, GPT-4)
  • Need to balance quality with budget constraints
  • Optimize cache hit rates to reduce costs

Monthly Budget Overview

Real-time budget tracking:

Monthly Budget Status: October 2025
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Used: $15.48 / Budget: $100.00
[████████░░░░░░░░░░░░░░░░░░░░░] 15.5%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Remaining: $84.52
Status: NORMAL ✅
Next Alert: 25% ($25.00)

Budget Status Levels

StatusThresholdActionColor
Normal0-25%No action neededGreen
Low25-50%Monitor usageYellow
Medium50-75%Review usage patternsOrange
High75-100%Urgent optimization neededRed
Over Budget100-150%Alerts sent, continue with cautionRed
Blocked>150%Hard limit reached, LLM disabledGray

Alert Thresholds

Automatic alerts trigger at:

  • 25% Used: First warning - Monitor usage patterns
  • 50% Used: Mid-month check - Review if on track
  • 75% Used: Urgent attention - Optimize immediately
  • 100% Used: Budget exceeded - Hard limit approaching
  • 150% Used: Hard limit reached - LLM Layer 3 disabled until next month

Total Cost Tracking

Key cost metrics for the selected time period:

Cost Summary Display

Total LLM Cost: $15.48
├─ Total Requests: 258
├─ Average Cost per Request: $0.06
├─ Successful Generations: 255 (98.8%)
└─ Failed Generations: 3 (1.2%)

Cost Per Request

  • Average: $0.06 per LLM generation
  • Range: $0.03 (Nova Lite) to $0.10 (GPT-4)
  • Used for ROI calculations
  • Compared against time saved

Different AI providers have different costs:

Cost by Provider

ProviderCost per RequestTotal CostTotal RequestsUse Case
Nova Lite$0.03$2.3277Simple incidents, standard patterns
Claude Sonnet$0.06$12.45207Complex technical issues (default)
GPT-4$0.10$0.717Highly complex, multi-system issues
Total-$15.48258-

Provider Selection Logic

The system automatically selects providers based on:

  • Incident Complexity: Simple vs complex technical issues
  • Context Size: Amount of incident details and logs
  • Historical Success: Which provider worked best for similar issues
  • Budget Status: Prefer cheaper providers when budget is low

Weaviate Hit Rate vs LLM Fallback

The most important cost optimization metric:

Search Efficiency (Last 30 Days)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Weaviate Hit Rate: 87.5% (FREE) ✅
[████████████████████████████░░░░]
LLM Fallback Rate: 12.5% (PAID) 💰
[████░░░░░░░░░░░░░░░░░░░░░░░░░░░░]
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

What This Means

  • 87.5% Hit Rate: Most searches find solutions in free Weaviate database
  • 12.5% Fallback: Only 1 in 8 searches requires paid LLM generation
  • Target: Maintain >85% Weaviate hit rate through continuous learning

Cost Savings from Caching

If all searches used LLM (no Weaviate):

  • Hypothetical cost: $123.84 per month
  • Actual cost with Weaviate: $15.48 per month
  • Savings: $108.36 (87.5% cost reduction)

Request Metrics

Detailed usage breakdown:

Usage Statistics (Last 30 Days)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Total Requests: 258
Successful Generations: 255 (98.8%)
Failed Generations: 3 (1.2%)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Total Tokens Used: 322,500
Avg Tokens per Request: 1,250
Avg Generation Time: 2.3 seconds
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Token Usage

  • Input tokens: Context from incident (description, logs, metrics)
  • Output tokens: Generated solution steps and explanations
  • Total tokens: Input + output (what you’re charged for)

Generation Time

  • Average time for AI to generate solution
  • Typical range: 1.5-3.5 seconds
  • Longer for complex incidents
  • Includes thinking and verification time

Track which incidents generated the most expensive AI solutions:

Top 5 Most Expensive AI Solutions

1. Kubernetes Multi-Cluster Networking Issue
Provider: GPT-4 | Cost: $0.10
Date: Oct 5, 2025
Reason: Complex multi-system diagnosis requiring extensive context
2. Database Replication Failure
Provider: Claude Sonnet | Cost: $0.08
Date: Oct 8, 2025
Reason: Large log file analysis and correlation
3. Memory Leak Investigation
Provider: Claude Sonnet | Cost: $0.07
Date: Oct 12, 2025
Reason: Multiple service interaction analysis
4. API Gateway Timeout Pattern
Provider: Claude Sonnet | Cost: $0.07
Date: Oct 15, 2025
Reason: Historical pattern analysis across services
5. Certificate Chain Validation Error
Provider: Claude Sonnet | Cost: $0.06
Date: Oct 18, 2025
Reason: Security context and compliance review

Actionable Insights

  • Recurring expensive incidents: Create dedicated procedures to avoid future LLM costs
  • High-cost patterns: Add these solutions to Weaviate for free future searches
  • Provider selection: Review if GPT-4 usage was necessary or if Claude Sonnet would suffice

Monthly Cost Forecast

Based on current usage patterns:

Budget Projection
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Current Month: October 2025
Days Elapsed: 18 days
Days Remaining: 13 days
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Current Cost: $15.48
Daily Average: $0.86
Projected Month-End: $23.22
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Monthly Budget: $100.00
Projected Usage: 23.2%
Status: On Track ✅

Projection Scenarios

Best Case (Current pace maintained)

  • Projected Cost: $23.22
  • Budget Remaining: $76.78
  • Status: Well within budget

Expected Case (Historical average)

  • Projected Cost: $28.50
  • Budget Remaining: $71.50
  • Status: Normal usage pattern

Worst Case (Peak usage sustained)

  • Projected Cost: $42.80
  • Budget Remaining: $57.20
  • Status: Higher than typical but acceptable

Reduce LLM Costs

  1. Improve Incident Descriptions

    • Better descriptions help Weaviate find matches
    • Reduces need for LLM fallback
    • Use consistent terminology
  2. Create Procedures from Solutions

    • Convert successful LLM solutions to procedures
    • Future similar incidents use procedures (free)
    • Builds organizational knowledge base
  3. Update Weaviate Database

    • Add high-cost solutions to Weaviate
    • Improves hit rate for similar future incidents
    • One-time LLM cost, infinite free reuse
  4. Use Template Matching

    • Leverage incident templates
    • Better Weaviate query construction
    • Higher confidence scores = less LLM fallback
  5. Review Provider Selection

    • Nova Lite for simple incidents ($0.03)
    • Claude Sonnet for standard incidents ($0.06)
    • GPT-4 only for highly complex incidents ($0.10)

Create personalized views for your specific needs:

Access Dashboard Configuration

Dashboard → Analytics → Custom Dashboards → Create New

Dashboard Builder

Select metrics and visualizations:

Available Widgets

  • Incident volume trends (line chart)
  • Severity breakdown (pie chart)
  • Resolution time distribution (histogram)
  • Top procedures by usage (bar chart)
  • Team performance comparison (table)
  • LLM cost trends (line chart)
  • Success rate over time (line chart)
  • Procedure execution heatmap (calendar view)

Configuration Options

  • Time range (default to 30 days)
  • Refresh interval (manual, 5 min, 15 min, 30 min, 1 hour)
  • Widget size and positioning
  • Color schemes and themes
  • Data filters and groupings

Dashboard Persistence

Save your custom layouts:

Save Options

  • Personal Dashboard: Private to your account
  • Team Dashboard: Shared with your team
  • Organization Dashboard: Available to entire organization
  • Default Dashboard: Replace standard analytics view

Naming Conventions

✅ Good Names:
- "On-Call Weekly Report"
- "Database Team Performance"
- "LLM Cost Tracking - Q4 2025"
❌ Poor Names:
- "Dashboard 1"
- "My Analytics"
- "Untitled"

Collaboration Features

Share Dashboard

Custom Dashboard → Share → Select Recipients

Sharing Options

  • View Only: Recipients can see but not modify
  • Edit Access: Recipients can customize their copy
  • Public Link: Generate shareable URL (read-only)

Use Cases

  • Weekly team review meetings
  • Manager reports for leadership
  • On-call handoff summaries
  • Incident response post-mortems

Extract analytics data for external reporting and analysis:

Available Export Formats

FormatUse CaseSize LimitIncludes
JSONAPI integration, programmatic analysis100 MBFull detail with metadata
CSVExcel analysis, reporting tools50 MBTabular data only
PDFExecutive reports, presentations20 MBVisualizations and summaries
ExcelAdvanced spreadsheet analysis50 MBMultiple sheets with formatting

Quick Export

From any analytics page:

Analytics Page → Export Button → Select Format

Exported Data Includes

  • All visible metrics for selected time range
  • Chart data and visualizations
  • Filter configurations
  • Export timestamp and user
  • Data source information

Custom Export

For specific data subsets:

Analytics → Export → Custom Export → Configure

Custom Export Options

  • Select specific metrics and dimensions
  • Choose date range (custom start/end dates)
  • Filter by tags, categories, teams
  • Include or exclude specific data points
  • Schedule recurring exports (daily, weekly, monthly)

Programmatic Data Access

Access analytics data via API:

Available Endpoints

GET /api/v1/analytics/incidents?period_days=30
GET /api/v1/analytics/procedures?period_days=30
GET /api/v1/analytics/performance?period_days=30
GET /api/v1/analytics/llm/costs?period_days=30

Authentication

  • Use personal API key or service account key
  • Include in Authorization header: Bearer YOUR_API_KEY
  • Rate limits: 100 requests per minute

Example API Request

Terminal window
curl -X GET "https://your-instance.overwatch.com/api/v1/analytics/incidents?period_days=30" \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Accept: application/json"

Example Response

{
"total_incidents": 45,
"total_resolved": 38,
"mttr": 4.5,
"mtta": 0.3,
"severity_breakdown": {
"critical": 3,
"high": 12,
"medium": 22,
"low": 8
},
"period_start": "2025-09-15T00:00:00Z",
"period_end": "2025-10-15T00:00:00Z"
}

Automated Reporting

Generate and deliver reports automatically:

Report Configuration

Analytics → Reports → New Scheduled Report

Scheduling Options

  • Frequency: Daily, weekly, monthly, quarterly
  • Day/Time: Specific day of week/month and time
  • Recipients: Email addresses or Slack channels
  • Format: PDF for executives, CSV for analysis
  • Filters: Specific teams, categories, or metrics

Example Weekly Report

Report Name: "Weekly Team Performance Report"
Frequency: Every Monday at 9:00 AM
Recipients: team-leads@company.com, #engineering-metrics
Format: PDF with charts and CSV data attachment
Content:
- Incident volume and trends
- MTTR and MTTA
- Team performance summary
- Top 5 procedures executed
- Key insights and recommendations

Transform data into actionable insights:

Pattern Recognition

Analytics reveal trends that need attention:

High-Frequency Incidents

Pattern: Database connection timeouts spike every Monday morning
Analysis: Weekend deployment causes connection pool exhaustion
Action: Implement connection pool warming after deployments
Result: 78% reduction in Monday morning incidents

Resolution Time Degradation

Pattern: Average MTTR increasing from 3.2h to 5.8h over 3 months
Analysis: New team members lack training on key procedures
Action: Mandatory procedure training for all engineers
Result: MTTR improved to 3.6h within 2 months

Low Success Rate Procedures

Pattern: "Database Migration" procedure fails 42% of the time
Analysis: Missing validation steps and rollback guidance
Action: Revise procedure with detailed verification and rollback
Result: Success rate improved to 96%

Data-Driven Optimization

Procedure Creation from Patterns

  1. Identify incidents occurring ≥3 times per month
  2. Review resolution steps from incident history
  3. Create standardized procedure from successful resolutions
  4. Measure incident recurrence after procedure deployment
  5. Track time savings and success rate

Example Success Story

Incident: "Redis Cache Failure" - 8 occurrences per month
Before Procedure: Avg resolution time 45 minutes
After Procedure: Avg resolution time 8 minutes
Time Saved: 296 minutes per month (4.9 hours)
LLM Cost Avoided: $1.20 per month (now uses Weaviate)

Training Prioritization

Use analytics to guide training investments:

Low Performer Identification

  • Not for punishment, but for support
  • Target training to actual skill gaps
  • Measure improvement post-training
  • Pair with high performers for mentoring

High-Value Training Areas

Analysis: 68% of database incidents take >2 hours
Insight: Database troubleshooting skills need improvement
Action: Database performance workshop for entire team
Result: Database incident MTTR reduced by 43%

Demonstrate Platform Value

Quantify benefits with analytics:

Time Savings Calculation

Manual Incident Resolution (Before Overwatch)
├─ Avg Resolution Time: 6.2 hours
├─ Monthly Incidents: 45
└─ Total Manual Time: 279 hours/month
With Overwatch Procedures (After)
├─ Avg Resolution Time: 3.8 hours
├─ Monthly Incidents: 45
└─ Total Procedural Time: 171 hours/month
Time Saved: 108 hours/month (17.6 work days)

Cost Savings Calculation

Engineer Cost: $75/hour (loaded rate)
Time Saved: 108 hours/month
Monthly Savings: $8,100
Annual Savings: $97,200

Additional ROI Factors

  • Reduced downtime costs (customer impact)
  • Improved team morale (less firefighting)
  • Knowledge retention (procedures capture expertise)
  • Faster onboarding (new engineers productive sooner)
  • Audit compliance (complete execution records)

LLM Layer 3 ROI

LLM Generation Cost: $15.48/month
Time Saved by AI Solutions: 32 hours/month
Value of Time Saved: $2,400/month
ROI: 15,500% ($2,400 saved per $15.48 spent)

Daily Review (On-Call Lead)

  • Active incidents status
  • New high-severity incidents
  • Team workload distribution
  • LLM budget status (if admin)

Weekly Review (Engineering Managers)

  • Incident volume trends
  • MTTR and MTTA performance
  • Procedure success rates
  • Team performance highlights
  • LLM cost tracking

Monthly Review (Team)

  • Comprehensive analytics review
  • Process improvement identification
  • Procedure updates and archival
  • Training needs assessment
  • Celebrate wins and improvements

Quarterly Review (Leadership)

  • Long-term trend analysis
  • ROI reporting and budget justification
  • Strategic initiatives from analytics insights
  • Team capacity planning
  • Tool and process investment decisions

Focus on What Matters

Don’t track everything - focus on actionable metrics:

Critical Metrics (Must track)

  • MTTR and MTTA
  • Incident volume trends
  • Procedure success rates
  • LLM budget usage (admins)

Important Metrics (Should track)

  • Resolution time by severity
  • Procedure execution frequency
  • Team performance indicators
  • Weaviate hit rate

Nice-to-Have Metrics (Can track)

  • Incident type distribution
  • Time-of-day patterns
  • Individual step execution times
  • Cross-team collaboration metrics

Ensure Accurate Analytics

Analytics are only as good as the data:

Data Quality Checklist

  • ✅ Incidents properly categorized and tagged
  • ✅ Resolution times accurately recorded
  • ✅ Procedure steps completed in order
  • ✅ Notes and observations documented
  • ✅ Severity levels consistently applied
  • ✅ Team assignments accurate and up-to-date

Common Data Quality Issues

Incorrect Severity Assignments

Problem: Engineers mark all incidents as "High" to get attention
Impact: Severity analytics become meaningless
Solution: Clear severity criteria and regular audits

Missing Resolution Documentation

Problem: Incidents closed without resolution notes
Impact: Can't learn from successful resolutions
Solution: Required resolution notes field

Incomplete Procedure Executions

Problem: Procedures started but not marked complete
Impact: Success rate appears artificially low
Solution: Automatic timeout and completion reminders

Data Not Appearing

Symptom: Analytics page shows “No data available”

Possible Causes and Solutions

  1. Time range too narrow
    • Try expanding to 30 or 90 days
    • Check if any incidents exist in period
  2. Filter too restrictive
    • Clear all filters and retry
    • Review filter criteria
  3. Organization has no data yet
    • Create incidents and procedures first
    • Wait for execution history to build
  4. Permission issue
    • Verify role has analytics access
    • Contact admin for permission review

Incorrect Metrics

Symptom: Numbers don’t match expected values

Possible Causes and Solutions

  1. Timezone confusion
    • Check timezone setting in profile
    • Verify organization timezone
    • Consider UTC vs local time
  2. Filter applied without notice
    • Look for active filters at top of page
    • Clear filters and refresh
  3. Calculation period mismatch
    • Verify time range selection
    • Check “Last Updated” timestamp
  4. Cached data
    • Click refresh button
    • Hard refresh browser (Ctrl+Shift+R)

Export Failures

Symptom: Export button doesn’t work or file corrupted

Possible Causes and Solutions

  1. Data set too large
    • Reduce time range
    • Select specific metrics instead of all
    • Use API for large exports
  2. Browser popup blocked
    • Allow popups for Overwatch domain
    • Check browser download settings
  3. Network timeout
    • Try smaller export
    • Download during off-peak hours
    • Use scheduled reports instead

LLM Cost Discrepancies

Symptom: Reported costs don’t match AWS billing

Possible Causes and Solutions

  1. Billing period mismatch
    • Overwatch uses UTC months
    • AWS billing may use different timezone
    • Compare actual date ranges
  2. Multiple organizations
    • Ensure viewing correct organization
    • Check organization selector at top
  3. Delayed billing data
    • AWS billing may lag 24-48 hours
    • Compare costs after billing cycle closes
  4. Non-LLM AWS costs
    • Overwatch only tracks Bedrock LLM costs
    • Other AWS services billed separately

Slow Dashboard Loading

Solutions

  • Reduce time range (90d → 30d)
  • Refresh less frequently (disable auto-refresh)
  • Close unused browser tabs
  • Clear browser cache
  • Disable chart animations in settings

Chart Rendering Problems

Solutions

  • Update browser to latest version
  • Disable browser extensions temporarily
  • Try different browser (Chrome, Firefox, Safari)
  • Reduce chart data points (shorter time range)
  • Contact support if issue persists
  • In-App Help: Click ? icon in analytics dashboard for contextual help
  • Keyboard Shortcuts: Press ? key to see analytics shortcuts
  • API Documentation: Visit /docs for programmatic access to analytics data
  • Support: Contact your system administrator for organization-specific questions

Last updated: October 2025 | Edit this page