Analytics Dashboard

The Analytics Dashboard provides comprehensive insights into incident response performance, procedure effectiveness, team productivity, and AI-powered solution costs. Use these metrics to identify improvement opportunities, optimize workflows, and demonstrate ROI.

Analytics Overview

Access the Analytics Dashboard:

Dashboard → Analytics

What You Can Track

The analytics system provides four main categories of insights:

Category	Key Metrics	Access Level
Incident Analytics	MTTR, MTTA, resolution trends, severity distribution	All users
Procedure Analytics	Success rates, execution times, usage patterns	All users
Team Performance	Individual metrics, team comparisons, workload distribution	Managers and above
LLM Cost Monitoring	AI generation costs, budget tracking, efficiency metrics	Admins and above

Role-Based Visibility

Engineers: View personal metrics, team-wide incident and procedure analytics
Managers: All Engineer access plus team performance comparisons and detailed reports
Admins: Full access including LLM cost analytics and organization-wide metrics

Time Range Selection

All analytics pages support flexible time range selection:

24 Hours: Real-time operational view
7 Days: Weekly trends and patterns
30 Days: Monthly performance analysis (default)
90 Days: Quarterly trends and long-term patterns

Incident Analytics

Detailed metrics about incident response and resolution:

Access Incident Analytics

Dashboard → Analytics → Incident Analytics

Key Incident Metrics

Mean Time To Resolution (MTTR)

Average time from incident creation to resolution
Calculated per severity level (Critical, High, Medium, Low)
Trend analysis shows improvement or degradation over time
Industry benchmark: Critical incidents < 1 hour, High < 4 hours

Example MTTR Display

Overall MTTR: 4.2 hours
├─ Critical: 0.8 hours (target: < 1 hour) ✅
├─ High: 3.2 hours (target: < 4 hours) ✅
├─ Medium: 6.1 hours (target: < 8 hours) ✅
└─ Low: 12.5 hours (target: < 24 hours) ✅

Mean Time To Acknowledge (MTTA)

Time from incident creation to first acknowledgment
Measures initial response speed
Critical for on-call and alerting effectiveness
Target: < 15 minutes for Critical severity

Severity Breakdown

Visual distribution of incidents by severity:

Severity	Count	Percentage	Resolution Time
Critical	3	6.7%	0.8h average
High	12	26.7%	3.2h average
Medium	22	48.9%	6.1h average
Low	8	17.8%	12.5h average

Status Distribution

Current state of incidents in the system:

Open: 4 incidents - Newly created, awaiting assignment
In Progress: 3 incidents - Active investigation and resolution
Resolved: 35 incidents - Solution implemented, awaiting verification
Closed: 3 incidents - Fully resolved and documented

Incident Volume Trends

Daily Incident Trends

Line chart showing incident creation over time:

Identify peak incident periods
Detect anomalous spikes requiring investigation
Track effectiveness of preventive measures
Forecast capacity planning needs

Pattern Analysis

Automated pattern detection identifies:

Recurring Incidents: Same root cause appearing multiple times
Time-Based Patterns: Incidents occurring at specific times (e.g., after deployments)
Cascading Failures: Multiple related incidents from single root cause
Seasonal Trends: Periodic patterns related to business cycles

Resolution Time Analysis

Resolution Time Trends

Track how resolution efficiency changes over time:

Average resolution time per day/week
Comparison against baseline and targets
Impact of process improvements
Training effectiveness measurement

Longest Incidents

List of incidents taking the most time to resolve:

Top 5 Longest Incidents:
1. Database connection timeout - 12.5 hours (Critical)
2. API gateway overload - 8.2 hours (High)
3. Memory leak investigation - 7.8 hours (High)
4. DNS propagation delay - 6.5 hours (Medium)
5. Certificate expiration - 5.2 hours (Medium)

Use Case: Identify incidents needing procedure improvements or better documentation.

Incident Type Breakdown

Most frequent incident categories:

Type	Count	% of Total	Avg Resolution Time
Database Performance	12	26.7%	4.2 hours
Resource Exhaustion	10	22.2%	5.1 hours
API Latency	8	17.8%	3.8 hours
Network Connectivity	6	13.3%	6.2 hours
Service Outage	5	11.1%	8.5 hours
Security Breach	2	4.4%	15.2 hours
Other	2	4.4%	3.2 hours

Actionable Insights

High-frequency types: Create dedicated procedures or automation
High-duration types: Improve documentation and training
Recurring types: Investigate root causes for permanent fixes

Procedure Analytics

Performance metrics for runbook procedures and execution tracking:

Access Procedure Analytics

Dashboard → Analytics → Procedure Analytics

Procedure Performance Metrics

Overall Success Rate

Percentage of successful procedure executions
Calculated across all procedures
Target: > 90% success rate
Trend analysis shows execution quality improvements

Example Display

Success Rate: 87.1%
├─ Successful Executions: 298
├─ Failed Executions: 44
├─ Total Executions: 342
└─ Trend: ↑ 3.2% vs last month

Average Execution Time

Mean time to complete procedure executions
Compared against estimated duration
Used for capacity planning
Identifies procedures needing optimization

Usage Statistics

Total number of procedures in library
Total executions in time period
Most frequently executed procedures
Procedures never executed (candidates for archival)

Most Used Procedures

Top procedures by execution frequency:

Top 5 Most Executed Procedures:
1. Database Performance Recovery
   - Executions: 45
   - Success Rate: 91.1%
   - Avg Duration: 12.3 minutes

2. API Gateway Restart
   - Executions: 38
   - Success Rate: 94.7%
   - Avg Duration: 8.5 minutes

3. Memory Cleanup Process
   - Executions: 32
   - Success Rate: 84.4%
   - Avg Duration: 15.2 minutes

4. Cache Invalidation
   - Executions: 28
   - Success Rate: 96.4%
   - Avg Duration: 3.1 minutes

5. Certificate Renewal
   - Executions: 24
   - Success Rate: 100.0%
   - Avg Duration: 6.8 minutes

Strategic Actions

High-execution procedures: Candidates for automation
Low success rate: Need revision or better error handling
Long duration: Opportunities for step optimization

Highest Success Rate Procedures

Most reliable procedures:

Procedure	Success Rate	Executions	Notes
SSL Certificate Renewal	100.0%	12	Fully automated verification
Load Balancer Health Check	98.5%	67	Clear success criteria
DNS Update	97.8%	45	Well-documented rollback
Cache Clear	96.4%	89	Simple, low-risk operation
Log Rotation	95.2%	34	Scheduled maintenance

Best Practices Identified

Clear verification steps increase success rate
Automated checks reduce human error
Rollback capabilities increase confidence
Regular testing maintains reliability

Procedure Execution Time Analysis

Execution Time Distribution

Fastest 10% of executions
Average execution time
Slowest 10% of executions
Variance analysis (consistency measure)

Longest Procedures

Procedures requiring most time:

Longest Average Execution Time:
1. Full System Recovery - 45.2 minutes (3 executions)
2. Database Migration - 38.7 minutes (7 executions)
3. Blue-Green Deployment - 32.4 minutes (12 executions)

Time Comparison: Estimated vs Actual

Procedures consistently exceeding estimates need revised time estimates
Procedures completing faster indicate potential over-estimation
Large variance suggests inconsistent execution or environment differences

AI Recommendation Accuracy

AI-Powered Insights

Track effectiveness of AI-suggested procedures:

Metrics Tracked

AI Accuracy: 88.5% - Percentage of AI suggestions that successfully resolve incidents
AI Confidence Average: 0.82 - Average confidence score of AI recommendations
Suggestion Acceptance Rate: 76.3% - How often engineers choose AI-suggested procedures
Time Saved: 3.2 hours average per incident using AI suggestions

Learning Loop Impact

Accuracy improves as system learns from successful resolutions
Confidence scores correlate with actual success rates
Low-confidence suggestions prompt human expertise
Failed suggestions improve future recommendations

Team Performance

Individual and team metrics for performance tracking and training identification:

Access Team Performance

Dashboard → Analytics → Performance Analytics

Note: Team performance details are visible to Managers and Admins only. Engineers see their own metrics.

Individual Metrics

Performance tracking per team member:

Example Team Performance Table

Team Member	Incidents Assigned	Resolved	Avg Resolution Time	Success Rate
John Doe	15	13	3.8 hours	86.7%
Jane Smith	18	17	4.2 hours	94.4%
Mike Johnson	12	10	5.1 hours	83.3%
Sarah Williams	14	13	3.2 hours	92.9%

Metrics Explained

Incidents Assigned

Total incidents assigned to team member
Includes all severities and statuses
Used for workload balancing

Incidents Resolved

Successfully resolved incidents
Excludes reassigned or escalated incidents
Success rate = Resolved / Assigned

Average Resolution Time

Mean time from assignment to resolution
Adjusted for incident severity
Lower is generally better (but context matters)

Success Rate

Percentage of incidents resolved without escalation
High success rate indicates expertise and effectiveness
Low success rate may indicate training needs or complexity

Team Comparisons

Workload Distribution

Visualize assignment balance across team:

Identify overloaded team members
Ensure fair distribution of on-call responsibilities
Plan capacity for vacation and sick time
Detect skill-based assignment patterns

Performance Rankings

Not for competition, but for identifying:

Top performers to mentor others
Team members needing additional training
Expertise areas for knowledge sharing
Recognition opportunities

Training Opportunities

Skill Gap Identification

Analytics reveal areas where training would help:

Low Success Rate in Specific Categories

Database Incidents - John Doe: 62% success rate
→ Recommendation: Database troubleshooting training

Network Issues - Mike Johnson: 58% success rate
→ Recommendation: Network fundamentals course

Long Resolution Times

API Performance - All team members: 8.2h average
→ Recommendation: Create dedicated procedure or documentation

Procedure Usage Patterns

Team members not using certain procedures: May not know they exist
Team members with high procedure success rates: Good mentors for others
Procedures never executed by certain members: Training opportunity

Team Collaboration Metrics

Response Time Metrics

Average First Response Time: 8.5 minutes - How quickly incidents are acknowledged
Average Resolution Time: 4.3 hours - Team-wide incident resolution speed
Escalation Rate: 12.4% - Incidents requiring manager or expert intervention

Collaboration Indicators

Incidents with multiple team members involved
Comment and mention activity levels
Knowledge sharing through procedure updates
Cross-training effectiveness

LLM Cost Monitoring

Track AI-powered solution generation costs and optimize budget usage:

Access LLM Cost Analytics

Dashboard → Analytics → LLM Costs

Note: LLM cost analytics are visible to Admins and Organization Owners only.

Understanding LLM Layer 3

What is LLM Layer 3?

When vector database search (Weaviate) doesn’t find suitable solutions with high confidence, the system falls back to AI-powered solution generation using AWS Bedrock:

3-Layer Search Architecture

Layer 1 (Phase 2): Customer-specific solutions from private vector database
Layer 2 (Current): Public community solutions from Weaviate vector database
Layer 3 (Current): AI-generated solutions via AWS Bedrock (fallback)

Why Monitor LLM Costs?

Unlike free vector database queries, LLM generation incurs costs:

AWS Bedrock charges per token generated
Costs vary by provider (Nova Lite, Claude Sonnet, GPT-4)
Need to balance quality with budget constraints
Optimize cache hit rates to reduce costs

Budget Status Dashboard

Monthly Budget Overview

Real-time budget tracking:

Monthly Budget Status: October 2025
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Used: $15.48 / Budget: $100.00
[████████░░░░░░░░░░░░░░░░░░░░░] 15.5%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Remaining: $84.52
Status: NORMAL ✅
Next Alert: 25% ($25.00)

Budget Status Levels

Status	Threshold	Action	Color
Normal	0-25%	No action needed	Green
Low	25-50%	Monitor usage	Yellow
Medium	50-75%	Review usage patterns	Orange
High	75-100%	Urgent optimization needed	Red
Over Budget	100-150%	Alerts sent, continue with caution	Red
Blocked	>150%	Hard limit reached, LLM disabled	Gray

Alert Thresholds

Automatic alerts trigger at:

25% Used: First warning - Monitor usage patterns
50% Used: Mid-month check - Review if on track
75% Used: Urgent attention - Optimize immediately
100% Used: Budget exceeded - Hard limit approaching
150% Used: Hard limit reached - LLM Layer 3 disabled until next month

Cost Metrics

Total Cost Tracking

Key cost metrics for the selected time period:

Cost Summary Display

Total LLM Cost: $15.48
├─ Total Requests: 258
├─ Average Cost per Request: $0.06
├─ Successful Generations: 255 (98.8%)
└─ Failed Generations: 3 (1.2%)

Cost Per Request

Average: $0.06 per LLM generation
Range: $0.03 (Nova Lite) to $0.10 (GPT-4)
Used for ROI calculations
Compared against time saved

Provider Cost Breakdown

Different AI providers have different costs:

Cost by Provider

Provider	Cost per Request	Total Cost	Total Requests	Use Case
Nova Lite	$0.03	$2.32	77	Simple incidents, standard patterns
Claude Sonnet	$0.06	$12.45	207	Complex technical issues (default)
GPT-4	$0.10	$0.71	7	Highly complex, multi-system issues
Total	-	$15.48	258	-

Provider Selection Logic

The system automatically selects providers based on:

Incident Complexity: Simple vs complex technical issues
Context Size: Amount of incident details and logs
Historical Success: Which provider worked best for similar issues
Budget Status: Prefer cheaper providers when budget is low

Efficiency Metrics

Weaviate Hit Rate vs LLM Fallback

The most important cost optimization metric:

Search Efficiency (Last 30 Days)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Weaviate Hit Rate: 87.5% (FREE) ✅
[████████████████████████████░░░░]

LLM Fallback Rate: 12.5% (PAID) 💰
[████░░░░░░░░░░░░░░░░░░░░░░░░░░░░]
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

What This Means

87.5% Hit Rate: Most searches find solutions in free Weaviate database
12.5% Fallback: Only 1 in 8 searches requires paid LLM generation
Target: Maintain >85% Weaviate hit rate through continuous learning

Cost Savings from Caching

If all searches used LLM (no Weaviate):

Hypothetical cost: $123.84 per month
Actual cost with Weaviate: $15.48 per month
Savings: $108.36 (87.5% cost reduction)

Usage Statistics

Request Metrics

Detailed usage breakdown:

Usage Statistics (Last 30 Days)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Total Requests: 258
Successful Generations: 255 (98.8%)
Failed Generations: 3 (1.2%)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Total Tokens Used: 322,500
Avg Tokens per Request: 1,250
Avg Generation Time: 2.3 seconds
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Token Usage

Input tokens: Context from incident (description, logs, metrics)
Output tokens: Generated solution steps and explanations
Total tokens: Input + output (what you’re charged for)

Generation Time

Average time for AI to generate solution
Typical range: 1.5-3.5 seconds
Longer for complex incidents
Includes thinking and verification time

Highest Cost Incidents

Track which incidents generated the most expensive AI solutions:

Top 5 Most Expensive AI Solutions

1. Kubernetes Multi-Cluster Networking Issue
   Provider: GPT-4 | Cost: $0.10
   Date: Oct 5, 2025
   Reason: Complex multi-system diagnosis requiring extensive context

2. Database Replication Failure
   Provider: Claude Sonnet | Cost: $0.08
   Date: Oct 8, 2025
   Reason: Large log file analysis and correlation

3. Memory Leak Investigation
   Provider: Claude Sonnet | Cost: $0.07
   Date: Oct 12, 2025
   Reason: Multiple service interaction analysis

4. API Gateway Timeout Pattern
   Provider: Claude Sonnet | Cost: $0.07
   Date: Oct 15, 2025
   Reason: Historical pattern analysis across services

5. Certificate Chain Validation Error
   Provider: Claude Sonnet | Cost: $0.06
   Date: Oct 18, 2025
   Reason: Security context and compliance review

Actionable Insights

Recurring expensive incidents: Create dedicated procedures to avoid future LLM costs
High-cost patterns: Add these solutions to Weaviate for free future searches
Provider selection: Review if GPT-4 usage was necessary or if Claude Sonnet would suffice

Budget Projections

Monthly Cost Forecast

Based on current usage patterns:

Budget Projection
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Current Month: October 2025
Days Elapsed: 18 days
Days Remaining: 13 days
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Current Cost: $15.48
Daily Average: $0.86
Projected Month-End: $23.22
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Monthly Budget: $100.00
Projected Usage: 23.2%
Status: On Track ✅

Projection Scenarios

Best Case (Current pace maintained)

Projected Cost: $23.22
Budget Remaining: $76.78
Status: Well within budget

Expected Case (Historical average)

Projected Cost: $28.50
Budget Remaining: $71.50
Status: Normal usage pattern

Worst Case (Peak usage sustained)

Projected Cost: $42.80
Budget Remaining: $57.20
Status: Higher than typical but acceptable

Cost Optimization Tips

Reduce LLM Costs

Improve Incident Descriptions
- Better descriptions help Weaviate find matches
- Reduces need for LLM fallback
- Use consistent terminology
Create Procedures from Solutions
- Convert successful LLM solutions to procedures
- Future similar incidents use procedures (free)
- Builds organizational knowledge base
Update Weaviate Database
- Add high-cost solutions to Weaviate
- Improves hit rate for similar future incidents
- One-time LLM cost, infinite free reuse
Use Template Matching
- Leverage incident templates
- Better Weaviate query construction
- Higher confidence scores = less LLM fallback
Review Provider Selection
- Nova Lite for simple incidents ($0.03)
- Claude Sonnet for standard incidents ($0.06)
- GPT-4 only for highly complex incidents ($0.10)

Custom Dashboards

Create personalized views for your specific needs:

Access Dashboard Configuration

Dashboard → Analytics → Custom Dashboards → Create New

Creating Custom Views

Dashboard Builder

Select metrics and visualizations:

Available Widgets

Incident volume trends (line chart)
Severity breakdown (pie chart)
Resolution time distribution (histogram)
Top procedures by usage (bar chart)
Team performance comparison (table)
LLM cost trends (line chart)
Success rate over time (line chart)
Procedure execution heatmap (calendar view)

Configuration Options

Time range (default to 30 days)
Refresh interval (manual, 5 min, 15 min, 30 min, 1 hour)
Widget size and positioning
Color schemes and themes
Data filters and groupings

Saving Configurations

Dashboard Persistence

Save your custom layouts:

Save Options

Personal Dashboard: Private to your account
Team Dashboard: Shared with your team
Organization Dashboard: Available to entire organization
Default Dashboard: Replace standard analytics view

Naming Conventions

✅ Good Names:
- "On-Call Weekly Report"
- "Database Team Performance"
- "LLM Cost Tracking - Q4 2025"

❌ Poor Names:
- "Dashboard 1"
- "My Analytics"
- "Untitled"

Collaboration Features

Share Dashboard

Custom Dashboard → Share → Select Recipients

Sharing Options

View Only: Recipients can see but not modify
Edit Access: Recipients can customize their copy
Public Link: Generate shareable URL (read-only)

Use Cases

Weekly team review meetings
Manager reports for leadership
On-call handoff summaries
Incident response post-mortems

Exporting Data

Extract analytics data for external reporting and analysis:

Export Formats

Available Export Formats

Format	Use Case	Size Limit	Includes
JSON	API integration, programmatic analysis	100 MB	Full detail with metadata
CSV	Excel analysis, reporting tools	50 MB	Tabular data only
PDF	Executive reports, presentations	20 MB	Visualizations and summaries
Excel	Advanced spreadsheet analysis	50 MB	Multiple sheets with formatting

Export Options

Quick Export

From any analytics page:

Analytics Page → Export Button → Select Format

Exported Data Includes

All visible metrics for selected time range
Chart data and visualizations
Filter configurations
Export timestamp and user
Data source information

Custom Export

For specific data subsets:

Analytics → Export → Custom Export → Configure

Custom Export Options

Select specific metrics and dimensions
Choose date range (custom start/end dates)
Filter by tags, categories, teams
Include or exclude specific data points
Schedule recurring exports (daily, weekly, monthly)

API Access

Programmatic Data Access

Access analytics data via API:

Available Endpoints

GET /api/v1/analytics/incidents?period_days=30
GET /api/v1/analytics/procedures?period_days=30
GET /api/v1/analytics/performance?period_days=30
GET /api/v1/analytics/llm/costs?period_days=30

Authentication

Use personal API key or service account key
Include in Authorization header: Bearer YOUR_API_KEY
Rate limits: 100 requests per minute

Example API Request

curl -X GET "https://your-instance.overwatch.com/api/v1/analytics/incidents?period_days=30" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Accept: application/json"

Example Response

{
  "total_incidents": 45,
  "total_resolved": 38,
  "mttr": 4.5,
  "mtta": 0.3,
  "severity_breakdown": {
    "critical": 3,
    "high": 12,
    "medium": 22,
    "low": 8
  },
  "period_start": "2025-09-15T00:00:00Z",
  "period_end": "2025-10-15T00:00:00Z"
}

Scheduled Reports

Automated Reporting

Generate and deliver reports automatically:

Report Configuration

Analytics → Reports → New Scheduled Report

Scheduling Options

Frequency: Daily, weekly, monthly, quarterly
Day/Time: Specific day of week/month and time
Recipients: Email addresses or Slack channels
Format: PDF for executives, CSV for analysis
Filters: Specific teams, categories, or metrics

Example Weekly Report

Report Name: "Weekly Team Performance Report"
Frequency: Every Monday at 9:00 AM
Recipients: team-leads@company.com, #engineering-metrics
Format: PDF with charts and CSV data attachment
Content:
  - Incident volume and trends
  - MTTR and MTTA
  - Team performance summary
  - Top 5 procedures executed
  - Key insights and recommendations

Using Analytics for Improvement

Transform data into actionable insights:

Identifying Patterns

Pattern Recognition

Analytics reveal trends that need attention:

High-Frequency Incidents

Pattern: Database connection timeouts spike every Monday morning
Analysis: Weekend deployment causes connection pool exhaustion
Action: Implement connection pool warming after deployments
Result: 78% reduction in Monday morning incidents

Resolution Time Degradation

Pattern: Average MTTR increasing from 3.2h to 5.8h over 3 months
Analysis: New team members lack training on key procedures
Action: Mandatory procedure training for all engineers
Result: MTTR improved to 3.6h within 2 months

Low Success Rate Procedures

Pattern: "Database Migration" procedure fails 42% of the time
Analysis: Missing validation steps and rollback guidance
Action: Revise procedure with detailed verification and rollback
Result: Success rate improved to 96%

Process Improvements

Data-Driven Optimization

Procedure Creation from Patterns

Identify incidents occurring ≥3 times per month
Review resolution steps from incident history
Create standardized procedure from successful resolutions
Measure incident recurrence after procedure deployment
Track time savings and success rate

Example Success Story

Incident: "Redis Cache Failure" - 8 occurrences per month
Before Procedure: Avg resolution time 45 minutes
After Procedure: Avg resolution time 8 minutes
Time Saved: 296 minutes per month (4.9 hours)
LLM Cost Avoided: $1.20 per month (now uses Weaviate)

Training Prioritization

Use analytics to guide training investments:

Low Performer Identification

Not for punishment, but for support
Target training to actual skill gaps
Measure improvement post-training
Pair with high performers for mentoring

High-Value Training Areas

Analysis: 68% of database incidents take >2 hours
Insight: Database troubleshooting skills need improvement
Action: Database performance workshop for entire team
Result: Database incident MTTR reduced by 43%

ROI Tracking

Demonstrate Platform Value

Quantify benefits with analytics:

Time Savings Calculation

Manual Incident Resolution (Before Overwatch)
├─ Avg Resolution Time: 6.2 hours
├─ Monthly Incidents: 45
└─ Total Manual Time: 279 hours/month

With Overwatch Procedures (After)
├─ Avg Resolution Time: 3.8 hours
├─ Monthly Incidents: 45
└─ Total Procedural Time: 171 hours/month

Time Saved: 108 hours/month (17.6 work days)

Cost Savings Calculation

Engineer Cost: $75/hour (loaded rate)
Time Saved: 108 hours/month
Monthly Savings: $8,100
Annual Savings: $97,200

Additional ROI Factors

Reduced downtime costs (customer impact)
Improved team morale (less firefighting)
Knowledge retention (procedures capture expertise)
Faster onboarding (new engineers productive sooner)
Audit compliance (complete execution records)

LLM Layer 3 ROI

LLM Generation Cost: $15.48/month
Time Saved by AI Solutions: 32 hours/month
Value of Time Saved: $2,400/month
ROI: 15,500% ($2,400 saved per $15.48 spent)

Best Practices

Regular Review Cadence

Daily Review (On-Call Lead)

Active incidents status
New high-severity incidents
Team workload distribution
LLM budget status (if admin)

Weekly Review (Engineering Managers)

Incident volume trends
MTTR and MTTA performance
Procedure success rates
Team performance highlights
LLM cost tracking

Monthly Review (Team)

Comprehensive analytics review
Process improvement identification
Procedure updates and archival
Training needs assessment
Celebrate wins and improvements

Quarterly Review (Leadership)

Long-term trend analysis
ROI reporting and budget justification
Strategic initiatives from analytics insights
Team capacity planning
Tool and process investment decisions

Metric Selection

Focus on What Matters

Don’t track everything - focus on actionable metrics:

Critical Metrics (Must track)

MTTR and MTTA
Incident volume trends
Procedure success rates
LLM budget usage (admins)

Important Metrics (Should track)

Resolution time by severity
Procedure execution frequency
Team performance indicators
Weaviate hit rate

Nice-to-Have Metrics (Can track)

Incident type distribution
Time-of-day patterns
Individual step execution times
Cross-team collaboration metrics

Data Quality

Ensure Accurate Analytics

Analytics are only as good as the data:

Data Quality Checklist

✅ Incidents properly categorized and tagged
✅ Resolution times accurately recorded
✅ Procedure steps completed in order
✅ Notes and observations documented
✅ Severity levels consistently applied
✅ Team assignments accurate and up-to-date

Common Data Quality Issues

Incorrect Severity Assignments

Problem: Engineers mark all incidents as "High" to get attention
Impact: Severity analytics become meaningless
Solution: Clear severity criteria and regular audits

Missing Resolution Documentation

Problem: Incidents closed without resolution notes
Impact: Can't learn from successful resolutions
Solution: Required resolution notes field

Incomplete Procedure Executions

Problem: Procedures started but not marked complete
Impact: Success rate appears artificially low
Solution: Automatic timeout and completion reminders

Troubleshooting

Common Analytics Issues

Data Not Appearing

Symptom: Analytics page shows “No data available”

Possible Causes and Solutions

Time range too narrow
- Try expanding to 30 or 90 days
- Check if any incidents exist in period
Filter too restrictive
- Clear all filters and retry
- Review filter criteria
Organization has no data yet
- Create incidents and procedures first
- Wait for execution history to build
Permission issue
- Verify role has analytics access
- Contact admin for permission review

Incorrect Metrics

Symptom: Numbers don’t match expected values

Possible Causes and Solutions

Timezone confusion
- Check timezone setting in profile
- Verify organization timezone
- Consider UTC vs local time
Filter applied without notice
- Look for active filters at top of page
- Clear filters and refresh
Calculation period mismatch
- Verify time range selection
- Check “Last Updated” timestamp
Cached data
- Click refresh button
- Hard refresh browser (Ctrl+Shift+R)

Export Failures

Symptom: Export button doesn’t work or file corrupted

Possible Causes and Solutions

Data set too large
- Reduce time range
- Select specific metrics instead of all
- Use API for large exports
Browser popup blocked
- Allow popups for Overwatch domain
- Check browser download settings
Network timeout
- Try smaller export
- Download during off-peak hours
- Use scheduled reports instead

LLM Cost Discrepancies

Symptom: Reported costs don’t match AWS billing

Possible Causes and Solutions

Billing period mismatch
- Overwatch uses UTC months
- AWS billing may use different timezone
- Compare actual date ranges
Multiple organizations
- Ensure viewing correct organization
- Check organization selector at top
Delayed billing data
- AWS billing may lag 24-48 hours
- Compare costs after billing cycle closes
Non-LLM AWS costs
- Overwatch only tracks Bedrock LLM costs
- Other AWS services billed separately

Performance Issues

Slow Dashboard Loading

Solutions

Reduce time range (90d → 30d)
Refresh less frequently (disable auto-refresh)
Close unused browser tabs
Clear browser cache
Disable chart animations in settings

Chart Rendering Problems

Solutions

Update browser to latest version
Disable browser extensions temporarily
Try different browser (Chrome, Firefox, Safari)
Reduce chart data points (shorter time range)
Contact support if issue persists

Next Steps

Incident Management - Create and resolve incidents tracked in analytics
Procedure Management - Build procedures monitored in analytics
Search Features - Use AI-powered search that feeds analytics
Admin Guide - Configure analytics settings and permissions

Need Help?

In-App Help: Click ? icon in analytics dashboard for contextual help
Keyboard Shortcuts: Press ? key to see analytics shortcuts
API Documentation: Visit /docs for programmatic access to analytics data
Support: Contact your system administrator for organization-specific questions

Last updated: October 2025 | Edit this page

Analytics Dashboard

Analytics Dashboard

Analytics Overview

Incident Analytics

Key Incident Metrics

Incident Volume Trends

Resolution Time Analysis

Incident Type Breakdown

Procedure Analytics

Procedure Performance Metrics

Most Used Procedures

Highest Success Rate Procedures

Procedure Execution Time Analysis

AI Recommendation Accuracy

Team Performance

Individual Metrics

Team Comparisons

Training Opportunities

Team Collaboration Metrics

LLM Cost Monitoring

Understanding LLM Layer 3

Budget Status Dashboard

Cost Metrics

Provider Cost Breakdown

Efficiency Metrics

Usage Statistics

Highest Cost Incidents

Budget Projections

Cost Optimization Tips

Custom Dashboards

Creating Custom Views

Saving Configurations

Sharing with Team

Exporting Data

Export Formats

Export Options

API Access

Scheduled Reports

Using Analytics for Improvement

Identifying Patterns

Process Improvements

ROI Tracking

Best Practices

Regular Review Cadence

Metric Selection

Data Quality

Troubleshooting

Common Analytics Issues

Performance Issues

Next Steps

Need Help?