Analytics Dashboard
Analytics Dashboard
Section titled “Analytics Dashboard”The Analytics Dashboard provides comprehensive insights into incident response performance, procedure effectiveness, team productivity, and AI-powered solution costs. Use these metrics to identify improvement opportunities, optimize workflows, and demonstrate ROI.
Analytics Overview
Section titled “Analytics Overview”Access the Analytics Dashboard:
Dashboard → AnalyticsWhat You Can Track
The analytics system provides four main categories of insights:
| Category | Key Metrics | Access Level |
|---|---|---|
| Incident Analytics | MTTR, MTTA, resolution trends, severity distribution | All users |
| Procedure Analytics | Success rates, execution times, usage patterns | All users |
| Team Performance | Individual metrics, team comparisons, workload distribution | Managers and above |
| LLM Cost Monitoring | AI generation costs, budget tracking, efficiency metrics | Admins and above |
Role-Based Visibility
- Engineers: View personal metrics, team-wide incident and procedure analytics
- Managers: All Engineer access plus team performance comparisons and detailed reports
- Admins: Full access including LLM cost analytics and organization-wide metrics
Time Range Selection
All analytics pages support flexible time range selection:
- 24 Hours: Real-time operational view
- 7 Days: Weekly trends and patterns
- 30 Days: Monthly performance analysis (default)
- 90 Days: Quarterly trends and long-term patterns
Incident Analytics
Section titled “Incident Analytics”Detailed metrics about incident response and resolution:
Access Incident Analytics
Dashboard → Analytics → Incident AnalyticsKey Incident Metrics
Section titled “Key Incident Metrics”Mean Time To Resolution (MTTR)
- Average time from incident creation to resolution
- Calculated per severity level (Critical, High, Medium, Low)
- Trend analysis shows improvement or degradation over time
- Industry benchmark: Critical incidents < 1 hour, High < 4 hours
Example MTTR Display
Overall MTTR: 4.2 hours├─ Critical: 0.8 hours (target: < 1 hour) ✅├─ High: 3.2 hours (target: < 4 hours) ✅├─ Medium: 6.1 hours (target: < 8 hours) ✅└─ Low: 12.5 hours (target: < 24 hours) ✅Mean Time To Acknowledge (MTTA)
- Time from incident creation to first acknowledgment
- Measures initial response speed
- Critical for on-call and alerting effectiveness
- Target: < 15 minutes for Critical severity
Severity Breakdown
Visual distribution of incidents by severity:
| Severity | Count | Percentage | Resolution Time |
|---|---|---|---|
| Critical | 3 | 6.7% | 0.8h average |
| High | 12 | 26.7% | 3.2h average |
| Medium | 22 | 48.9% | 6.1h average |
| Low | 8 | 17.8% | 12.5h average |
Status Distribution
Current state of incidents in the system:
- Open: 4 incidents - Newly created, awaiting assignment
- In Progress: 3 incidents - Active investigation and resolution
- Resolved: 35 incidents - Solution implemented, awaiting verification
- Closed: 3 incidents - Fully resolved and documented
Incident Volume Trends
Section titled “Incident Volume Trends”Daily Incident Trends
Line chart showing incident creation over time:
- Identify peak incident periods
- Detect anomalous spikes requiring investigation
- Track effectiveness of preventive measures
- Forecast capacity planning needs
Pattern Analysis
Automated pattern detection identifies:
- Recurring Incidents: Same root cause appearing multiple times
- Time-Based Patterns: Incidents occurring at specific times (e.g., after deployments)
- Cascading Failures: Multiple related incidents from single root cause
- Seasonal Trends: Periodic patterns related to business cycles
Resolution Time Analysis
Section titled “Resolution Time Analysis”Resolution Time Trends
Track how resolution efficiency changes over time:
- Average resolution time per day/week
- Comparison against baseline and targets
- Impact of process improvements
- Training effectiveness measurement
Longest Incidents
List of incidents taking the most time to resolve:
Top 5 Longest Incidents:1. Database connection timeout - 12.5 hours (Critical)2. API gateway overload - 8.2 hours (High)3. Memory leak investigation - 7.8 hours (High)4. DNS propagation delay - 6.5 hours (Medium)5. Certificate expiration - 5.2 hours (Medium)Use Case: Identify incidents needing procedure improvements or better documentation.
Incident Type Breakdown
Section titled “Incident Type Breakdown”Most frequent incident categories:
| Type | Count | % of Total | Avg Resolution Time |
|---|---|---|---|
| Database Performance | 12 | 26.7% | 4.2 hours |
| Resource Exhaustion | 10 | 22.2% | 5.1 hours |
| API Latency | 8 | 17.8% | 3.8 hours |
| Network Connectivity | 6 | 13.3% | 6.2 hours |
| Service Outage | 5 | 11.1% | 8.5 hours |
| Security Breach | 2 | 4.4% | 15.2 hours |
| Other | 2 | 4.4% | 3.2 hours |
Actionable Insights
- High-frequency types: Create dedicated procedures or automation
- High-duration types: Improve documentation and training
- Recurring types: Investigate root causes for permanent fixes
Procedure Analytics
Section titled “Procedure Analytics”Performance metrics for runbook procedures and execution tracking:
Access Procedure Analytics
Dashboard → Analytics → Procedure AnalyticsProcedure Performance Metrics
Section titled “Procedure Performance Metrics”Overall Success Rate
- Percentage of successful procedure executions
- Calculated across all procedures
- Target: > 90% success rate
- Trend analysis shows execution quality improvements
Example Display
Success Rate: 87.1%├─ Successful Executions: 298├─ Failed Executions: 44├─ Total Executions: 342└─ Trend: ↑ 3.2% vs last monthAverage Execution Time
- Mean time to complete procedure executions
- Compared against estimated duration
- Used for capacity planning
- Identifies procedures needing optimization
Usage Statistics
- Total number of procedures in library
- Total executions in time period
- Most frequently executed procedures
- Procedures never executed (candidates for archival)
Most Used Procedures
Section titled “Most Used Procedures”Top procedures by execution frequency:
Top 5 Most Executed Procedures:1. Database Performance Recovery - Executions: 45 - Success Rate: 91.1% - Avg Duration: 12.3 minutes
2. API Gateway Restart - Executions: 38 - Success Rate: 94.7% - Avg Duration: 8.5 minutes
3. Memory Cleanup Process - Executions: 32 - Success Rate: 84.4% - Avg Duration: 15.2 minutes
4. Cache Invalidation - Executions: 28 - Success Rate: 96.4% - Avg Duration: 3.1 minutes
5. Certificate Renewal - Executions: 24 - Success Rate: 100.0% - Avg Duration: 6.8 minutesStrategic Actions
- High-execution procedures: Candidates for automation
- Low success rate: Need revision or better error handling
- Long duration: Opportunities for step optimization
Highest Success Rate Procedures
Section titled “Highest Success Rate Procedures”Most reliable procedures:
| Procedure | Success Rate | Executions | Notes |
|---|---|---|---|
| SSL Certificate Renewal | 100.0% | 12 | Fully automated verification |
| Load Balancer Health Check | 98.5% | 67 | Clear success criteria |
| DNS Update | 97.8% | 45 | Well-documented rollback |
| Cache Clear | 96.4% | 89 | Simple, low-risk operation |
| Log Rotation | 95.2% | 34 | Scheduled maintenance |
Best Practices Identified
- Clear verification steps increase success rate
- Automated checks reduce human error
- Rollback capabilities increase confidence
- Regular testing maintains reliability
Procedure Execution Time Analysis
Section titled “Procedure Execution Time Analysis”Execution Time Distribution
- Fastest 10% of executions
- Average execution time
- Slowest 10% of executions
- Variance analysis (consistency measure)
Longest Procedures
Procedures requiring most time:
Longest Average Execution Time:1. Full System Recovery - 45.2 minutes (3 executions)2. Database Migration - 38.7 minutes (7 executions)3. Blue-Green Deployment - 32.4 minutes (12 executions)Time Comparison: Estimated vs Actual
- Procedures consistently exceeding estimates need revised time estimates
- Procedures completing faster indicate potential over-estimation
- Large variance suggests inconsistent execution or environment differences
AI Recommendation Accuracy
Section titled “AI Recommendation Accuracy”AI-Powered Insights
Track effectiveness of AI-suggested procedures:
Metrics Tracked
- AI Accuracy: 88.5% - Percentage of AI suggestions that successfully resolve incidents
- AI Confidence Average: 0.82 - Average confidence score of AI recommendations
- Suggestion Acceptance Rate: 76.3% - How often engineers choose AI-suggested procedures
- Time Saved: 3.2 hours average per incident using AI suggestions
Learning Loop Impact
- Accuracy improves as system learns from successful resolutions
- Confidence scores correlate with actual success rates
- Low-confidence suggestions prompt human expertise
- Failed suggestions improve future recommendations
Team Performance
Section titled “Team Performance”Individual and team metrics for performance tracking and training identification:
Access Team Performance
Dashboard → Analytics → Performance AnalyticsNote: Team performance details are visible to Managers and Admins only. Engineers see their own metrics.
Individual Metrics
Section titled “Individual Metrics”Performance tracking per team member:
Example Team Performance Table
| Team Member | Incidents Assigned | Resolved | Avg Resolution Time | Success Rate |
|---|---|---|---|---|
| John Doe | 15 | 13 | 3.8 hours | 86.7% |
| Jane Smith | 18 | 17 | 4.2 hours | 94.4% |
| Mike Johnson | 12 | 10 | 5.1 hours | 83.3% |
| Sarah Williams | 14 | 13 | 3.2 hours | 92.9% |
Metrics Explained
Incidents Assigned
- Total incidents assigned to team member
- Includes all severities and statuses
- Used for workload balancing
Incidents Resolved
- Successfully resolved incidents
- Excludes reassigned or escalated incidents
- Success rate = Resolved / Assigned
Average Resolution Time
- Mean time from assignment to resolution
- Adjusted for incident severity
- Lower is generally better (but context matters)
Success Rate
- Percentage of incidents resolved without escalation
- High success rate indicates expertise and effectiveness
- Low success rate may indicate training needs or complexity
Team Comparisons
Section titled “Team Comparisons”Workload Distribution
Visualize assignment balance across team:
- Identify overloaded team members
- Ensure fair distribution of on-call responsibilities
- Plan capacity for vacation and sick time
- Detect skill-based assignment patterns
Performance Rankings
Not for competition, but for identifying:
- Top performers to mentor others
- Team members needing additional training
- Expertise areas for knowledge sharing
- Recognition opportunities
Training Opportunities
Section titled “Training Opportunities”Skill Gap Identification
Analytics reveal areas where training would help:
Low Success Rate in Specific Categories
Database Incidents - John Doe: 62% success rate→ Recommendation: Database troubleshooting training
Network Issues - Mike Johnson: 58% success rate→ Recommendation: Network fundamentals courseLong Resolution Times
API Performance - All team members: 8.2h average→ Recommendation: Create dedicated procedure or documentationProcedure Usage Patterns
- Team members not using certain procedures: May not know they exist
- Team members with high procedure success rates: Good mentors for others
- Procedures never executed by certain members: Training opportunity
Team Collaboration Metrics
Section titled “Team Collaboration Metrics”Response Time Metrics
- Average First Response Time: 8.5 minutes - How quickly incidents are acknowledged
- Average Resolution Time: 4.3 hours - Team-wide incident resolution speed
- Escalation Rate: 12.4% - Incidents requiring manager or expert intervention
Collaboration Indicators
- Incidents with multiple team members involved
- Comment and mention activity levels
- Knowledge sharing through procedure updates
- Cross-training effectiveness
LLM Cost Monitoring
Section titled “LLM Cost Monitoring”Track AI-powered solution generation costs and optimize budget usage:
Access LLM Cost Analytics
Dashboard → Analytics → LLM CostsNote: LLM cost analytics are visible to Admins and Organization Owners only.
Understanding LLM Layer 3
Section titled “Understanding LLM Layer 3”What is LLM Layer 3?
When vector database search (Weaviate) doesn’t find suitable solutions with high confidence, the system falls back to AI-powered solution generation using AWS Bedrock:
3-Layer Search Architecture
- Layer 1 (Phase 2): Customer-specific solutions from private vector database
- Layer 2 (Current): Public community solutions from Weaviate vector database
- Layer 3 (Current): AI-generated solutions via AWS Bedrock (fallback)
Why Monitor LLM Costs?
Unlike free vector database queries, LLM generation incurs costs:
- AWS Bedrock charges per token generated
- Costs vary by provider (Nova Lite, Claude Sonnet, GPT-4)
- Need to balance quality with budget constraints
- Optimize cache hit rates to reduce costs
Budget Status Dashboard
Section titled “Budget Status Dashboard”Monthly Budget Overview
Real-time budget tracking:
Monthly Budget Status: October 2025━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━Used: $15.48 / Budget: $100.00[████████░░░░░░░░░░░░░░░░░░░░░] 15.5%━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━Remaining: $84.52Status: NORMAL ✅Next Alert: 25% ($25.00)Budget Status Levels
| Status | Threshold | Action | Color |
|---|---|---|---|
| Normal | 0-25% | No action needed | Green |
| Low | 25-50% | Monitor usage | Yellow |
| Medium | 50-75% | Review usage patterns | Orange |
| High | 75-100% | Urgent optimization needed | Red |
| Over Budget | 100-150% | Alerts sent, continue with caution | Red |
| Blocked | >150% | Hard limit reached, LLM disabled | Gray |
Alert Thresholds
Automatic alerts trigger at:
- 25% Used: First warning - Monitor usage patterns
- 50% Used: Mid-month check - Review if on track
- 75% Used: Urgent attention - Optimize immediately
- 100% Used: Budget exceeded - Hard limit approaching
- 150% Used: Hard limit reached - LLM Layer 3 disabled until next month
Cost Metrics
Section titled “Cost Metrics”Total Cost Tracking
Key cost metrics for the selected time period:
Cost Summary Display
Total LLM Cost: $15.48├─ Total Requests: 258├─ Average Cost per Request: $0.06├─ Successful Generations: 255 (98.8%)└─ Failed Generations: 3 (1.2%)Cost Per Request
- Average: $0.06 per LLM generation
- Range: $0.03 (Nova Lite) to $0.10 (GPT-4)
- Used for ROI calculations
- Compared against time saved
Provider Cost Breakdown
Section titled “Provider Cost Breakdown”Different AI providers have different costs:
Cost by Provider
| Provider | Cost per Request | Total Cost | Total Requests | Use Case |
|---|---|---|---|---|
| Nova Lite | $0.03 | $2.32 | 77 | Simple incidents, standard patterns |
| Claude Sonnet | $0.06 | $12.45 | 207 | Complex technical issues (default) |
| GPT-4 | $0.10 | $0.71 | 7 | Highly complex, multi-system issues |
| Total | - | $15.48 | 258 | - |
Provider Selection Logic
The system automatically selects providers based on:
- Incident Complexity: Simple vs complex technical issues
- Context Size: Amount of incident details and logs
- Historical Success: Which provider worked best for similar issues
- Budget Status: Prefer cheaper providers when budget is low
Efficiency Metrics
Section titled “Efficiency Metrics”Weaviate Hit Rate vs LLM Fallback
The most important cost optimization metric:
Search Efficiency (Last 30 Days)━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━Weaviate Hit Rate: 87.5% (FREE) ✅[████████████████████████████░░░░]
LLM Fallback Rate: 12.5% (PAID) 💰[████░░░░░░░░░░░░░░░░░░░░░░░░░░░░]━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━What This Means
- 87.5% Hit Rate: Most searches find solutions in free Weaviate database
- 12.5% Fallback: Only 1 in 8 searches requires paid LLM generation
- Target: Maintain >85% Weaviate hit rate through continuous learning
Cost Savings from Caching
If all searches used LLM (no Weaviate):
- Hypothetical cost: $123.84 per month
- Actual cost with Weaviate: $15.48 per month
- Savings: $108.36 (87.5% cost reduction)
Usage Statistics
Section titled “Usage Statistics”Request Metrics
Detailed usage breakdown:
Usage Statistics (Last 30 Days)━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━Total Requests: 258Successful Generations: 255 (98.8%)Failed Generations: 3 (1.2%)━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━Total Tokens Used: 322,500Avg Tokens per Request: 1,250Avg Generation Time: 2.3 seconds━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━Token Usage
- Input tokens: Context from incident (description, logs, metrics)
- Output tokens: Generated solution steps and explanations
- Total tokens: Input + output (what you’re charged for)
Generation Time
- Average time for AI to generate solution
- Typical range: 1.5-3.5 seconds
- Longer for complex incidents
- Includes thinking and verification time
Highest Cost Incidents
Section titled “Highest Cost Incidents”Track which incidents generated the most expensive AI solutions:
Top 5 Most Expensive AI Solutions
1. Kubernetes Multi-Cluster Networking Issue Provider: GPT-4 | Cost: $0.10 Date: Oct 5, 2025 Reason: Complex multi-system diagnosis requiring extensive context
2. Database Replication Failure Provider: Claude Sonnet | Cost: $0.08 Date: Oct 8, 2025 Reason: Large log file analysis and correlation
3. Memory Leak Investigation Provider: Claude Sonnet | Cost: $0.07 Date: Oct 12, 2025 Reason: Multiple service interaction analysis
4. API Gateway Timeout Pattern Provider: Claude Sonnet | Cost: $0.07 Date: Oct 15, 2025 Reason: Historical pattern analysis across services
5. Certificate Chain Validation Error Provider: Claude Sonnet | Cost: $0.06 Date: Oct 18, 2025 Reason: Security context and compliance reviewActionable Insights
- Recurring expensive incidents: Create dedicated procedures to avoid future LLM costs
- High-cost patterns: Add these solutions to Weaviate for free future searches
- Provider selection: Review if GPT-4 usage was necessary or if Claude Sonnet would suffice
Budget Projections
Section titled “Budget Projections”Monthly Cost Forecast
Based on current usage patterns:
Budget Projection━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━Current Month: October 2025Days Elapsed: 18 daysDays Remaining: 13 days━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━Current Cost: $15.48Daily Average: $0.86Projected Month-End: $23.22━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━Monthly Budget: $100.00Projected Usage: 23.2%Status: On Track ✅Projection Scenarios
Best Case (Current pace maintained)
- Projected Cost: $23.22
- Budget Remaining: $76.78
- Status: Well within budget
Expected Case (Historical average)
- Projected Cost: $28.50
- Budget Remaining: $71.50
- Status: Normal usage pattern
Worst Case (Peak usage sustained)
- Projected Cost: $42.80
- Budget Remaining: $57.20
- Status: Higher than typical but acceptable
Cost Optimization Tips
Section titled “Cost Optimization Tips”Reduce LLM Costs
-
Improve Incident Descriptions
- Better descriptions help Weaviate find matches
- Reduces need for LLM fallback
- Use consistent terminology
-
Create Procedures from Solutions
- Convert successful LLM solutions to procedures
- Future similar incidents use procedures (free)
- Builds organizational knowledge base
-
Update Weaviate Database
- Add high-cost solutions to Weaviate
- Improves hit rate for similar future incidents
- One-time LLM cost, infinite free reuse
-
Use Template Matching
- Leverage incident templates
- Better Weaviate query construction
- Higher confidence scores = less LLM fallback
-
Review Provider Selection
- Nova Lite for simple incidents ($0.03)
- Claude Sonnet for standard incidents ($0.06)
- GPT-4 only for highly complex incidents ($0.10)
Custom Dashboards
Section titled “Custom Dashboards”Create personalized views for your specific needs:
Access Dashboard Configuration
Dashboard → Analytics → Custom Dashboards → Create NewCreating Custom Views
Section titled “Creating Custom Views”Dashboard Builder
Select metrics and visualizations:
Available Widgets
- Incident volume trends (line chart)
- Severity breakdown (pie chart)
- Resolution time distribution (histogram)
- Top procedures by usage (bar chart)
- Team performance comparison (table)
- LLM cost trends (line chart)
- Success rate over time (line chart)
- Procedure execution heatmap (calendar view)
Configuration Options
- Time range (default to 30 days)
- Refresh interval (manual, 5 min, 15 min, 30 min, 1 hour)
- Widget size and positioning
- Color schemes and themes
- Data filters and groupings
Saving Configurations
Section titled “Saving Configurations”Dashboard Persistence
Save your custom layouts:
Save Options
- Personal Dashboard: Private to your account
- Team Dashboard: Shared with your team
- Organization Dashboard: Available to entire organization
- Default Dashboard: Replace standard analytics view
Naming Conventions
✅ Good Names:- "On-Call Weekly Report"- "Database Team Performance"- "LLM Cost Tracking - Q4 2025"
❌ Poor Names:- "Dashboard 1"- "My Analytics"- "Untitled"Sharing with Team
Section titled “Sharing with Team”Collaboration Features
Share Dashboard
Custom Dashboard → Share → Select RecipientsSharing Options
- View Only: Recipients can see but not modify
- Edit Access: Recipients can customize their copy
- Public Link: Generate shareable URL (read-only)
Use Cases
- Weekly team review meetings
- Manager reports for leadership
- On-call handoff summaries
- Incident response post-mortems
Exporting Data
Section titled “Exporting Data”Extract analytics data for external reporting and analysis:
Export Formats
Section titled “Export Formats”Available Export Formats
| Format | Use Case | Size Limit | Includes |
|---|---|---|---|
| JSON | API integration, programmatic analysis | 100 MB | Full detail with metadata |
| CSV | Excel analysis, reporting tools | 50 MB | Tabular data only |
| Executive reports, presentations | 20 MB | Visualizations and summaries | |
| Excel | Advanced spreadsheet analysis | 50 MB | Multiple sheets with formatting |
Export Options
Section titled “Export Options”Quick Export
From any analytics page:
Analytics Page → Export Button → Select FormatExported Data Includes
- All visible metrics for selected time range
- Chart data and visualizations
- Filter configurations
- Export timestamp and user
- Data source information
Custom Export
For specific data subsets:
Analytics → Export → Custom Export → ConfigureCustom Export Options
- Select specific metrics and dimensions
- Choose date range (custom start/end dates)
- Filter by tags, categories, teams
- Include or exclude specific data points
- Schedule recurring exports (daily, weekly, monthly)
API Access
Section titled “API Access”Programmatic Data Access
Access analytics data via API:
Available Endpoints
GET /api/v1/analytics/incidents?period_days=30GET /api/v1/analytics/procedures?period_days=30GET /api/v1/analytics/performance?period_days=30GET /api/v1/analytics/llm/costs?period_days=30Authentication
- Use personal API key or service account key
- Include in Authorization header:
Bearer YOUR_API_KEY - Rate limits: 100 requests per minute
Example API Request
curl -X GET "https://your-instance.overwatch.com/api/v1/analytics/incidents?period_days=30" \ -H "Authorization: Bearer YOUR_API_KEY" \ -H "Accept: application/json"Example Response
{ "total_incidents": 45, "total_resolved": 38, "mttr": 4.5, "mtta": 0.3, "severity_breakdown": { "critical": 3, "high": 12, "medium": 22, "low": 8 }, "period_start": "2025-09-15T00:00:00Z", "period_end": "2025-10-15T00:00:00Z"}Scheduled Reports
Section titled “Scheduled Reports”Automated Reporting
Generate and deliver reports automatically:
Report Configuration
Analytics → Reports → New Scheduled ReportScheduling Options
- Frequency: Daily, weekly, monthly, quarterly
- Day/Time: Specific day of week/month and time
- Recipients: Email addresses or Slack channels
- Format: PDF for executives, CSV for analysis
- Filters: Specific teams, categories, or metrics
Example Weekly Report
Report Name: "Weekly Team Performance Report"Frequency: Every Monday at 9:00 AMRecipients: team-leads@company.com, #engineering-metricsFormat: PDF with charts and CSV data attachmentContent: - Incident volume and trends - MTTR and MTTA - Team performance summary - Top 5 procedures executed - Key insights and recommendationsUsing Analytics for Improvement
Section titled “Using Analytics for Improvement”Transform data into actionable insights:
Identifying Patterns
Section titled “Identifying Patterns”Pattern Recognition
Analytics reveal trends that need attention:
High-Frequency Incidents
Pattern: Database connection timeouts spike every Monday morningAnalysis: Weekend deployment causes connection pool exhaustionAction: Implement connection pool warming after deploymentsResult: 78% reduction in Monday morning incidentsResolution Time Degradation
Pattern: Average MTTR increasing from 3.2h to 5.8h over 3 monthsAnalysis: New team members lack training on key proceduresAction: Mandatory procedure training for all engineersResult: MTTR improved to 3.6h within 2 monthsLow Success Rate Procedures
Pattern: "Database Migration" procedure fails 42% of the timeAnalysis: Missing validation steps and rollback guidanceAction: Revise procedure with detailed verification and rollbackResult: Success rate improved to 96%Process Improvements
Section titled “Process Improvements”Data-Driven Optimization
Procedure Creation from Patterns
- Identify incidents occurring ≥3 times per month
- Review resolution steps from incident history
- Create standardized procedure from successful resolutions
- Measure incident recurrence after procedure deployment
- Track time savings and success rate
Example Success Story
Incident: "Redis Cache Failure" - 8 occurrences per monthBefore Procedure: Avg resolution time 45 minutesAfter Procedure: Avg resolution time 8 minutesTime Saved: 296 minutes per month (4.9 hours)LLM Cost Avoided: $1.20 per month (now uses Weaviate)Training Prioritization
Use analytics to guide training investments:
Low Performer Identification
- Not for punishment, but for support
- Target training to actual skill gaps
- Measure improvement post-training
- Pair with high performers for mentoring
High-Value Training Areas
Analysis: 68% of database incidents take >2 hoursInsight: Database troubleshooting skills need improvementAction: Database performance workshop for entire teamResult: Database incident MTTR reduced by 43%ROI Tracking
Section titled “ROI Tracking”Demonstrate Platform Value
Quantify benefits with analytics:
Time Savings Calculation
Manual Incident Resolution (Before Overwatch)├─ Avg Resolution Time: 6.2 hours├─ Monthly Incidents: 45└─ Total Manual Time: 279 hours/month
With Overwatch Procedures (After)├─ Avg Resolution Time: 3.8 hours├─ Monthly Incidents: 45└─ Total Procedural Time: 171 hours/month
Time Saved: 108 hours/month (17.6 work days)Cost Savings Calculation
Engineer Cost: $75/hour (loaded rate)Time Saved: 108 hours/monthMonthly Savings: $8,100Annual Savings: $97,200Additional ROI Factors
- Reduced downtime costs (customer impact)
- Improved team morale (less firefighting)
- Knowledge retention (procedures capture expertise)
- Faster onboarding (new engineers productive sooner)
- Audit compliance (complete execution records)
LLM Layer 3 ROI
LLM Generation Cost: $15.48/monthTime Saved by AI Solutions: 32 hours/monthValue of Time Saved: $2,400/monthROI: 15,500% ($2,400 saved per $15.48 spent)Best Practices
Section titled “Best Practices”Regular Review Cadence
Section titled “Regular Review Cadence”Daily Review (On-Call Lead)
- Active incidents status
- New high-severity incidents
- Team workload distribution
- LLM budget status (if admin)
Weekly Review (Engineering Managers)
- Incident volume trends
- MTTR and MTTA performance
- Procedure success rates
- Team performance highlights
- LLM cost tracking
Monthly Review (Team)
- Comprehensive analytics review
- Process improvement identification
- Procedure updates and archival
- Training needs assessment
- Celebrate wins and improvements
Quarterly Review (Leadership)
- Long-term trend analysis
- ROI reporting and budget justification
- Strategic initiatives from analytics insights
- Team capacity planning
- Tool and process investment decisions
Metric Selection
Section titled “Metric Selection”Focus on What Matters
Don’t track everything - focus on actionable metrics:
Critical Metrics (Must track)
- MTTR and MTTA
- Incident volume trends
- Procedure success rates
- LLM budget usage (admins)
Important Metrics (Should track)
- Resolution time by severity
- Procedure execution frequency
- Team performance indicators
- Weaviate hit rate
Nice-to-Have Metrics (Can track)
- Incident type distribution
- Time-of-day patterns
- Individual step execution times
- Cross-team collaboration metrics
Data Quality
Section titled “Data Quality”Ensure Accurate Analytics
Analytics are only as good as the data:
Data Quality Checklist
- ✅ Incidents properly categorized and tagged
- ✅ Resolution times accurately recorded
- ✅ Procedure steps completed in order
- ✅ Notes and observations documented
- ✅ Severity levels consistently applied
- ✅ Team assignments accurate and up-to-date
Common Data Quality Issues
Incorrect Severity Assignments
Problem: Engineers mark all incidents as "High" to get attentionImpact: Severity analytics become meaninglessSolution: Clear severity criteria and regular auditsMissing Resolution Documentation
Problem: Incidents closed without resolution notesImpact: Can't learn from successful resolutionsSolution: Required resolution notes fieldIncomplete Procedure Executions
Problem: Procedures started but not marked completeImpact: Success rate appears artificially lowSolution: Automatic timeout and completion remindersTroubleshooting
Section titled “Troubleshooting”Common Analytics Issues
Section titled “Common Analytics Issues”Data Not Appearing
Symptom: Analytics page shows “No data available”
Possible Causes and Solutions
- Time range too narrow
- Try expanding to 30 or 90 days
- Check if any incidents exist in period
- Filter too restrictive
- Clear all filters and retry
- Review filter criteria
- Organization has no data yet
- Create incidents and procedures first
- Wait for execution history to build
- Permission issue
- Verify role has analytics access
- Contact admin for permission review
Incorrect Metrics
Symptom: Numbers don’t match expected values
Possible Causes and Solutions
- Timezone confusion
- Check timezone setting in profile
- Verify organization timezone
- Consider UTC vs local time
- Filter applied without notice
- Look for active filters at top of page
- Clear filters and refresh
- Calculation period mismatch
- Verify time range selection
- Check “Last Updated” timestamp
- Cached data
- Click refresh button
- Hard refresh browser (Ctrl+Shift+R)
Export Failures
Symptom: Export button doesn’t work or file corrupted
Possible Causes and Solutions
- Data set too large
- Reduce time range
- Select specific metrics instead of all
- Use API for large exports
- Browser popup blocked
- Allow popups for Overwatch domain
- Check browser download settings
- Network timeout
- Try smaller export
- Download during off-peak hours
- Use scheduled reports instead
LLM Cost Discrepancies
Symptom: Reported costs don’t match AWS billing
Possible Causes and Solutions
- Billing period mismatch
- Overwatch uses UTC months
- AWS billing may use different timezone
- Compare actual date ranges
- Multiple organizations
- Ensure viewing correct organization
- Check organization selector at top
- Delayed billing data
- AWS billing may lag 24-48 hours
- Compare costs after billing cycle closes
- Non-LLM AWS costs
- Overwatch only tracks Bedrock LLM costs
- Other AWS services billed separately
Performance Issues
Section titled “Performance Issues”Slow Dashboard Loading
Solutions
- Reduce time range (90d → 30d)
- Refresh less frequently (disable auto-refresh)
- Close unused browser tabs
- Clear browser cache
- Disable chart animations in settings
Chart Rendering Problems
Solutions
- Update browser to latest version
- Disable browser extensions temporarily
- Try different browser (Chrome, Firefox, Safari)
- Reduce chart data points (shorter time range)
- Contact support if issue persists
Next Steps
Section titled “Next Steps”- Incident Management - Create and resolve incidents tracked in analytics
- Procedure Management - Build procedures monitored in analytics
- Search Features - Use AI-powered search that feeds analytics
- Admin Guide - Configure analytics settings and permissions
Need Help?
Section titled “Need Help?”- In-App Help: Click
?icon in analytics dashboard for contextual help - Keyboard Shortcuts: Press
?key to see analytics shortcuts - API Documentation: Visit
/docsfor programmatic access to analytics data - Support: Contact your system administrator for organization-specific questions
Last updated: October 2025 | Edit this page