On-Demand Problem Reporting

AI-Powered Problem Resolution at Your Fingertips

On-Demand Problem Reporting allows you to report problems directly from any monitoring dashboard and receive instant AI-powered solutions with step-by-step troubleshooting guidance.

Overview

What is On-Demand Reporting?

On-Demand Reporting is a revolutionary feature that enables DevOps engineers to:

Report Anywhere

Report problems directly from any monitoring platform without switching contexts or filling out complex forms.

AI-Powered Analysis

Get instant problem analysis powered by AWS Bedrock LLMs with intelligent context understanding.

Turn-by-Turn Guidance

Receive step-by-step troubleshooting procedures tailored to your specific problem and environment.

Learning Loop

Successful resolutions are captured and used to improve future recommendations for the entire organization.

How It Works

3-Layer Search Architecture:

Layer 1: Customer-Specific Solutions
- Searches your organization’s historical incident resolutions
- Finds solutions that worked for similar problems in your environment
- Highest confidence score (0.8-1.0)
Layer 2: Public Knowledge Base
- Searches Weaviate public database with community solutions
- Finds proven solutions from the broader DevOps community
- Medium confidence score (0.7-0.9)
Layer 3: LLM-Generated Solutions (Fallback)
- AI-powered solution generation when no vector matches found
- Uses AWS Bedrock (Claude Sonnet, Nova Lite, or GPT-4)
- Lower confidence score (0.6-0.8) but highly contextual

Using On-Demand Reporting

Quick Start

Report a Problem in 3 Clicks:

Activate Reporting
- Click Overwatch extension icon while on monitoring dashboard
- Or use keyboard shortcut: Ctrl+Shift+R (Cmd+Shift+R on Mac)
- Or click “Report Problem” button in overlay
Describe the Problem
- Enter problem description (minimum 50 characters)
- Extension automatically extracts dashboard context
- Add additional details if needed
Get Solutions
- AI analyzes your problem and context
- Solutions appear in overlay with confidence scores
- Follow turn-by-turn guidance to resolve

Detailed Workflow

Step 1: Open Report Form

Using Extension Icon:

Navigate to monitoring dashboard showing the problem
Click Overwatch extension icon in toolbar
Select “Report Problem” from menu
Report form opens in overlay

Using Keyboard Shortcut:

While on any monitoring platform page
Press Ctrl+Shift+R (Windows/Linux) or Cmd+Shift+R (Mac)
Report form opens immediately
Begin typing your problem description

Using Overlay Button:

Open extension overlay (click icon or Ctrl+Shift+O)
Click “Report Problem” button at top
Report form opens in expanded overlay
Context from current page pre-populated

Step 2: Describe the Problem

What to Include:

Problem Description (required, minimum 50 characters):
- What is happening or not working?
- When did it start?
- What have you tried already?
Problem Type (auto-detected from context):
- Application Error
- Performance Issue
- Infrastructure Problem
- Database Issue
- Network Problem
- Security Incident
Urgency Level:
- Critical: Production down, immediate attention required
- High: Significant impact, needs quick resolution
- Medium: Moderate impact, can wait for business hours
- Low: Minor issue, low priority

Example Problem Descriptions:

Good: "API response time increased from 200ms to 2000ms in last hour.
Users reporting timeout errors. Checked database queries - all normal.
CPU usage on API servers is 85% but memory looks fine."

Bad: "API is slow"

Auto-Extracted Context:

Current monitoring platform (Datadog, New Relic, etc.)
Dashboard URL and time range
Visible metrics and their values
Alert status and severity
Affected services or hosts
Browser and system information

Step 3: Submit and Analyze

What Happens Next:

Problem Analysis
- Extension sends problem description + context to backend
- LLM analyzes problem and identifies key aspects
- Problem categorized and enriched with technical details
Solution Search
- System searches 3 layers for relevant solutions
- Layer 1: Your organization’s historical solutions
- Layer 2: Public knowledge base (Weaviate)
- Layer 3: LLM generates new solution if needed
Results Display
- Solutions appear in overlay ranked by confidence
- Each solution shows confidence score (0-100%)
- Turn-by-turn steps with estimated time
- Related procedures and similar incidents linked

Analysis Time:

Layer 1 + 2: < 2 seconds (vector search)
Layer 3: 5-10 seconds (LLM generation)
Total: Typically 2-10 seconds for complete results

Step 4: Follow Turn-by-Turn Guidance

Solution Overlay Interface:

┌─────────────────────────────────────────────┐
│ Solution: Database Connection Pool Exhausted │
│ Confidence: 92% | Source: Customer Layer     │
├─────────────────────────────────────────────┤
│                                             │
│ Step 1 of 5: Check Connection Pool Size    │
│ ▶ Run: kubectl get pods -n production      │
│                                             │
│ [ Run Command ] [ Copy ] [ Next Step ]     │
│                                             │
│ Estimated time: 15 minutes                  │
│ Success rate: 94% (based on 47 incidents)  │
└─────────────────────────────────────────────┘

Interactive Features:

Run Command: Execute commands with one click (where supported)
Copy: Copy commands or code snippets to clipboard
Next/Previous: Navigate through resolution steps
Mark Complete: Track which steps you’ve completed
Report Success: Let system know if solution worked
Report Failure: Get alternative solutions if needed

Step 5: Track Resolution

Automatic Tracking:

Incident created with source_type: 'manual_report'
Progress tracked through each resolution step
Time spent on each step recorded
Final outcome (success/failure) captured

Learning Loop:

Successful resolutions (with your confirmation) captured
Solutions added to your organization’s knowledge base (Layer 1)
Future similar problems resolve even faster
Team benefits from your troubleshooting experience

Problem Types

Application Errors

Examples:

Exception errors and crashes
API failures and timeouts
500 errors and application exceptions
Memory leaks and resource exhaustion

What to Include:

Error message and stack trace
Affected endpoints or services
Request volume and error rate
Recent deployments or changes

Performance Issues

Examples:

Slow response times
High CPU or memory usage
Database query slowness
Network latency problems

What to Include:

Performance metrics (response time, throughput)
Baseline vs current values
Affected services or components
Time when degradation started

Infrastructure Problems

Examples:

Container crashes or restarts
Pod evictions or scheduling issues
Disk space or I/O problems
Load balancer issues

What to Include:

Infrastructure metrics (CPU, memory, disk)
Affected nodes or instances
Resource utilization patterns
Recent infrastructure changes

Database Issues

Examples:

Connection pool exhaustion
Slow queries or deadlocks
Replication lag
Transaction failures

What to Include:

Database metrics (connections, query time)
Slow query logs if available
Recent schema changes
Connection pool configuration

Network Problems

Examples:

Connection timeouts
DNS resolution failures
Packet loss or latency
Firewall or routing issues

What to Include:

Network metrics (latency, packet loss)
Affected endpoints or services
Network topology or configuration
Recent network changes

Rate Limits and Quotas

Free Tier Limits

Reporting Limits:

5 reports per hour per user
20 reports per day per user
200 reports per month per organization

LLM Usage Limits:

Semantic caching reduces costs by 30-50%
Budget alerts at 25%, 50%, 75%, 100%
Hard limit at monthly budget cap

Professional Tier Limits

Increased Limits:

20 reports per hour per user
100 reports per day per user
Unlimited monthly reports (fair use)

Priority Processing:

Faster LLM processing
Priority access to Layer 3 generation
Dedicated support for failed resolutions

Rate Limit Messages

When Limit Reached:

⚠️ Hourly Rate Limit Reached

You've used all 5 problem reports for this hour.
Next report available in: 23 minutes

Need more reports? Upgrade to Professional tier.
[Upgrade Now] [Learn More]

Best Practices

Writing Effective Problem Descriptions

Do:

✅ Be specific about what’s not working
✅ Include error messages or metrics
✅ Mention what you’ve already tried
✅ Provide context about when it started
✅ Include affected services or users

Don’t:

❌ Be too vague (“something is broken”)
❌ Just paste error codes without context
❌ Skip important details to save time
❌ Assume the AI knows your infrastructure
❌ Report multiple unrelated problems together

Maximizing Solution Quality

Use Descriptive Titles: Clear problem summary helps search
Provide Full Context: More context = better solutions
Select Correct Problem Type: Helps narrow solution space
Set Accurate Urgency: Affects solution prioritization
Follow Steps Completely: Don’t skip steps to save time

Improving Future Results

Report Outcome: Always report success or failure
Provide Feedback: Rate solution quality
Add Notes: Document what worked and what didn’t
Share with Team: Successful solutions help everyone
Update Procedures: Capture new procedures for team

Cost Management

Understanding LLM Costs

Cost Factors:

Vector Search (Layers 1-2): No LLM cost, fast and cheap
LLM Generation (Layer 3): Only when vector search fails
Semantic Caching: Reduces repeat costs by 30-50%
Model Selection: Claude Sonnet (high quality) vs Nova Lite (economical)

Cost Per Report:

With Cache Hit: $0.00 (no LLM call)
Nova Lite: ~$0.02-0.05 per report
Claude Sonnet: ~$0.10-0.20 per report
Cached Result: ~$0.01-0.02 per report

Optimizing Costs

Cost Reduction Strategies:

Leverage Vector Search: Write clear descriptions for better vector matches
Use Semantic Caching: Similar problems reuse cached results
Batch Similar Issues: Report one detailed problem, not multiple vague ones
Review Before Submitting: Ensure quality to avoid wasted reports
Learn from History: Check solved incidents before reporting

Monitor Your Usage:

View cost dashboard: Dashboard → Analytics → LLM Costs
See monthly budget progress and alerts
Review most expensive queries
Identify cost optimization opportunities

Troubleshooting

”Problem description too short”

Cause: Description must be at least 50 characters

Solution: Add more details about:

What’s happening?
What did you expect?
What have you tried?

”Rate limit exceeded”

Cause: Exceeded hourly or daily report limit

Solution:

Wait for rate limit reset (shown in error message)
Upgrade to Professional tier for higher limits
Review existing incidents for similar problems

”No solutions found”

Cause: Problem too unique or poorly described

Solution:

Refine problem description with more context
Try different problem type classification
Search existing incidents manually
Contact support for complex problems

”LLM generation failed”

Cause: LLM service temporarily unavailable or budget exceeded

Solution:

Retry in a few moments
Check organization LLM budget status
Fall back to manual procedure search
Contact administrator if persistent

Next Steps

Now that you understand on-demand reporting:

Workflows → - Common usage workflows with examples
Platform Guides → - Platform-specific reporting tips
Analytics → - View cost and usage metrics
Troubleshooting → - Resolve reporting issues

Need Help?

If you have questions about on-demand reporting, contact support@overwatch-observability.com.

Related Documentation:

Configuration - Configure reporting settings
User Guide - Platform usage guide
LLM Cost Management - Monitor and optimize costs