AI-Powered Problem Resolution at Your Fingertips
On-Demand Problem Reporting allows you to report problems directly from any monitoring dashboard and receive instant AI-powered solutions with step-by-step troubleshooting guidance.
On-Demand Reporting is a revolutionary feature that enables DevOps engineers to:
Report Anywhere
Report problems directly from any monitoring platform without switching contexts or filling out complex forms.
AI-Powered Analysis
Get instant problem analysis powered by AWS Bedrock LLMs with intelligent context understanding.
Turn-by-Turn Guidance
Receive step-by-step troubleshooting procedures tailored to your specific problem and environment.
Learning Loop
Successful resolutions are captured and used to improve future recommendations for the entire organization.
3-Layer Search Architecture :
Layer 1: Customer-Specific Solutions
Searches your organization’s historical incident resolutions
Finds solutions that worked for similar problems in your environment
Highest confidence score (0.8-1.0)
Layer 2: Public Knowledge Base
Searches Weaviate public database with community solutions
Finds proven solutions from the broader DevOps community
Medium confidence score (0.7-0.9)
Layer 3: LLM-Generated Solutions (Fallback)
AI-powered solution generation when no vector matches found
Uses AWS Bedrock (Claude Sonnet, Nova Lite, or GPT-4)
Lower confidence score (0.6-0.8) but highly contextual
Report a Problem in 3 Clicks :
Activate Reporting
Click Overwatch extension icon while on monitoring dashboard
Or use keyboard shortcut: Ctrl+Shift+R (Cmd+Shift+R on Mac)
Or click “Report Problem” button in overlay
Describe the Problem
Enter problem description (minimum 50 characters)
Extension automatically extracts dashboard context
Add additional details if needed
Get Solutions
AI analyzes your problem and context
Solutions appear in overlay with confidence scores
Follow turn-by-turn guidance to resolve
Using Extension Icon :
Navigate to monitoring dashboard showing the problem
Click Overwatch extension icon in toolbar
Select “Report Problem” from menu
Report form opens in overlay
Using Keyboard Shortcut :
While on any monitoring platform page
Press Ctrl+Shift+R (Windows/Linux) or Cmd+Shift+R (Mac)
Report form opens immediately
Begin typing your problem description
Using Overlay Button :
Open extension overlay (click icon or Ctrl+Shift+O)
Click “Report Problem” button at top
Report form opens in expanded overlay
Context from current page pre-populated
What to Include :
Example Problem Descriptions :
Good: "API response time increased from 200ms to 2000ms in last hour.
Users reporting timeout errors. Checked database queries - all normal.
CPU usage on API servers is 85% but memory looks fine."
Auto-Extracted Context :
Current monitoring platform (Datadog, New Relic, etc.)
Dashboard URL and time range
Visible metrics and their values
Alert status and severity
Affected services or hosts
Browser and system information
What Happens Next :
Problem Analysis
Extension sends problem description + context to backend
LLM analyzes problem and identifies key aspects
Problem categorized and enriched with technical details
Solution Search
System searches 3 layers for relevant solutions
Layer 1: Your organization’s historical solutions
Layer 2: Public knowledge base (Weaviate)
Layer 3: LLM generates new solution if needed
Results Display
Solutions appear in overlay ranked by confidence
Each solution shows confidence score (0-100%)
Turn-by-turn steps with estimated time
Related procedures and similar incidents linked
Analysis Time :
Layer 1 + 2 : < 2 seconds (vector search)
Layer 3 : 5-10 seconds (LLM generation)
Total : Typically 2-10 seconds for complete results
Solution Overlay Interface :
┌─────────────────────────────────────────────┐
│ Solution: Database Connection Pool Exhausted │
│ Confidence: 92% | Source: Customer Layer │
├─────────────────────────────────────────────┤
│ Step 1 of 5: Check Connection Pool Size │
│ ▶ Run: kubectl get pods -n production │
│ [ Run Command ] [ Copy ] [ Next Step ] │
│ Estimated time: 15 minutes │
│ Success rate: 94% (based on 47 incidents) │
└─────────────────────────────────────────────┘
Interactive Features :
Run Command : Execute commands with one click (where supported)
Copy : Copy commands or code snippets to clipboard
Next/Previous : Navigate through resolution steps
Mark Complete : Track which steps you’ve completed
Report Success : Let system know if solution worked
Report Failure : Get alternative solutions if needed
Automatic Tracking :
Incident created with source_type: 'manual_report'
Progress tracked through each resolution step
Time spent on each step recorded
Final outcome (success/failure) captured
Learning Loop :
Successful resolutions (with your confirmation) captured
Solutions added to your organization’s knowledge base (Layer 1)
Future similar problems resolve even faster
Team benefits from your troubleshooting experience
Examples :
Exception errors and crashes
API failures and timeouts
500 errors and application exceptions
Memory leaks and resource exhaustion
What to Include :
Error message and stack trace
Affected endpoints or services
Request volume and error rate
Recent deployments or changes
Examples :
Slow response times
High CPU or memory usage
Database query slowness
Network latency problems
What to Include :
Performance metrics (response time, throughput)
Baseline vs current values
Affected services or components
Time when degradation started
Examples :
Container crashes or restarts
Pod evictions or scheduling issues
Disk space or I/O problems
Load balancer issues
What to Include :
Infrastructure metrics (CPU, memory, disk)
Affected nodes or instances
Resource utilization patterns
Recent infrastructure changes
Examples :
Connection pool exhaustion
Slow queries or deadlocks
Replication lag
Transaction failures
What to Include :
Database metrics (connections, query time)
Slow query logs if available
Recent schema changes
Connection pool configuration
Examples :
Connection timeouts
DNS resolution failures
Packet loss or latency
Firewall or routing issues
What to Include :
Network metrics (latency, packet loss)
Affected endpoints or services
Network topology or configuration
Recent network changes
Reporting Limits :
5 reports per hour per user
20 reports per day per user
200 reports per month per organization
LLM Usage Limits :
Semantic caching reduces costs by 30-50%
Budget alerts at 25%, 50%, 75%, 100%
Hard limit at monthly budget cap
Increased Limits :
20 reports per hour per user
100 reports per day per user
Unlimited monthly reports (fair use)
Priority Processing :
Faster LLM processing
Priority access to Layer 3 generation
Dedicated support for failed resolutions
When Limit Reached :
⚠️ Hourly Rate Limit Reached
You've used all 5 problem reports for this hour.
Next report available in: 23 minutes
Need more reports? Upgrade to Professional tier.
[Upgrade Now] [Learn More]
Do :
✅ Be specific about what’s not working
✅ Include error messages or metrics
✅ Mention what you’ve already tried
✅ Provide context about when it started
✅ Include affected services or users
Don’t :
❌ Be too vague (“something is broken”)
❌ Just paste error codes without context
❌ Skip important details to save time
❌ Assume the AI knows your infrastructure
❌ Report multiple unrelated problems together
Use Descriptive Titles : Clear problem summary helps search
Provide Full Context : More context = better solutions
Select Correct Problem Type : Helps narrow solution space
Set Accurate Urgency : Affects solution prioritization
Follow Steps Completely : Don’t skip steps to save time
Report Outcome : Always report success or failure
Provide Feedback : Rate solution quality
Add Notes : Document what worked and what didn’t
Share with Team : Successful solutions help everyone
Update Procedures : Capture new procedures for team
Cost Factors :
Vector Search (Layers 1-2) : No LLM cost, fast and cheap
LLM Generation (Layer 3) : Only when vector search fails
Semantic Caching : Reduces repeat costs by 30-50%
Model Selection : Claude Sonnet (high quality) vs Nova Lite (economical)
Cost Per Report :
With Cache Hit : $0.00 (no LLM call)
Nova Lite : ~$0.02-0.05 per report
Claude Sonnet : ~$0.10-0.20 per report
Cached Result : ~$0.01-0.02 per report
Cost Reduction Strategies :
Leverage Vector Search : Write clear descriptions for better vector matches
Use Semantic Caching : Similar problems reuse cached results
Batch Similar Issues : Report one detailed problem, not multiple vague ones
Review Before Submitting : Ensure quality to avoid wasted reports
Learn from History : Check solved incidents before reporting
Monitor Your Usage :
View cost dashboard: Dashboard → Analytics → LLM Costs
See monthly budget progress and alerts
Review most expensive queries
Identify cost optimization opportunities
Cause : Description must be at least 50 characters
Solution : Add more details about:
What’s happening?
What did you expect?
What have you tried?
Cause : Exceeded hourly or daily report limit
Solution :
Wait for rate limit reset (shown in error message)
Upgrade to Professional tier for higher limits
Review existing incidents for similar problems
Cause : Problem too unique or poorly described
Solution :
Refine problem description with more context
Try different problem type classification
Search existing incidents manually
Contact support for complex problems
Cause : LLM service temporarily unavailable or budget exceeded
Solution :
Retry in a few moments
Check organization LLM budget status
Fall back to manual procedure search
Contact administrator if persistent
Now that you understand on-demand reporting:
If you have questions about on-demand reporting, contact support@overwatch-observability.com .
Related Documentation :