Skip to content

On-Demand Problem Reporting

AI-Powered Problem Resolution at Your Fingertips

On-Demand Problem Reporting allows you to report problems directly from any monitoring dashboard and receive instant AI-powered solutions with step-by-step troubleshooting guidance.

On-Demand Reporting is a revolutionary feature that enables DevOps engineers to:

Report Anywhere

Report problems directly from any monitoring platform without switching contexts or filling out complex forms.

AI-Powered Analysis

Get instant problem analysis powered by AWS Bedrock LLMs with intelligent context understanding.

Turn-by-Turn Guidance

Receive step-by-step troubleshooting procedures tailored to your specific problem and environment.

Learning Loop

Successful resolutions are captured and used to improve future recommendations for the entire organization.

3-Layer Search Architecture:

  1. Layer 1: Customer-Specific Solutions

    • Searches your organization’s historical incident resolutions
    • Finds solutions that worked for similar problems in your environment
    • Highest confidence score (0.8-1.0)
  2. Layer 2: Public Knowledge Base

    • Searches Weaviate public database with community solutions
    • Finds proven solutions from the broader DevOps community
    • Medium confidence score (0.7-0.9)
  3. Layer 3: LLM-Generated Solutions (Fallback)

    • AI-powered solution generation when no vector matches found
    • Uses AWS Bedrock (Claude Sonnet, Nova Lite, or GPT-4)
    • Lower confidence score (0.6-0.8) but highly contextual

Report a Problem in 3 Clicks:

  1. Activate Reporting

    • Click Overwatch extension icon while on monitoring dashboard
    • Or use keyboard shortcut: Ctrl+Shift+R (Cmd+Shift+R on Mac)
    • Or click “Report Problem” button in overlay
  2. Describe the Problem

    • Enter problem description (minimum 50 characters)
    • Extension automatically extracts dashboard context
    • Add additional details if needed
  3. Get Solutions

    • AI analyzes your problem and context
    • Solutions appear in overlay with confidence scores
    • Follow turn-by-turn guidance to resolve

Using Extension Icon:

  1. Navigate to monitoring dashboard showing the problem
  2. Click Overwatch extension icon in toolbar
  3. Select “Report Problem” from menu
  4. Report form opens in overlay

What to Include:

  • Problem Description (required, minimum 50 characters):

    • What is happening or not working?
    • When did it start?
    • What have you tried already?
  • Problem Type (auto-detected from context):

    • Application Error
    • Performance Issue
    • Infrastructure Problem
    • Database Issue
    • Network Problem
    • Security Incident
  • Urgency Level:

    • Critical: Production down, immediate attention required
    • High: Significant impact, needs quick resolution
    • Medium: Moderate impact, can wait for business hours
    • Low: Minor issue, low priority

Example Problem Descriptions:

Good: "API response time increased from 200ms to 2000ms in last hour.
Users reporting timeout errors. Checked database queries - all normal.
CPU usage on API servers is 85% but memory looks fine."
Bad: "API is slow"

Auto-Extracted Context:

  • Current monitoring platform (Datadog, New Relic, etc.)
  • Dashboard URL and time range
  • Visible metrics and their values
  • Alert status and severity
  • Affected services or hosts
  • Browser and system information

What Happens Next:

  1. Problem Analysis

    • Extension sends problem description + context to backend
    • LLM analyzes problem and identifies key aspects
    • Problem categorized and enriched with technical details
  2. Solution Search

    • System searches 3 layers for relevant solutions
    • Layer 1: Your organization’s historical solutions
    • Layer 2: Public knowledge base (Weaviate)
    • Layer 3: LLM generates new solution if needed
  3. Results Display

    • Solutions appear in overlay ranked by confidence
    • Each solution shows confidence score (0-100%)
    • Turn-by-turn steps with estimated time
    • Related procedures and similar incidents linked

Analysis Time:

  • Layer 1 + 2: < 2 seconds (vector search)
  • Layer 3: 5-10 seconds (LLM generation)
  • Total: Typically 2-10 seconds for complete results

Solution Overlay Interface:

┌─────────────────────────────────────────────┐
│ Solution: Database Connection Pool Exhausted │
│ Confidence: 92% | Source: Customer Layer │
├─────────────────────────────────────────────┤
│ │
│ Step 1 of 5: Check Connection Pool Size │
│ ▶ Run: kubectl get pods -n production │
│ │
│ [ Run Command ] [ Copy ] [ Next Step ] │
│ │
│ Estimated time: 15 minutes │
│ Success rate: 94% (based on 47 incidents) │
└─────────────────────────────────────────────┘

Interactive Features:

  • Run Command: Execute commands with one click (where supported)
  • Copy: Copy commands or code snippets to clipboard
  • Next/Previous: Navigate through resolution steps
  • Mark Complete: Track which steps you’ve completed
  • Report Success: Let system know if solution worked
  • Report Failure: Get alternative solutions if needed

Automatic Tracking:

  • Incident created with source_type: 'manual_report'
  • Progress tracked through each resolution step
  • Time spent on each step recorded
  • Final outcome (success/failure) captured

Learning Loop:

  • Successful resolutions (with your confirmation) captured
  • Solutions added to your organization’s knowledge base (Layer 1)
  • Future similar problems resolve even faster
  • Team benefits from your troubleshooting experience

Examples:

  • Exception errors and crashes
  • API failures and timeouts
  • 500 errors and application exceptions
  • Memory leaks and resource exhaustion

What to Include:

  • Error message and stack trace
  • Affected endpoints or services
  • Request volume and error rate
  • Recent deployments or changes

Examples:

  • Slow response times
  • High CPU or memory usage
  • Database query slowness
  • Network latency problems

What to Include:

  • Performance metrics (response time, throughput)
  • Baseline vs current values
  • Affected services or components
  • Time when degradation started

Examples:

  • Container crashes or restarts
  • Pod evictions or scheduling issues
  • Disk space or I/O problems
  • Load balancer issues

What to Include:

  • Infrastructure metrics (CPU, memory, disk)
  • Affected nodes or instances
  • Resource utilization patterns
  • Recent infrastructure changes

Examples:

  • Connection pool exhaustion
  • Slow queries or deadlocks
  • Replication lag
  • Transaction failures

What to Include:

  • Database metrics (connections, query time)
  • Slow query logs if available
  • Recent schema changes
  • Connection pool configuration

Examples:

  • Connection timeouts
  • DNS resolution failures
  • Packet loss or latency
  • Firewall or routing issues

What to Include:

  • Network metrics (latency, packet loss)
  • Affected endpoints or services
  • Network topology or configuration
  • Recent network changes

Reporting Limits:

  • 5 reports per hour per user
  • 20 reports per day per user
  • 200 reports per month per organization

LLM Usage Limits:

  • Semantic caching reduces costs by 30-50%
  • Budget alerts at 25%, 50%, 75%, 100%
  • Hard limit at monthly budget cap

Increased Limits:

  • 20 reports per hour per user
  • 100 reports per day per user
  • Unlimited monthly reports (fair use)

Priority Processing:

  • Faster LLM processing
  • Priority access to Layer 3 generation
  • Dedicated support for failed resolutions

When Limit Reached:

⚠️ Hourly Rate Limit Reached
You've used all 5 problem reports for this hour.
Next report available in: 23 minutes
Need more reports? Upgrade to Professional tier.
[Upgrade Now] [Learn More]

Do:

  • ✅ Be specific about what’s not working
  • ✅ Include error messages or metrics
  • ✅ Mention what you’ve already tried
  • ✅ Provide context about when it started
  • ✅ Include affected services or users

Don’t:

  • ❌ Be too vague (“something is broken”)
  • ❌ Just paste error codes without context
  • ❌ Skip important details to save time
  • ❌ Assume the AI knows your infrastructure
  • ❌ Report multiple unrelated problems together
  1. Use Descriptive Titles: Clear problem summary helps search
  2. Provide Full Context: More context = better solutions
  3. Select Correct Problem Type: Helps narrow solution space
  4. Set Accurate Urgency: Affects solution prioritization
  5. Follow Steps Completely: Don’t skip steps to save time
  1. Report Outcome: Always report success or failure
  2. Provide Feedback: Rate solution quality
  3. Add Notes: Document what worked and what didn’t
  4. Share with Team: Successful solutions help everyone
  5. Update Procedures: Capture new procedures for team

Cost Factors:

  • Vector Search (Layers 1-2): No LLM cost, fast and cheap
  • LLM Generation (Layer 3): Only when vector search fails
  • Semantic Caching: Reduces repeat costs by 30-50%
  • Model Selection: Claude Sonnet (high quality) vs Nova Lite (economical)

Cost Per Report:

  • With Cache Hit: $0.00 (no LLM call)
  • Nova Lite: ~$0.02-0.05 per report
  • Claude Sonnet: ~$0.10-0.20 per report
  • Cached Result: ~$0.01-0.02 per report

Cost Reduction Strategies:

  1. Leverage Vector Search: Write clear descriptions for better vector matches
  2. Use Semantic Caching: Similar problems reuse cached results
  3. Batch Similar Issues: Report one detailed problem, not multiple vague ones
  4. Review Before Submitting: Ensure quality to avoid wasted reports
  5. Learn from History: Check solved incidents before reporting

Monitor Your Usage:

  • View cost dashboard: Dashboard → Analytics → LLM Costs
  • See monthly budget progress and alerts
  • Review most expensive queries
  • Identify cost optimization opportunities

Cause: Description must be at least 50 characters

Solution: Add more details about:

  • What’s happening?
  • What did you expect?
  • What have you tried?

Cause: Exceeded hourly or daily report limit

Solution:

  • Wait for rate limit reset (shown in error message)
  • Upgrade to Professional tier for higher limits
  • Review existing incidents for similar problems

Cause: Problem too unique or poorly described

Solution:

  • Refine problem description with more context
  • Try different problem type classification
  • Search existing incidents manually
  • Contact support for complex problems

Cause: LLM service temporarily unavailable or budget exceeded

Solution:

  • Retry in a few moments
  • Check organization LLM budget status
  • Fall back to manual procedure search
  • Contact administrator if persistent

Now that you understand on-demand reporting:

If you have questions about on-demand reporting, contact support@overwatch-observability.com.


Related Documentation: