Skip to content

Datadog Integration

Seamless Integration with Datadog Monitoring Platform

The Overwatch Chrome Extension provides comprehensive integration with Datadog, enabling automatic context extraction from alerts, metrics, logs, and dashboards.

Alert Detection

  • Automatic alert detection and context extraction
  • Monitor and SLO status extraction
  • Alert correlation and impact analysis

Metric Correlation

  • Time series data extraction
  • Metric correlation and analysis
  • Dashboard context capture

Log Aggregation

  • Log message extraction
  • Error pattern recognition
  • Stack trace capture

Dashboard Integration

  • Dashboard screenshot capture
  • Widget and panel extraction
  • Time range and filter context

Automatic Detection For:

  • app.datadoghq.com (US1)
  • app.datadoghq.eu (EU)
  • us3.datadoghq.com (US3)
  • us5.datadoghq.com (US5)
  • Custom Datadog instances (configure in settings)
  1. Navigate to Datadog

    • Open any Datadog page (app.datadoghq.com)
    • Extension automatically detects platform
  2. Grant Permissions

    • Click Overwatch extension icon
    • Select “Enable Datadog Integration”
    • Grant required permissions when prompted
  3. Configure Settings

    • Enable auto-detect alerts: ✅ Recommended
    • Enable metric extraction: ✅ Recommended
    • Enable log extraction: ✅ Recommended
    • Enable screenshot capture: ❌ Optional (performance impact)
  4. Verify Detection

    • Navigate to a Datadog alert
    • Extension icon should show active status
    • Context should be automatically extracted

Datadog-Specific Settings:

{
"datadog": {
"enabled": true,
"domains": ["app.datadoghq.com", "app.datadoghq.eu"],
"auto_extract": true,
"extract_logs": true,
"extract_metrics": true,
"screenshot_on_alert": false,
"notification_types": ["alerts", "metrics"],
"max_log_lines": 100,
"max_metric_points": 1000
}
}

Alert Context:

  • Alert ID and name
  • Current status (OK, Alert, Warning, No Data)
  • Severity level
  • Trigger condition and threshold
  • Affected tags and hosts
  • Alert message and details

Example Extracted Data:

{
"platform": "datadog",
"alert_id": "12345",
"alert_name": "High Error Rate - API",
"severity": "critical",
"status": "triggered",
"message": "Error rate above 5% for API endpoints",
"tags": ["env:production", "service:api"],
"metrics": {
"error_rate": 7.2,
"threshold": 5.0,
"time_range": "last_15_minutes"
},
"affected_hosts": ["api-1", "api-2", "api-3"],
"dashboard_url": "https://app.datadoghq.com/dashboard/..."
}

Automatic Workflow:

  1. Alert Triggered

    • Datadog alert appears on dashboard or notification
    • Extension detects alert context automatically
  2. Context Extraction

    • Alert details extracted (name, severity, status)
    • Related metrics and logs captured
    • Affected infrastructure identified
  3. Incident Creation

    • Click extension icon
    • Review pre-populated incident form
    • Click “Create Incident” with full context
  4. AI Suggestions

    • Overwatch suggests relevant procedures
    • Similar incidents displayed
    • Turn-by-turn guidance provided

What Gets Extracted from Dashboards:

  • Dashboard name and URL
  • Time range (from/to)
  • Visible widgets and their queries
  • Current metric values
  • Threshold configurations

Example Dashboard Context:

{
"dashboard": {
"id": "dashboard-abc-123",
"title": "API Performance Dashboard",
"url": "https://app.datadoghq.com/dashboard/abc-123",
"time_range": {
"from": "now-1h",
"to": "now"
}
},
"widgets": [
{
"id": 1,
"title": "Request Rate",
"type": "timeseries",
"query": "sum:api.requests{env:production}",
"current_value": 1250,
"threshold": {
"warning": 2000,
"critical": 3000
}
}
]
}

Extracted from Monitor Pages:

  • Monitor configuration
  • Query and formula
  • Threshold settings (warning/critical)
  • Notification settings
  • Historical trigger data

What Gets Extracted:

  • Log messages and timestamps
  • Log levels (ERROR, WARN, INFO, DEBUG)
  • Service and source tags
  • Stack traces (if present)
  • Related logs in time window

Example Log Context:

{
"logs": [
{
"timestamp": "2025-10-15T10:25:00Z",
"level": "ERROR",
"service": "api-service",
"message": "DatabaseTimeoutException: Connection timeout after 30 seconds",
"stack_trace": "...",
"tags": ["env:production", "host:api-1"],
"attributes": {
"error.kind": "DatabaseTimeoutException",
"error.stack": "..."
}
}
],
"log_patterns": {
"error_count": 47,
"unique_errors": 3,
"most_common": "DatabaseTimeoutException"
}
}

Using with Log Explorer:

  1. Open Log Explorer

    • Navigate to Logs → Explorer
    • Apply filters for your investigation
  2. Review Logs

    • Identify relevant error patterns
    • Note timestamps and affected services
  3. Report Problem

    • Click Overwatch extension icon
    • Or use Ctrl+Shift+R shortcut
    • Log context automatically extracted
  4. Get Solutions

    • AI analyzes log patterns
    • Similar errors identified
    • Solutions suggested with steps

Scenario: High error rate alert triggered

  1. Receive Alert

    • Datadog alert notification received
    • Navigate to alert details page
    • Extension detects alert context
  2. Extract Context

    • Click Overwatch extension icon
    • Review extracted alert data
    • Verify affected services and metrics
  3. Create Incident

    • Click “Create Incident” in overlay
    • Incident pre-populated with alert context
    • Tags automatically assigned
  4. Get Guidance

    • AI suggests relevant procedures
    • View similar past incidents
    • Follow turn-by-turn resolution steps
  5. Track Resolution

    • Execute suggested procedures
    • Monitor metrics for improvement
    • Document successful resolution

Scenario: Performance degradation observed on dashboard

  1. Identify Issue

    • Reviewing API performance dashboard
    • Notice response time spike
    • Correlation with error rate increase
  2. Report Problem

    • Press Ctrl+Shift+R to open report form
    • Describe: “API response time increased 10x in last hour”
    • Dashboard context automatically captured
  3. Receive Analysis

    • AI analyzes metrics and trends
    • Identifies potential root causes
    • Suggests investigation procedures
  4. Follow Guidance

    • Execute diagnostic procedures
    • Check database connections
    • Review recent deployments
  5. Resolve and Learn

    • Identify root cause (connection pool exhaustion)
    • Apply fix and verify
    • Solution captured for future use

Scenario: Investigating complex issue across multiple services

  1. Review Metrics

    • Open composite dashboard
    • Multiple services showing issues
    • Time correlation observed
  2. Capture Context

    • Click extension icon on dashboard
    • Review extracted metrics from all widgets
    • Note time ranges and correlations
  3. Create Detailed Report

    • Use “Report Problem” feature
    • Describe multi-service impact
    • Extension includes all dashboard context
  4. Get Coordinated Response

    • AI identifies service dependencies
    • Suggests investigation order
    • Provides steps for each service
  5. Execute Coordinated Fix

    • Follow service-specific procedures
    • Monitor cross-service metrics
    • Verify cascade resolution
  1. Use Descriptive Alert Names: Helps with solution matching
  2. Tag Consistently: Tags improve context extraction
  3. Create Focused Dashboards: Better metric correlation
  4. Use Log Patterns: Enable pattern recognition
  5. Document Resolutions: Build organizational knowledge
  1. Disable Screenshot Capture: Reduces overhead
  2. Limit Log Extraction: Set reasonable max_log_lines
  3. Configure Time Windows: Shorter windows = faster extraction
  4. Use Selective Detection: Enable only needed features
  1. Verify Extracted Context: Always review before creating incident
  2. Add Manual Context: Supplement automated extraction
  3. Update Tags: Keep Datadog tags current and accurate
  4. Maintain Dashboards: Well-organized dashboards extract better

Symptoms: Extension icon grayed out on Datadog pages

Solutions:

  1. Verify URL matches configured domains
  2. Check if Datadog integration enabled in settings
  3. Refresh page and wait for full load
  4. Clear browser cache and reload
  5. Check browser console for errors

Symptoms: Missing metrics or logs in extracted data

Solutions:

  1. Wait for page to fully load before extraction
  2. Verify extraction limits in settings
  3. Check if data is visible on Datadog page
  4. Try manual refresh and re-extract
  5. Enable debug mode to see what’s extracted

Symptoms: Slow page load or browser lag on Datadog

Solutions:

  1. Disable screenshot capture in settings
  2. Reduce max_log_lines and max_metric_points
  3. Disable auto-extraction and extract manually
  4. Clear browser cache
  5. Check for browser extension conflicts

Symptoms: “Unable to access Datadog data” errors

Solutions:

  1. Verify Overwatch connection status
  2. Check Datadog session is active
  3. Re-authenticate with Datadog if needed
  4. Verify browser permissions granted
  5. Contact administrator for access issues

For Datadog-specific issues, contact support@overwatch-observability.com.


Related Documentation: