Datadog Integration

Seamless Integration with Datadog Monitoring Platform

The Overwatch Chrome Extension provides comprehensive integration with Datadog, enabling automatic context extraction from alerts, metrics, logs, and dashboards.

Overview

Supported Features

Alert Detection

Automatic alert detection and context extraction
Monitor and SLO status extraction
Alert correlation and impact analysis

Metric Correlation

Time series data extraction
Metric correlation and analysis
Dashboard context capture

Log Aggregation

Log message extraction
Error pattern recognition
Stack trace capture

Dashboard Integration

Dashboard screenshot capture
Widget and panel extraction
Time range and filter context

Supported Datadog URLs

Automatic Detection For:

app.datadoghq.com (US1)
app.datadoghq.eu (EU)
us3.datadoghq.com (US3)
us5.datadoghq.com (US5)
Custom Datadog instances (configure in settings)

Setup

Enable Datadog Integration

Navigate to Datadog
- Open any Datadog page (app.datadoghq.com)
- Extension automatically detects platform
Grant Permissions
- Click Overwatch extension icon
- Select “Enable Datadog Integration”
- Grant required permissions when prompted
Configure Settings
- Enable auto-detect alerts: ✅ Recommended
- Enable metric extraction: ✅ Recommended
- Enable log extraction: ✅ Recommended
- Enable screenshot capture: ❌ Optional (performance impact)
Verify Detection
- Navigate to a Datadog alert
- Extension icon should show active status
- Context should be automatically extracted

Configuration

Datadog-Specific Settings:

{
  "datadog": {
    "enabled": true,
    "domains": ["app.datadoghq.com", "app.datadoghq.eu"],
    "auto_extract": true,
    "extract_logs": true,
    "extract_metrics": true,
    "screenshot_on_alert": false,
    "notification_types": ["alerts", "metrics"],
    "max_log_lines": 100,
    "max_metric_points": 1000
  }
}

Alert Detection

What Gets Extracted

Alert Context:

Alert ID and name
Current status (OK, Alert, Warning, No Data)
Severity level
Trigger condition and threshold
Affected tags and hosts
Alert message and details

Example Extracted Data:

{
  "platform": "datadog",
  "alert_id": "12345",
  "alert_name": "High Error Rate - API",
  "severity": "critical",
  "status": "triggered",
  "message": "Error rate above 5% for API endpoints",
  "tags": ["env:production", "service:api"],
  "metrics": {
    "error_rate": 7.2,
    "threshold": 5.0,
    "time_range": "last_15_minutes"
  },
  "affected_hosts": ["api-1", "api-2", "api-3"],
  "dashboard_url": "https://app.datadoghq.com/dashboard/..."
}

Using Alert Context

Automatic Workflow:

Alert Triggered
- Datadog alert appears on dashboard or notification
- Extension detects alert context automatically
Context Extraction
- Alert details extracted (name, severity, status)
- Related metrics and logs captured
- Affected infrastructure identified
Incident Creation
- Click extension icon
- Review pre-populated incident form
- Click “Create Incident” with full context
AI Suggestions
- Overwatch suggests relevant procedures
- Similar incidents displayed
- Turn-by-turn guidance provided

Metric Extraction

Dashboard Metrics

What Gets Extracted from Dashboards:

Dashboard name and URL
Time range (from/to)
Visible widgets and their queries
Current metric values
Threshold configurations

Example Dashboard Context:

{
  "dashboard": {
    "id": "dashboard-abc-123",
    "title": "API Performance Dashboard",
    "url": "https://app.datadoghq.com/dashboard/abc-123",
    "time_range": {
      "from": "now-1h",
      "to": "now"
    }
  },
  "widgets": [
    {
      "id": 1,
      "title": "Request Rate",
      "type": "timeseries",
      "query": "sum:api.requests{env:production}",
      "current_value": 1250,
      "threshold": {
        "warning": 2000,
        "critical": 3000
      }
    }
  ]
}

Monitor Details

Extracted from Monitor Pages:

Monitor configuration
Query and formula
Threshold settings (warning/critical)
Notification settings
Historical trigger data

Log Integration

Log Extraction

What Gets Extracted:

Log messages and timestamps
Log levels (ERROR, WARN, INFO, DEBUG)
Service and source tags
Stack traces (if present)
Related logs in time window

Example Log Context:

{
  "logs": [
    {
      "timestamp": "2025-10-15T10:25:00Z",
      "level": "ERROR",
      "service": "api-service",
      "message": "DatabaseTimeoutException: Connection timeout after 30 seconds",
      "stack_trace": "...",
      "tags": ["env:production", "host:api-1"],
      "attributes": {
        "error.kind": "DatabaseTimeoutException",
        "error.stack": "..."
      }
    }
  ],
  "log_patterns": {
    "error_count": 47,
    "unique_errors": 3,
    "most_common": "DatabaseTimeoutException"
  }
}

Log Search Integration

Using with Log Explorer:

Open Log Explorer
- Navigate to Logs → Explorer
- Apply filters for your investigation
Review Logs
- Identify relevant error patterns
- Note timestamps and affected services
Report Problem
- Click Overwatch extension icon
- Or use Ctrl+Shift+R shortcut
- Log context automatically extracted
Get Solutions
- AI analyzes log patterns
- Similar errors identified
- Solutions suggested with steps

Common Workflows

Workflow 1: Alert Response

Scenario: High error rate alert triggered

Receive Alert
- Datadog alert notification received
- Navigate to alert details page
- Extension detects alert context
Extract Context
- Click Overwatch extension icon
- Review extracted alert data
- Verify affected services and metrics
Create Incident
- Click “Create Incident” in overlay
- Incident pre-populated with alert context
- Tags automatically assigned
Get Guidance
- AI suggests relevant procedures
- View similar past incidents
- Follow turn-by-turn resolution steps
Track Resolution
- Execute suggested procedures
- Monitor metrics for improvement
- Document successful resolution

Workflow 2: Dashboard Investigation

Scenario: Performance degradation observed on dashboard

Identify Issue
- Reviewing API performance dashboard
- Notice response time spike
- Correlation with error rate increase
Report Problem
- Press Ctrl+Shift+R to open report form
- Describe: “API response time increased 10x in last hour”
- Dashboard context automatically captured
Receive Analysis
- AI analyzes metrics and trends
- Identifies potential root causes
- Suggests investigation procedures
Follow Guidance
- Execute diagnostic procedures
- Check database connections
- Review recent deployments
Resolve and Learn
- Identify root cause (connection pool exhaustion)
- Apply fix and verify
- Solution captured for future use

Workflow 3: Multi-Metric Correlation

Scenario: Investigating complex issue across multiple services

Review Metrics
- Open composite dashboard
- Multiple services showing issues
- Time correlation observed
Capture Context
- Click extension icon on dashboard
- Review extracted metrics from all widgets
- Note time ranges and correlations
Create Detailed Report
- Use “Report Problem” feature
- Describe multi-service impact
- Extension includes all dashboard context
Get Coordinated Response
- AI identifies service dependencies
- Suggests investigation order
- Provides steps for each service
Execute Coordinated Fix
- Follow service-specific procedures
- Monitor cross-service metrics
- Verify cascade resolution

Best Practices

Effective Use with Datadog

Use Descriptive Alert Names: Helps with solution matching
Tag Consistently: Tags improve context extraction
Create Focused Dashboards: Better metric correlation
Use Log Patterns: Enable pattern recognition
Document Resolutions: Build organizational knowledge

Performance Optimization

Disable Screenshot Capture: Reduces overhead
Limit Log Extraction: Set reasonable max_log_lines
Configure Time Windows: Shorter windows = faster extraction
Use Selective Detection: Enable only needed features

Data Quality

Verify Extracted Context: Always review before creating incident
Add Manual Context: Supplement automated extraction
Update Tags: Keep Datadog tags current and accurate
Maintain Dashboards: Well-organized dashboards extract better

Troubleshooting

Extension Not Detecting Datadog

Symptoms: Extension icon grayed out on Datadog pages

Solutions:

Verify URL matches configured domains
Check if Datadog integration enabled in settings
Refresh page and wait for full load
Clear browser cache and reload
Check browser console for errors

Incomplete Context Extraction

Symptoms: Missing metrics or logs in extracted data

Solutions:

Wait for page to fully load before extraction
Verify extraction limits in settings
Check if data is visible on Datadog page
Try manual refresh and re-extract
Enable debug mode to see what’s extracted

Performance Issues

Symptoms: Slow page load or browser lag on Datadog

Solutions:

Disable screenshot capture in settings
Reduce max_log_lines and max_metric_points
Disable auto-extraction and extract manually
Clear browser cache
Check for browser extension conflicts

Authentication Errors

Symptoms: “Unable to access Datadog data” errors

Solutions:

Verify Overwatch connection status
Check Datadog session is active
Re-authenticate with Datadog if needed
Verify browser permissions granted
Contact administrator for access issues

Next Steps

New Relic Integration → - APM and error tracking
Grafana Integration → - Dashboard and alerting
Workflows → - Common usage patterns
Troubleshooting → - Resolve issues

Need Help?

For Datadog-specific issues, contact support@overwatch-observability.com.

Related Documentation:

Installation - Extension setup
Configuration - Platform settings
On-Demand Reporting - Problem reporting