Alerts Guide

Get notified when things go wrong

Overview

Qorrelate alerts monitor your logs, metrics, and traces 24/7 and notify you when conditions are met. Configure thresholds, set up notification channels, and reduce alert fatigue with smart grouping.

Log Alerts

Alert on error patterns, specific messages, or log volume spikes.

Metric Alerts

Alert on thresholds, anomalies, or rate of change in metrics.

Trace Alerts

Alert on high latency, error rates, or trace patterns.

Creating an Alert

Via Dashboard

  1. Navigate to Alerts in the sidebar
  2. Click + Create Alert
  3. Choose alert type (Log, Metric, or Trace)
  4. Configure the condition and threshold
  5. Add notification destinations
  6. Click Save

Via API

curl -X POST https://qorrelate.io/v1/organizations/{org_id}/alerts \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "High Error Rate",
    "type": "metric",
    "condition": {
      "query": "sum(rate(http_requests_total{status=~\"5..\"}[5m])) / sum(rate(http_requests_total[5m])) * 100",
      "operator": ">",
      "threshold": 5,
      "for": "5m"
    },
    "notifications": ["slack-engineering"],
    "severity": "critical",
    "enabled": true
  }'

Alert Conditions

Operators

Operator Description Example
> Greater than Error rate > 5%
< Less than Request rate < 100/min
= Equal to Healthy instances = 0
!= Not equal to Status != "running"

Duration (for)

The for parameter prevents flapping by requiring the condition to be true for a sustained period:

  • for: "0s" — Alert immediately (may be noisy)
  • for: "1m" — Alert after 1 minute (recommended minimum)
  • for: "5m" — Alert after 5 minutes (recommended for most alerts)
  • for: "15m" — Alert after 15 minutes (for slow-burn issues)

Log Alerts

Alert based on log content, patterns, or volume.

Example: Alert on Error Logs

{
  "name": "Critical Errors in Production",
  "type": "log",
  "condition": {
    "query": "severity:ERROR AND resource.environment:production",
    "count_operator": ">",
    "count_threshold": 10,
    "time_window": "5m"
  },
  "notifications": ["slack-oncall", "pagerduty-critical"]
}

Example: Alert on Specific Pattern

{
  "name": "Database Connection Failures",
  "type": "log",
  "condition": {
    "query": "\"connection refused\" OR \"connection timeout\" AND service.name:api",
    "count_operator": ">",
    "count_threshold": 5,
    "time_window": "1m"
  },
  "notifications": ["slack-backend"]
}

Metric Alerts

Alert based on metric thresholds using PromQL queries.

Example: High Latency Alert

{
  "name": "API P99 Latency > 500ms",
  "type": "metric",
  "condition": {
    "query": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{service=\"api\"}[5m])) by (le))",
    "operator": ">",
    "threshold": 0.5,
    "for": "5m"
  },
  "notifications": ["slack-engineering"]
}

Example: High Error Rate

{
  "name": "Error Rate > 5%",
  "type": "metric",
  "condition": {
    "query": "100 * sum(rate(http_requests_total{status=~\"5..\"}[5m])) / sum(rate(http_requests_total[5m]))",
    "operator": ">",
    "threshold": 5,
    "for": "5m"
  },
  "notifications": ["pagerduty-critical"],
  "severity": "critical"
}

Example: Low Traffic (Service Down)

{
  "name": "No Traffic to API",
  "type": "metric",
  "condition": {
    "query": "sum(rate(http_requests_total{service=\"api\"}[5m]))",
    "operator": "<",
    "threshold": 1,
    "for": "5m"
  },
  "notifications": ["pagerduty-critical"],
  "severity": "critical"
}

Trace Alerts

Alert based on trace-derived metrics like latency and error rates.

{
  "name": "Checkout Latency Spike",
  "type": "trace",
  "condition": {
    "service": "checkout-service",
    "operation": "POST /checkout",
    "metric": "p95_latency",
    "operator": ">",
    "threshold": 1000,
    "for": "5m"
  },
  "notifications": ["slack-checkout-team"]
}

Slack Integration

  1. Go to SettingsNotifications
  2. Click Add DestinationSlack
  3. Click Add to Slack to authorize
  4. Select the channel for alerts
  5. Click Save

Via Webhook (alternative)

{
  "type": "slack",
  "name": "slack-engineering",
  "webhook_url": "https://hooks.slack.com/services/T00/B00/xxx"
}

PagerDuty Integration

  1. In PagerDuty, create a new integration and get the Integration Key
  2. In Qorrelate, go to SettingsNotifications
  3. Click Add DestinationPagerDuty
  4. Enter the Integration Key
  5. Click Save
{
  "type": "pagerduty",
  "name": "pagerduty-critical",
  "integration_key": "your-pagerduty-integration-key"
}

Email Notifications

{
  "type": "email",
  "name": "email-oncall",
  "addresses": ["oncall@yourcompany.com", "team@yourcompany.com"]
}

Custom Webhook

Send alerts to any HTTP endpoint:

{
  "type": "webhook",
  "name": "custom-webhook",
  "url": "https://your-server.com/alerts",
  "method": "POST",
  "headers": {
    "Authorization": "Bearer your-token"
  }
}

Webhook Payload

{
  "alert_name": "High Error Rate",
  "status": "firing",
  "severity": "critical",
  "timestamp": "2024-01-15T10:30:00Z",
  "condition": {
    "query": "error_rate > 5%",
    "value": 7.5
  },
  "labels": {
    "service": "api-gateway",
    "environment": "production"
  }
}

API Reference

List Alerts

GET /v1/organizations/{org_id}/alerts

Get Alert

GET /v1/organizations/{org_id}/alerts/{alert_id}

Create Alert

POST /v1/organizations/{org_id}/alerts

Update Alert

PUT /v1/organizations/{org_id}/alerts/{alert_id}

Delete Alert

DELETE /v1/organizations/{org_id}/alerts/{alert_id}

Silence Alert

POST /v1/organizations/{org_id}/alerts/{alert_id}/silence
{
  "duration": "2h",
  "reason": "Deploying fix"
}

Best Practices

Use meaningful names

Alert names should clearly describe what's wrong: "API Error Rate > 5%" not "Alert 1"

Set appropriate durations

Use for: "5m" minimum to avoid alert fatigue from transient spikes

Use severity levels

Reserve "critical" for true emergencies. Route critical alerts to PagerDuty, warnings to Slack.

Include runbook links

Add a runbook_url to alerts so on-call knows how to respond

Test your alerts

Intentionally trigger alerts in staging to verify they work before relying on them