← Back to Guides

Alerts, Dashboards and Workbooks

IntermediateApplication Insights & Monitoring2026-03-14

Azure Monitor Alerts

Alerts proactively notify you when conditions are detected in your monitoring data. They are essential for maintaining the health of your integration platform.

Microsoft Reference: Azure Monitor alerts overview

Alert Types

Type Description Use Case
Metric alerts Threshold on metric values CPU > 80%, capacity > 70%
Log alerts KQL query results Failed Logic App runs, error patterns
Activity log alerts Azure resource events Resource deleted, deployment failed
Smart detection AI-powered anomaly detection Unusual failure rates, latency spikes

Action Groups

Action groups define who to notify and how:

resource actionGroup 'Microsoft.Insights/actionGroups@2023-01-01' = {
  name: 'ag-integration-ops'
  location: 'global'
  properties: {
    groupShortName: 'IntOps'
    enabled: true
    emailReceivers: [
      {
        name: 'OpsTeamLead'
        emailAddress: 'ops-lead@enterprise.com'
        useCommonAlertSchema: true
      }
      {
        name: 'OpsTeam'
        emailAddress: 'ops-team@enterprise.com'
        useCommonAlertSchema: true
      }
    ]
    smsReceivers: [
      {
        name: 'OnCallSMS'
        countryCode: '44'
        phoneNumber: '7700900000'
      }
    ]
    webhookReceivers: [
      {
        name: 'TeamsWebhook'
        serviceUri: 'https://enterprise.webhook.office.com/webhookb2/...'
        useCommonAlertSchema: true
      }
      {
        name: 'PagerDuty'
        serviceUri: 'https://events.pagerduty.com/integration/{key}/enqueue'
        useCommonAlertSchema: true
      }
      {
        name: 'ServiceNow'
        serviceUri: 'https://enterprise.service-now.com/api/now/table/incident'
        useCommonAlertSchema: true
      }
    ]
    azureFunctionReceivers: [
      {
        name: 'AutoRemediation'
        functionAppResourceId: functionApp.id
        functionName: 'HandleAlert'
        httpTriggerUrl: 'https://func-remediation.azurewebsites.net/api/HandleAlert'
        useCommonAlertSchema: true
      }
    ]
  }
}

Microsoft Reference: Action groups

Logic Apps Alert Rules

Failed Workflow Runs

resource logicAppFailedRunsAlert 'Microsoft.Insights/metricAlerts@2018-03-01' = {
  name: 'alert-la-failed-runs'
  location: 'global'
  properties: {
    description: 'Alert when Logic App workflow runs fail'
    severity: 1
    enabled: true
    scopes: [logicApp.id]
    evaluationFrequency: 'PT5M'
    windowSize: 'PT5M'
    criteria: {
      'odata.type': 'Microsoft.Azure.Monitor.SingleResourceMultipleMetricCriteria'
      allOf: [
        {
          name: 'FailedRuns'
          metricName: 'RunsFailed'
          operator: 'GreaterThan'
          threshold: 0
          timeAggregation: 'Total'
        }
      ]
    }
    actions: [
      { actionGroupId: actionGroup.id }
    ]
  }
}

High Latency Workflows (Log Alert)

resource latencyLogAlert 'Microsoft.Insights/scheduledQueryRules@2023-03-15-preview' = {
  name: 'alert-la-high-latency'
  location: location
  properties: {
    description: 'Alert when workflow duration exceeds threshold'
    severity: 2
    enabled: true
    scopes: [logAnalyticsWorkspace.id]
    evaluationFrequency: 'PT15M'
    windowSize: 'PT15M'
    criteria: {
      allOf: [
        {
          query: '''
            AzureDiagnostics
            | where ResourceType == "WORKFLOWS"
            | where OperationName == "Microsoft.Logic/workflows/workflowRunCompleted"
            | extend DurationMs = datetime_diff("millisecond", endTime_t, startTime_t)
            | where DurationMs > 30000
            | summarize SlowRuns = count() by Resource
          '''
          timeAggregation: 'Count'
          operator: 'GreaterThan'
          threshold: 5
          failingPeriods: {
            numberOfEvaluationPeriods: 1
            minFailingPeriodsToAlert: 1
          }
        }
      ]
    }
    actions: {
      actionGroups: [actionGroup.id]
    }
  }
}

Throttled Actions

resource throttlingAlert 'Microsoft.Insights/metricAlerts@2018-03-01' = {
  name: 'alert-la-throttled'
  location: 'global'
  properties: {
    description: 'Alert when Logic App actions are being throttled'
    severity: 2
    scopes: [logicApp.id]
    evaluationFrequency: 'PT5M'
    windowSize: 'PT15M'
    criteria: {
      'odata.type': 'Microsoft.Azure.Monitor.SingleResourceMultipleMetricCriteria'
      allOf: [
        {
          name: 'ThrottledRuns'
          metricName: 'RunThrottledEvents'
          operator: 'GreaterThan'
          threshold: 10
          timeAggregation: 'Total'
        }
      ]
    }
    actions: [
      { actionGroupId: actionGroup.id }
    ]
  }
}

API Management Alert Rules

High Error Rate

resource apimErrorAlert 'Microsoft.Insights/metricAlerts@2018-03-01' = {
  name: 'alert-apim-errors'
  location: 'global'
  properties: {
    description: 'Alert when APIM error rate exceeds 5%'
    severity: 1
    scopes: [apim.id]
    evaluationFrequency: 'PT5M'
    windowSize: 'PT5M'
    criteria: {
      'odata.type': 'Microsoft.Azure.Monitor.SingleResourceMultipleMetricCriteria'
      allOf: [
        {
          name: 'FailedRequests'
          metricName: 'FailedRequests'
          operator: 'GreaterThan'
          threshold: 50
          timeAggregation: 'Total'
        }
      ]
    }
    actions: [
      { actionGroupId: actionGroup.id }
    ]
  }
}

Capacity Warning

resource capacityAlert 'Microsoft.Insights/metricAlerts@2018-03-01' = {
  name: 'alert-apim-capacity'
  location: 'global'
  properties: {
    description: 'APIM gateway capacity exceeding 80%'
    severity: 2
    scopes: [apim.id]
    evaluationFrequency: 'PT5M'
    windowSize: 'PT15M'
    criteria: {
      'odata.type': 'Microsoft.Azure.Monitor.SingleResourceMultipleMetricCriteria'
      allOf: [
        {
          name: 'HighCapacity'
          metricName: 'Capacity'
          operator: 'GreaterThan'
          threshold: 80
          timeAggregation: 'Average'
        }
      ]
    }
    actions: [
      { actionGroupId: actionGroup.id }
    ]
  }
}

Unauthorised Access Spike

resource authAlert 'Microsoft.Insights/scheduledQueryRules@2023-03-15-preview' = {
  name: 'alert-apim-auth-failures'
  location: location
  properties: {
    description: 'Spike in unauthorised API requests'
    severity: 1
    scopes: [logAnalyticsWorkspace.id]
    evaluationFrequency: 'PT5M'
    windowSize: 'PT15M'
    criteria: {
      allOf: [
        {
          query: '''
            ApiManagementGatewayLogs
            | where ResponseCode == 401 or ResponseCode == 403
            | summarize UnauthorisedCount = count() by bin(TimeGenerated, 5m)
            | where UnauthorisedCount > 100
          '''
          timeAggregation: 'Count'
          operator: 'GreaterThan'
          threshold: 0
          failingPeriods: {
            numberOfEvaluationPeriods: 1
            minFailingPeriodsToAlert: 1
          }
        }
      ]
    }
    actions: {
      actionGroups: [actionGroup.id]
    }
  }
}

Azure Dashboards

Creating a Dashboard

resource dashboard 'Microsoft.Portal/dashboards@2020-09-01-preview' = {
  name: 'dashboard-integration-ops'
  location: location
  properties: {
    lenses: [
      {
        order: 0
        parts: [
          // Each part is a tile on the dashboard
          // Configured via portal or exported as JSON
        ]
      }
    ]
  }
  tags: {
    'hidden-title': 'Integration Platform Operations'
  }
}

Recommended Dashboard Layout

┌─────────────────────────────────────────────────────────┐
│                  Integration Platform Health             │
├──────────────────────┬──────────────────────────────────┤
│  Logic App Runs      │  API Management Requests          │
│  (metric chart)      │  (metric chart)                  │
├──────────────────────┼──────────────────────────────────┤
│  Failed Runs         │  Error Rate                      │
│  (KPI tile)          │  (KPI tile)                      │
├──────────────────────┼──────────────────────────────────┤
│  Workflow Latency    │  APIM Latency (p95)              │
│  (line chart)        │  (line chart)                    │
├──────────────────────┼──────────────────────────────────┤
│  Top Errors          │  APIM Capacity                   │
│  (log query table)   │  (metric chart)                  │
├──────────────────────┼──────────────────────────────────┤
│  Active Alerts       │  Resource Health                 │
│  (alerts summary)    │  (resource health tile)          │
└──────────────────────┴──────────────────────────────────┘

Pinning Queries to Dashboards

  1. Run your KQL query in Log Analytics
  2. Click Pin to dashboard in the toolbar
  3. Select the target dashboard
  4. Configure tile size and refresh interval
  5. Repeat for each query

Microsoft Reference: Azure dashboards

Azure Workbooks

Workbooks provide interactive, customisable reports that combine text, KQL queries, metrics, and parameters.

Microsoft Reference: Azure Monitor workbooks

Creating an Integration Platform Workbook

Step 1: Parameters

Add time range and resource selectors:

Parameter: TimeRange (type: Time range picker, default: Last 24 hours)
Parameter: Environment (type: Dropdown, values: dev, test, prod)
Parameter: LogicAppName (type: Resource picker, type: Microsoft.Logic/workflows)

Step 2: Overview Section

// Summary statistics
AzureDiagnostics
| where ResourceType == "WORKFLOWS"
| where TimeGenerated {TimeRange}
| where OperationName == "Microsoft.Logic/workflows/workflowRunCompleted"
| summarize
    TotalRuns = count(),
    Succeeded = countif(status_s == "Succeeded"),
    Failed = countif(status_s == "Failed"),
    AvgDuration = avg(datetime_diff("millisecond", endTime_t, startTime_t))
| extend SuccessRate = round(100.0 * Succeeded / TotalRuns, 2)

Step 3: Failure Analysis

// Failed runs by workflow
AzureDiagnostics
| where ResourceType == "WORKFLOWS"
| where TimeGenerated {TimeRange}
| where status_s == "Failed"
| summarize Failures = count() by Resource, error_code_s
| order by Failures desc

Step 4: Trend Charts

// Success/failure trend
AzureDiagnostics
| where ResourceType == "WORKFLOWS"
| where TimeGenerated {TimeRange}
| where OperationName == "Microsoft.Logic/workflows/workflowRunCompleted"
| summarize
    Succeeded = countif(status_s == "Succeeded"),
    Failed = countif(status_s == "Failed")
  by bin(TimeGenerated, {TimeRange:grain})
| render timechart

Workbook Templates

Common workbook sections for an integration platform:

Section Content
Executive Summary KPIs, success rates, total volumes
Logic Apps Health Run counts, failures, duration trends
APIM Performance Request volume, latency, error rates
Error Analysis Top errors, error trends, affected workflows
Consumer Insights API usage by subscription, top consumers
Cost Analysis Execution counts, billable operations
SLA Compliance Uptime percentages, breach incidents
Capacity Planning Resource utilisation, scaling recommendations

Exporting Workbooks as ARM/Bicep

# Export workbook as ARM template
az monitor app-insights workbook show \
  --resource-group rg-integration-prod \
  --name "Integration Platform Health" \
  --output json > workbook-template.json

Alert Processing Rules

Control alert notifications by suppressing or routing based on conditions:

resource alertProcessingRule 'Microsoft.AlertsManagement/actionRules@2023-05-01-preview' = {
  name: 'suppress-dev-alerts-outside-hours'
  location: 'global'
  properties: {
    scopes: ['/subscriptions/{sub}/resourceGroups/rg-integration-dev']
    conditions: [
      {
        field: 'Severity'
        operator: 'Equals'
        values: ['Sev3', 'Sev4']
      }
    ]
    actions: [
      {
        actionType: 'RemoveAllActionGroups'
      }
    ]
    schedule: {
      effectiveFrom: '2026-01-01T00:00:00Z'
      timeZone: 'GMT Standard Time'
      recurrences: [
        {
          recurrenceType: 'Weekly'
          daysOfWeek: ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday']
          startTime: '18:00:00'
          endTime: '09:00:00'
        }
        {
          recurrenceType: 'Weekly'
          daysOfWeek: ['Saturday', 'Sunday']
        }
      ]
    }
  }
}

Microsoft Reference: Alert processing rules

Recommended Alert Strategy

Severity Levels

Severity Description Response Example
Sev 0 Critical — service down Immediate (< 15 min) All APIs returning 503
Sev 1 High — significant impact Urgent (< 30 min) >20% error rate
Sev 2 Warning — degraded Standard (< 2 hours) High latency, capacity >80%
Sev 3 Informational — monitor Review next business day Unusual traffic patterns
Sev 4 Verbose — tracking Weekly review Cost threshold approaching

Alert Matrix

Resource Metric Sev 0 Sev 1 Sev 2
Logic Apps Failed runs >50% fail rate Any failure Latency >30s
APIM Error rate >20% 5xx >5% errors >2% errors
APIM Capacity >95% >85% >70%
APIM Latency p99 >10s p95 >5s p95 >2s
App Insights Availability <95% <99% <99.5%
Key Vault Throttled Any - -

Best Practices

  1. Use severity levels consistently across all alert rules
  2. Avoid alert fatigue — tune thresholds based on baselines, not guesses
  3. Create separate action groups for different severity levels
  4. Suppress non-critical alerts outside business hours
  5. Use workbooks for interactive investigation, dashboards for at-a-glance monitoring
  6. Review and tune alerts quarterly — adjust thresholds as baselines change
  7. Document alert runbooks — what to do when each alert fires
  8. Test alert delivery regularly to ensure notifications reach the right people
  9. Use auto-remediation for known, automatable issues
  10. Centralise monitoring in a shared Log Analytics workspace for cross-service queries

Official Microsoft Resources