Alerts, Dashboards and Workbooks

IntermediateApplication Insights & Monitoring2026-03-14

Azure Monitor Alerts

Alerts proactively notify you when conditions are detected in your monitoring data. They are essential for maintaining the health of your integration platform.

Microsoft Reference: Azure Monitor alerts overview

Alert Types

Type	Description	Use Case
Metric alerts	Threshold on metric values	CPU > 80%, capacity > 70%
Log alerts	KQL query results	Failed Logic App runs, error patterns
Activity log alerts	Azure resource events	Resource deleted, deployment failed
Smart detection	AI-powered anomaly detection	Unusual failure rates, latency spikes

Action Groups

Action groups define who to notify and how:

resource actionGroup 'Microsoft.Insights/actionGroups@2023-01-01' = {
  name: 'ag-integration-ops'
  location: 'global'
  properties: {
    groupShortName: 'IntOps'
    enabled: true
    emailReceivers: [
      {
        name: 'OpsTeamLead'
        emailAddress: 'ops-lead@enterprise.com'
        useCommonAlertSchema: true
      }
      {
        name: 'OpsTeam'
        emailAddress: 'ops-team@enterprise.com'
        useCommonAlertSchema: true
      }
    ]
    smsReceivers: [
      {
        name: 'OnCallSMS'
        countryCode: '44'
        phoneNumber: '7700900000'
      }
    ]
    webhookReceivers: [
      {
        name: 'TeamsWebhook'
        serviceUri: 'https://enterprise.webhook.office.com/webhookb2/...'
        useCommonAlertSchema: true
      }
      {
        name: 'PagerDuty'
        serviceUri: 'https://events.pagerduty.com/integration/{key}/enqueue'
        useCommonAlertSchema: true
      }
      {
        name: 'ServiceNow'
        serviceUri: 'https://enterprise.service-now.com/api/now/table/incident'
        useCommonAlertSchema: true
      }
    ]
    azureFunctionReceivers: [
      {
        name: 'AutoRemediation'
        functionAppResourceId: functionApp.id
        functionName: 'HandleAlert'
        httpTriggerUrl: 'https://func-remediation.azurewebsites.net/api/HandleAlert'
        useCommonAlertSchema: true
      }
    ]
  }
}

Microsoft Reference: Action groups

Logic Apps Alert Rules

Failed Workflow Runs

resource logicAppFailedRunsAlert 'Microsoft.Insights/metricAlerts@2018-03-01' = {
  name: 'alert-la-failed-runs'
  location: 'global'
  properties: {
    description: 'Alert when Logic App workflow runs fail'
    severity: 1
    enabled: true
    scopes: [logicApp.id]
    evaluationFrequency: 'PT5M'
    windowSize: 'PT5M'
    criteria: {
      'odata.type': 'Microsoft.Azure.Monitor.SingleResourceMultipleMetricCriteria'
      allOf: [
        {
          name: 'FailedRuns'
          metricName: 'RunsFailed'
          operator: 'GreaterThan'
          threshold: 0
          timeAggregation: 'Total'
        }
      ]
    }
    actions: [
      { actionGroupId: actionGroup.id }
    ]
  }
}

High Latency Workflows (Log Alert)

resource latencyLogAlert 'Microsoft.Insights/scheduledQueryRules@2023-03-15-preview' = {
  name: 'alert-la-high-latency'
  location: location
  properties: {
    description: 'Alert when workflow duration exceeds threshold'
    severity: 2
    enabled: true
    scopes: [logAnalyticsWorkspace.id]
    evaluationFrequency: 'PT15M'
    windowSize: 'PT15M'
    criteria: {
      allOf: [
        {
          query: '''
            AzureDiagnostics
            | where ResourceType == "WORKFLOWS"
            | where OperationName == "Microsoft.Logic/workflows/workflowRunCompleted"
            | extend DurationMs = datetime_diff("millisecond", endTime_t, startTime_t)
            | where DurationMs > 30000
            | summarize SlowRuns = count() by Resource
          '''
          timeAggregation: 'Count'
          operator: 'GreaterThan'
          threshold: 5
          failingPeriods: {
            numberOfEvaluationPeriods: 1
            minFailingPeriodsToAlert: 1
          }
        }
      ]
    }
    actions: {
      actionGroups: [actionGroup.id]
    }
  }
}

Throttled Actions

resource throttlingAlert 'Microsoft.Insights/metricAlerts@2018-03-01' = {
  name: 'alert-la-throttled'
  location: 'global'
  properties: {
    description: 'Alert when Logic App actions are being throttled'
    severity: 2
    scopes: [logicApp.id]
    evaluationFrequency: 'PT5M'
    windowSize: 'PT15M'
    criteria: {
      'odata.type': 'Microsoft.Azure.Monitor.SingleResourceMultipleMetricCriteria'
      allOf: [
        {
          name: 'ThrottledRuns'
          metricName: 'RunThrottledEvents'
          operator: 'GreaterThan'
          threshold: 10
          timeAggregation: 'Total'
        }
      ]
    }
    actions: [
      { actionGroupId: actionGroup.id }
    ]
  }
}

API Management Alert Rules

High Error Rate

resource apimErrorAlert 'Microsoft.Insights/metricAlerts@2018-03-01' = {
  name: 'alert-apim-errors'
  location: 'global'
  properties: {
    description: 'Alert when APIM error rate exceeds 5%'
    severity: 1
    scopes: [apim.id]
    evaluationFrequency: 'PT5M'
    windowSize: 'PT5M'
    criteria: {
      'odata.type': 'Microsoft.Azure.Monitor.SingleResourceMultipleMetricCriteria'
      allOf: [
        {
          name: 'FailedRequests'
          metricName: 'FailedRequests'
          operator: 'GreaterThan'
          threshold: 50
          timeAggregation: 'Total'
        }
      ]
    }
    actions: [
      { actionGroupId: actionGroup.id }
    ]
  }
}

Capacity Warning

resource capacityAlert 'Microsoft.Insights/metricAlerts@2018-03-01' = {
  name: 'alert-apim-capacity'
  location: 'global'
  properties: {
    description: 'APIM gateway capacity exceeding 80%'
    severity: 2
    scopes: [apim.id]
    evaluationFrequency: 'PT5M'
    windowSize: 'PT15M'
    criteria: {
      'odata.type': 'Microsoft.Azure.Monitor.SingleResourceMultipleMetricCriteria'
      allOf: [
        {
          name: 'HighCapacity'
          metricName: 'Capacity'
          operator: 'GreaterThan'
          threshold: 80
          timeAggregation: 'Average'
        }
      ]
    }
    actions: [
      { actionGroupId: actionGroup.id }
    ]
  }
}

Unauthorised Access Spike

resource authAlert 'Microsoft.Insights/scheduledQueryRules@2023-03-15-preview' = {
  name: 'alert-apim-auth-failures'
  location: location
  properties: {
    description: 'Spike in unauthorised API requests'
    severity: 1
    scopes: [logAnalyticsWorkspace.id]
    evaluationFrequency: 'PT5M'
    windowSize: 'PT15M'
    criteria: {
      allOf: [
        {
          query: '''
            ApiManagementGatewayLogs
            | where ResponseCode == 401 or ResponseCode == 403
            | summarize UnauthorisedCount = count() by bin(TimeGenerated, 5m)
            | where UnauthorisedCount > 100
          '''
          timeAggregation: 'Count'
          operator: 'GreaterThan'
          threshold: 0
          failingPeriods: {
            numberOfEvaluationPeriods: 1
            minFailingPeriodsToAlert: 1
          }
        }
      ]
    }
    actions: {
      actionGroups: [actionGroup.id]
    }
  }
}

Azure Dashboards

Creating a Dashboard

resource dashboard 'Microsoft.Portal/dashboards@2020-09-01-preview' = {
  name: 'dashboard-integration-ops'
  location: location
  properties: {
    lenses: [
      {
        order: 0
        parts: [
          // Each part is a tile on the dashboard
          // Configured via portal or exported as JSON
        ]
      }
    ]
  }
  tags: {
    'hidden-title': 'Integration Platform Operations'
  }
}

Recommended Dashboard Layout

┌─────────────────────────────────────────────────────────┐
│                  Integration Platform Health             │
├──────────────────────┬──────────────────────────────────┤
│  Logic App Runs      │  API Management Requests          │
│  (metric chart)      │  (metric chart)                  │
├──────────────────────┼──────────────────────────────────┤
│  Failed Runs         │  Error Rate                      │
│  (KPI tile)          │  (KPI tile)                      │
├──────────────────────┼──────────────────────────────────┤
│  Workflow Latency    │  APIM Latency (p95)              │
│  (line chart)        │  (line chart)                    │
├──────────────────────┼──────────────────────────────────┤
│  Top Errors          │  APIM Capacity                   │
│  (log query table)   │  (metric chart)                  │
├──────────────────────┼──────────────────────────────────┤
│  Active Alerts       │  Resource Health                 │
│  (alerts summary)    │  (resource health tile)          │
└──────────────────────┴──────────────────────────────────┘

Pinning Queries to Dashboards

Run your KQL query in Log Analytics
Click Pin to dashboard in the toolbar
Select the target dashboard
Configure tile size and refresh interval
Repeat for each query

Microsoft Reference: Azure dashboards

Azure Workbooks

Workbooks provide interactive, customisable reports that combine text, KQL queries, metrics, and parameters.

Microsoft Reference: Azure Monitor workbooks

Creating an Integration Platform Workbook

Step 1: Parameters

Add time range and resource selectors:

Parameter: TimeRange (type: Time range picker, default: Last 24 hours)
Parameter: Environment (type: Dropdown, values: dev, test, prod)
Parameter: LogicAppName (type: Resource picker, type: Microsoft.Logic/workflows)

Step 2: Overview Section

// Summary statistics
AzureDiagnostics
| where ResourceType == "WORKFLOWS"
| where TimeGenerated {TimeRange}
| where OperationName == "Microsoft.Logic/workflows/workflowRunCompleted"
| summarize
    TotalRuns = count(),
    Succeeded = countif(status_s == "Succeeded"),
    Failed = countif(status_s == "Failed"),
    AvgDuration = avg(datetime_diff("millisecond", endTime_t, startTime_t))
| extend SuccessRate = round(100.0 * Succeeded / TotalRuns, 2)

Step 3: Failure Analysis

// Failed runs by workflow
AzureDiagnostics
| where ResourceType == "WORKFLOWS"
| where TimeGenerated {TimeRange}
| where status_s == "Failed"
| summarize Failures = count() by Resource, error_code_s
| order by Failures desc

Step 4: Trend Charts

// Success/failure trend
AzureDiagnostics
| where ResourceType == "WORKFLOWS"
| where TimeGenerated {TimeRange}
| where OperationName == "Microsoft.Logic/workflows/workflowRunCompleted"
| summarize
    Succeeded = countif(status_s == "Succeeded"),
    Failed = countif(status_s == "Failed")
  by bin(TimeGenerated, {TimeRange:grain})
| render timechart

Workbook Templates

Common workbook sections for an integration platform:

Section	Content
Executive Summary	KPIs, success rates, total volumes
Logic Apps Health	Run counts, failures, duration trends
APIM Performance	Request volume, latency, error rates
Error Analysis	Top errors, error trends, affected workflows
Consumer Insights	API usage by subscription, top consumers
Cost Analysis	Execution counts, billable operations
SLA Compliance	Uptime percentages, breach incidents
Capacity Planning	Resource utilisation, scaling recommendations

Exporting Workbooks as ARM/Bicep

# Export workbook as ARM template
az monitor app-insights workbook show \
  --resource-group rg-integration-prod \
  --name "Integration Platform Health" \
  --output json > workbook-template.json

Alert Processing Rules

Control alert notifications by suppressing or routing based on conditions:

resource alertProcessingRule 'Microsoft.AlertsManagement/actionRules@2023-05-01-preview' = {
  name: 'suppress-dev-alerts-outside-hours'
  location: 'global'
  properties: {
    scopes: ['/subscriptions/{sub}/resourceGroups/rg-integration-dev']
    conditions: [
      {
        field: 'Severity'
        operator: 'Equals'
        values: ['Sev3', 'Sev4']
      }
    ]
    actions: [
      {
        actionType: 'RemoveAllActionGroups'
      }
    ]
    schedule: {
      effectiveFrom: '2026-01-01T00:00:00Z'
      timeZone: 'GMT Standard Time'
      recurrences: [
        {
          recurrenceType: 'Weekly'
          daysOfWeek: ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday']
          startTime: '18:00:00'
          endTime: '09:00:00'
        }
        {
          recurrenceType: 'Weekly'
          daysOfWeek: ['Saturday', 'Sunday']
        }
      ]
    }
  }
}

Microsoft Reference: Alert processing rules

Recommended Alert Strategy

Severity Levels

Severity	Description	Response	Example
Sev 0	Critical — service down	Immediate (< 15 min)	All APIs returning 503
Sev 1	High — significant impact	Urgent (< 30 min)	>20% error rate
Sev 2	Warning — degraded	Standard (< 2 hours)	High latency, capacity >80%
Sev 3	Informational — monitor	Review next business day	Unusual traffic patterns
Sev 4	Verbose — tracking	Weekly review	Cost threshold approaching

Alert Matrix

Resource	Metric	Sev 0	Sev 1	Sev 2
Logic Apps	Failed runs	>50% fail rate	Any failure	Latency >30s
APIM	Error rate	>20% 5xx	>5% errors	>2% errors
APIM	Capacity	>95%	>85%	>70%
APIM	Latency	p99 >10s	p95 >5s	p95 >2s
App Insights	Availability	<95%	<99%	<99.5%
Key Vault	Throttled	Any	-	-

Best Practices

Use severity levels consistently across all alert rules
Avoid alert fatigue — tune thresholds based on baselines, not guesses
Create separate action groups for different severity levels
Suppress non-critical alerts outside business hours
Use workbooks for interactive investigation, dashboards for at-a-glance monitoring
Review and tune alerts quarterly — adjust thresholds as baselines change
Document alert runbooks — what to do when each alert fires
Test alert delivery regularly to ensure notifications reach the right people
Use auto-remediation for known, automatable issues
Centralise monitoring in a shared Log Analytics workspace for cross-service queries