Alerts, Dashboards and Workbooks
Azure Monitor Alerts
Alerts proactively notify you when conditions are detected in your monitoring data. They are essential for maintaining the health of your integration platform.
Microsoft Reference: Azure Monitor alerts overview
Alert Types
| Type | Description | Use Case |
|---|---|---|
| Metric alerts | Threshold on metric values | CPU > 80%, capacity > 70% |
| Log alerts | KQL query results | Failed Logic App runs, error patterns |
| Activity log alerts | Azure resource events | Resource deleted, deployment failed |
| Smart detection | AI-powered anomaly detection | Unusual failure rates, latency spikes |
Action Groups
Action groups define who to notify and how:
resource actionGroup 'Microsoft.Insights/actionGroups@2023-01-01' = {
name: 'ag-integration-ops'
location: 'global'
properties: {
groupShortName: 'IntOps'
enabled: true
emailReceivers: [
{
name: 'OpsTeamLead'
emailAddress: 'ops-lead@enterprise.com'
useCommonAlertSchema: true
}
{
name: 'OpsTeam'
emailAddress: 'ops-team@enterprise.com'
useCommonAlertSchema: true
}
]
smsReceivers: [
{
name: 'OnCallSMS'
countryCode: '44'
phoneNumber: '7700900000'
}
]
webhookReceivers: [
{
name: 'TeamsWebhook'
serviceUri: 'https://enterprise.webhook.office.com/webhookb2/...'
useCommonAlertSchema: true
}
{
name: 'PagerDuty'
serviceUri: 'https://events.pagerduty.com/integration/{key}/enqueue'
useCommonAlertSchema: true
}
{
name: 'ServiceNow'
serviceUri: 'https://enterprise.service-now.com/api/now/table/incident'
useCommonAlertSchema: true
}
]
azureFunctionReceivers: [
{
name: 'AutoRemediation'
functionAppResourceId: functionApp.id
functionName: 'HandleAlert'
httpTriggerUrl: 'https://func-remediation.azurewebsites.net/api/HandleAlert'
useCommonAlertSchema: true
}
]
}
}
Microsoft Reference: Action groups
Logic Apps Alert Rules
Failed Workflow Runs
resource logicAppFailedRunsAlert 'Microsoft.Insights/metricAlerts@2018-03-01' = {
name: 'alert-la-failed-runs'
location: 'global'
properties: {
description: 'Alert when Logic App workflow runs fail'
severity: 1
enabled: true
scopes: [logicApp.id]
evaluationFrequency: 'PT5M'
windowSize: 'PT5M'
criteria: {
'odata.type': 'Microsoft.Azure.Monitor.SingleResourceMultipleMetricCriteria'
allOf: [
{
name: 'FailedRuns'
metricName: 'RunsFailed'
operator: 'GreaterThan'
threshold: 0
timeAggregation: 'Total'
}
]
}
actions: [
{ actionGroupId: actionGroup.id }
]
}
}
High Latency Workflows (Log Alert)
resource latencyLogAlert 'Microsoft.Insights/scheduledQueryRules@2023-03-15-preview' = {
name: 'alert-la-high-latency'
location: location
properties: {
description: 'Alert when workflow duration exceeds threshold'
severity: 2
enabled: true
scopes: [logAnalyticsWorkspace.id]
evaluationFrequency: 'PT15M'
windowSize: 'PT15M'
criteria: {
allOf: [
{
query: '''
AzureDiagnostics
| where ResourceType == "WORKFLOWS"
| where OperationName == "Microsoft.Logic/workflows/workflowRunCompleted"
| extend DurationMs = datetime_diff("millisecond", endTime_t, startTime_t)
| where DurationMs > 30000
| summarize SlowRuns = count() by Resource
'''
timeAggregation: 'Count'
operator: 'GreaterThan'
threshold: 5
failingPeriods: {
numberOfEvaluationPeriods: 1
minFailingPeriodsToAlert: 1
}
}
]
}
actions: {
actionGroups: [actionGroup.id]
}
}
}
Throttled Actions
resource throttlingAlert 'Microsoft.Insights/metricAlerts@2018-03-01' = {
name: 'alert-la-throttled'
location: 'global'
properties: {
description: 'Alert when Logic App actions are being throttled'
severity: 2
scopes: [logicApp.id]
evaluationFrequency: 'PT5M'
windowSize: 'PT15M'
criteria: {
'odata.type': 'Microsoft.Azure.Monitor.SingleResourceMultipleMetricCriteria'
allOf: [
{
name: 'ThrottledRuns'
metricName: 'RunThrottledEvents'
operator: 'GreaterThan'
threshold: 10
timeAggregation: 'Total'
}
]
}
actions: [
{ actionGroupId: actionGroup.id }
]
}
}
API Management Alert Rules
High Error Rate
resource apimErrorAlert 'Microsoft.Insights/metricAlerts@2018-03-01' = {
name: 'alert-apim-errors'
location: 'global'
properties: {
description: 'Alert when APIM error rate exceeds 5%'
severity: 1
scopes: [apim.id]
evaluationFrequency: 'PT5M'
windowSize: 'PT5M'
criteria: {
'odata.type': 'Microsoft.Azure.Monitor.SingleResourceMultipleMetricCriteria'
allOf: [
{
name: 'FailedRequests'
metricName: 'FailedRequests'
operator: 'GreaterThan'
threshold: 50
timeAggregation: 'Total'
}
]
}
actions: [
{ actionGroupId: actionGroup.id }
]
}
}
Capacity Warning
resource capacityAlert 'Microsoft.Insights/metricAlerts@2018-03-01' = {
name: 'alert-apim-capacity'
location: 'global'
properties: {
description: 'APIM gateway capacity exceeding 80%'
severity: 2
scopes: [apim.id]
evaluationFrequency: 'PT5M'
windowSize: 'PT15M'
criteria: {
'odata.type': 'Microsoft.Azure.Monitor.SingleResourceMultipleMetricCriteria'
allOf: [
{
name: 'HighCapacity'
metricName: 'Capacity'
operator: 'GreaterThan'
threshold: 80
timeAggregation: 'Average'
}
]
}
actions: [
{ actionGroupId: actionGroup.id }
]
}
}
Unauthorised Access Spike
resource authAlert 'Microsoft.Insights/scheduledQueryRules@2023-03-15-preview' = {
name: 'alert-apim-auth-failures'
location: location
properties: {
description: 'Spike in unauthorised API requests'
severity: 1
scopes: [logAnalyticsWorkspace.id]
evaluationFrequency: 'PT5M'
windowSize: 'PT15M'
criteria: {
allOf: [
{
query: '''
ApiManagementGatewayLogs
| where ResponseCode == 401 or ResponseCode == 403
| summarize UnauthorisedCount = count() by bin(TimeGenerated, 5m)
| where UnauthorisedCount > 100
'''
timeAggregation: 'Count'
operator: 'GreaterThan'
threshold: 0
failingPeriods: {
numberOfEvaluationPeriods: 1
minFailingPeriodsToAlert: 1
}
}
]
}
actions: {
actionGroups: [actionGroup.id]
}
}
}
Azure Dashboards
Creating a Dashboard
resource dashboard 'Microsoft.Portal/dashboards@2020-09-01-preview' = {
name: 'dashboard-integration-ops'
location: location
properties: {
lenses: [
{
order: 0
parts: [
// Each part is a tile on the dashboard
// Configured via portal or exported as JSON
]
}
]
}
tags: {
'hidden-title': 'Integration Platform Operations'
}
}
Recommended Dashboard Layout
┌─────────────────────────────────────────────────────────┐
│ Integration Platform Health │
├──────────────────────┬──────────────────────────────────┤
│ Logic App Runs │ API Management Requests │
│ (metric chart) │ (metric chart) │
├──────────────────────┼──────────────────────────────────┤
│ Failed Runs │ Error Rate │
│ (KPI tile) │ (KPI tile) │
├──────────────────────┼──────────────────────────────────┤
│ Workflow Latency │ APIM Latency (p95) │
│ (line chart) │ (line chart) │
├──────────────────────┼──────────────────────────────────┤
│ Top Errors │ APIM Capacity │
│ (log query table) │ (metric chart) │
├──────────────────────┼──────────────────────────────────┤
│ Active Alerts │ Resource Health │
│ (alerts summary) │ (resource health tile) │
└──────────────────────┴──────────────────────────────────┘
Pinning Queries to Dashboards
- Run your KQL query in Log Analytics
- Click Pin to dashboard in the toolbar
- Select the target dashboard
- Configure tile size and refresh interval
- Repeat for each query
Microsoft Reference: Azure dashboards
Azure Workbooks
Workbooks provide interactive, customisable reports that combine text, KQL queries, metrics, and parameters.
Microsoft Reference: Azure Monitor workbooks
Creating an Integration Platform Workbook
Step 1: Parameters
Add time range and resource selectors:
Parameter: TimeRange (type: Time range picker, default: Last 24 hours)
Parameter: Environment (type: Dropdown, values: dev, test, prod)
Parameter: LogicAppName (type: Resource picker, type: Microsoft.Logic/workflows)
Step 2: Overview Section
// Summary statistics
AzureDiagnostics
| where ResourceType == "WORKFLOWS"
| where TimeGenerated {TimeRange}
| where OperationName == "Microsoft.Logic/workflows/workflowRunCompleted"
| summarize
TotalRuns = count(),
Succeeded = countif(status_s == "Succeeded"),
Failed = countif(status_s == "Failed"),
AvgDuration = avg(datetime_diff("millisecond", endTime_t, startTime_t))
| extend SuccessRate = round(100.0 * Succeeded / TotalRuns, 2)
Step 3: Failure Analysis
// Failed runs by workflow
AzureDiagnostics
| where ResourceType == "WORKFLOWS"
| where TimeGenerated {TimeRange}
| where status_s == "Failed"
| summarize Failures = count() by Resource, error_code_s
| order by Failures desc
Step 4: Trend Charts
// Success/failure trend
AzureDiagnostics
| where ResourceType == "WORKFLOWS"
| where TimeGenerated {TimeRange}
| where OperationName == "Microsoft.Logic/workflows/workflowRunCompleted"
| summarize
Succeeded = countif(status_s == "Succeeded"),
Failed = countif(status_s == "Failed")
by bin(TimeGenerated, {TimeRange:grain})
| render timechart
Workbook Templates
Common workbook sections for an integration platform:
| Section | Content |
|---|---|
| Executive Summary | KPIs, success rates, total volumes |
| Logic Apps Health | Run counts, failures, duration trends |
| APIM Performance | Request volume, latency, error rates |
| Error Analysis | Top errors, error trends, affected workflows |
| Consumer Insights | API usage by subscription, top consumers |
| Cost Analysis | Execution counts, billable operations |
| SLA Compliance | Uptime percentages, breach incidents |
| Capacity Planning | Resource utilisation, scaling recommendations |
Exporting Workbooks as ARM/Bicep
# Export workbook as ARM template
az monitor app-insights workbook show \
--resource-group rg-integration-prod \
--name "Integration Platform Health" \
--output json > workbook-template.json
Alert Processing Rules
Control alert notifications by suppressing or routing based on conditions:
resource alertProcessingRule 'Microsoft.AlertsManagement/actionRules@2023-05-01-preview' = {
name: 'suppress-dev-alerts-outside-hours'
location: 'global'
properties: {
scopes: ['/subscriptions/{sub}/resourceGroups/rg-integration-dev']
conditions: [
{
field: 'Severity'
operator: 'Equals'
values: ['Sev3', 'Sev4']
}
]
actions: [
{
actionType: 'RemoveAllActionGroups'
}
]
schedule: {
effectiveFrom: '2026-01-01T00:00:00Z'
timeZone: 'GMT Standard Time'
recurrences: [
{
recurrenceType: 'Weekly'
daysOfWeek: ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday']
startTime: '18:00:00'
endTime: '09:00:00'
}
{
recurrenceType: 'Weekly'
daysOfWeek: ['Saturday', 'Sunday']
}
]
}
}
}
Microsoft Reference: Alert processing rules
Recommended Alert Strategy
Severity Levels
| Severity | Description | Response | Example |
|---|---|---|---|
| Sev 0 | Critical — service down | Immediate (< 15 min) | All APIs returning 503 |
| Sev 1 | High — significant impact | Urgent (< 30 min) | >20% error rate |
| Sev 2 | Warning — degraded | Standard (< 2 hours) | High latency, capacity >80% |
| Sev 3 | Informational — monitor | Review next business day | Unusual traffic patterns |
| Sev 4 | Verbose — tracking | Weekly review | Cost threshold approaching |
Alert Matrix
| Resource | Metric | Sev 0 | Sev 1 | Sev 2 |
|---|---|---|---|---|
| Logic Apps | Failed runs | >50% fail rate | Any failure | Latency >30s |
| APIM | Error rate | >20% 5xx | >5% errors | >2% errors |
| APIM | Capacity | >95% | >85% | >70% |
| APIM | Latency | p99 >10s | p95 >5s | p95 >2s |
| App Insights | Availability | <95% | <99% | <99.5% |
| Key Vault | Throttled | Any | - | - |
Best Practices
- Use severity levels consistently across all alert rules
- Avoid alert fatigue — tune thresholds based on baselines, not guesses
- Create separate action groups for different severity levels
- Suppress non-critical alerts outside business hours
- Use workbooks for interactive investigation, dashboards for at-a-glance monitoring
- Review and tune alerts quarterly — adjust thresholds as baselines change
- Document alert runbooks — what to do when each alert fires
- Test alert delivery regularly to ensure notifications reach the right people
- Use auto-remediation for known, automatable issues
- Centralise monitoring in a shared Log Analytics workspace for cross-service queries