Monitoring APIM with Application Insights
Monitoring Overview
Comprehensive monitoring of Azure API Management is essential for understanding API performance, identifying issues, and ensuring SLA compliance.
Microsoft Reference: Monitor API Management
Application Insights Integration
Setting Up the Logger
resource appInsights 'Microsoft.Insights/components@2020-02-02' = {
name: 'appi-apim-${environment}'
location: location
kind: 'web'
properties: {
Application_Type: 'web'
WorkspaceResourceId: logAnalyticsWorkspace.id
RetentionInDays: environment == 'prod' ? 90 : 30
}
}
resource apimLogger 'Microsoft.ApiManagement/service/loggers@2023-05-01-preview' = {
parent: apim
name: 'appinsights-logger'
properties: {
loggerType: 'applicationInsights'
resourceId: appInsights.id
credentials: {
connectionString: appInsights.properties.ConnectionString
}
}
}
API-Level Diagnostics
Enable detailed logging per API:
resource apiDiagnostic 'Microsoft.ApiManagement/service/apis/diagnostics@2023-05-01-preview' = {
parent: ordersApi
name: 'applicationinsights'
properties: {
loggerId: apimLogger.id
alwaysLog: 'allErrors'
sampling: {
samplingType: 'fixed'
percentage: environment == 'prod' ? 25 : 100
}
verbosity: environment == 'prod' ? 'information' : 'verbose'
httpCorrelationProtocol: 'W3C'
logClientIp: true
frontend: {
request: {
headers: ['X-Correlation-Id', 'X-Forwarded-For']
body: { bytes: 1024 }
}
response: {
headers: ['Content-Type', 'X-Request-Id']
body: { bytes: 1024 }
}
}
backend: {
request: {
headers: ['Host']
body: { bytes: 1024 }
}
response: {
headers: ['Content-Type']
body: { bytes: 1024 }
}
}
}
}
Global Diagnostics
Apply diagnostics to all APIs:
resource globalDiagnostic 'Microsoft.ApiManagement/service/diagnostics@2023-05-01-preview' = {
parent: apim
name: 'applicationinsights'
properties: {
loggerId: apimLogger.id
alwaysLog: 'allErrors'
sampling: {
samplingType: 'fixed'
percentage: 10
}
httpCorrelationProtocol: 'W3C'
}
}
Sampling Strategies
| Environment | Sampling Rate | Rationale |
|---|---|---|
| Development | 100% | Full visibility for debugging |
| Testing | 100% | Capture all test scenarios |
| Staging | 50% | Balance between visibility and cost |
| Production | 5–25% | Cost-effective monitoring at scale |
Microsoft Reference: Application Insights integration
Azure Monitor Metrics
Key APIM Metrics
| Metric | Description | Alert Threshold |
|---|---|---|
| Requests | Total API call count | Baseline deviation |
| Failed Requests | 4xx and 5xx responses | > 5% of total |
| Unauthorised Requests | 401 responses | > 10 per minute |
| Overall Gateway Requests Duration | End-to-end latency | p95 > 2 seconds |
| Backend Request Duration | Backend response time | p95 > 1 second |
| Capacity | Gateway utilisation (%) | > 80% |
| Event Hub Events (Dropped) | Lost diagnostic events | > 0 |
Capacity Monitoring
Capacity is the most critical metric for APIM scaling decisions:
resource capacityAlert 'Microsoft.Insights/metricAlerts@2018-03-01' = {
name: 'alert-apim-capacity'
location: 'global'
properties: {
severity: 2
scopes: [apim.id]
evaluationFrequency: 'PT5M'
windowSize: 'PT15M'
criteria: {
'odata.type': 'Microsoft.Azure.Monitor.SingleResourceMultipleMetricCriteria'
allOf: [
{
name: 'HighCapacity'
metricName: 'Capacity'
operator: 'GreaterThan'
threshold: 80
timeAggregation: 'Average'
}
]
}
actions: [
{ actionGroupId: opsActionGroup.id }
]
}
}
Microsoft Reference: APIM capacity
Log Analytics Queries (KQL)
Request Volume and Latency
// API request volume over time (15-minute intervals)
ApiManagementGatewayLogs
| where TimeGenerated > ago(24h)
| summarize RequestCount = count(),
AvgDuration = avg(TotalTime),
P95Duration = percentile(TotalTime, 95),
P99Duration = percentile(TotalTime, 99)
by bin(TimeGenerated, 15m)
| render timechart
Error Analysis
// Top errors by API and operation
ApiManagementGatewayLogs
| where TimeGenerated > ago(24h)
| where ResponseCode >= 400
| summarize ErrorCount = count() by ApiId, OperationId, ResponseCode
| order by ErrorCount desc
| take 20
Slow Requests
// Requests slower than 2 seconds
ApiManagementGatewayLogs
| where TimeGenerated > ago(1h)
| where TotalTime > 2000
| project TimeGenerated, ApiId, OperationId, TotalTime,
BackendTime, ClientTime,
ResponseCode, CallerIpAddress
| order by TotalTime desc
Consumer Analytics
// Top API consumers by subscription
ApiManagementGatewayLogs
| where TimeGenerated > ago(7d)
| summarize CallCount = count(),
AvgLatency = avg(TotalTime),
ErrorRate = countif(ResponseCode >= 400) * 100.0 / count()
by SubscriptionId
| order by CallCount desc
| take 20
Backend Health
// Backend response time by backend URL
ApiManagementGatewayLogs
| where TimeGenerated > ago(1h)
| where isnotempty(BackendUrl)
| summarize AvgBackendTime = avg(BackendTime),
P95BackendTime = percentile(BackendTime, 95),
ErrorRate = countif(ResponseCode >= 500) * 100.0 / count(),
TotalRequests = count()
by BackendUrl
| order by AvgBackendTime desc
Policy Execution Analysis
// Track custom trace messages from policies
ApiManagementGatewayLogs
| where TimeGenerated > ago(1h)
| extend PolicyTrace = tostring(parse_json(ResponseBody)["trace"])
| where isnotempty(PolicyTrace)
| project TimeGenerated, ApiId, PolicyTrace, TotalTime
Diagnostic Settings
Enable Diagnostic Logging
resource apimDiagnostics 'Microsoft.Insights/diagnosticSettings@2021-05-01-preview' = {
name: 'apim-diagnostics'
scope: apim
properties: {
workspaceId: logAnalyticsWorkspace.id
logs: [
{
category: 'GatewayLogs'
enabled: true
retentionPolicy: { enabled: true, days: 90 }
}
{
category: 'WebSocketConnectionLogs'
enabled: true
retentionPolicy: { enabled: true, days: 30 }
}
]
metrics: [
{
category: 'AllMetrics'
enabled: true
retentionPolicy: { enabled: true, days: 30 }
}
]
}
}
Log Categories
| Category | Description | Retention |
|---|---|---|
| GatewayLogs | API request/response details | 90 days |
| WebSocketConnectionLogs | WebSocket connection events | 30 days |
| AllMetrics | Performance and usage metrics | 30 days |
Alerts
Failed Request Alert
resource failedRequestAlert 'Microsoft.Insights/metricAlerts@2018-03-01' = {
name: 'alert-apim-failed-requests'
location: 'global'
properties: {
severity: 1
scopes: [apim.id]
evaluationFrequency: 'PT5M'
windowSize: 'PT5M'
criteria: {
'odata.type': 'Microsoft.Azure.Monitor.SingleResourceMultipleMetricCriteria'
allOf: [
{
name: 'HighErrorRate'
metricName: 'FailedRequests'
operator: 'GreaterThan'
threshold: 50
timeAggregation: 'Total'
}
]
}
actions: [
{ actionGroupId: opsActionGroup.id }
]
}
}
Latency Alert
resource latencyAlert 'Microsoft.Insights/metricAlerts@2018-03-01' = {
name: 'alert-apim-latency'
location: 'global'
properties: {
severity: 2
scopes: [apim.id]
evaluationFrequency: 'PT5M'
windowSize: 'PT15M'
criteria: {
'odata.type': 'Microsoft.Azure.Monitor.SingleResourceMultipleMetricCriteria'
allOf: [
{
name: 'HighLatency'
metricName: 'Duration'
operator: 'GreaterThan'
threshold: 3000
timeAggregation: 'Average'
}
]
}
actions: [
{ actionGroupId: opsActionGroup.id }
]
}
}
Action Groups
resource opsActionGroup 'Microsoft.Insights/actionGroups@2023-01-01' = {
name: 'ag-ops-team'
location: 'global'
properties: {
groupShortName: 'OpsTeam'
enabled: true
emailReceivers: [
{
name: 'OpsTeamEmail'
emailAddress: 'ops-team@enterprise.com'
useCommonAlertSchema: true
}
]
azureAppPushReceivers: [
{
name: 'OnCallEngineer'
emailAddress: 'oncall@enterprise.com'
}
]
webhookReceivers: [
{
name: 'PagerDuty'
serviceUri: 'https://events.pagerduty.com/integration/{key}/enqueue'
useCommonAlertSchema: true
}
]
}
}
Distributed Tracing
W3C Trace Context
APIM supports W3C Trace Context for end-to-end distributed tracing:
Client → APIM (traceparent header) → Backend → Database
↓ ↓ ↓
Application Insights (correlated traces across all services)
Correlation in Policies
<inbound>
<!-- Set correlation ID from incoming header or generate new -->
<set-header name="X-Correlation-Id" exists-action="skip">
<value>@(context.RequestId.ToString())</value>
</set-header>
<!-- Pass to backend -->
<set-header name="traceparent" exists-action="skip">
<value>@{
var traceId = context.RequestId.ToString("N");
var spanId = Guid.NewGuid().ToString("N").Substring(0, 16);
return $"00-{traceId}-{spanId}-01";
}</value>
</set-header>
</inbound>
Viewing Traces
In Application Insights:
- Open Transaction Search
- Filter by time range and operation
- Click a request to see the end-to-end trace
- View the Application Map for service dependencies
Microsoft Reference: Distributed tracing with APIM
Azure Monitor Workbooks
Create custom dashboards for API monitoring:
API Health Dashboard
Key sections to include:
- Request Volume — Time series of total requests
- Error Rate — Percentage of 4xx/5xx responses
- Latency Distribution — p50, p95, p99 response times
- Top Errors — Most common error responses
- Capacity — Gateway utilisation
- Top Consumers — Most active subscriptions
- Backend Health — Backend response times and error rates
- Geographic Distribution — Request origins
Microsoft Reference: Azure Monitor workbooks
Policy-Based Logging
Custom Trace Messages
<trace source="custom-trace" severity="information">
<message>@{
return String.Format("Processing order {0} for customer {1}",
context.Request.MatchedParameters["orderId"],
context.Request.Headers.GetValueOrDefault("X-Customer-Id", "unknown"));
}</message>
<metadata name="orderId" value="@(context.Request.MatchedParameters["orderId"])" />
<metadata name="customerRegion" value="@(context.Request.Headers.GetValueOrDefault("X-Region", "unknown"))" />
</trace>
Emit Custom Metrics
<inbound>
<emit-metric name="custom-api-calls" value="1" namespace="apim-custom-metrics">
<dimension name="API" value="@(context.Api.Name)" />
<dimension name="Operation" value="@(context.Operation.Name)" />
<dimension name="ProductName" value="@(context.Product?.Name ?? "none")" />
</emit-metric>
</inbound>
Microsoft Reference: Emit custom metrics policy
Best Practices
- Enable Application Insights on all APIM instances for comprehensive monitoring
- Use sampling in production (5–25%) to control costs while maintaining visibility
- Set up alerts for capacity, error rates, and latency thresholds
- Log request/response bodies only in non-production or at low sampling rates
- Use W3C Trace Context for end-to-end distributed tracing
- Create dashboards for different audiences (operations, development, business)
- Monitor capacity proactively — scale before hitting limits
- Retain logs for compliance (90 days minimum for production)
- Use custom metrics for business-specific monitoring
- Review analytics regularly to identify trends and optimisation opportunities