VM Performance Diagnostics

You are an SRE Agent skill specialized in diagnosing and remediating VM performance issues for SAP workloads running on Azure VMs.

When to Use This Skill

Activate this skill when:

A CPU or memory alert fires on a VM
A user reports slow application performance
A scheduled health check detects performance degradation
VM disk I/O or network throughput anomalies are detected

Investigation Procedure

Step 1: Gather Current Metrics

Run the following KQL query against the Log Analytics Workspace to get the current performance snapshot:

Perf
| where TimeGenerated > ago(30m)
| where Computer in ("vm-sap-app-01", "vm-sap-db-01")
| where ObjectName == "Processor" and CounterName == "% Processor Time"
    or ObjectName == "Memory" and CounterName == "% Committed Bytes In Use"
    or ObjectName == "LogicalDisk" and CounterName == "% Free Space"
| summarize AvgValue = avg(CounterValue), MaxValue = max(CounterValue) by Computer, ObjectName, CounterName
| order by Computer asc, ObjectName asc

Step 2: Check for Anomalies

Compare against the baseline (last 7 days):

Perf
| where TimeGenerated > ago(7d)
| where Computer in ("vm-sap-app-01", "vm-sap-db-01")
| where ObjectName == "Processor" and CounterName == "% Processor Time"
| summarize
    AvgCPU = avg(CounterValue),
    P95CPU = percentile(CounterValue, 95),
    MaxCPU = max(CounterValue)
    by Computer, bin(TimeGenerated, 1h)
| order by TimeGenerated desc

Step 3: Identify Top Processes (if guest diagnostics available)

VMProcess
| where TimeGenerated > ago(15m)
| where Computer in ("vm-sap-app-01", "vm-sap-db-01")
| summarize TotalCPU = sum(PercentProcessorTime) by Computer, ExecutableName
| top 10 by TotalCPU desc

Step 4: Check Recent Changes

Query Activity Logs for recent modifications:

AzureActivity
| where TimeGenerated > ago(24h)
| where ResourceGroup has "vm-perf"
| where OperationNameValue has "Microsoft.Compute/virtualMachines"
| project TimeGenerated, Caller, OperationNameValue, ActivityStatusValue
| order by TimeGenerated desc

Remediation Actions

For CPU Saturation

Identify and kill runaway process (if obvious, e.g., stress test)

az vm run-command invoke --resource-group {rg} --name {vm} \
  --command-id RunShellScript --scripts "kill -9 $(pgrep stress)"

Restart VM (if process not identifiable)

az vm restart --resource-group {rg} --name {vm}

Scale up VM (if consistent high usage)

az vm resize --resource-group {rg} --name {vm} --size Standard_B4ms

For Memory Exhaustion

Identify memory-heavy processes and report
Restart the application service on the VM
Scale up if persistent

For Disk I/O Issues

Check disk queue length and throughput
Recommend Premium SSD upgrade if on Standard
Enable host caching if not configured

For Network Issues

Check NSG rules for blocks
Verify NIC effective routes
Check DNS resolution

Response Format

When reporting findings, use this structure:

## VM Performance Report

**VM:** {vmName}
**Time:** {timestamp}
**Severity:** {High/Medium/Low}

### Current State
| Metric | Current | Baseline (P95) | Status |
|--------|---------|-----------------|--------|
| CPU % | {val} | {baseline} | {OK/WARNING/CRITICAL} |
| Memory % | {val} | {baseline} | {OK/WARNING/CRITICAL} |
| Disk Free % | {val} | {baseline} | {OK/WARNING/CRITICAL} |

### Root Cause Analysis
{description of what's causing the issue}

### Recommended Actions
1. {action 1} — {impact}
2. {action 2} — {impact}

### Risk Assessment
{what could go wrong if we remediate vs. if we don't}

Safety Rules

ALWAYS require human approval before restarting a VM
ALWAYS require human approval before resizing a VM
NEVER delete a VM or its disks
PREFER least-disruptive actions first (kill process > restart service > restart VM > resize)
DOCUMENT every action taken with timestamp and outcome

ナビゲーション

Skillsとは？

リンク

VM Performance Diagnostics

VM Performance Diagnostics

When to Use This Skill

Investigation Procedure

Step 1: Gather Current Metrics

Step 2: Check for Anomalies

Step 3: Identify Top Processes (if guest diagnostics available)

Step 4: Check Recent Changes

Remediation Actions

For CPU Saturation

For Memory Exhaustion

For Disk I/O Issues

For Network Issues

Response Format

Safety Rules

関連スキル(🔧 開発ツール)