name: backdoor-deployment description: "Validate a container image change via backdoor deployment. Use when: deploying test image to a cluster, comparing data volume between deployments, comparing resource consumption, backdoor deploy, validate container image, image regression testing, build and deploy branch." argument-hint: "Provide branch name, current production image, and YAML file path"
Backdoor Deployment Automation
Validates a container image change by deploying the current production image, collecting baseline data, then deploying the test image (from a CI build) and comparing data volume and resource consumption. No regressions = pass.
Required Inputs
Check with the user if they want to use the default values or provide new ones.
| Input | Description | Default |
|---|---|---|
| Branch name | Git branch to build | suyadav/aiautomation |
| Current production image | Production image tag (e.g. ciprod:X.Y.Z) | ciprod:3.1.35 |
| YAML file path | Helm values file for backdoor deployment | ./../azuremonitor-containerinsights-for-prod-clusters/values.yaml |
Derived Values
Parse these automatically from the YAML file — do not ask the user.
| Value | Source |
|---|---|
| Cluster Resource ID | OmsAgent.aksResourceID |
| Log Analytics Workspace ID | OmsAgent.workspaceID (a GUID used with az monitor log-analytics query -w) |
| Cluster Name | Last segment of the cluster resource ID (for kubectl config use-context) |
| Subscription ID | Extracted from the cluster resource ID (/subscriptions/<this>/...) |
| Resource Group | Extracted from the cluster resource ID (/resourceGroups/<this>/...) |
Build Pipeline
| Field | Value |
|---|---|
| Organization | github-private |
| Project | microsoft |
| Build Definition ID | 444 |
General Rules
- Save the output of each step to
BackdoorDeploymentOutput.mdin the repo root. Always append new results at the end. Beautify for readability. Don't clear until explicitly asked. - If asked "what's the next step", read
BackdoorDeploymentOutput.mdand suggest the next step. - Before executing any step, verify previous step data exists in
BackdoorDeploymentOutput.md. If missing, confirm with the user before proceeding. - If the build must be retriggered, keep the existing production baseline data — do not re-deploy the production image or re-collect baseline data.
- After the workflow completes, restore the YAML file to its original production image values.
Procedures
Update YAML Image Tags
- Only update the image version — do NOT change any other part of the file.
- Update exactly two fields:
imageTagLinuxandimageTagWindows. - Windows naming convention: prefix
win-after the image type. Examples:cidev:3.1.27-2-abc123-20250520184627→cidev:win-3.1.27-2-abc123-20250520184627ciprod:3.1.27→ciprod:win-3.1.27
Deploy with Helm
Always use --install to handle both fresh installs and upgrades:
helm upgrade --install ama-logs <chart-path> -n kube-system
where <chart-path> is the directory containing the YAML (e.g. ./../azuremonitor-containerinsights-for-prod-clusters/).
Collect Table Data
Run Kusto queries via az monitor log-analytics query -w <workspaceId> (or the kusto-mcp MCP server if available).
Collect aggregated row counts in 1-minute bins from (deployment time + 5 min) to (deployment time + 10 min) for these tables:
ContainerInventoryKubeNodeInventoryKubePodInventoryInsightsMetricsPerfContainerLogV2
Query template (run once per table, all 6 can run in parallel):
<TableName>
| where TimeGenerated between(datetime('<deployTime+5min>') .. datetime('<deployTime+10min>'))
| where _ResourceId =~ '<clusterResourceId>'
| summarize Count=count() by bin(TimeGenerated, 1m)
| order by TimeGenerated asc
Timing: Wait at least 15 minutes after deployment before running these queries — this accounts for pod startup (~5 min) plus Log Analytics ingestion latency (~5–10 min). The query window (deploy+5 to deploy+10) captures steady-state data only.
Compare Data Volume
- Compare production vs test counts side by side for each table.
- For
ContainerInventory,KubeNodeInventory,KubePodInventory,InsightsMetrics,Perf: counts must match exactly per minute, excluding first/last minute edge windows. If they differ by even 1, investigate. - For
ContainerLogV2: exact match is not required, but check for sustained upward/downward trends indicating regression.
Check Build Failure Reason
Query the build timeline to find which task(s) failed:
az devops invoke --organization "https://dev.azure.com/github-private" \
--area build --resource timeline \
--route-parameters project=microsoft buildId=<BUILD_ID> \
--query "records[?result=='failed'].{name:name, type:type}" -o table
- If the only failed task name contains "Trivy" (vulnerability scan), the build images are valid — continue using this build. Do NOT fall back to a previous build. Extract the image tag from this build's logs.
- If any other task failed, the build is unusable — report the failure to the user.
Extract Image Version from Build Logs
Use the ADO API to read the build log directly (no need to download zip files):
-
Find the log ID for the "Multi-arch Linux build" task:
az devops invoke --organization "https://dev.azure.com/github-private" \ --area build --resource timeline \ --route-parameters project=microsoft buildId=<BUILD_ID> \ --query "records[?name=='Multi-arch Linux build'].{name:name, logId:log.id}" -o json -
Read the log and extract the image tag. The log contains a line like:
##[warning]Linux image built with tag: containerinsightsprod.azurecr.io/public/azuremonitor/containerinsights/cidev:3.1.34-17-g67321cf0d-20260323045331Use
grep -o 'cidev:[^ ]*'or similar to extract the tag. -
Derive the Windows tag from the Linux tag using the naming convention (prefix
win-). Alternatively, find "Docker windows build for multi-arc image" log for a line like:##[warning]Windows image built with tag: ...cidev:win-3.1.34-17-g67321cf0d-20260323045331
Get PodUid
Query KubePodInventory scoped to the relevant deployment window:
KubePodInventory
| where TimeGenerated between(datetime('<windowStart>') .. datetime('<windowEnd>'))
| where _ResourceId =~ '<clusterResourceId>'
| where Name in ('<pod1>', '<pod2>', ...)
| distinct PodUid, Name
Compare Resource Consumption
Query per-minute resource consumption. You can batch multiple pods in one query using or:
Perf
| where TimeGenerated between(datetime('<windowStart>') .. datetime('<windowEnd>'))
| where _ResourceId =~ '<clusterResourceId>'
| where CounterName =~ '<counterName>'
| where InstanceName contains '<podUid1>' or InstanceName contains '<podUid2>' or ...
| extend Pod = case(
InstanceName contains '<podUid1>', '<podName1>',
InstanceName contains '<podUid2>', '<podName2>',
'unknown')
| summarize MaxValue=max(CounterValue/1000/1000/1000) by bin(TimeGenerated, 1m), Pod
| order by Pod asc, TimeGenerated asc
Compare the two counter names:
memoryWorkingSetBytes— memory in GBcpuUsageNanoCores— CPU in cores
Flag any regression (sustained increase in the test deployment).
Investigate Data Volume Regression
When a table's counts differ between production and test (or ContainerLogV2 shows a sustained trend), investigate before marking it as a regression:
-
Break down by ContainerName in both windows to identify which container(s) are responsible:
<TableName> | where TimeGenerated between(datetime('<windowStart>') .. datetime('<windowEnd>')) | where _ResourceId =~ '<clusterResourceId>' | summarize Count=count() by ContainerName | sort by Count desc -
Compare the per-container breakdown between production and test. Look for:
- Containers present in one window but not the other (cluster workload change, not a code regression).
- A specific container with significantly higher counts in the test window.
-
If a container is only present in one window, verify it was running independently of the deployment by checking a broader time range (e.g., 30 min before the deployment):
<TableName> | where TimeGenerated between(datetime('<deployTime-30min>') .. datetime('<deployTime>')) | where _ResourceId =~ '<clusterResourceId>' | where ContainerName == '<suspectContainer>' | summarize Count=count() by bin(TimeGenerated, 1m) | order by TimeGenerated asc -
Classify the finding:
- If the difference is caused by a container that started/stopped independently of the deployment → not a regression (cluster workload difference). Note this in the output file and mark as PASS.
- If the difference is caused by an ama-logs container or directly relates to the code change → potential regression. Flag it and ask the user to review.
Investigate Resource Consumption Regression
When memory or CPU shows a sustained increase in the test deployment:
-
Check per-container resource usage within each pod to isolate which container is consuming more. The ama-logs pods run multiple containers (ama-logs, ama-logs-prometheus, addon-token-adapter). Use:
Perf | where TimeGenerated between(datetime('<windowStart>') .. datetime('<windowEnd>')) | where _ResourceId =~ '<clusterResourceId>' | where CounterName =~ '<counterName>' | where InstanceName contains '<podUid>' | summarize MaxValue=max(CounterValue/1000/1000/1000) by bin(TimeGenerated, 1m), InstanceName | order by InstanceName asc, TimeGenerated asc -
Compare the per-container breakdown between production and test to pinpoint the specific container causing the increase.
-
Classify the finding:
- Increases < 10% within normal variance → not a regression. Note in output and mark as PASS.
- Sustained increases ≥ 10% in an ama-logs container → potential regression. Flag and ask user to review.
Steps
The workflow has two parallel tracks that converge after the build completes.
Phase 1: Obtain Build + Deploy Production Image (parallel)
- Parse derived values from the YAML file (see Derived Values table). Save all values to the output file.
- Set kubectl context:
kubectl config use-context <cluster name>. - Check for an existing build on the branch for the latest commit (definition ID 444, org:
github-private, project:microsoft).- If a completed build exists on the latest commit → use it (even if it failed due to Trivy — see "Check Build Failure Reason").
- IMPORTANT: A build that failed ONLY due to Trivy is still usable. Do NOT fall back to a previous build. The images are already built and pushed before Trivy runs. Always extract the image tag from the failed build's logs (see "Extract Image Version from Build Logs").
- If no usable build exists → trigger a new build. Save the build ID.
- If the build is already complete, skip to Phase 2 after finishing production baseline steps. If the build is still running, proceed with steps 5–9 in parallel; periodically check build status during wait times.
- Update YAML with the current production image and deploy (see "Update YAML Image Tags" and "Deploy with Helm"). Record the production deployment time (UTC).
- Wait 15 minutes, then verify pods:
kubectl get pods -n kube-system | grep ama-logs. Confirm all are Running with 0 restarts. Save pod names to the output file. - Collect production baseline data for all 6 tables (see "Collect Table Data"). Save results to the output file.
Phase 2: Deploy Test Image (after build completes)
- Confirm the build completed. Check failure reason if needed (see "Check Build Failure Reason"). If it failed for a non-Trivy reason, ask the user whether to retrigger. If it failed only due to Trivy, treat it as a successful build — the images are valid. Do NOT fall back to a previous build.
- Extract the test image version from the build logs (see "Extract Image Version from Build Logs"). Save to the output file.
- Update YAML with the test image and deploy. Record the test deployment time (UTC).
- Wait 15 minutes, then verify pods are Running. If any pod restarted, get the reason via
kubectl describe pod <name> -n kube-system. Save pod names to the output file. - Collect test data for all 6 tables (see "Collect Table Data"). Save results to the output file.
Phase 3: Compare Results
- Compare data volume between production and test for all tables (see "Compare Data Volume"). If any table shows a difference, investigate before reporting (see "Investigate Data Volume Regression").
- Get PodUid for all pods in both deployments (see "Get PodUid").
- Compare resource consumption for
memoryWorkingSetBytesandcpuUsageNanoCores(see "Compare Resource Consumption"). If any metric shows a sustained increase, investigate before reporting (see "Investigate Resource Consumption Regression"). - Restore YAML to its original production image values.
- Write summary to the output file: pass/fail for each table and resource check. Include investigation findings for any anomalies — clearly distinguish between code regressions and cluster workload differences.