Cloud Well-Architected Frameworks
Overview
The major cloud providers each publish a Well-Architected Framework -- a set of pillars, design principles, and best practices for building reliable, secure, performant, and cost-effective workloads in the cloud. While the terminology and organization differ, the core concerns are remarkably consistent across all three.
This skill covers all three frameworks in a unified view, enabling cross-cloud comparison and provider-agnostic architecture reasoning.
Cross-Cloud Pillar Comparison
| Concern | AWS (6 Pillars) | Azure (5 Pillars) | GCP (6 Pillars) |
|---|---|---|---|
| Operations | Operational Excellence | Operational Excellence | Operational Excellence |
| Security | Security | Security | Security, Privacy & Compliance |
| Reliability | Reliability | Reliability | Reliability |
| Performance | Performance Efficiency | Performance Efficiency | Performance Optimization |
| Cost | Cost Optimization | Cost Optimization | Cost Optimization |
| Sustainability | Sustainability | -- | -- |
| System Design | -- | -- | System Design |
Key observation: All three frameworks agree on the five core concerns (operations, security, reliability, performance, cost). AWS adds Sustainability; GCP adds System Design as an explicit pillar; Azure covers both implicitly within its five pillars.
AWS Well-Architected Framework (6 Pillars)
1. Operational Excellence
Design, run, and monitor systems to deliver business value and continually improve processes and procedures.
Key principles:
- Perform operations as code (Infrastructure as Code)
- Make frequent, small, reversible changes
- Refine operations procedures frequently
- Anticipate failure; learn from all operational events
- Use managed services to reduce operational burden
2. Security
Protect data, systems, and assets through risk assessments, security controls, and automated security best practices.
Key principles:
- Implement a strong identity foundation (least privilege, IAM)
- Enable traceability (logging, auditing, monitoring)
- Apply security at all layers (edge, VPC, subnet, instance, OS, application)
- Automate security best practices
- Protect data in transit and at rest
- Keep people away from data (reduce direct access)
- Prepare for security events (incident response runbooks)
3. Reliability
Ensure a workload can recover from failures and meet demand through proper planning and design.
Key principles:
- Automatically recover from failure
- Test recovery procedures
- Scale horizontally to increase aggregate availability
- Stop guessing capacity (use auto-scaling)
- Manage change through automation
4. Performance Efficiency
Use computing resources efficiently and maintain that efficiency as demand changes and technologies evolve.
Key principles:
- Democratize advanced technologies (use managed services)
- Go global in minutes (multi-region)
- Use serverless architectures where possible
- Experiment more often
- Consider mechanical sympathy (understand how services are consumed)
5. Cost Optimization
Avoid unnecessary costs and understand where money is being spent.
Key principles:
- Implement cloud financial management
- Adopt a consumption model (pay for what you use)
- Measure overall efficiency
- Stop spending money on undifferentiated heavy lifting
- Analyze and attribute expenditure
6. Sustainability
Minimize environmental impact of cloud workloads.
Key principles:
- Understand your impact
- Establish sustainability goals
- Maximize utilization
- Anticipate and adopt new, more efficient offerings
- Use managed services (shared infrastructure is more efficient)
- Reduce downstream impact of your cloud workloads
Azure Well-Architected Framework (5 Pillars)
1. Reliability
Ensure the application meets its availability commitments through resiliency and recovery design.
Key principles:
- Design for business requirements (define SLA/SLO/SLI)
- Design for failure (assume everything can fail)
- Observe application health (monitoring, alerting)
- Drive automation (minimize human error)
- Design for self-healing
- Design for scale-out
2. Security
Protect the confidentiality, integrity, and availability of the application and its data.
Key principles:
- Plan resources and how to harden them
- Automate and use least privilege
- Classify and encrypt data
- Guard with identity management (Zero Trust)
- Monitor security for the entire system
- Secure the supply chain
3. Cost Optimization
Balance business goals with budget to create a cost-effective workload while avoiding waste.
Key principles:
- Develop cost-management discipline
- Design with a cost-efficiency mindset
- Design for usage optimization (right-size, auto-scale)
- Continuously monitor and optimize
4. Operational Excellence
Reduce issues in production by building holistic observability and automated processes.
Key principles:
- Embrace DevOps culture
- Establish development standards (IaC, CI/CD)
- Evolve operations with observability
- Deploy with confidence (progressive rollout, rollback)
- Automate for efficiency
- Adopt safe deployment practices
5. Performance Efficiency
Efficiently scale your workload to meet demand without over-provisioning or under-provisioning.
Key principles:
- Negotiate realistic performance targets (SLAs/SLOs)
- Design to meet capacity requirements
- Achieve and sustain performance
- Improve efficiency through optimization
- Monitor and collect data to measure performance
GCP Architecture Framework (6 Pillars)
1. System Design
Design systems that meet functional and non-functional requirements using cloud-native patterns.
Key principles:
- Design for change (loosely coupled components)
- Design for automation
- Design for managed services
- Design for portability where appropriate
- Design for observability
2. Operational Excellence
Deploy, operate, and monitor systems efficiently with minimal manual intervention.
Key principles:
- Automate deployments (CI/CD)
- Practice infrastructure as code
- Monitor and alert on SLIs
- Conduct game days and chaos engineering
- Implement progressive rollouts
3. Security, Privacy & Compliance
Protect data and systems, maintain privacy, and meet compliance requirements.
Key principles:
- Leverage shared responsibility model
- Apply defense in depth
- Automate security controls
- Classify data by sensitivity
- Implement identity federation and least privilege
- Manage compliance as code
4. Reliability
Design and operate a resilient, highly available service that meets availability targets.
Key principles:
- Define and measure SLOs/SLIs
- Build redundancy to handle failures
- Design for graceful degradation
- Implement health monitoring and automated remediation
- Test for reliability (disaster recovery, chaos engineering)
5. Cost Optimization
Manage and optimize costs while maintaining performance and reliability.
Key principles:
- Identify cost drivers
- Right-size and auto-scale resources
- Use committed use discounts and sustained use discounts
- Monitor and forecast costs
- Build a cost-aware culture
6. Performance Optimization
Design, validate, and tune resources for optimal performance.
Key principles:
- Define performance requirements early
- Benchmark and load-test
- Optimize at the application and infrastructure layers
- Use caching and CDNs
- Monitor performance continuously
Well-Architected Review Process
A Well-Architected Review (WAR) is a structured assessment of a workload against the framework's pillars. All three clouds provide review tooling:
| Cloud | Tool | How It Works |
|---|---|---|
| AWS | AWS Well-Architected Tool | Answer questions per pillar; generates findings and improvement plan |
| Azure | Azure Well-Architected Review (online assessment) | Self-service questionnaire; generates recommendations |
| GCP | Architecture Framework checklists + Cloud Architecture Center | Checklist-driven review; reference architectures |
Review Steps
- Scope the workload -- Define the boundary of what is being reviewed (a single application, a platform, a service).
- Assemble the team -- Include architects, developers, operations, security, and finance.
- Walk through each pillar -- Answer the framework's questions honestly. Identify gaps.
- Prioritize findings -- Rank by business impact and effort. Focus on high-risk, high-impact items first.
- Create an improvement plan -- Assign owners, set deadlines, track progress.
- Schedule regular reviews -- Architecture is not a one-time activity. Review quarterly or after major changes.
Review Frequency
| Trigger | Action |
|---|---|
| New workload launch | Full review before production |
| Major architecture change | Review affected pillars |
| Quarterly cadence | Lightweight review of all pillars |
| Incident or outage | Review Reliability and Operational Excellence pillars |
| Cost spike | Review Cost Optimization pillar |
Pillar Tensions and Tradeoffs
The pillars are inherently in tension. Optimizing one often increases costs or complexity in another:
| Tradeoff | Example |
|---|---|
| Reliability vs. Cost | Multi-region deployment increases availability but doubles infrastructure cost |
| Security vs. Performance | Encryption at rest and in transit adds latency |
| Performance vs. Cost | Over-provisioning ensures headroom but wastes money |
| Operational Excellence vs. Speed | Comprehensive CI/CD and observability take time to set up but pay off long-term |
| Sustainability vs. Performance | Right-sizing reduces waste but may reduce performance headroom |
Key principle: Make tradeoffs explicitly. Document which pillars are prioritized and why (use Architecture Decision Records -- see specs/documentation/adr).
Best Practices
- Use the well-architected framework as a common language between architects, developers, and stakeholders -- not as a compliance checklist.
- Conduct well-architected reviews early and often, not just before launch.
- Prioritize the pillars that matter most for your workload (e.g., a financial system prioritizes Security and Reliability; a data pipeline prioritizes Performance and Cost).
- Leverage the cloud provider's native review tooling to structure the assessment.
- Document all tradeoff decisions in Architecture Decision Records.
- Remember that well-architected is aspirational -- no workload scores perfectly on every pillar. The goal is continuous improvement.
- When working across clouds (multi-cloud or migration), use this cross-cloud comparison to map equivalent concerns and avoid gaps.