AWS Well-Architected: How We Really Do It

AWS Well-Architected Framework. Six pillars. Hundreds of best practices. Dozens of whitepapers. Every AWS partner knows it. Most treat it like a checklist: Read, check off, forget.

We don't. We live in these pillars. Daily. Across 1,500+ AWS accounts at Siemens, in ML platforms at Volkswagen, in an HR data platform that saved over one million euros per year. Here's how we actually implement the six pillars. Not the AWS marketing version. Ours.

Operational Excellence: Observability Over Dashboards

AWS recommends: Implement monitoring and logging. Most teams interpret that as: Create CloudWatch dashboards and write alert rules.

We go further.

Self-healing agents instead of alert fatigue. At the HR Data Hub at Siemens Energy, our AWS Architect Agent doesn't passively monitor CloudWatch logs. It analyzes them. Detects patterns that no human would define in an alert rule. Creates root cause analyses. Resolves simple problems independently. Escalates complex ones with context.

The result: No alert fatigue. No overnight pager alerts for false positives. An agent that knows the difference between a cold-start timeout and a real infrastructure issue.

Infrastructure as Code as documentation. We don't write separate architecture documents. The CDK code IS the documentation. Every resource, every configuration, every relationship between services is defined in code. When the code changes, the documentation updates automatically. No architecture diagram that's outdated after three months.

Security: At the Beginning, Not the End

AWS recommends: Security by design. Most teams interpret that as: Security review before go-live.

We implement security from the first line of code:

Least privilege as default. Every IAM role starts with zero permissions. We add what's needed. Never more. At Siemens with 1,500 AWS accounts, this isn't optional. One account that's allowed too much is a security risk for the entire enterprise.

Cross-account security. No account blindly trusts another. Every cross-account access is explicitly defined, time-limited, and audited. That sounds expensive. It is. But with 1,500 accounts, the alternative is unacceptable.

Security tests in the pipeline. Every push goes through SAST and dependency checks. Not as a separate step after development. As part of every commit. If a security test fails, the code doesn't go to staging.

No access for us after handover. When we hand over a project, we lose access. Completely. No admin backdoor, no "for emergencies." Your system, your access.

Reliability: Serverless as Insurance

AWS recommends: Design for failure. We interpret that as: Eliminate what can fail.

Serverless first. Lambda, Step Functions, S3, DynamoDB, SQS. No server that can crash. No container that hangs. No EC2 instance that has a kernel panic at 3 AM.

At the HR Data Hub, the platform processes data from 150+ countries with different time windows and formats. All serverless. Operating costs under 40,000 EUR per year. Not a single night where someone had to get up.

Retry with exponential backoff. Every external call (API, database, AI model) has retry logic. Not a blanket three retries. But with exponentially growing wait times and a circuit breaker. When an AI provider is temporarily unreachable, the agent falls back to an alternative provider. Provider agnosticism as a reliability feature.

Chaos engineering (light). We don't randomly shut down services. But we regularly test what happens when an AI provider doesn't respond, when a Lambda hits its timeout limit, when DynamoDB throttles. These tests are part of our CI/CD pipeline.

Performance Efficiency: Measure, Don't Guess

Right compute choice per workload. Not everything is Lambda. GPU-intensive ML workloads run on SageMaker or ECS with GPU instances. Batch processing runs on Step Functions with parallel Lambda invocations. Real-time APIs run on Fargate with auto-scaling.

At VW Snowpark, we run 100+ ML environments. Each with its own compute profile, optimized for the specific ML workload. No one-size-fits-all.

Caching with intent. Not everything needs to be cached. But AI model responses that are identical for identical inputs, we cache. This reduces API costs by 30–40 % for certain workloads. DynamoDB as a cache for LLM responses, with TTL based on the model release cycle.

Cost Optimization: Architecture Over FinOps Tools

Serverless as cost control. The HR Data Hub at Siemens Energy. Old solution: Over 1,000,000 EUR per year. New solution: Under 40,000 EUR. No FinOps tool would have produced this saving. The architecture decision did.

Tagging policies from day one. Every resource is tagged: project, team, environment, cost center. Automated via CDK. No resource exists without tags. No tag, no deployment.

Right-sizing through data. No engineer guesses which Lambda size or Fargate task fits. We measure. AWS Compute Optimizer plus our own metrics. Quarterly review of all running resources.

Sustainability: Less Compute, Same Results

AI model selection as a sustainability decision. Not every task needs Claude Opus. For simple classifications, Haiku suffices. For standard text, Sonnet suffices. Model selection reduces not just costs, but also the compute footprint.

Serverless = no idle compute. No server sitting idle 23 hours a day still consuming power. Lambda only runs when there's work to do. For workloads that run sporadically, this reduces compute consumption by 90 %+.

Not a Framework. A Way of Working.

Well-Architected isn't a review we perform before go-live. It's how we work every day. Every architecture decision, every pull request, every code review follows these principles.

Not because AWS recommends it. But because it works. Across 1,500 accounts. For 120,000 users. With operating costs under 40,000 EUR for a platform that previously cost over a million.

The software that runs on this infrastructure. Our AI stack in practice.
Well-Architected in action. 800,000 EUR in savings.
Why security and speed aren't contradictory.

AWS Well-Architected: How We Really Do It.