
In late 2023, our team faced a challenge that had never existed before: building an AI platform for 120,000 Siemens employees — infrastructure, backend, AI. Not as an experiment. Not as a prototype. In production.
Today, over 50 AI models run on this platform. Hundreds of thousands of messages per day. Thousands of agents created by employees with no programming skills. It's a success story. But the path to get there was anything but straightforward.
Here's what we learned.
1. Provider Agnosticism Isn't an Option — It's a Requirement
When we designed SiemensGPT, the temptation was strong to bet on a single AI provider. OpenAI was dominant, the API well-documented, the ecosystem growing fast.
We decided against it. And that was one of the best architectural decisions of the project.
Within 18 months, the model landscape changed fundamentally three times. Claude became better than GPT-4 for certain use cases. Open-source models like LLaMA reached enterprise-grade quality. Google launched Gemini with multimodal capabilities.
Had we locked ourselves into a single provider, we would have had to rebuild the platform three times. Instead, we were able to integrate new models within days.
Takeaway: Build abstraction layers. Not because it's elegant, but because AI models change faster than any other technology we've ever worked with.
2. LLMs Couldn't Do Agents in 2023. We Built Them Anyway.
When we started, there were no native agentic workflows in LLMs. No tool calling as we know it today. No function calling that worked reliably. The models could generate text,
but they couldn't control systems.
Our team built a custom agent framework. From scratch. Prompt engineering, retry logic, tool orchestration, error handling. All of it ourselves.
Six months later, OpenAI released function calling. A year later, every provider had native agents. Our framework was technically superseded before it grew old.
Was it wasted effort? No. Because during those six months, 120,000 employees already had access to agents. Our customers didn't wait for the industry. And the know-how from building our own framework helped us leverage the native APIs more effectively —
faster than teams that only started with function calling.
Takeaway: Don't wait for the perfect technology. Work with what's available. The experience is more valuable than the shortcut.
3. 1,500 AWS Accounts Aren't an Infrastructure Problem. They're an Organizational Problem.
Siemens operates over 1,500 AWS accounts. Cross-account access, different security policies, teams with varying compliance requirements. Technically solvable. Organizationally challenging.
Most of the problems we had to solve weren't code problems. They were governance problems: Who is allowed to access which data? How do you ensure an AI agent in Account A can read data from Account B without violating security policies? How do you deploy updates across hundreds of accounts without breaking existing workflows?
Our solution: Infrastructure as Code with CDK, cross-account roles with least-privilege permissions, and a deployment pipeline that treats every account as an independent unit. No account blindly trusts another.
Takeaway: At enterprise scale, 80 % of the challenges are organizational, not technical. Plan for that.
4. No-Code Agent Builder Works.
We were skeptical. Can employees without technical know-how really create useful AI agents? Will the agents be good enough? Will the support overhead eat up the productivity gains?
The answer after 10,000+ agents created: Yes, it works. Better than expected.
The best agents don't come from us. They come from business units that know their processes better than any consulting team. HR creates onboarding agents. Finance creates reporting agents. Engineering creates code review agents. Each with domain knowledge and a clear problem.
The key: The platform must abstract complexity without losing power. Guardrails instead of restrictions. Templates instead of blank pages. Monitoring instead of trust.
Takeaway: The most valuable AI applications aren't built by AI experts, but by domain experts with the right tools.
5. Observability at 50 Models Is a Problem of Its Own
Beyond a certain scale, standard monitoring no longer suffices. 50 models from five providers, each with its own latency profile, rate limits, and error patterns. Hundreds of thousands of messages per day. Thousands of agents created by business units.
Every provider behaves differently under load. Claude slows down, GPT returns 429s, Gemini has different timeout patterns. When a provider degrades, the platform must detect it and reroute requests. Not after minutes. After seconds.
We treated observability as an architectural component from day one. Every model request is logged: latency, token consumption, error rate, cost. Per model, per agent, per user. This gives us a real-time picture of the entire platform and the foundation for automatic failover between providers.
Takeaway: In a multi-model platform, observability isn't an ops concern. It's an architecture concern. If you run 50 models, you need monitoring that understands the differences between providers.
What We Learned From All This
Three principles that are now our standard:
-
Observability from day one. Monitoring is part of every architecture from the start. Retrofitting it always costs more.
-
Standards before custom. Custom frameworks only when there's no alternative. And as soon as native APIs are available, migrate.
-
Forward-deployed engineering. The best results happen when our engineers work directly with business units. Together, on the same problem.
The One Thing That Remains
50 AI models in production have taught me one thing: The technology changes every few months. What remains is the ability to build large, secure, scalable systems.
Systems engineering. That's our core. AI is the tool. But the craft behind it is the same as it was 20 years ago: architecture, decisions, accountability.
We think in possibilities and make them real. And that won't change, no matter which model comes next.
