Workload Identity

Why Service Accounts Are the New Attack Surface

David Goldschlag · Oct 12, 2023 · 9 min read

Pull up your Kubernetes cluster's service account list. Count them. Now ask yourself: how many of those were created by engineers who no longer work at your company? How many were scoped for a one-off migration that ran in Q3 of last year? How many have never been audited against what they actually need?

We spend an enormous amount of time and tooling hardening human identity — MFA, SSO, privileged access workstations, session recording. The average service account gets none of that. It gets a static token written into a Kubernetes Secret, mounted into a pod, and forgotten. We've watched this pattern replicate across every team we've talked to while building Aembit, and the gap between how we treat human identity vs. non-human identity is striking.

The Math Is Worse Than You Think

Consider a medium-sized microservice deployment running 30-50 services. Each service typically needs: its own Kubernetes service account, credentials to reach a database or cache, credentials to call downstream services, and credentials to push metrics or logs to an observability platform. That's conservatively 3-4 non-human credentials per service, before you add CI/CD pipeline secrets, third-party API keys, and inter-cluster identities.

The 40+ service account figure isn't theoretical. In the deployments we've examined while scoping Aembit's integrations, the count runs higher. One Kubernetes-heavy deployment we analyzed had 67 service accounts across two clusters, and roughly a third of them had RBAC bindings that nobody on the current team could explain from memory.

The problem compounds because service accounts are cheap to create and expensive to audit. When an engineer needs a new service account to test something, they create one. When that engineer leaves, nobody deletes it. When the test becomes production code, the credentials get copied into CI. When the permissions turn out to be too narrow, someone broadens them rather than scoping them precisely. Each of these decisions is individually rational. Collectively, they produce an ever-expanding surface of dormant, over-privileged machine identities.

What Your SIEM Actually Sees

Here's the asymmetry that makes this hard to track: SIEMs are optimized for detecting anomalous human behavior. Lateral movement after a credential compromise, unusual login times, geographic impossibilities. The signal model for human identity is built around detecting deviation from a baseline.

Non-human identities don't behave that way. A service account that connects to your PostgreSQL instance at 2:17 AM every night isn't anomalous — it's probably your backup job. A service account that makes 10,000 API calls per hour isn't suspicious — it's probably your event processing pipeline. The traffic patterns for machine-to-machine authentication look completely different from human-generated traffic, and most SIEM rules aren't tuned to detect stale service accounts being used for lateral movement within that traffic.

The typical attack pattern doesn't look like a breach from the outside at all. An attacker with access to one compromised pod can use a mounted service account token to query the Kubernetes API, discover other services and their endpoints, call those services using the over-provisioned RBAC attached to the first service account, and move through the environment without generating the kinds of alerts your security team is watching for.

We're not saying SIEMs are wrong for what they do. We're saying that the non-human identity attack surface operates on different principles and requires different detection logic — which means it tends to stay invisible until an incident exposes it.

Why RBAC Alone Doesn't Fix This

The standard advice for Kubernetes service account hygiene is: use RBAC, apply least privilege, review regularly. This is correct advice. It also fails in practice, for predictable reasons.

First, least-privilege RBAC requires knowing exactly what a workload needs at the moment of creation, and maintaining that knowledge through subsequent changes. Service behavior changes over time. Permissions get broadened to fix an immediate problem and never tightened back. RBAC policies accumulate technical debt the same way application code does, except there's no linting tool that flags a ClusterRoleBinding that grants get on all secrets when the service only needs one specific secret.

Second, RBAC controls what a workload is allowed to do within the Kubernetes API. It doesn't help with the much larger category of credentials that service accounts carry to reach external systems: database passwords, third-party API tokens, cloud provider access keys. Those live outside RBAC entirely, typically as environment variables or mounted secrets, with no equivalent of RBAC scoping or automatic expiration.

Third, RBAC gives you authorization controls without identity verification. If you have two different pods — one running a legitimate payment processor and one running a compromised dependency — and both are assigned to the same service account (which happens more than you'd think when teams copy deployment configs), RBAC can't distinguish between them. Both get the same permissions regardless of which code is actually running.

The Lifecycle Problem

Human identity has a well-understood lifecycle: hire date triggers account creation, role changes trigger access changes, termination triggers deprovisioning. HR systems, identity providers, and provisioning tools handle most of this automatically. Offboarding a person reliably removes their access within hours, not weeks.

Service accounts have no equivalent lifecycle. They're created when a service is built, sometimes documented in a runbook, often not. They're modified when something breaks. They're never deprovisioned because nobody wants to be responsible for breaking production by removing a credential that turns out to still be in use.

The practical consequence: service accounts accumulate. In a deployment that's been running for two years, you'll find service accounts created for services that were deprecated in the first six months. You'll find credentials scoped to third-party APIs that the company stopped using. You'll find database credentials for databases that were migrated away from. All of these represent real, valid credentials that a compromised host could potentially use.

The fix isn't just better offboarding checklists — it's making non-human credentials ephemeral by design, so that a stale service account becomes meaningless because it can't produce a valid credential without passing current attestation. That's the core principle behind what we're building: a workload proves its identity every time it needs access, rather than carrying a static credential that remains valid indefinitely.

Attestation vs. Authentication

There's a useful distinction worth drawing here. Authentication answers "are you who you claim to be?" Attestation answers "is what's running here actually what we expect to be running here?"

Traditional service account tokens do weak authentication at best. A token presented to the Kubernetes API proves that someone has the bytes of that token. It doesn't prove that the workload presenting the token is the expected code, running in the expected environment, with expected dependencies. A compromised pod with a stolen token looks identical to a legitimate pod with the same token.

SPIFFE-based workload identity starts from attestation: the identity is derived from facts about the running environment (what image is this? what namespace? what service account name? what node is it running on?) rather than from possession of a secret. An identity document — a SVID — is issued by a trusted authority (SPIRE) to a workload that passes attestation. The SVID has a short TTL, typically on the order of an hour, so even if it's stolen, it's briefly useful. And because the identity is derived from attestation rather than from a static token, you can't construct a valid identity document just by copying bytes from one context to another.

We're not saying attestation-based identity is a complete replacement for good RBAC hygiene or secret scanning. We're saying that if the root cause of the service account problem is that static credentials persist indefinitely and can be used without proof of the original context, then the structural fix is to eliminate static credentials in favor of ephemeral, context-bound identity — not to add more process around managing the static credentials.

What to Audit Today

Before re-architecting anything, it's worth understanding the scope of what you have. A useful starting audit for Kubernetes environments:

kubectl get serviceaccounts --all-namespaces -o json \
  | jq '.items[] | {name: .metadata.name, namespace: .metadata.namespace, created: .metadata.creationTimestamp}'

Sort by creation timestamp. Identify any service account older than 6 months. For each one, check: is there a running workload that references it? What RBAC roles are bound to it? Are there associated secrets with data that hasn't been accessed recently?

For non-Kubernetes credentials — database passwords, API keys, cloud access keys — the same logic applies. When was this credential last rotated? Who owns it? Is there a runbook for what to do if it's compromised? The absence of good answers to these questions maps directly to the attack surface.

The goal of the audit isn't necessarily to delete everything immediately — it's to make the surface visible. The service account problem persists partly because it's invisible. Engineers know about the service accounts they created. Nobody has a comprehensive view of the full inventory, and without that view, it's impossible to reason about risk.

Moving Toward Machine-Speed Rotation

The long-term direction for non-human identity management is the same as where human identity has been heading for the past decade: eliminate the conditions that make long-lived credentials dangerous.

For humans, that meant moving from passwords to sessions, from sessions to short-lived tokens, from short-lived tokens to cryptographic hardware keys. The common thread is reducing the window of exposure when a credential is compromised.

For workloads, the equivalent is eliminating stored secrets in favor of credentials that are issued at runtime, scoped to the specific request or session, and expire automatically. OIDC federation with cloud providers (AWS STS, GCP Workload Identity, Azure Managed Identity) moves in this direction. SPIFFE/SPIRE does this for service-to-service identity within a cluster or across clusters. What's missing in most environments is a policy layer that can evaluate whether the workload requesting credentials is authorized to have them at that moment — not just whether it knows the right secret.

That policy layer is what we're building at Aembit. Every access request carries attestation about the requesting workload, evaluated against access policies that can be reviewed and updated independently of the workload code itself. When an engineer leaves your company and their service gets decommissioned, the access policy for that service gets removed — and any remaining code that tries to use that service account gets denied, not silently permitted.

The service account problem is solvable. It requires treating non-human identity with the same lifecycle discipline and policy rigor that we apply to human identity. The tools to do that are getting more mature. The patterns are established. The remaining gap is mostly organizational: getting security and platform teams to treat the service account audit as a first-class security concern, not an infrastructure maintenance task.