Platform

Platform Team Identity Operations Guide

Maya Chen · · 15 min read
Platform Team Identity Operations Guide

This is a practical guide for platform teams who own the workload identity migration problem — moving from static credentials (service account passwords, API keys, long-lived tokens) to short-lived, workload-attested identity. It's not a vision document. It's a runbook: specific steps, sequencing rationale, and how to measure whether the work is having the intended effect.

We wrote this from what we've learned deploying Aembit in environments ranging from 20-service monorepo-style platforms to 200+ microservice architectures. The process is similar regardless of scale; the timelines differ.

Phase 0: Inventory Before You Migrate Anything

The most common mistake is starting migration before completing inventory. You migrate the services you know about, discover there are twice as many credential types as you thought, and have to rework your migration plan mid-flight.

Spend two to four weeks on inventory before touching a single credential. The output is a credential register: a table with one row per credential that a non-human workload uses to authenticate to another system.

Inventory Sources

Pull from all of these:

  • Secrets manager exports: AWS Secrets Manager, HashiCorp Vault, GCP Secret Manager — list all secrets, filter to those accessed by non-human principals. Note last-accessed date, accessing principal, and creating principal.
  • IAM credential reports: AWS IAM credential report flags service users with active access keys, last use date, and rotation history. GCP IAM has equivalent service account key inventories.
  • Kubernetes Secrets: kubectl get secrets -A across all namespaces. Filter to type: Opaque and examine what applications mount them.
  • Deployment configurations: Helm values files, ECS task definitions, Lambda environment variable configs. Search for patterns that suggest credentials: _KEY, _SECRET, _PASSWORD, _TOKEN suffixes.
  • CI/CD platforms: GitHub Actions secrets, GitLab CI variables, CircleCI contexts. These are often overlooked and frequently hold long-lived cloud provider credentials.
  • Application code: Static analysis with gitleaks or trufflehog on your git history. You'll find credentials that were committed at some point even if they've since been moved to a secrets manager.

For each credential found, record: credential type, identifier, owner team, accessing workload, accessed system, creation date (if known), last rotation date (if any), rotation policy, and whether rotation is automated or manual.

This inventory is your baseline. Every metric you'll use to measure migration progress starts from this register.

Phase 1: Triage and Prioritization

Not all credentials are equal. After inventory, run each credential through a priority score based on two factors: sensitivity of the accessed system, and rotation risk.

Sensitivity tiers

  • Critical: Credentials accessing secrets managers, key management systems, production databases containing PII, payment processors, identity providers
  • High: Credentials accessing cloud provider APIs with write permissions, production data stores, customer-facing services
  • Medium: Credentials accessing internal services with read-write permissions, analytics systems, non-production data
  • Low: Read-only access to internal services, non-sensitive analytics, logging and monitoring endpoints

Rotation risk

  • High rotation risk: No rotation policy, last rotated more than 180 days ago, manual rotation only, multiple services sharing the same credential
  • Medium rotation risk: Rotation policy exists but manual, rotated within 90 days, single service using the credential
  • Low rotation risk: Automated rotation in place, rotated within 30 days

Migrate credentials in order of: Critical+High Risk first, then High+High Risk, then Critical+Low Risk, and so on down the matrix. This prioritization ensures you're addressing the most dangerous combinations first — credentials with broad access that rarely rotate are the ones most likely to be involved in a breach.

Phase 2: Establish the Identity Foundation

Before migrating individual credentials, you need the infrastructure that short-lived tokens depend on: a trust anchor per environment, an OIDC discovery endpoint, and a token exchange service that validates platform identity assertions and issues scoped tokens.

Kubernetes environments

Kubernetes already provides OIDC-signed service account tokens via the projected volume mechanism. Enable the API server's --service-account-issuer flag if it's not already set, and configure your OIDC discovery endpoint:

kube-apiserver \
  --service-account-issuer=https://k8s.internal/cluster-1 \
  --service-account-jwks-uri=https://k8s.internal/cluster-1/openid/v1/jwks \
  --api-audiences=https://aembit.internal

Configure Pods to request service account tokens with a specific audience:

volumes:
- name: workload-identity-token
  projected:
    sources:
    - serviceAccountToken:
        audience: "https://aembit.internal"
        expirationSeconds: 900
        path: token

The token at /var/run/secrets/workload-identity/token is a 15-minute OIDC JWT signed by the Kubernetes API server, with the pod's service account as the subject. This is the platform credential that gets exchanged for resource-specific tokens.

ECS and Lambda environments

ECS task roles and Lambda execution roles provide equivalent platform identity via the instance metadata service. The IMDS returns AWS STS tokens bound to the task or function execution role. These can be exchanged for OIDC tokens through AWS IAM Identity Center or through a custom token exchange endpoint that accepts STS credentials as subject tokens.

Phase 3: Migration Wave Structure

Run migrations in waves rather than all at once. Each wave covers one credential type or one system. Validate the wave before starting the next one.

Wave 1: CI/CD credentials (2-4 weeks)

GitHub Actions OIDC federation with AWS and GCP is well-supported and has good documentation. This wave eliminates some of the highest-risk credentials in most environments: IAM access keys used by CI pipelines that have broad permissions to deploy infrastructure and application code.

The migration for each pipeline: replace the static credential secret with an OIDC federation configuration, update the workflow to use the id-token: write permission, and test. The rollback is straightforward — restore the secret if the federation doesn't work.

Wave 2: Service-to-service API authentication (4-8 weeks)

Services calling internal APIs is typically the largest category by credential count. The migration per service: add the Aembit client agent as a sidecar or as a dependency, configure the client workload identity (SPIFFE ID mapping), configure the access policy for the target service, replace the static credential in the application config with the Aembit SDK call or environment variable injection.

Don't attempt to migrate all services simultaneously. Start with two or three services that call each other — a known dependency pair — validate end-to-end, then expand.

Wave 3: Database credentials (4-8 weeks)

Database credential migration is the most operationally complex wave because the connection pool behavior matters. Work through the connection pooling considerations first (token refresh before expiry, pool eviction on token expiry, reconnect logic for long-running queries), then migrate one database at a time.

For PostgreSQL, the scram-sha-256 auth method can be combined with a PAM module that validates OIDC tokens, but this requires PostgreSQL and PAM configuration. The simpler path for most teams is to use a database proxy (RDS Proxy, Cloud SQL Auth Proxy) that handles the OIDC-to-database credential translation, keeping the token validation out of the application layer.

Wave 4: Third-party API credentials (ongoing)

External API credentials — Stripe, Twilio, Slack, SaaS platform APIs — migrate via the credential mediation pattern. The credential stays in Aembit's vault; the workload gets a scoped token that triggers authenticated calls through the outbound policy. This wave has no application code changes beyond removing the credential from environment variables.

Phase 4: Legacy Service Handling

Some services can't be migrated quickly. On-premises systems with LDAP authentication, third-party applications that only accept API keys, legacy services with hardcoded credential paths. These need a documented exceptions process rather than being indefinitely deferred.

For each legacy exception:

  1. Document why migration is blocked (technical constraint, not "we haven't gotten to it")
  2. Define compensating controls: network segmentation, rotation schedule, access monitoring
  3. Set a review date — not a migration date, but a date to reassess whether the blocking constraint still applies
  4. Add the exception to your credential register with an exception_reason field and review_date

Compensating controls for credentials that can't be migrated: rotate on a defined schedule (90 days maximum for credentials accessing sensitive systems), store only in the secrets manager (no environment variables or config files), restrict network access to the minimum necessary source IPs, and monitor for anomalous usage patterns.

Measuring Progress

Track these metrics throughout the migration:

  • Static credential count: Number of credentials in your register that are long-lived (not platform-attested short-lived tokens). Target: trending down toward zero outside the exceptions list.
  • Average credential age: Mean age of credentials that haven't been rotated. This should drop as you migrate to short-lived tokens (age is always under TTL by definition) and as automated rotation covers legacy credentials.
  • Credentials per service: Ratio of static credentials to services. A well-migrated service has zero static credentials; it authenticates entirely through workload identity.
  • Policy coverage: Percentage of service-to-service connections that have an explicit Aembit policy defined. A high percentage here means your access control model is explicit; a low percentage means services might be authenticating through undocumented channels.
  • Unattested access events per day: Authentication events (from your logs) where the authenticating workload presented a static credential rather than a platform-attested token. Target: trending toward zero.

Report these metrics quarterly to security leadership. The credential count trend is particularly effective as an executive-level metric — it's concrete and directional, and it makes progress visible without requiring deep technical context to interpret.

Common Failure Modes to Watch For

Token service availability is a dependency you didn't have before. If the Aembit control plane is unreachable, workloads can't get new tokens. You need either a local token cache with graceful degradation or a defined incident runbook for token service outages. We recommend both: cache tokens at the workload level with a pre-expiry refresh buffer, and maintain a runbook for manual credential fallback in the event of an extended outage.

Clock skew between the token issuer and validators causes intermittent authentication failures that are hard to diagnose. Ensure all nodes are running NTP and that clock skew is under 30 seconds. Monitor for clock drift as part of your infrastructure health checks.

JWKS caching misconfiguration causes key rotation outages. Test your key rotation procedure in a non-production environment before doing it in production. The test should include verifying that all validators successfully refresh their JWKS cache when new tokens start arriving signed with the new key.