Security

mTLS Is Not Enough

Raj Patel · Jan 21, 2024 · 10 min read

Teams that have deployed mutual TLS for service-to-service communication often describe it as "we have mTLS, so our internal traffic is authenticated." This is technically true and operationally misleading. mTLS establishes that both endpoints hold valid certificates signed by a trusted CA. It says nothing about whether the workload presenting that certificate is supposed to be calling this service, whether the policy authorizing that call was reviewed recently, or whether the certificate belongs to an active service or a decommissioned one that never had its cert revoked.

The gap isn't that mTLS is broken — it isn't. The gap is that teams treat certificate presence as a proxy for authorization, and that conflation creates a class of security problems that mTLS was never designed to prevent.

What mTLS Actually Proves

Mutual TLS, at its core, establishes a channel with three properties: confidentiality (traffic is encrypted), server authentication (the client verified the server's certificate), and client authentication (the server verified the client's certificate). The client authentication step is what distinguishes mTLS from standard TLS — the server asks the client to present a certificate, and the client does.

What "the client has a valid certificate" means in practice: the certificate was signed by a CA that the server trusts, the certificate hasn't expired, and the certificate hasn't been revoked (assuming you check the CRL or OCSP, which many internal implementations don't). That's a meaningful cryptographic guarantee. It proves the client has access to the private key corresponding to a certificate that was issued by your CA.

What it does not prove: that the workload running on the other end is the workload that was supposed to receive that certificate. If service A and service B share a CA — which they will if they're in the same service mesh — and service B somehow gets a copy of service A's certificate (or service B's certificate is issued with more permissive CommonName matching), then service B's requests look identical to service A's requests to the downstream server. The mTLS handshake succeeds. The policy question — should service B be allowed to call this endpoint? — was never asked.

The Authorization Gap

Consider a typical service mesh deployment with Istio. You enable mTLS in STRICT mode across the mesh. Good. Now every service-to-service call requires valid mTLS. What controls which services can talk to which other services? That's AuthorizationPolicy resources. AuthorizationPolicy in Istio operates on the SPIFFE ID embedded in the certificate's SAN field (when the cert was issued by Istio's own CA, Citadel/istiod).

An AuthorizationPolicy that actually pins the expected caller looks like this:

apiVersion: security.istio.io/v1beta1
kind: AuthorizationPolicy
metadata:
  name: payment-processor-ingress
  namespace: payments
spec:
  selector:
    matchLabels:
      app: payment-processor
  action: ALLOW
  rules:
  - from:
    - source:
        principals:
          - "cluster.local/ns/checkout/sa/checkout-service"

This policy says: only traffic from the service account checkout-service in the checkout namespace can reach the payment processor. mTLS is what carries the identity claim. The AuthorizationPolicy is what enforces whether that identity is allowed to make this call.

The problem we see in practice: most teams configure mTLS correctly but write AuthorizationPolicy at the namespace level rather than the service account level. A namespace-scoped policy that allows all traffic from namespace checkout means any compromised service in that namespace can call your payment processor. mTLS is doing its job; the authorization policy just isn't specific enough to prevent lateral movement after a compromise in the same namespace.

Certificate Lifetime and Revocation

Istio-issued certificates have a default lifetime of 24 hours. SPIRE-issued SVIDs default to 1 hour. In contrast, certificates issued by internal CAs for long-running services frequently have lifetimes of 90 days, 1 year, or longer. The longer the certificate lifetime, the larger the window of opportunity if a private key is ever compromised.

Revocation is the theoretical solution to this problem. In practice, revocation is rarely effective for internal mTLS certificates because: CRL distribution points (CDPs) often aren't configured for internal PKIs, OCSP stapling is frequently not enabled on internal services, and checking revocation status adds latency to every connection. Teams skip it. The certificate stays trusted until it expires naturally.

A workload that gets its identity from an ephemeral SVID with a 1-hour TTL has a much shorter exposure window than a workload carrying a 90-day certificate. If the private key is compromised, you have at most an hour before the SVID becomes worthless. You can't rotate static certificates at that frequency without automation that most teams haven't built — so the practical security gap between "mTLS with well-managed certs" and "mTLS with SPIFFE-derived identity" is significant.

Who Made That Call?

The audit log problem is where mTLS-only architectures show their weakest seams. Suppose you have an incident and you need to answer: "Every call to the accounts database in the last 48 hours — which service made each call, and was each call authorized by a valid policy at the time?"

With mTLS and a service mesh, you have access log data. You can see source IPs and certificate CNs. But CNs in internal certificates are often set to service names like payment-service.payments.svc.cluster.local, which doesn't tell you which pod instance made the call, which version of the code was deployed, or whether that service had drifted from its expected configuration. The certificate identifies the certificate holder, not the runtime state of the workload.

With SPIFFE-derived identity and an authorization layer that evaluates each request against explicit policy, you can log: which SPIFFE ID presented which SVID, which policy rule matched, and what access was granted. That audit trail is what an incident response actually needs — not just "service A called service B" but "service A instance X with attestation attributes Y was granted access under policy Z."

We're not saying mTLS produces no useful audit data. We're saying that mTLS alone doesn't produce the structured authorization decision log that meaningful compliance and incident response require.

The Policy Review Problem

When mTLS authorization relies on network policies or service mesh rules, those policies live in YAML or HCL checked into a repository somewhere. Who owns them? Who reviews changes? How frequently are they audited for drift between what's authorized and what's actually needed?

The typical answer: nobody owns them explicitly, they accumulate alongside the services they govern, and they're audited only when something breaks or when a compliance review forces a scan. Permissions granted for a temporary migration stay in place long after the migration is done because removing them requires understanding what they were for — and that institutional memory has often left with the engineer who created the rule.

An authorization policy layer that's decoupled from the workload deployment — where policies are explicitly created, named, and have owners — makes this review tractable. It's much easier to audit "here are the 47 active access policies, here's when each was last modified, here are the ones that haven't been reviewed in 60 days" than to audit "here are the 200 Istio AuthorizationPolicy objects spread across 12 namespaces, some of which are in default-ALLOW mode."

Workload Identity Beyond the Mesh

The harder problem: mTLS in a service mesh handles internal service-to-service traffic. It does nothing for the traffic that crosses the mesh boundary — workloads calling external APIs, workloads connecting to managed databases, workloads pushing data to SaaS platforms. Those connections use static API keys, database passwords, or cloud provider access keys, none of which have the short TTLs or policy evaluation that mTLS-within-mesh proponents use to justify their security posture.

A realistic threat model for a Kubernetes deployment has to account for all paths out of the cluster, not just the paths between services. An attacker who compromises a pod in a mTLS-enforced mesh still has access to any external credentials mounted into that pod. The mTLS hardening of internal traffic doesn't change the exposure of the external credential surface.

That's the problem space where we focus at Aembit. We don't replace mTLS for internal traffic — that's well-handled by SPIFFE-based service meshes. We handle the credential surface for everything that goes outside the cluster or outside the mesh's identity boundary: the database, the third-party API, the cross-account resource. Workloads present their attested identity (a SPIFFE SVID), Aembit evaluates whether that identity is authorized to access the target resource at this moment, and if yes, Aembit hands the workload a short-lived credential for that specific resource rather than requiring the workload to hold a permanent secret.

mTLS is necessary. mTLS with workload-attested identity is better. mTLS with workload-attested identity plus external access policy evaluation is what "authenticated and authorized" actually means for the full surface of a production service deployment.

What to Check in Your Own Deployment

If you're running a service mesh with mTLS, a useful self-audit:

Are AuthorizationPolicy rules scoped to service account principals, or to namespace-level wildcards?
What's the certificate TTL? If it's over 24 hours, is there a rotation automation in place?
Is revocation checking (OCSP or CRL) actually functional, or is it configured but never tested?
Do you have a structured log of authorization decisions (not just connection logs) for each service endpoint?
What fraction of your service-to-external-resource calls still use static stored credentials?

mTLS handles one layer. The full picture requires thinking about each layer independently: transport security, workload identity, authorization policy, access audit, and credential lifetime. Checking one box doesn't check the others.