Workload Identity

Workload Identity for Snowflake Connections

Maya Chen · · 10 min read
Workload Identity for Snowflake Connections

Snowflake is often the most sensitive system in a data stack — it holds the analytical layer on top of your production data, aggregated and queryable. The authentication pattern used to connect pipelines to Snowflake matters. Most pipelines connect with a username and password, or a service account key pair, stored as a secret that rarely changes. That credential is effectively a standing access grant to your Snowflake environment for as long as it exists.

Snowflake supports two approaches that eliminate stored passwords from the authentication flow: key pair authentication with automatic rotation, and OAuth token-based authentication for programmatic access. The latter, combined with OIDC-based workload identity, gets you to a model where your pipeline authenticates to Snowflake using a short-lived token issued by your identity provider — not a password stored in a secrets manager.

This post walks through how that works technically, what Snowflake's OAuth support actually provides, and where the integration points are for workload identity systems.

The Standard Pattern and Its Problems

The typical Snowflake pipeline setup looks like this:

conn = snowflake.connector.connect(
    user='svc_pipeline_user',
    password=os.environ['SNOWFLAKE_PASSWORD'],
    account='myorg-myaccount',
    warehouse='PIPELINE_WH',
    database='ANALYTICS',
    role='PIPELINE_ROLE'
)

SNOWFLAKE_PASSWORD comes from your secrets manager, injected at runtime. The password is long-lived — it was set when the service account was created and is rotated quarterly at best, manually, by someone who has to update every downstream system that uses it when they do.

The failure modes here are well-understood: password leaks through log scraping, environment variable inspection, or credential report exfiltration; rotation failures that break pipelines because a downstream consumer missed the update; over-broad permissions on the service account because the exact required permissions weren't well-defined at creation time; and no per-pipeline audit trail because all pipelines authenticate as the same service account.

Snowflake OAuth for Service Accounts

Snowflake supports OAuth 2.0 for programmatic access through its External OAuth feature. This allows you to configure Snowflake to trust a specific OAuth authorization server — your identity provider — to issue tokens that Snowflake accepts for authentication.

The setup involves:

  1. Configuring a security integration in Snowflake that points at your IdP's token endpoint and public key set
  2. Mapping OAuth token claims to Snowflake roles
  3. Configuring your workloads to request tokens from your IdP and present them to Snowflake

The Snowflake security integration looks like this:

CREATE SECURITY INTEGRATION pipeline_oauth_integration
  TYPE = EXTERNAL_OAUTH
  ENABLED = TRUE
  EXTERNAL_OAUTH_TYPE = CUSTOM
  EXTERNAL_OAUTH_ISSUER = 'https://idp.internal/oauth2'
  EXTERNAL_OAUTH_JWS_KEYS_URL = 'https://idp.internal/oauth2/.well-known/jwks.json'
  EXTERNAL_OAUTH_TOKEN_USER_MAPPING_CLAIM = 'sub'
  EXTERNAL_OAUTH_SNOWFLAKE_USER_MAPPING_ATTRIBUTE = 'LOGIN_NAME'
  EXTERNAL_OAUTH_AUDIENCE_LIST = ('https://snowflake.example.com');

With this integration, a workload that obtains a JWT from your identity provider with the appropriate sub claim and audience will be able to authenticate to Snowflake without a password.

Connecting Workload Identity to Snowflake OAuth

The interesting part is how a workload obtains the OAuth token in the first place. If the pipeline is running on a cloud compute instance — EKS pod, ECS task, GCE VM — it has a platform identity: the Kubernetes service account, the ECS task role, the GCE service account. That platform identity can be exchanged for an OAuth token scoped specifically for Snowflake access.

The exchange works through a token service that sits between the workload and the IdP. The workload presents its platform credential (a projected OIDC token from the Kubernetes API server, or an AWS STS token from the instance metadata service) to the token service. The token service validates the platform credential — confirming the workload is who it claims to be — and issues a short-lived OAuth token scoped for Snowflake.

In Aembit's model, the workload doesn't manage this exchange directly. The platform-side agent handles the credential acquisition transparently, so the pipeline code doesn't need to know about token exchange flows. The connection string looks like:

conn = snowflake.connector.connect(
    account='myorg-myaccount',
    authenticator='oauth',
    token=aembit.get_credential('snowflake-analytics'),
    warehouse='PIPELINE_WH',
    database='ANALYTICS',
    role='PIPELINE_ROLE'
)

aembit.get_credential('snowflake-analytics') returns a token that was issued for this specific workload's identity, is scoped to the Snowflake audience, and expires in 15 minutes. The pipeline itself never touches a password. It doesn't know the token value at the start of a long-running job — the token is fetched per connection, or refreshed before expiry for connection pool scenarios.

Scoping Per Pipeline, Not Per Service Account

The per-pipeline identity model solves a problem that the shared-service-account model structurally can't: audit granularity.

Consider a data platform with five pipelines reading from Snowflake: ingestion from operational databases, nightly aggregation jobs, ML feature computation, ad hoc analytics for the product team, and a real-time dashboard refresh. With a shared service account, all five pipelines appear in Snowflake's query history as the same user. When you need to investigate why a particular table was queried at 3am, you know which service account queried it but not which pipeline.

With workload-identity-based authentication, each pipeline has a distinct identity. The Snowflake user mapping can encode the pipeline identity in the username: pipeline_ingestion_prod, pipeline_aggregation_prod, and so on. Snowflake role assignments can be per-pipeline, so the ingestion pipeline has write access to staging tables but the dashboard refresh pipeline is read-only. The audit trail is immediately actionable because it's pipeline-granular.

Key Pair Authentication as a Stepping Stone

If your IdP doesn't yet support the External OAuth integration, or if you're not ready to deploy a workload identity layer, Snowflake's key pair authentication is a meaningful improvement over password-based auth. Key pair auth uses RSA key pairs rather than passwords — the private key never leaves the secrets manager, and Snowflake stores only the public key fingerprint.

from cryptography.hazmat.backends import default_backend
from cryptography.hazmat.primitives.serialization import (
    load_pem_private_key, Encoding, PrivateFormat, NoEncryption
)

private_key = load_pem_private_key(
    open('/run/secrets/snowflake_rsa_key').read().encode(),
    password=None,
    backend=default_backend()
)

conn = snowflake.connector.connect(
    user='svc_pipeline_user',
    account='myorg-myaccount',
    private_key=private_key.private_bytes(
        Encoding.DER, PrivateFormat.PKCS8, NoEncryption()
    ),
    warehouse='PIPELINE_WH'
)

Key pair auth can be rotated by generating a new key pair and updating both the secrets manager and the Snowflake user's public key — no password management. Automated rotation is more straightforward than password rotation because the key and its fingerprint are independent artifacts. But key pair auth is still a long-lived credential model: the private key persists until explicitly rotated. OAuth-based authentication with short-lived tokens is the more complete solution.

Handling Connection Pools and Long-Running Jobs

One practical complication with short-lived tokens is connection pool management. A connection established with a 15-minute token will fail when that token expires if the connection stays open. Connection pools that hold connections open across query batches need a refresh strategy.

The correct approach is pre-expiry token refresh. Before establishing a pooled connection, check the token's exp claim against the current time. If the token expires within the next refresh window (typically 2-5 minutes), fetch a new token before opening the connection. For connections already in the pool, implement a health check that validates the token's remaining validity and evicts connections whose tokens are expiring soon.

Most connection pool libraries support a validation callback. In Python's SQLAlchemy, the pool_pre_ping flag plus a custom event listener can handle this:

from sqlalchemy import event
from sqlalchemy.pool import Pool

@event.listens_for(Pool, "checkout")
def checkout_listener(dbapi_con, con_record, con_proxy):
    if token_expiring_soon(con_record.info.get('token_exp')):
        new_token = aembit.get_credential('snowflake-analytics')
        con_record.info['token_exp'] = parse_exp(new_token)
        raise DisconnectionError("token refresh required")

The DisconnectionError causes the pool to establish a new connection rather than reusing the stale one. This works cleanly for batch jobs. For streaming jobs with very long-lived connections, consider using Snowflake's JWT reauthentication support to refresh the session token without disconnecting.

What This Doesn't Solve

We're not saying token-based authentication solves all Snowflake security concerns. It solves credential management — no passwords, no rotation burden, short-lived access grants. It doesn't solve data access governance, which is a Snowflake-side concern: role design, column-level security, row access policies, and data masking. Those controls are orthogonal and need to be configured regardless of your authentication mechanism.

It also doesn't solve the problem of a compromised pipeline binary. If an attacker controls the pipeline process, they can call aembit.get_credential() directly and get a valid token. The identity model gives you a better audit trail of that access — you'll see the pipeline identity making unusual queries — but it doesn't prevent the access itself. Defense in depth still requires runtime monitoring and anomaly detection on the Snowflake query audit log.

The token-based model does make the blast radius of a compromise smaller: the attacker gets a 15-minute token scoped to the Snowflake audience, not a persistent password that grants indefinite access until manually revoked. That's a meaningful improvement in a real incident scenario.