mTLS in Production: A Hands-On Guide to Service-to-Service Authentication Without the Footguns

Mutual TLS (mTLS) authenticates both sides of a service-to-service connection using cryptographic certificates, blocking lateral movement and service impersonation inside your network. After being on-call for NTP-drifted VMs rejecting handshakes at 3am and "rotated" certs still cached in non-reloaded processes, I wrote this guide as the mTLS implementation reference I wish I'd had. It covers working Go and Python code, real configs, the failure modes that will page you, and honest takes on when a service mesh beats the DIY path.

What mTLS Actually Buys You (And What It Doesn't)

Mutual TLS provides cryptographic proof of identity at the transport layer by having the client present a certificate that the server validates against a trusted CA. That's the entire pitch: it blocks lateral movement, prevents service impersonation, and gives you an identity primitive to build authorization on top of.

What mTLS does NOT solve:

Compromised hosts — an attacker who pops a service owns its private key and becomes that service.
Leaked private keys — same problem, lower bar.
Authorization — mTLS proves who you are, not what you're allowed to do.
Application-layer attacks — SQLi, SSRF, and broken auth flows are still on the menu.

According to the CNCF's 2024 annual survey, 58% of organizations running Kubernetes report using a service mesh, and the overwhelming majority use the mesh's built-in mTLS rather than rolling their own. That ratio is the right starting point. If you run on Kubernetes without strong reasons to hand-roll, install Istio, Linkerd, or Consul Connect and let them issue and rotate certs.

The DIY path in this article fits these cases:

Bare-metal services
Non-Kubernetes workloads
Third-party integrations
Environments requiring direct PKI control

mTLS vs JWT/OAuth2 — when to pick which:

Use Case	Right Tool
Service-to-service, you control both ends	mTLS
User-facing identity	JWT + OAuth2
Identity carries scopes/tenants/roles	JWT + OAuth2
Transport-layer identity only	mTLS

They compose well: mTLS for the transport, JWT for the application semantics.

Setting Up a Private CA with step-ca in 15 Minutes

Smallstep's step-ca is the fastest path to an internal PKI: it speaks ACME natively, supports SSH certificates, and skips the OpenSSL config file format entirely. Expect a working private CA in roughly 15 minutes, which is about how long it takes to read the OpenSSL man page and give up.

Bootstrap the root and intermediate:

step ca init \
  --name "Internal CA" \
  --dns ca.internal \
  --address ":8443" \
  --provisioner admin@internal \
  --deployment-type standalone

This generates four artifacts: a root CA, an intermediate, an ACME-capable provisioner, and a config in ~/.step/. The root key is written to disk encrypted — pull it out, put it on a YubiKey or sealed offline backup, and never let it touch a running server again. The intermediate signs everything day-to-day.

Add an ACME provisioner so services request certs without operator involvement:

step ca provisioner add acme --type ACME

Any service with the step CLI and your root fingerprint can now request a cert:

step ca certificate svc.internal svc.crt svc.key \
  --provisioner acme \
  --san svc.internal \
  --not-after 24h

Do not reuse your public-facing CA for internal mTLS. Three reasons:

Blast radius — a public CA compromise should not grant access to internal services.
Lifecycle mismatch — public CAs are pushing toward the 47-day validity era while internal certs should be 24 hours.
Validation rules — CT logging exposes internal hostnames you'd rather not publish.

Keep them separate.

Issuing and Loading Client Certificates in Go and Python

Authentication is step one; mapping the verified identity to an authorization decision is the part that actually matters and the part most tutorials skip. Below is a Go server requiring and verifying client certificates, with the handler enforcing per-route authorization based on SPIFFE identity.

A Go server that requires and verifies client certificates:

caCert, _ := os.ReadFile("ca.crt")
caPool := x509.NewCertPool()
caPool.AppendCertsFromPEM(caCert)

tlsConfig := &tls.Config{
    ClientAuth: tls.RequireAndVerifyClientCert,
    ClientCAs:  caPool,
    MinVersion: tls.VersionTLS13,
}

server := &http.Server{
    Addr:      ":8443",
    TLSConfig: tlsConfig,
    Handler:   http.HandlerFunc(handle),
}
server.ListenAndServeTLS("server.crt", "server.key")

The handler pulls the verified identity off the request and uses it for authorization:

func handle(w http.ResponseWriter, r *http.Request) {
    if len(r.TLS.VerifiedChains) == 0 {
        http.Error(w, "no client cert", 401)
        return
    }
    cert := r.TLS.VerifiedChains[0][0]
    spiffeID := ""
    for _, uri := range cert.URIs {
        if uri.Scheme == "spiffe" {
            spiffeID = uri.String()
            break
        }
    }
    if !canAccess(spiffeID, r.URL.Path) {
        http.Error(w, "forbidden", 403)
        return
    }
}

The SPIFFE ID gives you a stable identity (spiffe://corp.internal/svc/billing) that survives cert rotation. Map it to a permission set in your authz layer.

The Python httpx client is six lines:

import httpx

client = httpx.Client(
    cert=("client.crt", "client.key"),
    verify="ca.crt",
)
r = client.get("https://svc.internal:8443/api/v1/charges")

The server now knows it's talking to the billing service and can decide whether the billing service is allowed to read charges. Use SAN-based or SPIFFE-based identity, mapped to roles, checked per-route. Don't ship the cert-verified-therefore-trusted pattern — that's how you accidentally let any internal service read everything.

Short-Lived Certs and Automatic Rotation

Use 24-hour certificates for service-to-service authentication instead of 1-year certs. Three benefits: the window for a stolen key collapses, you exercise rotation continuously instead of annually, and you stop pretending revocation works (it mostly doesn't). Smallstep's published guidance recommends 24-hour or shorter validity for service identity, and SPIFFE/SPIRE defaults sit in the same ballpark.

Rotation tooling by environment:

Kubernetes — cert-manager with the csi-driver-spiffe addon. Pods get a tmpfs volume with their cert and key, rotated automatically.
Bare metal / VMs — systemd timer plus the step CLI does the same job:

[Unit]
Description=Renew service cert
[Service]
ExecStart=/usr/bin/step ca renew /etc/svc/cert.pem /etc/svc/key.pem --force
ExecStartPost=/bin/systemctl reload svc

Pair it with a .timer unit firing every few hours.

The silent failure to watch for: rotation works, the file on disk is fresh, but your running process is still serving the old cert because it loaded the file once at startup and never re-read it. SIGHUP only helps if you wired up SIGHUP handling. The fix in Go is GetCertificate instead of Certificates:

var (
    certMu sync.RWMutex
    cert   *tls.Certificate
)

func loadCert() error {
    c, err := tls.LoadX509KeyPair("server.crt", "server.key")
    if err != nil { return err }
    certMu.Lock()
    cert = &c
    certMu.Unlock()
    return nil
}

tlsConfig := &tls.Config{
    GetCertificate: func(_ *tls.ClientHelloInfo) (*tls.Certificate, error) {
        certMu.RLock()
        defer certMu.RUnlock()
        return cert, nil
    },
}

Reload via fsnotify when the file changes, or on a timer. In my experience running this in production, I've seen this pattern skipped in code that was "doing mTLS correctly" right up until the cert expired and traffic dropped to zero. Same family of failure as the gap between renewal and deploy, hiding inside a single process.

The Failure Modes That Will Page You

These four failure modes have woken me up on-call. They are not exhaustive, but they are what to check first when mTLS misbehaves in production.

Clock skew. TLS handshakes verify cert validity against the local clock. A VM losing NTP and drifting 90 seconds will start failing handshakes with x509: certificate has expired or is not yet valid even though everything is fine. Monitor NTP offset and alert at 30 seconds. With 24-hour certs, your margin is far smaller than you think.
CRL/OCSP fail-open. Most TLS clients fail open when revocation checking can't reach the CA. A CA outage doesn't break traffic — which sounds great until you remember it also means revocation isn't enforced. For real revocation, use short-lived certs and skip CRL/OCSP entirely. OCSP stapling has its own pile of issues.
Cipher suite mismatches. A Java 8 client trying to talk to a TLS 1.3-only server fails with no cipher suites in common and a useless stack trace. Pin minimum TLS versions explicitly on both sides and document the supported matrix.
Stale CA bundles in container images. You rotated the CA. New pods have the new bundle. The old image floating around in a forgotten DaemonSet does not. Treat your trust bundle as a versioned artifact and alert on services presenting certs from CAs you've retired.

The debugging command worth committing to muscle memory:

openssl s_client -connect svc.internal:8443 \
  -cert client.crt -key client.key \
  -CAfile ca.crt -verify_return_error -showcerts

Read the actual handshake. The error tells you whether the failure is in the chain, the SAN, the validity window, or the cipher negotiation. Don't guess.

Observability for Your Mutual TLS Mesh

You need three signals to know the mesh is working: cert age distribution, handshake failure rate, and time-to-expiry per service. The standard "alert 30 days before expiry" rule does not work when certs live 24 hours — it would page you constantly. Tune alerts to a percentage of cert lifetime instead.

A Prometheus exporter walking /etc/svc/*.crt and emitting expiry per service is roughly 80 lines of Go. Export these metrics:

cert_expiry_seconds{service="..."} — seconds until notAfter
cert_age_seconds{service="..."} — seconds since notBefore
mtls_handshake_failures_total{reason="..."} — counter from your server's TLS error callback

Alerting rules tuned for 24-hour certs:

- alert: CertNearExpiry
  expr: cert_expiry_seconds < (cert_age_seconds + cert_expiry_seconds) * 0.25
  for: 5m
  annotations:
    summary: "{{ $labels.service }} cert under 25% lifetime remaining"

- alert: HandshakeFailureSpike
  expr: rate(mtls_handshake_failures_total[5m]) > 0.5
  for: 10m

The first rule fires when remaining lifetime drops under 25% of total lifetime, which works whether certs are 24 hours or 90 days. The second catches the rotation-broke-my-process pattern. Pair this with broader certificate monitoring across your fleet so internal PKI doesn't sit in a blind spot relative to your public certs.

Wrapping Up

Mutual TLS is the right primitive for service-to-service authentication when you control both ends and need short-lived, cryptographically verifiable identity. The failure modes — in-memory cert caching, NTP drift, fail-open revocation, and stale CA bundles — will all bite you in production if you skip the boring instrumentation work.

The minimum viable mTLS checklist:

Start with step-ca for the internal PKI.
Run 24-hour certificate lifetimes.
Wire up GetCertificate-based reloading.
Alert on lifetime percentage rather than absolute days.
If you're on Kubernetes and don't need to own the PKI, install a service mesh.

CertPulse monitors TLS certificate expiry across services, which is exactly the work teams skip when they don't want to babysit a Prometheus exporter forever.

FAQ

When should I use mTLS instead of JWT or OAuth2?

Use mTLS for service-to-service traffic where both ends are infrastructure you control and identity is the only thing the receiving side needs. Use JWT/OAuth2 when identity carries application context (user, tenant, scopes) or when one end is a third party. Combining them is normal: mTLS for the transport, JWT inside the request for application semantics.

Do I need SPIFFE/SPIRE to do mTLS properly?

No, but it helps once you have more than a handful of services. SPIFFE provides a standard identity format (spiffe://trust-domain/path) and SPIRE automates issuance and rotation against workload attestation. For under 10 services, step-ca with ACME and a naming convention works. Past 10 services, the SPIFFE/SPIRE workflow scales better than ad-hoc SAN conventions.

How short should my certificate lifetimes be?

24 hours is a reasonable default for service identity. Some teams go to 1 hour with SPIRE. Going much shorter stresses your issuance path without buying additional security; going much longer makes revocation impossible without CRL/OCSP, which you should not rely on.

What happens when the CA is unreachable?

Existing connections keep working because the trust anchor is already on disk. New cert issuance fails until the CA returns, and any service whose cert expires during the outage will start rejecting handshakes. Run the CA highly available, keep the root offline, and monitor the issuance pipeline as a separate signal from the certs themselves.

Can I run mTLS without a service mesh?

Yes. The setup in this article runs without one. The tradeoff: you own rotation, observability, and authorization plumbing yourself. A mesh handles those at the cost of a sidecar per pod and the operational weight of the mesh control plane. If you're already on Kubernetes without a hard reason to avoid a mesh, the mesh is usually the lower-overhead path.

This is why we built CertPulse

CertPulse connects to your AWS, Azure, and GCP accounts, enumerates every certificate, monitors your external endpoints, and watches Certificate Transparency logs. One dashboard for every cert. Alerts when auto-renewal fails. Alerts when certs approach expiry. Alerts when someone issues a cert for your domain that you didn't request.

If you're looking for complete certificate visibility without maintaining scripts, we can get you there in about 5 minutes.

Start monitoring free See how it works

Back to blog