Operations

Short-lived certificates and the secrets management problem nobody planned for

May 25, 20269 min readCertPulse Engineering

I rebuilt a system once where the same wildcard cert was pinned in eleven places. Eleven. We caught it because a Prometheus blackbox exporter — one nobody remembered configuring — kept paging on a stale cert two weeks after the "official" renewal. The runbook said the renewal was a five-minute job. It was. If you ignored the other ten places.

That system rotated certificates once a year. We're about to live in a world where it rotates every 47 days.

The math your secrets workflow wasn't built for

Annual renewal means a cert touches your pipeline once. Twice if you count staging. The whole workflow can be a Confluence page titled "yearly cert dance" and a calendar reminder, and it works fine. Plenty of teams operate this way and never have an incident, because the human in the loop catches the mistakes.

Move to the SC-081v3 endgame — 47 days plus 10-day DCV reuse by March 2029 — and that same cert flows through your pipeline roughly eight times a year. The 200-day phase is live as of March 2026. The 100-day phase hits March 2027. Each step roughly doubles the renewal rate. Each step exposes a different weakness in how secrets actually move through your infrastructure.

The failure mode isn't "renewals fail." Renewing certs is the easy part. Let's Encrypt has been doing 90-day renewals for a decade, and ACM has issued automated certs since 2016. The failure mode is that the new cert sits at the CA, or in Vault, or in the cert-manager Secret, and never reaches the seven other places it needs to be. Or it reaches six of them and the seventh is a Prometheus exporter nobody remembers configuring.

The workflows that quietly fall apart

The most common pattern I see at mid-market companies isn't dynamic. It's GitOps with certs baked in. Someone PEM-encodes the cert and key, runs it through SOPS or sealed-secrets, commits the encrypted blob, and ArgoCD syncs it into a Kubernetes Secret. Works beautifully for annual rotations. Also means every renewal is a pull request. Eight PRs a year, per cert, per environment, with a reviewer who increasingly rubber-stamps the diff because it's "another cert thing."

Then there's the manually-synced Vault KV pattern. The CA emails ops, ops downloads the bundle, ops runs vault kv put. Maybe there's a script. Maybe the script lives on one engineer's laptop. I've watched this entire workflow collapse the first time the cert holder went on vacation during a renewal week.

ArgoCD pinning Kubernetes Secrets with prune: false is its own special hell. cert-manager updates the cert, ArgoCD sees the drift, ArgoCD reverts it back to the GitOps version, and now there's a perfectly renewed cert sitting in a git branch nobody merged. The application keeps serving the old cert. Monitoring is green because the cert in git is technically valid. It just isn't the one cert-manager wanted you to use.

None of these break catastrophically. They degrade. Each renewal cycle, a slightly different subset of consumers ends up out of sync, you fix it by hand, and the institutional knowledge of "how this actually works" lives in the heads of two or three people. Multiply by eight cycles a year and you have a part-time job nobody signed up for.

Four distribution patterns, ranked by how well they age

I've seen roughly four ways teams actually get cert bytes from a source of truth to a running process. They scale very differently under short-lived certs.

File-on-disk with a sidecar reloader. The cert lives at a known path, a sidecar or systemd timer fetches updates, and the application either watches the file or gets a SIGHUP. This is the pattern nginx, HAProxy, and Envoy were designed for. It scales fine to 47-day rotations if — and only if — the reloader actually triggers a graceful reload. I've seen environments where the file got updated and the process held the old cert in memory until the next deploy six weeks later. Watch your reload signals carefully.

Vault Agent sidecar. The agent maintains the lease, writes the rendered template to a tmpfs mount, and the app re-reads on its own schedule. Cleanest pattern I've used, but it pushes complexity into the app. The app has to know the cert can change underneath it. Go's crypto/tls has good support for this via GetCertificate callbacks. Most Node.js HTTP servers do not, and you'll end up with restart-on-update workarounds that defeat the point.

CSI Secret Store driver. Mounts the cert as a volume sourced directly from Vault or a cloud KMS. Conceptually elegant, operationally fiddly. The driver does support rotation, but the application still needs to handle a changing file. The pattern that bites people is mounting the secret as an environment variable via the sync-to-Kubernetes-Secret feature. Env vars are baked at process start and never update.

Init-container fetch. The init container pulls the cert at pod startup, writes it to a shared volume, exits. This was fine at one rotation a year because pods restarted often enough to pick up the new cert. At 47 days you're coupling cert lifecycle to pod lifecycle, which means a long-running pod can serve a cert that's been rotated three times since it started. This pattern needs to die.

The honest ranking: Vault Agent sidecar and well-implemented file-on-disk survive the transition. CSI works if you commit to handling file changes. Init-container fetch is a trap, and you should migrate off it before 2027.

The hidden coupling problem

Here's the part that makes a written-down inventory mandatory rather than nice-to-have.

Take any production wildcard cert. Now list every place its bytes physically exist. A realistic inventory for a mid-sized SaaS:

The load balancer's terminator (ALB, Cloudflare, or an Envoy gateway). The application pods serving HTTPS internally for mTLS. A monitoring sidecar that needs the cert to scrape its own service. The Prometheus blackbox exporter that probes the external endpoint and pins the expected SANs. The internal API client that does cert pinning for a partner integration. A nightly backup script uploading to an S3 bucket fronted by the same cert. A Terraform state somewhere that has the cert ARN cached, which becomes a problem the next time someone runs terraform plan against an old state.

Seven places. Each has its own refresh mechanism. The load balancer is automatic if you're on ACM. The pods rely on cert-manager. The sidecar is on a different cert-manager Issuer because it needs an internal CA. The blackbox exporter has its expected SANs in a ConfigMap that nobody updates. The backup script is a cron job pulling the cert from S3 at runtime. The internal API client has a pinned fingerprint hardcoded into a config file.

When you rotated once a year, you could batch all seven into a single change window and hand-verify each one. When you rotate eight times a year, you cannot. The cost of automating each consumer is fixed. The cost of doing it manually is multiplied by your rotation frequency. Somewhere around the 100-day phase the manual cost crosses the automation cost, and you're losing engineering time on a recurring tax.

The coupling is hidden because nobody documents it. The cert is referenced by seven different mechanisms that nobody mapped. That's the work nobody planned for.

A practical migration path

The order matters here, because trying to "modernize secrets management" as one big rewrite will fail every time.

First, inventory. For each production cert, list every consumer and how it gets the bytes today. Yes, this is tedious. Yes, you need to do it. The artifact is the input to every later decision, and you cannot skip it. Network captures, audit logs, and grepping your config repos for the cert's CN or SAN strings will catch consumers nobody remembers.

Second, pick a source of truth. For Kubernetes-heavy shops, that's cert-manager with an Issuer pointing at your CA of choice — Let's Encrypt, an internal Vault PKI mount, or ACM Private CA. For mixed estates, Vault PKI with cert-manager pulling from it works well. The point is that there's exactly one place certificates are issued, and everything downstream is a copy. If you have certs being issued from three different places today, consolidating is the prerequisite to everything else.

Third, peel off consumers one at a time. Start with the most-rotated, highest-blast-radius cert and migrate its seven consumers from manual to automated, one per sprint. Don't try to migrate every cert simultaneously. The pattern you build for the first one becomes the template for the rest, and the first one is always the hardest because that's where you find the assumptions baked into your environment.

Fourth — and this is the part most teams skip — instrument the gap between "renewed at the CA" and "deployed to all consumers." Your CA tells you when it issued a new cert. cert-manager tells you when it wrote a new Secret. Nothing in this chain tells you that the cert your application is actually serving to clients matches the one in the Secret. That verification has to happen at the endpoint, not at the source.

Where the visibility gap bites you

This is where products like CertPulse fit, and I want to be honest about the boundary. Knowing the cert renewed at the CA is necessary but not sufficient. The job isn't done until every consumer is serving the new bytes. At 47 days, "every consumer" is a moving target with weekly deploys, autoscaling pods, and three engineers who left last quarter.

Endpoint probing closes the loop. That means actually connecting to the HTTPS port, reading the cert the server returns, and comparing it against what you expect. You renewed at the CA. cert-manager updated the Secret. Did the load balancer pick it up? Did the internal mTLS service? Is the Prometheus exporter still pinning the old SAN? CT log monitoring catches the converse problem — a cert got issued for one of your domains that you didn't authorize — which becomes a real threat surface when issuance volume goes up 8x.

The pages you'll get at 2am in 2029 won't be "the CA failed to renew." They'll be "the renewal happened five days ago, but consumer seven is still serving the old cert and a client just hit the expiry." Design the verification layer now, while you have time to do it properly, and the 47-day cadence stops being a recurring incident.

The teams that come out of this transition clean are the ones who took the boring inventory step seriously, two years before they had to.

This is why we built CertPulse

CertPulse connects to your AWS, Azure, and GCP accounts, enumerates every certificate, monitors your external endpoints, and watches Certificate Transparency logs. One dashboard for every cert. Alerts when auto-renewal fails. Alerts when certs approach expiry. Alerts when someone issues a cert for your domain that you didn't request.

If you're looking for complete certificate visibility without maintaining scripts, we can get you there in about 5 minutes.

Short-lived certificates and the secrets management problem nobody planned for | CertPulse