Certificate Automation: A Practical Guide for Platform Engineers Managing Hundreds of Certs

Last year, a team I worked with had 347 certificates across three cloud providers and a handful of on-prem appliances. They knew about 280 of them. The other 67 surfaced during an audit after a wildcard cert expired on an internal load balancer at 2:47am on a Saturday. Nobody got paged because nobody had monitoring on that endpoint. Certificate automation isn't just scripting certbot renew on a cron job. It's the full operational pipeline for TLS certificate management without a human in the critical path: discovery, issuance, deployment, rotation, revocation, and monitoring. This guide covers what that actually looks like when you're managing hundreds of certs across mixed infrastructure.

What certificate automation actually means in practice

Certificate automation is the process of programmatically handling every stage of a TLS/SSL certificate's lifecycle — discovery, issuance, deployment, rotation, revocation, and monitoring — without manual intervention at any stage. According to a 2024 Ponemon Institute report, 67% of organizations experienced a certificate-related outage in the past two years. Most of those outages were preventable with proper automation.

Beyond renewal: the full lifecycle

When vendors say "automated certificate management," they usually mean automated renewal. Renewal is roughly 20% of the problem. After managing certificate pipelines across hundreds of environments, I've found the full certificate lifecycle management pipeline breaks down into six distinct stages:

Discovery: finding every certificate across your infrastructure, including ones you didn't know about
Issuance: requesting and receiving certificates from CAs or internal PKI
Deployment: getting the certificate to every endpoint that needs it — load balancers, CDNs, API gateways, service mesh sidecars
Rotation: replacing certificates before expiry without downtime
Revocation: invalidating compromised certificates immediately
Monitoring: validating that the correct certificate is actually serving on every endpoint, continuously

Most teams automate renewal and call it done. Then a certificate rotates on disk but the reverse proxy never reloads, and they're back to a 2am page. In my experience, the deployment and monitoring stages are where certificate renewal automation actually falls apart.

Why manual cert management breaks at ~50 certificates

Manual certificate management becomes unreliable once an organization exceeds approximately 50 certificates. At 10 certificates, spreadsheets and calendar reminders work fine. At 50, things crack. At 200, they shatter.

The failure modes are predictable:

Someone leaves the company and their name sits in the "owner" column of a spreadsheet nobody has updated in 8 months
A staging environment uses a cert copied from production, and nobody remembers it exists until it expires and breaks the CI pipeline
A team provisions a new service with a cert from a different CA, creating two renewal processes to maintain

According to Gartner, the average enterprise manages over 50,000 machine identities, growing 20% annually. Even at mid-market scale with 200–2,000 certificates, the combinatorial complexity of tracking expiry dates, owners, deployment targets, and CA relationships exceeds what any human can reliably manage.

The 4 approaches to certificate automation

There are four distinct approaches to automating certificate management: ACME protocol, vendor API integration, infrastructure-native tools, and custom scripts. Each carries real tradeoffs in cost, flexibility, and operational complexity. No single approach works for every environment, and most production setups combine two or three.

ACME protocol (Let's Encrypt, ZeroSSL, Google Trust Services)

The ACME protocol, defined in RFC 8555, is the closest thing to a universal standard for automated SSL/TLS certificate issuance and renewal. ACME clients like Certbot, acme.sh, and lego handle the heavy lifting. You configure DNS or HTTP challenges, point at a CA, and certificates renew automatically.

The tradeoffs are real:

ACME only supports domain-validated (DV) certificates
DNS challenge infrastructure requires API access to your DNS provider — if that provider has an outage, renewals fail silently
Rate limits apply (Let's Encrypt enforces 50 certificates per registered domain per week)

You can read more about how ACME works in production, including challenge types and rate limit gotchas.

Vendor API integration (DigiCert, Sectigo, Entrust)

Commercial CAs like DigiCert, Sectigo, and Entrust offer REST APIs for certificate lifecycle operations. DigiCert's CertCentral API and Sectigo's SCM API support OV/EV issuance, which ACME cannot do. These APIs enable extended validation and compliance with regulatory requirements.

The downsides:

Vendor lock-in: each CA has a different API, authentication model, and rate limits
Cost: ranges from $10 to $300+ per certificate per year
Migration complexity: switching CAs means rewriting your automation layer

Infrastructure-native tools (cert-manager, AWS ACM, Azure Key Vault)

For organizations in a single cloud or running Kubernetes-native workloads, infrastructure-native tools provide the path of least resistance:

cert-manager handles issuance and renewal inside Kubernetes clusters with Issuer resources supporting both ACME and private CAs
AWS ACM provides free public certificates that auto-renew and deploy to ALBs and CloudFront distributions
Azure Key Vault handles certificate storage and rotation for Azure services

The catch: these tools don't cross boundaries well. AWS ACM certificates can't be exported. cert-manager doesn't manage F5 load balancers. For multi-cloud or hybrid environments, you'll need an additional orchestration layer on top.

Custom scripts and cron jobs (and why they rot)

Every infrastructure team has a renew_certs.sh sitting in a repo somewhere. It worked when one person wrote it for 12 certificates. Then that person left, the script grew to 400 lines with hardcoded paths, and nobody touches it because nobody understands it.

According to a 2023 Venafi survey, 38% of organizations still rely on scripts or spreadsheets for certificate management. These scripts rot because they:

Lack error handling for edge cases
Don't surface failures visibly
Encode assumptions about infrastructure that quietly become wrong over time

Criteria	ACME	Vendor API	Infrastructure-native	Custom scripts
Cost	Free	$10–300+/cert/yr	Free (cloud)	Engineering time
Cert types	DV only	DV, OV, EV	DV (varies)	Any
Multi-cloud	Yes	Yes	No	Manual effort
Internal CA	Limited	No	Some (cert-manager)	Manual effort
Maintenance	Low	Medium	Low	High
Failure visibility	Good	Good	Good	Poor

Automating public vs. internal certificates

Internal certificate management is typically the harder problem at mid-market scale, despite public certificate automation getting most of the attention. Organizations with 500 public-facing certs often have 2,000+ internal certificates for mTLS, code signing, and client authentication — with little to no automation covering them.

Public certificate automation with ACME and CAs

Public certificate automation is a largely solved problem because the tooling is mature. ACME with Let's Encrypt or Google Trust Services handles the majority of use cases. For the 10–15% of public certs requiring OV/EV validation, vendor APIs from DigiCert or Sectigo fill the gap. The main challenge is deployment breadth: a single domain might need its certificate deployed simultaneously to an ALB, a CloudFront distribution, and an on-prem reverse proxy.

Internal PKI automation with private CAs

Private CA automation requires running or consuming a CA service, then building issuance, distribution, and rotation around it. The primary tooling options include:

Smallstep step-ca: open-source, ACME-compatible private CA that works well for mTLS automation between services
HashiCorp Vault PKI secrets engine: generates short-lived certificates on demand, strong for service-to-service auth but requires Vault operational expertise
AWS Private CA: managed service at $400/month per CA, integrates with ACM but costs compound fast with multiple CAs
EJBCA: enterprise-grade open-source option, powerful but complex to operate

In my experience managing certificate infrastructure across mixed environments, the reason internal cert automation lags behind isn't tooling — it's ownership. Public certs have clear owners. Internal certs for mTLS between microservices often fall between platform engineering, security, and application teams. Nobody automates what nobody owns.

Building a certificate automation pipeline

A certificate automation pipeline connects four stages into a continuous loop: discover what you have, issue what you need, deploy where it belongs, and monitor that it's working. According to the 2024 State of Machine Identity report, organizations that implement end-to-end certificate pipeline automation reduce certificate-related outages by up to 90%.

Discovery: finding every certificate you have

Certificate discovery — the process of scanning your entire infrastructure to build a complete certificate inventory — is the required first step in any automation pipeline. You can't automate what you haven't found. Key discovery methods include:

Network scanning with sslyze or nmap across your IP ranges
Cloud API queries to ACM, Azure Key Vault, and GCP Certificate Manager to enumerate managed certificates
Kubernetes resource parsing of Ingress and Gateway resources for TLS references
Certificate transparency log monitoring for your domains to catch certificates issued outside your normal process
Configuration audits of load balancer configs and configuration management databases

Run discovery continuously, not once. New certificates appear weekly as teams provision services. A quarterly audit finds problems months too late.

Issuance and deployment: GitOps and infrastructure as code

Certificate issuance should be declarative, managed through GitOps and infrastructure as code workflows. In Kubernetes, cert-manager lets you define a Certificate resource in YAML, commit it to Git, and let the controller handle issuance and renewal. For cloud resources, Terraform's aws_acm_certificate and azurerm_key_vault_certificate resources bring certificate automation into your existing IaC pipeline.

The harder part is deployment to endpoints that don't natively integrate:

On-prem load balancers: Ansible playbooks push certificates and trigger reloads
CDNs and SaaS platforms: custom deployment scripts via vendor APIs fill the gap

Make deployment idempotent and verifiable: deploy the cert, then confirm it's actually serving.

Monitoring and alerting: catching what automation misses

Certificate monitoring validates that your automation pipeline is working and catches the exceptions that slip through. Automation fails silently, making monitoring essential. Set up expiry alerting at multiple thresholds:

30 days before expiry: informational — triggers investigation if auto-renewal hasn't fired
14 days before expiry: warning — something in the automation pipeline is likely broken
7 days before expiry: critical — manual intervention required

The failure mode that catches most teams is when a certificate renews but never deploys to the endpoint actually serving traffic. Your monitoring needs to check what's being served over the network, not just what's on the filesystem.

Preparing for 47-day certificate lifetimes

The CA/Browser Forum approved Ballot SC-081, reducing maximum public TLS certificate lifetimes from 398 days to 47 days by March 2029. This is a ratified decision with a fixed timeline. Any certificate management process that involves a human clicking buttons in a web portal will break under this requirement.

What the CA/Browser Forum change means

The reduction to 47-day certificate lifetimes happens in three phases:

March 2026: maximum certificate lifetime drops to 200 days
March 2027: maximum certificate lifetime drops to 100 days
March 2029: maximum certificate lifetime drops to 47 days

Domain validation reuse periods shrink on the same schedule, reaching 10 days by 2029. We've written up the full timeline and what each phase requires.

At 47-day lifetimes, a certificate issued on day one expires before most teams complete a monthly change management cycle. There's no room for manual processes, vacation coverage gaps, or "we'll get to it next sprint."

What breaks when cert lifetimes shrink

The systems most at risk from 47-day certificate lifetimes aren't Kubernetes clusters or cloud load balancers — those already have automation. The risk sits in the long tail:

Legacy appliances (F5, NetScaler) that require manual cert uploads through a web UI
IoT devices and embedded systems with hardcoded certificates
Third-party SaaS integrations where you upload a cert through a vendor portal
Internal services running on VMs that nobody has touched in two years
Client certificates distributed to partners with no automated rotation path

Start inventorying these systems now. Each one needs either an automation path or an architectural change — like moving TLS termination to a proxy that supports automation — before short-lived certificates become mandatory.

Common mistakes that break certificate automation

Certificate automation fails most often after initial setup, when teams assume the pipeline is working and stop watching. After monitoring 347+ certificates across mixed infrastructure, these are the failure patterns I see repeatedly.

DNS and challenge infrastructure failures

ACME DNS-01 challenge failures are the leading cause of silent renewal breakdowns. These challenges depend on your DNS provider's API being available and DNS propagation completing before the CA validates. If your ACME client's propagation timeout is shorter than your provider's actual propagation time, challenges fail intermittently. According to Let's Encrypt data, DNS challenge failures account for roughly 15% of all failed validations.

The fix:

Configure generous propagation timeouts (120–180 seconds)
Use a DNS provider with fast propagation (Cloudflare, Route 53)
Implement retry logic with exponential backoff

Rate limits and blast radius

Let's Encrypt enforces a limit of 50 certificates per registered domain per week. If your automation renews all certificates simultaneously — because they were all issued on the same day — you can hit rate limits and leave some certs un-renewed.

The fix: stagger renewal windows. Distribute certificate issuance dates across the renewal period so you never hit rate limits during normal operations. Keep a buffer for emergency re-issuance.

The cert rotated but the service didn't reload

This is the single most common certificate rotation failure and the hardest to detect. Certbot writes the new certificate to /etc/letsencrypt/live/. Nginx continues serving the old certificate from memory because nobody ran nginx -s reload. The cert on disk is valid. The cert being served is expired.

The fix:

Post-renewal hooks that trigger service reloads (e.g., systemctl reload nginx)
Endpoint monitoring that checks what's actually being served over the network, not just what's on the filesystem
In Kubernetes: cert-manager handles this better because pods mount secrets that update automatically — but even there, some applications cache TLS contexts and need a restart

Where to start

If you're managing more than 50 certificates and still relying on manual processes or aging scripts, start with discovery. Build a complete inventory, identify which automation approach fits each certificate type, and implement certificate monitoring before you implement automation. Knowing when things break is more immediately valuable than preventing all breakage.

Certificate automation is an ongoing operational practice, not a one-time project. The 47-day lifetime deadline gives every team a hard date to work toward, but organizations that start now will spend the next three years iterating calmly instead of scrambling in 2028. CertPulse gives you visibility into every certificate across your infrastructure so you can see the full picture before you start automating.

Frequently asked questions

What is certificate automation? Certificate automation is the practice of programmatically managing the full lifecycle of TLS/SSL certificates — including discovery, issuance, deployment, renewal, rotation, revocation, and monitoring — without requiring manual intervention at each stage. According to the 2024 Ponemon Institute report, 67% of organizations experienced a certificate-related outage in the past two years, making automation essential.

How do I automate Let's Encrypt certificate renewal? Automate Let's Encrypt certificate renewal using an ACME client like Certbot, acme.sh, or lego configured with either HTTP-01 or DNS-01 challenges. Set up a systemd timer to run the renewal command daily and configure post-renewal hooks to reload services (e.g., systemctl reload nginx). In Kubernetes, cert-manager automates the entire ACME process declaratively through Certificate resources.

What is the 47-day certificate lifetime change? The CA/Browser Forum approved Ballot SC-081, reducing maximum public TLS certificate lifetimes from 398 days to 47 days by March 2029. The change phases in across three milestones: 200 days by March 2026, 100 days by March 2027, and 47 days by March 2029. Domain validation reuse periods shrink on the same schedule, reaching 10 days by 2029.

How do I automate internal certificates and mTLS? Automate internal certificates and mTLS using a private CA solution: Smallstep's step-ca (open-source, ACME-compatible), HashiCorp Vault's PKI secrets engine (short-lived certs on demand), or AWS Private CA ($400/month per CA, managed). These tools integrate with service meshes and IaC pipelines to issue and rotate internal certificates automatically.

What's the most common certificate automation failure? The most common certificate automation failure is when the certificate renews on disk but the service never reloads the new cert. Your automation reports success while the endpoint continues serving an expired certificate. Fix this with post-renewal reload hooks and endpoint-level monitoring that checks what's actually being served over the network, not just what's on the filesystem.

This is why we built CertPulse

CertPulse connects to your AWS, Azure, and GCP accounts, enumerates every certificate, monitors your external endpoints, and watches Certificate Transparency logs. One dashboard for every cert. Alerts when auto-renewal fails. Alerts when certs approach expiry. Alerts when someone issues a cert for your domain that you didn't request.

If you're looking for complete certificate visibility without maintaining scripts, we can get you there in about 5 minutes.

Start monitoring free See how it works

Back to blog