SSL Certificate Management: A Practitioner's Guide for Platform and DevOps Teams

Most teams don't think about SSL certificate management until a certificate expires and something breaks in production. Maybe it's a payment gateway that starts rejecting connections at 2am, or a wildcard cert that silently expired on a load balancer nobody remembered existed. The discipline of managing certificates only feels urgent after the first outage. By then, you're already behind.

This guide covers how platform and DevOps teams actually operate certificate infrastructure at mid-market scale, from discovery through automation, with specific tooling comparisons and an implementation playbook you can start executing this week.

What SSL certificate management actually involves at scale

SSL certificate management is the operational practice of discovering, inventorying, issuing, deploying, monitoring, renewing, and revoking every TLS certificate across your infrastructure. At 50+ certificates, it stops being a task and becomes a system that either runs itself or eventually fails.

Beyond the textbook definition

The textbook version of certificate lifecycle management describes a neat loop: generate a CSR, get it signed, install the cert, renew before expiry. That loop describes one certificate on one server. It doesn't describe reality at a company with 200 engineers, three cloud providers, a Kubernetes cluster running cert-manager, a legacy on-prem HAProxy that someone hand-configured in 2019, and a marketing team that bought their own domain and pointed it at a Netlify deploy.

The actual scope includes certificates you don't know about. According to a 2024 Ponemon Institute study, 62% of organizations say they don't know exactly how many certificates they have. After conducting discovery audits across multiple enterprise environments, I can confirm that number tracks. Every discovery audit I've been part of has surfaced at least 15–20% more certificates than the team expected.

The real scope: discovery, tracking, renewal, revocation

The full certificate lifecycle breaks down into six phases that compound in complexity as certificate count grows:

Discovery: finding every certificate across cloud providers, CDNs, load balancers, container orchestrators, and internal services
Inventory: mapping each cert to an owner, environment, and expiry date
Issuance and deployment: getting new certs signed and installed without manual steps
Monitoring: tracking expiry, chain validity, key strength, and revocation status
Renewal: automating the re-issuance cycle before anything expires
Revocation: invalidating compromised certs and rotating the underlying keys

At 10 certificates, a spreadsheet works. At 200, it doesn't. The difference isn't just volume — it's that the failure modes shift from "I forgot to renew" to "I didn't know that cert existed."

Why certificate management breaks down at 50+ certificates

Manual certificate tracking fails at scale for three specific reasons: renewal volume exceeds what humans can reliably calendar, infrastructure sprawl exceeds what any single person can see, and the industry is actively shortening certificate lifespans.

Spreadsheet tracking and its failure modes

Spreadsheet-based certificate tracking breaks when any of these conditions hit — and at 50+ certs, at least one always does:

An employee leaves the company and their name is on 30 certificates
A team provisions certificates through Terraform without updating the sheet
Three tabs maintained by three different people contain conflicting data
New infrastructure gets deployed without anyone logging the cert

The core issue isn't the spreadsheet format. Any manually maintained inventory drifts from reality within weeks. Certificate discovery tools exist specifically because static inventories can't keep up with dynamic infrastructure.

Multi-cloud and hybrid environments

Most mid-market teams run certificates across at least two of the following platforms, each with its own API, renewal logic, and alerting model:

Platform	Auto-Renewal Behavior	Key Limitation
AWS ACM	Auto-renews for ALB, CloudFront, API Gateway	Only works with AWS-attached resources
Azure Key Vault	Supports DigiCert/GlobalSign integration	Renewal workflows are clunky, limited ACME support
GCP Certificate Manager	Integrates with Google Cloud load balancing	Newer, fewer integrations than ACM or Key Vault
Kubernetes cert-manager	Handles in-cluster certs via ACME or internal CAs	Does not cover anything outside the cluster
On-prem load balancers	No auto-renewal	Requires manual or scripted renewal
CDNs (Cloudflare, Fastly)	Own certificate stores with separate renewal	Siloed from central management

Auditing certificates across dozens of AWS accounts alone is a project. Multiply that by every provider in your stack. Certificate expiration monitoring across all of these requires either a purpose-built tool or a fragile collection of scripts and cron jobs.

The 90-day certificate lifespan shift

The CA/Browser Forum has voted to move the entire industry to 47-day maximum certificate lifespans by March 2029. Here's what that means in concrete renewal volume for a team managing 200 certificates:

Certificate Lifespan	Renewal Events per Year	Renewals per Day
1 year (365 days)	200	~0.5
90 days (Let's Encrypt standard)	800+	~2.2
47 days (March 2029 mandate)	~1,600	~4.4

At 1,600 renewals per year, you're processing more than 4 per day, every day, including weekends. Manual SSL certificate renewal stops being tedious and starts being impossible. Automation isn't a nice-to-have at these volumes — it's a prerequisite for keeping services online.

Core components of an SSL certificate management strategy

A working certificate management strategy requires four capabilities: automated discovery, centralized inventory with team ownership, automated renewal via ACME or native integrations, and alerting that escalates before expiry becomes an outage.

Automated discovery and inventory

Certificate discovery means finding certificates you didn't know about. The three primary discovery approaches are:

CT log monitoring: Certificate Transparency logs reveal certificates issued for your domains, including unauthorized ones
Network scanning: probing your IP ranges and DNS records to find TLS endpoints
Cloud API integration: querying AWS ACM, Azure Key Vault, and GCP Certificate Manager APIs to enumerate managed certificates

A certificate inventory should track these fields for every certificate:

Domain and SANs
Issuing CA
Expiry date
Key algorithm and length
Owning team (not individual)
Environment
Renewal method

Ownership mapped to teams survives employee turnover. Ownership mapped to individuals doesn't.

Policy enforcement and approval workflows

Certificate policy enforcement covers the minimum security standards every certificate must meet. According to NIST SP 800-52 Rev. 2, TLS 1.2 is the minimum acceptable version. Certificate policies should enforce:

Minimum RSA 2048-bit or ECDSA P-256 keys
No SHA-1 signatures
SANs that match your approved domain list
Maximum validity periods aligned with CA/Browser Forum requirements

Automated renewal with ACME and native CA integrations

The ACME protocol is the industry standard for automated certificate management. Here's how the major tools handle ACME-based renewal:

cert-manager handles ACME natively in Kubernetes, covering ~90% of in-cluster use cases
Certbot handles ACME on VMs and bare-metal servers
AWS ACM, Azure Key Vault, and GCP Certificate Manager auto-renew their own managed certs

The automation gap lives in everything between these tools: internal CA certs, certs on legacy appliances, and certs on third-party SaaS platforms that don't support ACME.

Alerting, escalation, and incident response

Certificate monitoring should watch for more than just expiry dates. After managing certificate infrastructure across hundreds of environments, I've found these five alert types catch the failures that cause outages:

Certificates expiring within 30, 14, and 7 days
Renewal success without deployment confirmation
Weak key algorithms (RSA 1024, SHA-1)
Unexpected certificate issuance detected via CT log anomalies
OCSP stapling failures across your endpoints

Alerts should route to the owning team in Slack or PagerDuty, not a shared inbox.

Build-vs-buy decision matrix

The right approach depends on your certificate count and infrastructure complexity:

Scale	Recommended Approach	Build Cost	Maintenance Cost
50–100 certs, single cloud	Cloud-native tools (ACM, Key Vault) + cert-manager for Kubernetes	Low	Low
100–500 certs, multi-cloud	Certificate management platform that aggregates across providers	1–2 engineers part-time	Medium
500–2,000+ certs, hybrid	Commercial CLM or dedicated internal platform	2–4 engineering months	Permanent line item

Tooling landscape: open source, cloud-native, and commercial options

No single tool covers every certificate management scenario. The right choice depends on where your certs live, how your team operates, and what you're willing to pay.

Cloud provider native tools

AWS ACM, Azure Key Vault, and GCP Certificate Manager are free and auto-renew within their own ecosystems. They fall apart the moment you need a certificate on something outside that cloud. Key tradeoffs:

AWS ACM auto-renews for ALB, CloudFront, and API Gateway but cannot export private keys, locking you into AWS services
Azure Key Vault manages certificates and secrets together with DigiCert and GlobalSign integration, but renewal workflows are clunky and ACME support is limited
GCP Certificate Manager integrates with Google Cloud load balancing but offers fewer integrations than ACM or Key Vault

Open source: cert-manager, step-ca, Boulder

cert-manager: the standard for Kubernetes certificate automation. Supports ACME, Venafi, Vault, and custom issuers. Covers ~90% of in-cluster use cases but does not cover anything outside the cluster.
step-ca: a private CA for internal PKI, useful for mTLS and service mesh certificates. Requires you to operate your own CA infrastructure.
Boulder: the ACME CA server that powers Let's Encrypt. Overkill for most teams, but relevant if you're building an internal ACME-based PKI.

Commercial CLM platforms

Venafi, Sectigo, DigiCert Trust Lifecycle Manager, and AppViewX target enterprise teams with 1,000+ certificates. These platforms offer broad integrations, compliance reporting, and multi-CA support. Industry pricing typically starts at $50K+ annually, which puts them out of reach for many mid-market teams. Keyfactor and Smallstep occupy a middle ground with more accessible pricing.

When you need more than one tool

Most mid-market teams end up running a combination: cert-manager for Kubernetes, ACM or Key Vault for cloud-native resources, and something else for everything that doesn't fit. The "something else" is where the pain lives — it might be a collection of Certbot cron jobs, a custom Go service that wraps ACME, or a monitoring tool like CertPulse that aggregates visibility across all of the above.

Implementation playbook: from chaos to automated certificate management

Moving from manual certificate tracking to automated certificate management takes four phases. Based on implementations I've led, expect 6–10 weeks for a team managing 500 certificates — not the 30-minute onboarding that vendor marketing pages promise.

Phase 1: discovery and audit (weeks 1–2)

Run discovery across every environment using three methods simultaneously:

CT log queries for all your registered domains
Cloud provider API enumeration across ACM, Key Vault, and GCP
Network scanning for on-prem and legacy assets

Document every certificate you find, including the ones nobody claims. A team with 500 known certs should expect to find 575–625 actual certs during discovery. That 15–25% gap is normal and consistent across every audit I've participated in.

Phase 2: centralize inventory and assign ownership (weeks 2–4)

Build a single certificate inventory with team ownership, not individual ownership. For every certificate:

Map it to the team responsible for the service it protects
Flag any certificate with no clear owner
Prioritize orphaned certs as your highest-risk assets

Phase 3: automate renewal for the high-risk certs first (weeks 4–7)

Prioritize SSL certificate automation in this order:

Wildcard certificates — single point of failure for multiple services
Public-facing endpoints — direct customer impact on expiry
Anything expiring within 30 days — immediate risk

Use ACME where possible. For certs that can't use ACME, build renewal runbooks with explicit deployment verification steps.

Phase 4: policy enforcement and continuous monitoring (weeks 7–10)

Enforce minimum key lengths, approved CAs, and SAN policies. Set up continuous certificate expiration monitoring with escalation paths. Review the full inventory monthly for the first quarter, then quarterly after that. The goal is certificate management best practices baked into process, not heroics.

Common failures and how to prevent them

Certificate outages follow three predictable patterns: expired intermediates, wildcard over-reliance, and incomplete key rotation after compromise. Each is preventable with the right monitoring and process.

The outage nobody saw coming: expired intermediate certificates

In 2020, Microsoft Teams went down for multiple hours because an authentication certificate expired. In 2017, Equifax's breach investigation was delayed because the team couldn't inspect encrypted traffic on a device with an expired certificate. According to Gartner, certificate-related outages cost large organizations an average of $300,000 per hour of downtime.

Most monitoring checks only the leaf certificate. Incomplete chains break silently because browsers cache intermediates but API clients, curl, and mobile apps don't. To prevent this:

Verify the full chain with openssl s_client -connect host:443 -showcerts
Check each certificate in the chain for expiry, not just the leaf
Monitor intermediate certificate expiry dates alongside your own certs

Wildcard certificate over-reliance

A single wildcard certificate shared across 30 services creates two compounding risks:

Key compromise blast radius: one compromised private key requires emergency rotation on all 30 services simultaneously
Renewal failure blast radius: one renewal failure takes down all 30 services simultaneously

Wildcards are convenient right up until they're catastrophic. Individual certificates per service, renewed via ACME automation, reduce both blast radius and incident cost.

Key rotation gaps after compromise

When a certificate is revoked after a key compromise, teams commonly make two mistakes:

Replacing the cert but reusing the same compromised private key
Rotating the key on the primary service but forgetting the three other services sharing that cert

Certificate revocation without complete key rotation is security theater. Audit which services share each certificate and rotate the key everywhere it's deployed.

What changes with short-lived certificates and post-quantum readiness

Two shifts will reshape certificate management within the next 3–5 years: mandatory short-lived certificates and post-quantum cryptography migration. Teams that prepare now avoid emergency migrations later.

Preparing for 47-day and shorter lifespans

The CA/Browser Forum's ballot SC-081 establishes a concrete timeline for maximum certificate validity:

Effective Date	Maximum Certificate Lifespan
March 2026	200 days
March 2027	100 days
March 2029	47 days

Any certificate that isn't renewed via automation today will become a recurring outage source. Audit your infrastructure now for anything that requires manual renewal — every one of those is a future incident.

Post-quantum cryptography and certificate management impact

NIST finalized ML-KEM (formerly CRYSTALS-Kyber) in FIPS 203 and ML-DSA (formerly CRYSTALS-Dilithium) in FIPS 204 in 2024. Post-quantum certificates will be significantly larger: ML-DSA-65 public keys are 1,952 bytes compared to 91 bytes for ECDSA P-256 — a 21x size increase that affects TLS handshake performance, certificate storage, and any system that parses or validates certificates.

To prepare for post-quantum certificate migration now:

Ensure all renewal paths support ACME and can be updated without code changes
Audit for hardcoded certificate size assumptions in parsers, proxies, and middleware
Test PQC certificate support in your TLS libraries (OpenSSL 3.5+ and BoringSSL have experimental support)
Track your CA's PQC readiness timeline

Frequently asked questions

How many certificates can you manage manually before you need automation? The practical limit is around 50 certificates with annual lifespans. Below 50, calendar reminders and a spreadsheet work if the person maintaining them doesn't leave the company. Above 50, or with 90-day lifespans, the renewal volume exceeds what manual processes can handle reliably. At 200+ certs, automated certificate management isn't optional.

What's the difference between certificate management and certificate lifecycle management (CLM)? Certificate management and CLM describe the same discipline. CLM is the term vendors use to emphasize full-lifecycle coverage from issuance through revocation. In practice, any useful certificate management solution covers the full lifecycle. The distinction is marketing, not technical.

Should we use one wildcard certificate or individual certificates per service? Individual certificates per service. Wildcards reduce operational work up front but create a single point of failure and a larger blast radius during key compromise. The operational cost of managing individual certs with ACME automation is lower than the incident cost of a shared wildcard failure.

How do we prepare for 47-day certificate lifespans? Start by identifying every certificate that requires manual renewal and migrate those to ACME-based automation using cert-manager, Certbot, or your cloud provider's auto-renewal. Then verify that renewal actually results in deployment. In my experience managing certificate infrastructure at scale, the most common failure mode with short-lived certs isn't renewal failure — it's renewal success without deployment.

What's the first step if we have no idea how many certificates we have? Run a CT log query for all your registered domains. That gives you every publicly trusted certificate issued for your domains, including ones you didn't authorize. Pair that with cloud provider API enumeration (AWS ACM, Azure Key Vault, GCP Certificate Manager) and you'll have 80–90% visibility within a day. The remaining 10–20% requires network scanning for internal and legacy infrastructure.

This is why we built CertPulse

CertPulse connects to your AWS, Azure, and GCP accounts, enumerates every certificate, monitors your external endpoints, and watches Certificate Transparency logs. One dashboard for every cert. Alerts when auto-renewal fails. Alerts when certs approach expiry. Alerts when someone issues a cert for your domain that you didn't request.

If you're looking for complete certificate visibility without maintaining scripts, we can get you there in about 5 minutes.

Start monitoring free See how it works

Back to blog