Industry

SSL Monitoring for Production Infrastructure: What Actually Matters

April 14, 202612 min readCertPulse Engineering

The worst cert incident I've worked on wasn't an expiry. It was a cert that renewed fine, deployed to three of four load balancers, and silently broke about 25% of API traffic for six hours before anyone noticed. That's what ssl monitoring actually has to catch in 2026: not just the dates, but the drift between what you think is deployed and what's actually serving bytes on the wire.

This post is what I'd hand a new hire on day one of inheriting a 500-cert fleet. Opinionated, specific, and written against the new 47-day reality.

What SSL Monitoring Actually Means in 2026

SSL monitoring in 2026 is five overlapping problems: expiry tracking, chain validity, trust state (revocation plus CA distrust events), issuance visibility through CT logs, and deployment drift across every place a cert is supposed to live. Treating it as a single "check expiry" cron is how most of the cert outages I've responded to started.

Beyond expiry dates

Expiry is table stakes. It tells you a cert will fail in N days. It does not tell you:

  • Whether the chain your server is actually sending is complete
  • Whether your intermediate is still trusted by major root programs
  • Whether OCSP stapling is returning a fresh response
  • Whether a CT log saw a cert for your domain you didn't issue
  • Whether every replica behind your load balancer serves the same bytes

In my experience responding to cert incidents, I've paged out on all five. Expiry is the easiest and the least interesting.

The shift to 47-day certificates

The CA/Browser Forum ratified the lifetime reduction in 2025. The phase-in schedule:

Deadline Max validity DV reuse
March 2026 200 days
March 2027 100 days
March 2029 47 days 10 days

At 398 days you can manually renew in a pinch. At 47 you cannot — a single missed pipeline run on a non-automated cert becomes a production outage inside one sprint.

The math that changed everything: a 47-day validity with 10-day DV reuse means your pipeline re-validates, re-issues, and redeploys every cert roughly 8-9 times per year. Multiply that by fleet size and your tolerance for manual anything drops to zero. The full 47-day certificate timeline has the per-phase breakdown.

The Failure Modes Nobody Talks About

The cert failures that actually wake you up are not expiries. They're intermediate CA distrust events, partial deployments across load balancer pools, OCSP responder outages against hard-fail clients, and SNI mismatches behind CDNs. Generic uptime tools miss all four because they test one endpoint, once, from one client, and call it green.

Intermediate CA revocation

In September 2021, Let's Encrypt's DST Root CA X3 expired and took down OpenSSL 1.0.2 clients, older Android, and a long tail of IoT devices. Leaf certs were fine. Browsers were fine. The chain path validation on legacy trust stores was not.

Detection requires validating against multiple trust stores — Mozilla NSS, Apple, Android, OpenSSL default — and alerting on any that fail. openssl verify -CAfile handles one at a time; for the full matrix you need the trust bundles shipped explicitly.

Chain order bugs

nginx, HAProxy, and Envoy all happily serve a chain where the intermediate is missing or in the wrong order. AIA fetch support splits like this:

  • Fetches missing intermediates: Chrome, Firefox
  • Does not: curl, Python requests, Go crypto/tls

This is how you get a cert that passes a browser smoke test and then breaks every mobile client and server-to-server integration you own. When your certificate works in Chrome but breaks everywhere else covers the detection side in depth.

Mixed deployment states across load balancers

This one paged me at 2:47 a.m. on a Tuesday. ACM auto-renewed a cert bound to an ALB. The ALB fronted four targets in an Auto Scaling group behind a Route 53 weighted record. Three targets got the new cert. One kept serving the old one. The old expired at 00:00 UTC. 25% of TLS handshakes failed for six hours until the Slack signal got loud enough to escalate.

A per-endpoint check that hit the public DNS name would have been green 75% of the time. Detection requires probing every backend target separately with the right Host header. This failure mode is why renewal and deployment need to be monitored separately.

What to Monitor: A Concrete Checklist

Monitor at three layers: per-endpoint (what's actually served), per-certificate (what the cert itself claims), and per-issuer (what's happening upstream that you can't control). Anything less leaves at least one failure mode uncovered. Here's the reference table I keep around, re-thresholded for 47-day math.

Per-endpoint checks

Check Frequency Warn Page
Days to expiry 1h 7d 3d
Chain completeness 1h any gap any gap
Hostname SAN match 1h mismatch mismatch
Protocol ≥ TLS 1.2 6h TLS 1.1 offered TLS 1.0 offered
Cipher suite health 24h RC4/3DES export ciphers
OCSP stapling fresh 1h stale > 24h absent (hard-fail svc)

With 47-day certs, the old 30/14/7/1 warning cascade stops making sense. A 14-day warning on a 47-day cert fires at 70% of lifetime, which is noise. 7-day warn and 3-day page is my current default.

Per-certificate checks

  • Key size: RSA ≥ 2048, EC ≥ 256
  • Signature algorithm: SHA-256 minimum; SHA-1 pages immediately
  • CT log presence: absence on a public cert is either a bug or a rogue issuer
  • Revocation status: via OCSP and CRL
  • SAN drift: versus the last known-good snapshot

Certificate chain validation belongs here too — not just whether a chain exists, but whether it validates cleanly against every trust store that matters to your clients.

Per-issuer checks

Stuff outside your control that still breaks your stack:

  • CA distrust announcements (Mozilla Bugzilla, Chrome Root Program mailing list)
  • OCSP responder availability on the CA side
  • CT log shard health — logs get frozen and decommissioned
  • ACME account rate limits at your issuer

Monitoring at Scale: 50 vs 500 vs 2000 Certs

Scale transitions follow a predictable pattern:

  • 50 certs: a spreadsheet handles it
  • 500 certs: forces you to solve discovery
  • 2000 certs: forces you to solve ownership routing

Each transition hurts because the approach that worked yesterday does not stretch, and most teams do not notice until an alert has been ignored for 72 hours straight.

The discovery problem

At 50 certs you know where they all live. At 500 you do not, and anyone who claims otherwise has not actually gone looking. Certificate discovery has to cover:

  • ACM (regional, per-account, across every org account)
  • Cloudflare (account-scoped)
  • GCP Certificate Manager
  • Azure Key Vault
  • Kubernetes Secrets and cert-manager Certificate CRDs
  • nginx configs on long-lived EC2 instances
  • IIS boxes nobody in the current org remembers provisioning
  • SaaS vendors where a PM set up a custom domain in 2022

Sources worth wiring up: cloud provider ListCertificates APIs paginated across every region and account, cross-account enumeration when you're in AWS, a CT log listener (certstream or a local log follower) for your registered domains, and a Kubernetes secret watcher. CT log monitoring also catches the shadow-IT cert your marketing team bought without telling you.

Alert fatigue math

The numbers get brutal fast:

Fleet size Validity Annual alerts Real failures (99% success) Noise
50 certs 398-day ~50 ~1 ~49
2000 certs 47-day ~32,000 ~320 ~31,680

Industry data indicates that anything above 2-3 actionable alerts per engineer per day gets filtered into a folder and ignored within a month. The fix is routing, not better alerts.

Ownership mapping

Tag every cert with owner, service, and environment at issuance time. If you cannot do it at issuance, run a reconciliation job that maps SANs to services via your service catalog and writes tags back. Route alerts on the owner tag. A shared certs@ inbox at 500 certs is where expiry warnings go to die quietly.

Build vs Buy: An Honest Tradeoff

A 40-line bash script with openssl s_client and a cron job covers about 80% of what a small shop needs. It breaks at multi-cloud discovery, alert routing, historical data, and ownership mapping.

  • Under 100 certs, one cloud, one on-call: do not buy anything
  • Over 500 certs: the script is costing more engineering time than a tool would

What a cron + openssl gets you

#!/usr/bin/env bash
set -eu
for host in $(cat endpoints.txt); do
  end=$(echo | openssl s_client -servername "$host" -connect "$host:443" 2>/dev/null \
    | openssl x509 -noout -enddate | cut -d= -f2)
  days=$(( ($(date -d "$end" +%s) - $(date +%s)) / 86400 ))
  [ "$days" -lt 7 ] && echo "WARN: $host expires in $days days"
done

Run it hourly, pipe warnings to a Slack webhook, done. That's your starter ssl health check.

Where it breaks

  • No discovery: endpoints.txt is manual and goes stale the day you ship it
  • Single trust store: openssl uses the system CA bundle, not Mozilla NSS or Apple
  • No historical data: you cannot answer "when did this chain last change"
  • No alert routing: everything goes to one channel
  • No CT log watching: no unauthorized-issuance catch
  • No deployment drift check: it hits the DNS name, not each backend

When a tool is worth it

When the script's maintenance tax exceeds its value. For me that line sits somewhere around 200-300 certs, or the moment you cross two clouds, or when the on-call rotation grows past one engineer. Until then, openssl plus cron plus jq is honestly fine and I'll say so.

Integrating SSL Monitoring Into Your Existing Stack

Most teams already run Prometheus, Datadog, or a cloud-native monitoring stack. You don't need a separate SSL tool to get baseline coverage. You need the right probe config, thresholds that match 47-day math, and routing that splits warnings from pages.

Prometheus + blackbox_exporter

modules:
  tls_connect:
    prober: tcp
    timeout: 10s
    tcp:
      tls: true
      tls_config:
        insecure_skip_verify: false

Alert rules:

- alert: CertExpiryWarn
  expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 7
  for: 10m
  labels: { severity: warning }
- alert: CertExpiryPage
  expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 3
  for: 5m
  labels: { severity: page }

blackbox_exporter covers expiry and hostname match cleanly. It does not cover chain validation against alternate trust stores, CT log presence, or OCSP stapling freshness. For those, run a sidecar script and feed results in via the textfile collector.

Datadog synthetics

One SSL test per endpoint, alert on days_before_expiry < 7. The gotcha: Datadog SSL tests resolve the public DNS name and hit whatever the CDN or load balancer returns, which hides per-target drift. For ALB target-level coverage you need a separate HTTP check per target IP, or you accept the blind spot.

PagerDuty routing

Three-tier routing I use in production:

Trigger Severity Action
Expiry warnings (3-7 day window) Low Opens Jira ticket
Chain broken, hostname mismatch, cert invalid Sev-2 Pages on-call
Revocation events, distrust announcements Sev-1 Wakes whole rotation
OCSP stapling failures (hard-fail svc only) Sev-2 Pages on-call

OCSP stapling failures break far more often than you'd expect; OCSP stapling is probably broken on half your endpoints covers the detection problem in depth.

FAQ

How often should I check SSL certificates?

Check hourly for expiry and chain on production endpoints, every 6 hours for protocol and cipher checks, and daily for CT log scanning against your registered domains. With 47-day validity, daily expiry checks don't leave enough margin for DNS TTLs, pipeline latency, and on-call handoff.

What's the difference between SSL monitoring and TLS monitoring?

Nothing operational. SSL is the legacy term; TLS is the protocol name since 1999. Tools, dashboards, and runbooks still say SSL because that's what ops teams type into search bars. Use whichever your team already uses — tls monitoring and ssl monitoring describe the same work.

Is OCSP still worth monitoring with short-lived certs?

Yes, for now. Chrome is moving toward CRLite and deprecating OCSP checks, but legacy clients, mail servers, and hard-fail services still rely on it. Once validity drops to 47 days the revocation model weakens (certs expire before revocation propagates) but stapling failures still break live connections today.

What should I monitor first if I'm starting from zero?

Monitor expiry across every endpoint you can discover, with 7-day warnings, and chain completeness tested from an OpenSSL-only client (not a browser). Those two give you the biggest risk reduction per hour of work. Everything else is layer two.

Do I need to monitor CT logs if I'm not on a security team?

If you own a domain, yes. CT log monitoring catches unauthorized issuance, typosquatting, and shadow-IT certs on subdomains you didn't know existed. A certstream listener is 15 minutes of setup and it pays off the first time you catch something you didn't issue.

The takeaway

ssl monitoring in 2026 is a multi-layer problem and a single expiry check does not cover it. Work across three layers: endpoint, certificate, issuer. Build your own certificate inventory with openssl and cron until the maintenance tax hurts. Re-threshold every alert for 47-day math before March 2026. If you want all of that pre-wired, CertPulse handles discovery, drift, and CT logs in one place — but the bash script works too, and I'll never pretend otherwise.

This is why we built CertPulse

CertPulse connects to your AWS, Azure, and GCP accounts, enumerates every certificate, monitors your external endpoints, and watches Certificate Transparency logs. One dashboard for every cert. Alerts when auto-renewal fails. Alerts when certs approach expiry. Alerts when someone issues a cert for your domain that you didn't request.

If you're looking for complete certificate visibility without maintaining scripts, we can get you there in about 5 minutes.