The worst cert incident I've worked on wasn't an expiry. It was a cert that renewed fine, deployed to three of four load balancers, and silently broke about 25% of API traffic for six hours before anyone noticed. That's what ssl monitoring actually has to catch in 2026: not just the dates, but the drift between what you think is deployed and what's actually serving bytes on the wire.
This post is what I'd hand a new hire on day one of inheriting a 500-cert fleet. Opinionated, specific, and written against the new 47-day reality.
What SSL Monitoring Actually Means in 2026
SSL monitoring in 2026 is five overlapping problems: expiry tracking, chain validity, trust state (revocation plus CA distrust events), issuance visibility through CT logs, and deployment drift across every place a cert is supposed to live. Treating it as a single "check expiry" cron is how most of the cert outages I've responded to started.
Beyond expiry dates
Expiry is table stakes. It tells you a cert will fail in N days. It does not tell you:
- Whether the chain your server is actually sending is complete
- Whether your intermediate is still trusted by major root programs
- Whether OCSP stapling is returning a fresh response
- Whether a CT log saw a cert for your domain you didn't issue
- Whether every replica behind your load balancer serves the same bytes
In my experience responding to cert incidents, I've paged out on all five. Expiry is the easiest and the least interesting.
The shift to 47-day certificates
The CA/Browser Forum ratified the lifetime reduction in 2025. The phase-in schedule:
| Deadline | Max validity | DV reuse |
|---|---|---|
| March 2026 | 200 days | — |
| March 2027 | 100 days | — |
| March 2029 | 47 days | 10 days |
At 398 days you can manually renew in a pinch. At 47 you cannot — a single missed pipeline run on a non-automated cert becomes a production outage inside one sprint.
The math that changed everything: a 47-day validity with 10-day DV reuse means your pipeline re-validates, re-issues, and redeploys every cert roughly 8-9 times per year. Multiply that by fleet size and your tolerance for manual anything drops to zero. The full 47-day certificate timeline has the per-phase breakdown.
The Failure Modes Nobody Talks About
The cert failures that actually wake you up are not expiries. They're intermediate CA distrust events, partial deployments across load balancer pools, OCSP responder outages against hard-fail clients, and SNI mismatches behind CDNs. Generic uptime tools miss all four because they test one endpoint, once, from one client, and call it green.
Intermediate CA revocation
In September 2021, Let's Encrypt's DST Root CA X3 expired and took down OpenSSL 1.0.2 clients, older Android, and a long tail of IoT devices. Leaf certs were fine. Browsers were fine. The chain path validation on legacy trust stores was not.
Detection requires validating against multiple trust stores — Mozilla NSS, Apple, Android, OpenSSL default — and alerting on any that fail. openssl verify -CAfile handles one at a time; for the full matrix you need the trust bundles shipped explicitly.
Chain order bugs
nginx, HAProxy, and Envoy all happily serve a chain where the intermediate is missing or in the wrong order. AIA fetch support splits like this:
- Fetches missing intermediates: Chrome, Firefox
- Does not: curl, Python
requests, Gocrypto/tls
This is how you get a cert that passes a browser smoke test and then breaks every mobile client and server-to-server integration you own. When your certificate works in Chrome but breaks everywhere else covers the detection side in depth.
Mixed deployment states across load balancers
This one paged me at 2:47 a.m. on a Tuesday. ACM auto-renewed a cert bound to an ALB. The ALB fronted four targets in an Auto Scaling group behind a Route 53 weighted record. Three targets got the new cert. One kept serving the old one. The old expired at 00:00 UTC. 25% of TLS handshakes failed for six hours until the Slack signal got loud enough to escalate.
A per-endpoint check that hit the public DNS name would have been green 75% of the time. Detection requires probing every backend target separately with the right Host header. This failure mode is why renewal and deployment need to be monitored separately.
What to Monitor: A Concrete Checklist
Monitor at three layers: per-endpoint (what's actually served), per-certificate (what the cert itself claims), and per-issuer (what's happening upstream that you can't control). Anything less leaves at least one failure mode uncovered. Here's the reference table I keep around, re-thresholded for 47-day math.
Per-endpoint checks
| Check | Frequency | Warn | Page |
|---|---|---|---|
| Days to expiry | 1h | 7d | 3d |
| Chain completeness | 1h | any gap | any gap |
| Hostname SAN match | 1h | mismatch | mismatch |
| Protocol ≥ TLS 1.2 | 6h | TLS 1.1 offered | TLS 1.0 offered |
| Cipher suite health | 24h | RC4/3DES | export ciphers |
| OCSP stapling fresh | 1h | stale > 24h | absent (hard-fail svc) |
With 47-day certs, the old 30/14/7/1 warning cascade stops making sense. A 14-day warning on a 47-day cert fires at 70% of lifetime, which is noise. 7-day warn and 3-day page is my current default.
Per-certificate checks
- Key size: RSA ≥ 2048, EC ≥ 256
- Signature algorithm: SHA-256 minimum; SHA-1 pages immediately
- CT log presence: absence on a public cert is either a bug or a rogue issuer
- Revocation status: via OCSP and CRL
- SAN drift: versus the last known-good snapshot
Certificate chain validation belongs here too — not just whether a chain exists, but whether it validates cleanly against every trust store that matters to your clients.
Per-issuer checks
Stuff outside your control that still breaks your stack:
- CA distrust announcements (Mozilla Bugzilla, Chrome Root Program mailing list)
- OCSP responder availability on the CA side
- CT log shard health — logs get frozen and decommissioned
- ACME account rate limits at your issuer
Monitoring at Scale: 50 vs 500 vs 2000 Certs
Scale transitions follow a predictable pattern:
- 50 certs: a spreadsheet handles it
- 500 certs: forces you to solve discovery
- 2000 certs: forces you to solve ownership routing
Each transition hurts because the approach that worked yesterday does not stretch, and most teams do not notice until an alert has been ignored for 72 hours straight.
The discovery problem
At 50 certs you know where they all live. At 500 you do not, and anyone who claims otherwise has not actually gone looking. Certificate discovery has to cover:
- ACM (regional, per-account, across every org account)
- Cloudflare (account-scoped)
- GCP Certificate Manager
- Azure Key Vault
- Kubernetes Secrets and cert-manager Certificate CRDs
- nginx configs on long-lived EC2 instances
- IIS boxes nobody in the current org remembers provisioning
- SaaS vendors where a PM set up a custom domain in 2022
Sources worth wiring up: cloud provider ListCertificates APIs paginated across every region and account, cross-account enumeration when you're in AWS, a CT log listener (certstream or a local log follower) for your registered domains, and a Kubernetes secret watcher. CT log monitoring also catches the shadow-IT cert your marketing team bought without telling you.
Alert fatigue math
The numbers get brutal fast:
| Fleet size | Validity | Annual alerts | Real failures (99% success) | Noise |
|---|---|---|---|---|
| 50 certs | 398-day | ~50 | ~1 | ~49 |
| 2000 certs | 47-day | ~32,000 | ~320 | ~31,680 |
Industry data indicates that anything above 2-3 actionable alerts per engineer per day gets filtered into a folder and ignored within a month. The fix is routing, not better alerts.
Ownership mapping
Tag every cert with owner, service, and environment at issuance time. If you cannot do it at issuance, run a reconciliation job that maps SANs to services via your service catalog and writes tags back. Route alerts on the owner tag. A shared certs@ inbox at 500 certs is where expiry warnings go to die quietly.
Build vs Buy: An Honest Tradeoff
A 40-line bash script with openssl s_client and a cron job covers about 80% of what a small shop needs. It breaks at multi-cloud discovery, alert routing, historical data, and ownership mapping.
- Under 100 certs, one cloud, one on-call: do not buy anything
- Over 500 certs: the script is costing more engineering time than a tool would
What a cron + openssl gets you
#!/usr/bin/env bash
set -eu
for host in $(cat endpoints.txt); do
end=$(echo | openssl s_client -servername "$host" -connect "$host:443" 2>/dev/null \
| openssl x509 -noout -enddate | cut -d= -f2)
days=$(( ($(date -d "$end" +%s) - $(date +%s)) / 86400 ))
[ "$days" -lt 7 ] && echo "WARN: $host expires in $days days"
done
Run it hourly, pipe warnings to a Slack webhook, done. That's your starter ssl health check.
Where it breaks
- No discovery:
endpoints.txtis manual and goes stale the day you ship it - Single trust store: openssl uses the system CA bundle, not Mozilla NSS or Apple
- No historical data: you cannot answer "when did this chain last change"
- No alert routing: everything goes to one channel
- No CT log watching: no unauthorized-issuance catch
- No deployment drift check: it hits the DNS name, not each backend
When a tool is worth it
When the script's maintenance tax exceeds its value. For me that line sits somewhere around 200-300 certs, or the moment you cross two clouds, or when the on-call rotation grows past one engineer. Until then, openssl plus cron plus jq is honestly fine and I'll say so.
Integrating SSL Monitoring Into Your Existing Stack
Most teams already run Prometheus, Datadog, or a cloud-native monitoring stack. You don't need a separate SSL tool to get baseline coverage. You need the right probe config, thresholds that match 47-day math, and routing that splits warnings from pages.
Prometheus + blackbox_exporter
modules:
tls_connect:
prober: tcp
timeout: 10s
tcp:
tls: true
tls_config:
insecure_skip_verify: false
Alert rules:
- alert: CertExpiryWarn
expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 7
for: 10m
labels: { severity: warning }
- alert: CertExpiryPage
expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 3
for: 5m
labels: { severity: page }
blackbox_exporter covers expiry and hostname match cleanly. It does not cover chain validation against alternate trust stores, CT log presence, or OCSP stapling freshness. For those, run a sidecar script and feed results in via the textfile collector.
Datadog synthetics
One SSL test per endpoint, alert on days_before_expiry < 7. The gotcha: Datadog SSL tests resolve the public DNS name and hit whatever the CDN or load balancer returns, which hides per-target drift. For ALB target-level coverage you need a separate HTTP check per target IP, or you accept the blind spot.
PagerDuty routing
Three-tier routing I use in production:
| Trigger | Severity | Action |
|---|---|---|
| Expiry warnings (3-7 day window) | Low | Opens Jira ticket |
| Chain broken, hostname mismatch, cert invalid | Sev-2 | Pages on-call |
| Revocation events, distrust announcements | Sev-1 | Wakes whole rotation |
| OCSP stapling failures (hard-fail svc only) | Sev-2 | Pages on-call |
OCSP stapling failures break far more often than you'd expect; OCSP stapling is probably broken on half your endpoints covers the detection problem in depth.
FAQ
How often should I check SSL certificates?
Check hourly for expiry and chain on production endpoints, every 6 hours for protocol and cipher checks, and daily for CT log scanning against your registered domains. With 47-day validity, daily expiry checks don't leave enough margin for DNS TTLs, pipeline latency, and on-call handoff.
What's the difference between SSL monitoring and TLS monitoring?
Nothing operational. SSL is the legacy term; TLS is the protocol name since 1999. Tools, dashboards, and runbooks still say SSL because that's what ops teams type into search bars. Use whichever your team already uses — tls monitoring and ssl monitoring describe the same work.
Is OCSP still worth monitoring with short-lived certs?
Yes, for now. Chrome is moving toward CRLite and deprecating OCSP checks, but legacy clients, mail servers, and hard-fail services still rely on it. Once validity drops to 47 days the revocation model weakens (certs expire before revocation propagates) but stapling failures still break live connections today.
What should I monitor first if I'm starting from zero?
Monitor expiry across every endpoint you can discover, with 7-day warnings, and chain completeness tested from an OpenSSL-only client (not a browser). Those two give you the biggest risk reduction per hour of work. Everything else is layer two.
Do I need to monitor CT logs if I'm not on a security team?
If you own a domain, yes. CT log monitoring catches unauthorized issuance, typosquatting, and shadow-IT certs on subdomains you didn't know existed. A certstream listener is 15 minutes of setup and it pays off the first time you catch something you didn't issue.
The takeaway
ssl monitoring in 2026 is a multi-layer problem and a single expiry check does not cover it. Work across three layers: endpoint, certificate, issuer. Build your own certificate inventory with openssl and cron until the maintenance tax hurts. Re-threshold every alert for 47-day math before March 2026. If you want all of that pre-wired, CertPulse handles discovery, drift, and CT logs in one place — but the bash script works too, and I'll never pretend otherwise.
This is why we built CertPulse
CertPulse connects to your AWS, Azure, and GCP accounts, enumerates every certificate, monitors your external endpoints, and watches Certificate Transparency logs. One dashboard for every cert. Alerts when auto-renewal fails. Alerts when certs approach expiry. Alerts when someone issues a cert for your domain that you didn't request.
If you're looking for complete certificate visibility without maintaining scripts, we can get you there in about 5 minutes.