The 2AM Certificate Expiry: An Incident Postmortem and the Runbook We Built After

At 2:07 on a Tuesday morning, PagerDuty did what PagerDuty does. The alert said CRITICAL: api.prod health check failing, which could mean approximately forty things. What it actually meant was that a single expired TLS certificate would cascade through five organizational gaps and cause 79 minutes of customer-facing downtime. According to the Ponemon Institute, 74% of organizations have experienced an unplanned outage caused by an expired certificate in the past two years. This is the postmortem we wrote after, the runbook we built from it, and the systemic fixes that actually moved the needle.

The Page That Ruined a Tuesday

A certificate expiry incident takes an average of 22 minutes to diagnose when monitoring doesn't cover origin certificates — and that's 22 minutes of customer-facing downtime while the on-call engineer checks deploys, DNS, and target group health before anyone says "check the cert." According to a 2025 Uptime Institute survey, 47% of outages involve a configuration or certificate error that existing monitoring should have caught.

Here's the actual incident timeline, wrong turns included:

Time	Action	Result
2:07	Page fires. On-call SSHs into the bastion, half-awake.	Incident begins
2:11	Checks recent deploys. Nothing shipped since 4pm.	Dead end
2:14	Checks DNS resolution.	Fine. Another dead end
2:19	Looks at ALB target group health. Targets healthy, ALB returning 503s.	Confusing
2:28	Someone says "check the cert." Runs `openssl s_client -connect api.prod.example.com:443`	`verify error:num=10:certificate has expired`
2:29	The collective sigh of an entire incident channel.	Root cause confirmed

Twenty-two minutes from page to root cause. Not because the engineer was bad — because nothing in our monitoring said "the certificate on the origin server expires tomorrow."

Why the monitoring missed it

The root cause was a gap between what health checks validated and what the traffic path actually used:

ALB health checks hit port 8080 over HTTP — targets appeared healthy
The ALB's own TLS certificate was managed by AWS Certificate Manager (ACM) and auto-renewed — no issues
The origin certificate (used by Nginx to terminate TLS between the ALB and the backend) expired at midnight — completely unmonitored
The result: the ALB saw a healthy backend over HTTP and a broken TLS handshake over HTTPS simultaneously

The health check and the traffic path were checking different things.

Root Cause: How a Certificate Falls Through Every Crack

Certificate outages rarely have a single cause — they have a failure chain where five organizational and technical gaps must be open simultaneously. Per the Ponemon Institute, the average organization manages over 50,000 certificates, making this kind of gap statistically inevitable without systematic inventory management.

Here are the five failures that compounded in our incident:

Tribal knowledge departed. The origin cert was provisioned 13 months earlier by an engineer who left the company in month 10. The cert, the renewal process, and the context for why it existed all left with them. Certificate ownership tracking was nonexistent.
The tracking spreadsheet had blind spots. Our certificate tracking spreadsheet listed ALB certs but omitted origin certs, internal mTLS certs, and anything provisioned outside of Terraform. The origin cert was provisioned manually via certbot during a previous incident — exactly the kind of cert that most needs tracking.
Monitoring checked the wrong layer. Our TLS monitoring checked the certificate presented to the browser (the ALB cert) but never checked the origin cert behind it. This is the shadow certificates problem: certs that exist in your infrastructure but not in your inventory.
Renewal reminders went to a black hole. The cert was registered to a team distribution list routing to a shared inbox with 4,200 unread messages. The 30-day and 7-day expiry warnings from Let's Encrypt arrived, got buried, and expired themselves.
No escalation path existed. Even if someone had seen the email, there was no runbook for "a cert is expiring and I don't know what it does." The expected action was "the person who set it up will renew it." That person was on LinkedIn updating their resume.

Every layer assumed the adjacent layer was handling certificate lifecycle management.

The Fix at 2AM vs. The Fix at 10AM

Emergency certificate renewal under pressure follows a predictable pattern: you fix it wrong the first time, then fix it correctly in the morning. According to the SANS Institute, incident remediation performed during off-hours takes 2–3x longer and has a significantly higher error rate than the same work done during business hours. Our experience confirmed this exactly — the 2am fix took 47 minutes and caused a second outage, while the 10am fix took 20 minutes.

The 2am fix (the messy one)

Mistake 1: Partial deployment. At 2:31, the on-call engineer issued a new cert with certbot:

certbot certonly --nginx -d api.prod.example.com

Certbot dropped a fresh cert into /etc/letsencrypt/live/. The engineer verified it:

openssl x509 -in /etc/letsencrypt/live/api.prod.example.com/fullchain.pem -noout -dates

Dates looked good. Nginx reloaded. Still broken for 40% of requests — because the engineer reloaded Nginx on one of three origin servers. The other two still served the expired cert.

Mistake 2: Incomplete certificate chain. After cycling all three servers at 2:48, TLS handshakes succeeded for browsers. But Java-based clients in a partner integration failed with PKIX path building failed. The Nginx config referenced cert.pem (leaf only) instead of fullchain.pem (leaf plus intermediate). Browsers cached the intermediate from previous connections; Java's TLS implementation does not — it fails hard.

# the verification step that would have caught both mistakes
openssl s_client -connect api.prod.example.com:443 -showcerts | openssl verify

At 3:14, after fixing the Nginx config to reference the full chain and reloading across all three origins, the certificate chain was complete and all clients recovered.

The 10am fix (the proper one)

The morning crew verified certificate rotation across every endpoint:

All three origin servers
The staging environment (also expired — nobody had noticed)
The partner-facing mTLS endpoint
Verification from multiple external vantage points
Full chain validity confirmation
Customer communication that should have gone out at 2:30

Total customer-facing downtime: 67 minutes for the initial outage, 12 minutes for the self-inflicted second outage from the partial fix.

The Certificate Incident Runbook

A certificate incident runbook prevents 3am reasoning from first principles. Every step below maps directly to something that went wrong during our incident. Based on aggregated postmortem data from the SRE community, an estimated 60% of certificate-related outages could be resolved in under 15 minutes with a documented TLS troubleshooting checklist.

Step 1: Detection and triage

Confirm the symptom is TLS-related: openssl s_client -connect HOST:443 < /dev/null 2>&1 | grep -i "verify\|expire\|error"
Check if the cert is expired, expiring soon, or mismatched: openssl s_client -connect HOST:443 < /dev/null 2>&1 | openssl x509 -noout -dates -subject
Check from outside your network (clock skew on the host can mask local checks)
Severity classification: customer-facing = P1, internal-only = P2, dev/staging = P3

Step 2: Identification

Which cert? openssl s_client -connect HOST:PORT -showcerts — note the serial and issuer
Which endpoint? Check ALB, CDN, origin, and any internal hops separately
Who owns it? Check your certificate inventory. If you don't have one, check git blame on the Nginx/Apache config, Terraform state, or the cert's registered email

Step 3: Remediation by certificate type

Certificate Type	Renewal Method	Key Verification
ACM-managed (ALB, CloudFront)	Should auto-renew. Check DNS validation records. Re-trigger validation in ACM console.	Verify DNS CNAME records exist
certbot / Let's Encrypt (Nginx, origin)	Run `certbot renew --cert-name DOMAIN`. Reload Nginx on all servers.	Verify `fullchain.pem` is referenced, not `cert.pem`
Internal / mTLS	Reissue from internal CA. Distribute to both sides.	Coordinate with partner or service owner — cannot be rushed

Step 4: Verification

# verify full chain validity
openssl s_client -connect HOST:443 -showcerts 2>/dev/null | openssl x509 -noout -dates

# verify intermediate chain ordering
openssl s_client -connect HOST:443 2>/dev/null | grep -A2 "Certificate chain"

# verify from a Java-like client (strict chain validation)
curl --cacert /etc/ssl/certs/ca-certificates.crt https://HOST/health

Step 5: Things that look like cert problems but aren't

Clock skew: Host clock is wrong, making a valid cert appear expired. Check timedatectl.
OCSP stapling failure: Cert is valid but OCSP responder is down. Nginx logs show OCSP responder error. Temporarily disable stapling.
Client-side CA bundle: The client has an outdated CA bundle. Common with old Java runtimes and embedded devices. Not your problem to fix, but you need to identify it so you stop debugging your own infra.

Step 6: Communication

Send the status update before you send the fix. Customers are more forgiving of "we know and we're working on it" than silence followed by "it's fixed." Use your incident communication template. If you don't have one, write one now — you will need it again.

Systemic Fixes That Actually Prevent the Next 2AM Page

Preventing certificate expiry incidents requires fixes at five layers: discovery, monitoring, alerting, automation, and process. According to Gartner, organizations that implement automated certificate lifecycle management reduce certificate-related outages by 90%, but the median time to full implementation is 6–9 months. We implemented seven changes over three months.

Here they are, ranked by effort-to-impact:

Fix	Timeline	Impact	Description
Certificate inventory discovery	1 week	High	Scanned every load balancer, origin server, and internal endpoint. Found 23 unknown certs. Tools: `certigo`, `tlsx`, or `openssl s_client` loop across CIDR blocks.
Multi-layer TLS monitoring	2 weeks	High	Prometheus blackbox exporter with `probe_ssl_earliest_cert_expiry` checks certs at the edge and at every hop behind it. Caught 3 more shadow certs in month one.
Tiered alert thresholds with escalation	3 days	Medium	60/30/14/7/3/1-day alerts. 60-day goes to cert owner. 14-day goes to team lead. 3-day pages on-call. Solved alert fatigue by making early warnings low-urgency.
Ownership tagging	2 weeks	Medium	Every cert has an `owner` tag — a team, not a person. Enforced via OPA policy: no cert resource in Terraform without an owner tag.
Automated renewal	Ongoing	High	cert-manager in Kubernetes, ACM for ALB-fronted services, certbot cron jobs for legacy Nginx. Covers ~80% of certs.
Quarterly cert review	1 hour/quarter	Low effort	Calendar invite. Spreadsheet. Thirty minutes of "is this cert still needed, who owns it, is renewal automated."
Manual tracking for the 20%	Ongoing	Critical	Partner mTLS certs, legacy appliances without ACME support, vendor certs with manual CSR workflows. Explicit renewal dates and assigned owners.

We built internal tooling to handle the discovery and monitoring pieces. If you're looking for something purpose-built, CertPulse handles certificate inventory discovery, multi-layer monitoring, and ownership tracking that we spent weeks cobbling together. The tiered alerting with escalation policies made the biggest difference in practice — not just knowing a cert will expire, but making sure the right person acts on it before the page fires.

The honest truth about certificate lifecycle management: you will never fully automate it. The goal is to automate the 80% that can be automated and make the remaining 20% impossible to forget.

FAQ

How do I check if a TLS certificate is about to expire?

Run openssl s_client -connect yourhost.com:443 < /dev/null 2>&1 | openssl x509 -noout -enddate from any machine with OpenSSL installed. This returns the certificate's expiration date in plaintext. For monitoring at scale, Prometheus blackbox exporter provides the probe_ssl_earliest_cert_expiry metric, which gives days-to-expiry as a numeric value you can alert on with tiered thresholds.

What's the difference between a certificate expiry and a certificate chain error?

A certificate expiry means the leaf certificate's validity period has passed. A certificate chain error means the server isn't presenting the full trust path — typically a missing intermediate CA certificate. Both cause TLS handshake failures, but the fix is different: expiry requires renewal, while a chain error requires correcting the server's certificate bundle to include fullchain.pem instead of cert.pem.

How should certificate expiry alerts be structured to avoid alert fatigue?

Use tiered thresholds with escalating severity: 60-day as a low-priority notification to the cert owner, 30-day as a ticket, 14-day as an escalation to the team lead, and 7-day and under as an on-call page. The key principle is that early warnings should be informational and routed to the owner, not broadcast to everyone. Alert fatigue happens when every warning feels like an emergency.

Which certificates can't be auto-renewed with ACME or certbot?

Four categories of certificates require manual renewal: partner mTLS certs with contractual CSR exchange processes, EV certificates (though increasingly rare), certs on legacy hardware or appliances that don't support the ACME protocol, and certs issued by private certificate authorities without ACME integration. These should be tracked separately with explicit ownership and renewal dates.

What should a certificate incident runbook include?

A certificate incident runbook should include six sections: a detection and triage checklist with OpenSSL verification commands, identification steps (which cert, which endpoint, who owns it), remediation procedures for each cert type in your environment (ACM-managed, certbot/Let's Encrypt, internal mTLS), verification commands that check the full certificate chain from multiple vantage points, a "things that look like cert problems but aren't" section covering clock skew and OCSP stapling issues, and a communication template. Every step should reference a specific past failure it prevents.

This is why we built CertPulse

CertPulse connects to your AWS, Azure, and GCP accounts, enumerates every certificate, monitors your external endpoints, and watches Certificate Transparency logs. One dashboard for every cert. Alerts when auto-renewal fails. Alerts when certs approach expiry. Alerts when someone issues a cert for your domain that you didn't request.

If you're looking for complete certificate visibility without maintaining scripts, we can get you there in about 5 minutes.

Start monitoring free See how it works

Back to blog