Operations

Debugging TLS Handshake Failures: A Field Guide From Packet Capture to Root Cause

May 22, 202614 min readCertPulse Engineering

The pager went off at 2:47am. A payments integration partner couldn't reach our API. Dashboards green, app logs showed nothing past connection refused, and TCP retransmits sat at normal levels. Forty minutes in, we figured out it was a TLS handshake failure caused by a missing intermediate certificate on a freshly-rotated leaf. The handshake itself failed in under 80ms. The bridge call to find that out took four hours.

This is the worst class of outage we deal with: a fast-failing protocol error that produces almost no useful signal in the places engineers normally look. This guide walks through the eight failure modes I've actually hit in production, with the commands I run on autopilot at 2am to triage them.

Why TLS Handshake Failures Are the Worst Kind of Outage

TLS handshake failures are the hardest production outages to debug because they fail inside one RTT, leaving no signal in latency dashboards, APM tools, or application logs. According to our internal incident log, handshake-class failures average 3.8 hours to root-cause versus 47 minutes for a typical HTTP 5xx outage.

The asymmetry is brutal. The failure event itself lasts microseconds. The investigation lasts the rest of your night because:

  • Most APM tools don't decode TLS alert codes
  • Cloud load balancer access logs strip the handshake before logging the connection
  • App logs see a closed socket, not a failed negotiation
  • The two sides of the connection often record completely different stories
  • SDKs surface generic "connection reset" or "broken pipe" errors

Treat this as a debugging-cost problem, not a TLS problem. The fix is almost always trivial once you know what failed. The cost is in finding out.

The Handshake in 90 Seconds: What Actually Has to Go Right

A TLS handshake negotiates cipher suites, validates certificate chains, and checks hostnames against SANs. TLS 1.3 completes in one round trip and encrypts the certificate. TLS 1.2 takes two RTTs and sends the certificate in cleartext. Four ClientHello fields explain roughly 90% of failures: cipher list, SNI value, supported_versions extension, and ALPN.

Here's what a real ClientHello looks like in Wireshark (ssl.handshake.type == 1):

Handshake Type: Client Hello (1)
Version: TLS 1.2 (0x0303)
Cipher Suites Length: 32
  TLS_AES_128_GCM_SHA256 (0x1301)
  TLS_AES_256_GCM_SHA384 (0x1302)
Extension: server_name (len=20)
  Server Name: api.example.com
Extension: supported_versions
  Version: TLS 1.3 (0x0304)
  Version: TLS 1.2 (0x0303)
Extension: application_layer_protocol_negotiation
  ALPN Protocol: h2

TLS 1.3 moves the certificate behind the application traffic key, so if you sniff the wire and can't see the cert payload, that's expected, not a bug.

Across our scanner fleet, 71% of new connections now negotiate TLS 1.3, up from 38% in early 2022. That transition is what makes a lot of the failures below newly relevant.

Your First Three Commands: openssl s_client, curl -v, tcpdump

Three tools cover 95% of TLS troubleshooting before you ever open Wireshark: openssl s_client confirms the raw handshake, curl -v adds HTTP semantics, and tcpdump arbitrates when both lie to you. Run them in that order.

The exact invocations I keep in muscle memory:

# Confirm raw handshake, force SNI, dump full chain
openssl s_client -connect api.example.com:443 \
  -servername api.example.com -showcerts -tls1_2

# Bypass DNS to test a specific backend
curl -v --resolve api.example.com:443:10.0.0.42 \
  https://api.example.com/health

# Capture only the handshake from one host
tcpdump -i any -w handshake.pcap \
  'port 443 and host 10.0.0.42'

Each tool has a blind spot:

  • s_client doesn't speak HTTP/2 by default, so it misses ALPN-driven failures
  • curl hides handshake details unless you add --trace
  • tcpdump shows bytes on the wire but can't decrypt TLS 1.3 without keylog files

Knowing which tool is lying to you is half the job.

Failure Mode 1: SNI Mismatch and the "Wrong Certificate" Class

An SNI mismatch occurs when a client omits the Server Name Indication extension or sends the wrong hostname, and the server returns a default certificate belonging to a different service. Roughly 12% of the SSL handshake error tickets we've seen in customer postmortems trace back to this exact pattern. The symptom looks like hostname verification failure even though the cert isn't expired and the chain is intact.

Run openssl s_client against the same IP twice:

openssl s_client -connect 10.0.0.42:443 -servername api.example.com
# Returns: CN=api.example.com

openssl s_client -connect 10.0.0.42:443
# Returns: CN=default.k8s-ingress.local

Two different certificates from one IP. That's SNI routing doing its job, and the client failing to use it.

Usual offenders:

  • Old Java 7 clients with SNI support disabled
  • HTTP/1.0 software that doesn't send Host headers in a parseable form
  • Reverse proxies stripping SNI when terminating and re-originating TLS
  • Hardcoded Host: headers that don't match the URL hostname

This bites mTLS in production the hardest, because the wrong certificate triggers a chain failure on top of the hostname mismatch, and the actual root cause hides two errors deep.

Failure Mode 2: Chain Issues, Missing Intermediates, and Cross-Signed Roots

A TLS chain incomplete error happens when the server sends only its leaf certificate and skips intermediates needed to build a trust path to a root. Browsers paper over this with AIA fetching; curl, Go's crypto/tls, and most Java HTTP clients do not. Your laptop says fine, your Go service says no.

The Let's Encrypt DST Root CA X3 expiry in September 2021 was the canonical case. Cross-signed roots meant some clients trusted the new R3 path and others followed the expired chain. We still see the same shape every quarter from internal CAs that rotate their issuing intermediate without updating the bundle servers serve.

Verify the chain explicitly:

openssl s_client -connect host:443 -servername host -showcerts \
  </dev/null 2>/dev/null | \
  awk '/BEGIN CERT/,/END CERT/' > /tmp/chain.pem

openssl verify -CAfile /etc/ssl/certs/ca-bundle.pem \
  -untrusted /tmp/chain.pem /tmp/leaf.pem

A passing browser plus a failing curl is the canonical fingerprint here. We covered the broader pattern in when your certificate works in Chrome but breaks everywhere else, including the Salesforce, Postman, and mobile-app combinations that catch teams out.

Failure Mode 3: Cipher Suite and Protocol Version Negotiation Failures

A handshake_failure alert (code 40) or protocol_version alert (code 70) in the ServerHello means the two sides found no overlap in what they were willing to negotiate. Usually one side mandates TLS 1.2+, AEAD-only ciphers, or specific curves, and the other side can't comply. The fatal alert lands in the server log within milliseconds; the client sees a closed connection.

The TLS 1.3 rollout is making this worse, not better. We see more hardened nginx configs running ssl_protocols TLSv1.3 only, and more legacy IoT firmware that caps at TLS 1.0 with RC4. They cannot talk to each other.

Force a specific suite to isolate:

openssl s_client -connect host:443 -servername host \
  -tls1_2 -cipher 'ECDHE-RSA-AES128-GCM-SHA256'

If that connects and the default invocation fails, you've found a negotiation problem rather than a cert problem. Common signatures:

  • Alert 40 immediately after ClientHello: cipher suite intersection empty
  • Alert 70 before ServerHello: protocol version unsupported
  • FIPS-only OpenSSL builds rejecting non-FIPS suites the peer offers
  • Java keystores missing modern algorithms because the JRE is 8u191 or older

If you're standardizing config, TLS 1.3 across nginx, HAProxy, and Envoy has copy-paste cipher lists that survive most legacy audits.

Failure Mode 4: Clock Skew, Expired Certs, and Not-Yet-Valid Certs

Certificate validity errors come in three shapes: expired cert, not-yet-valid cert, or wrong client clock. According to our incident postmortems, roughly 18% of "expired cert" pages turn out to be clock issues somewhere in the path. The first triggers the classic 2am page. The second and third get misdiagnosed for hours because operators don't expect them.

Common clock-skew sources:

  • Containers with no NTP
  • Isolated networks behind air gaps
  • Virtual machines that lost time on suspend
  • CI runners minting certs with notBefore two minutes in the future because the runner clock drifted

In openssl s_client output, check the validity window:

Validity
    Not Before: May 22 14:00:00 2026 GMT
    Not After : Aug 20 14:00:00 2026 GMT

On the host, compare:

date -u
curl -sI https://google.com | grep -i date

A 6-minute drift is enough to break some pinned chains. Automated renewal pipelines that pre-stage certificates with future notBefore values are increasingly common as we push toward shorter lifetimes. The 47-day certificate timeline makes this failure mode more likely, not less.

Failure Mode 5: mTLS, Client Cert Missing, Wrong CA, or Revoked

mTLS debugging is harder because the handshake fails further in. The server sends a CertificateRequest, the client sends nothing or sends a cert from the wrong CA, and the server alerts with bad_certificate or unknown_ca. Both sides have logs. Both sides describe different events.

The asymmetry trap looks like this:

  • Server log: "no client certificate presented"
  • Client log: "certificate sent successfully, peer closed connection"

Both are technically correct. The client sent the cert; the server rejected it before parsing because it was signed by a CA not in the server's trust store, and the server's log line fires after the rejection. We see this almost every week in customer environments.

Debug from the server side first:

openssl s_client -connect mtls.example.com:8443 \
  -cert client.crt -key client.key \
  -CAfile server-ca.pem -showcerts

Alert interpretation:

  • tlsv13 alert certificate required: the client didn't send one or sent an empty Certificate message
  • tlsv13 alert unknown ca: the server doesn't trust the issuing CA

Trust store drift is the most common root cause, especially when client certs rotate to a new intermediate and the server bundle doesn't follow.

Failure Mode 6: OCSP and CRL Hangs Pretending to Be Handshake Failures

An OCSP stapling failure or unreachable CRL Distribution Point sometimes manifests as a 30-second handshake followed by a timeout, not a clean alert. Strict clients hard-fail on revocation check failures even when documentation says soft-fail. The handshake completes the bytes; the validation step times out. About 7% of our scanner fleet's slow-handshake alerts trace to this.

The fingerprint to watch for: handshake duration with a long tail. Median 80ms, p99 32 seconds. The variance is the giveaway, and it usually gets misdiagnosed as network latency for weeks before anyone correlates it to a CDP host being slow.

Confirm OCSP responder health directly:

openssl ocsp -issuer issuer.pem -cert leaf.pem \
  -url http://ocsp.example.com -noverify

curl -v --max-time 5 \
  http://crl.example.com/intermediate.crl > /dev/null

If either takes more than 2 seconds, you've found the latency source. We covered the broader pattern in OCSP stapling is probably broken on half your endpoints, including the silent-failure cases where the staple is missing but nobody hard-fails until one strict client hits the endpoint.

Failure Mode 7: Load Balancer and SNI Routing Misconfiguration

Load balancer misconfiguration accounts for roughly a quarter of the TLS handshake failure incidents tagged in our reviews, and it's almost always a mismatch between what the LB thinks the SNI routes to and what the backend serves. The certificate is fine. The backend is fine. The thing in the middle is wrong.

Patterns that show up most:

  • AWS NLB with TLS passthrough vs termination: passthrough forwards the original ClientHello, termination rewrites it. Mix them up and SNI disappears.
  • Cloudflare in Full (strict) mode: requires a complete chain on origin, not just the leaf
  • HAProxy use_backend if { req_ssl_sni -i } rules: stale hostname after a rename
  • Envoy SDS reloading certs without draining connections: leaves in-flight handshakes mid-rotation

AWS adjusted ALB SNI behavior in 2024 around how it picks a default certificate when no SNI is sent. If you rely on undocumented fallback behavior, you eventually pay for it. Test with s_client against the load balancer's IP and the backend's IP separately. If the certs differ in unexpected ways, the LB is the suspect.

Failure Mode 8: Connection Reuse, Session Resumption, and TLS 1.3 0-RTT Weirdness

Stale session tickets, 0-RTT replay rejection, and HTTP/2 connection coalescing produce the hardest class of bugs to reproduce: handshakes that succeed on the first request and fail on every subsequent one within the same connection, or vice versa. Up to 60% of traffic on high-volume endpoints resumes sessions rather than performing fresh handshakes, so resumption bugs only fire under load.

Classic symptoms:

  • First request works, second on the same connection returns "stream reset"
  • Works in staging, fails in prod because staging never resumes
  • Works from one POP but not another after a key rotation

HTTP/2 makes this worse because clients coalesce connections by IP and certificate SAN coverage. A second request to a different hostname can reuse the first connection's TLS state, and SNI for the second hostname never gets sent.

Isolate by disabling resumption:

openssl s_client -connect host:443 -servername host \
  -no_ticket -sess_out /dev/null

If that connects clean while resumed connections fail, you have a session ticket or 0-RTT issue. Rotate the ticket key on every server that handled the original handshake.

A 5-Minute Triage Checklist For The Next Page

The fastest path to root cause asks four questions in order: is it one host or many, is it a new deploy or steady state, all clients or one client family, and does the failure happen before or after the first byte? Each branch points at a specific failure mode above.

Symptom Likely cause First command
One host, post-deploy, all clients Cipher/version mismatch s_client -tls1_2 -cipher ...
Many hosts, no deploy, one client family SNI or chain s_client -servername ...
Steady-state, intermittent, long tail OCSP/CRL openssl ocsp -url ...
New backend pool, all clients Load balancer s_client LB then origin
First request works, second fails Session resumption s_client -no_ticket
mTLS only Trust store drift Check server CA bundle
Sudden, exactly at midnight UTC Cert expiry openssl x509 -noout -dates
Container only, never bare metal Clock skew date -u; ntpq -p

Tape it to the wall. The flowchart matters more than memorizing the tools.

What We Catch With CertPulse Before It Becomes a Page

CertPulse catches six of these eight failure modes through automated scanning before they page on-call. The platform validates certificate chains on every scan (Mode 2), detects weak ciphers and protocol versions (Mode 3), runs validity window checks across scanner regions to surface clock skew patterns (Mode 4), monitors OCSP responder health (Mode 6), and performs SNI-aware multi-IP probing for load-balanced endpoints (Mode 7). It also flags scan-time anomalies that correlate with Mode 8 resumption issues.

Honest limits:

  • CertPulse cannot catch mTLS trust store drift on internal services we can't probe with a client cert
  • CertPulse cannot catch the SNI-mismatch class when the misbehaving client is on the public internet and we never observe it

For those, the right tool is structured logging on the server side plus alerting on bad_certificate and unknown_ca alert codes.

A TLS handshake failure is rarely a TLS problem at the protocol level. It's a configuration drift problem, a clock problem, or a chain-of-trust problem that surfaces through TLS because that's where the validation runs. Catching it before the page means watching the things that drift, not just the things that expire.

FAQ

How do I tell the difference between an SSL handshake error and a network connectivity issue?

Run openssl s_client -connect host:443 first. If it returns "Connection refused" or hangs on TCP, you have a network problem. If it returns "no peer certificate available" or any line starting with tlsv1 alert, you have a TLS problem. The TCP handshake completing but the TLS handshake failing is the giveaway.

What's the fastest way to capture a TLS handshake for analysis?

Use tcpdump -i any -s 0 -w handshake.pcap 'port 443 and host x.x.x.x', reproduce the failure, then open the pcap in Wireshark and filter tls.handshake. The capture should be under 50KB if you scope it tightly. For TLS 1.3 decryption, set SSLKEYLOGFILE in the client environment first.

Why does my certificate work in browsers but fail in curl or Go?

Almost always missing intermediates. Browsers fetch them via AIA (Authority Information Access) extensions; most CLI tools and language stdlibs do not. Run openssl s_client -showcerts and count the certificates returned. If you see one BEGIN CERTIFICATE block instead of two or three, your server isn't sending the full chain.

How long should a TLS handshake actually take?

TLS 1.2 typically completes in 60-150ms across continents. TLS 1.3 cuts that roughly in half. Anything over 500ms is suspicious, and anything over 2 seconds is almost always OCSP or CRL fetching during validation. Persistent variance with a long tail is the signature.

Should I disable OCSP checking to make handshakes faster?

No. Configure OCSP stapling on the server side so the responder gets hit by your infrastructure, not by every client. That gives you fresh revocation data without adding latency to user connections, and it puts the OCSP responder's health on your monitoring rather than at the edge of an unpredictable internet path.

This is why we built CertPulse

CertPulse connects to your AWS, Azure, and GCP accounts, enumerates every certificate, monitors your external endpoints, and watches Certificate Transparency logs. One dashboard for every cert. Alerts when auto-renewal fails. Alerts when certs approach expiry. Alerts when someone issues a cert for your domain that you didn't request.

If you're looking for complete certificate visibility without maintaining scripts, we can get you there in about 5 minutes.