Operations

The certificates nobody monitors: TLS on Postgres, Kafka, SMTP, and the rest of your non-web stack

May 31, 20268 min readCertPulse Engineering

The certificates nobody monitors

When your cert monitoring fires, does it actually know about every certificate that can page you? For most setups the answer is no.

They were built around the web. Someone wired up a check against port 443, added a few hostnames behind a load balancer, set an expiry threshold, and called it done. That covers the certs your customers see. It misses the ones your systems see — the certs terminating TLS on your database, your message broker, your mail relay, your directory server. They expire on the same calendar everyone else's do. But when they go, there's no browser warning. There's a connection error deep inside an application someone wrote three years ago and nobody fully remembers.

I once watched a Postgres cert expiry get diagnosed as an application bug for forty minutes, because the error surfaced as a connection pool exhaustion alert instead of a TLS alert. The cert existed. It was just expired, and sslmode=verify-full was doing exactly what it had been told to do.

The blind spot is everything that isn't a browser

Walk your own stack and count the TLS listeners that aren't on 443. A typical haul:

  • Postgres on 5432, with sslmode set to something that actually validates
  • Kafka on 9093 doing TLS or mTLS between brokers and clients
  • Redis on 6379 with TLS enabled (since Redis 6)
  • SMTP on 25 and 587 using STARTTLS, plus submission on 465
  • IMAP on 993, the mail server nobody migrated off
  • LDAP on 636, or 389 with STARTTLS — your auth backbone
  • MQTT on 8883 if anything touches IoT or device fleets
  • etcd, the Kubernetes API, internal gRPC services, all TLS, often mTLS

Every one of these has a certificate with a notAfter date. Every one pages someone when it lapses. And almost none of them live in the dashboard watching your public web endpoints. The cert sits on a port your HTTP-shaped monitoring was never pointed at, and that's the whole gap right there.

Why a plain openssl probe doesn't cut it

The instinct is to reach for openssl s_client -connect host:port and read the dates off the output. For an HTTPS endpoint that works fine. For half the list above it hangs or returns nothing, and the reason is worth chasing, because it's the entire problem in miniature.

TLS doesn't always start the moment the TCP connection opens. Some protocols negotiate the upgrade inside their own wire protocol first. SMTP is the textbook case: the connection opens in plaintext, the client sends EHLO, the server advertises STARTTLS, the client asks for it, and only then does the handshake begin. Point a raw TLS client at port 587 and fire off a ClientHello immediately, and the mail server just sits there waiting for an EHLO that never comes. Your probe hangs until it times out.

openssl knows this, which is why it ships -starttls with protocol arguments: smtp, imap, pop3, ldap, xmpp, and the one that matters here, postgres. Postgres does something genuinely its own. The client sends an SSLRequest message — a specific byte sequence the wire protocol defines — and the server replies with a single byte: S for "yes, let's do TLS," N for "no." Only after that S does the handshake start. A naive TLS probe never sends the SSLRequest, the server never says S, and you get nothing. So Postgres cert expiry slips through generic monitoring with depressing reliability: the thing checking it has to speak a little Postgres before it can speak TLS.

Kafka is similar in spirit. TLS sits underneath the Kafka protocol, and depending on listener config you may have to satisfy mTLS — present a client cert — before you see anything at all. Run a probe with no client certificate against an mTLS-required broker and it won't tell you the server cert is expired. It'll tell you the handshake failed. That's a different fact, and a far less useful one.

So probing non-HTTP TLS means knowing, per protocol, when the handshake happens and what you have to say first. No single command covers all of it.

These are internal-PKI certs, and that changes the rules

Here's the part that lulls people into assuming these certs are safe: most of them aren't public. Your Postgres cert, your broker certs, your internal service mesh — almost always issued by a private CA. Maybe Vault. Maybe an internal step-ca. Maybe a root someone generated in 2019 and stashed in a wiki page.

Two things fall out of that, and they pull in opposite directions.

First, Certificate Transparency monitoring won't see any of it. CT logs record certs from publicly-trusted CAs. Your internal CA doesn't log to them, and shouldn't. If your cert visibility strategy leans on CT log watching — which is great for catching typosquats and unauthorized public issuance — it's completely blind to your internal estate. CT tells you when someone pulls a cert for your domain from a public CA. It tells you nothing about the cert on 5432.

Second, the new short-lifetime rules don't apply. The CA/Browser Forum's march toward 47-day certificates by 2029 governs publicly-trusted TLS. Your internal CA can issue 10-year certs if it feels like it. Sounds like a reprieve. It's a trap. Long-lived internal certs mean rotation happens so rarely that nobody builds the muscle memory for it. The runbook is stale, or missing. The person who set it up left. And an expired internal CA root doesn't take down one service — it invalidates every leaf cert that chains to it, which can drop an entire broker cluster or service mesh in one shot. Rotation discipline matters more on internal PKI, not less, precisely because no browser vendor is handing you the deadline. You have to set it yourself.

What actually breaks

The failure modes here are nastier than a web cert expiry, mostly because they don't announce themselves.

mTLS failures look like auth bugs. When a client cert expires on a Kafka or gRPC connection, the broker rejects the handshake, the application sees a connection refused or an authentication failure, and it logs it as exactly that. Your on-call engineer goes digging through credentials, IAM, service accounts — anywhere but the certificate, because nothing in the alert said "certificate."

Validation failures look like network flakiness. A renewed server cert that doesn't chain correctly, or a clock-skewed notBefore, shows up as intermittent connection errors. People restart pods, blame the network, and the real cause just sits there.

And then there's the one that gets everybody: renewal "succeeded," but the service is still serving the old cert. Plenty of these services load their certificate once at startup and never look at the file again. cert-manager rotates the secret, the file on disk updates, your automation reports success — and Postgres, or HAProxy, or the broker, is still holding the old cert in memory because nothing told it to reload. Postgres needs a pg_ctl reload (a SIGHUP) to pick up a new server cert. Some services need a full restart. Until that happens, your monitoring says renewed and the listener says otherwise. The only way to catch it is to probe the live listener and read the cert it's actually serving — not the cert sitting in a file or a Kubernetes secret. Probe the socket. The file will lie to you; the socket can't.

Building coverage

You don't need a new platform for this. You need to find the listeners and fold them into the alerting you already run.

Start by enumerating what's actually listening. On each host, ss -tlnp shows every listening TCP socket and the process behind it. Across the network, an nmap sweep with service and version detection turns up TLS on ports you forgot existed — and that's usually where the surprises hide, like the staging broker someone stood up before they left. Build the inventory from what's running, not from what the architecture diagram claims is running.

Keep a per-protocol probe cheat sheet. For each service, write down the one thing that makes its probe work. SMTP, IMAP, and LDAP take openssl s_client -starttls. Postgres takes -starttls postgres. Redis and MQTT over TLS handshake on connect, so a plain -connect is enough. Kafka needs a client cert if the listener requires mTLS. Once it's written down, it's automatable. The knowledge is the bottleneck, not the tooling.

Then fold the expiry dates into your existing 443 alerting. Whatever fires when a web cert hits 30 days: same threshold, same channel, same escalation. The goal isn't a separate "database certs" dashboard that nobody opens. It's that a Postgres cert at 14 days pages exactly the way a web cert at 14 days does. One queue, one set of eyes.

This is the seam a tool like CertPulse fits into. Its endpoint probes reach the externally-resolvable, TLS-on-connect services cleanly — anything where the handshake starts on connect, plus the STARTTLS protocols it speaks natively — and those land in the same inventory and expiry alerting as your web endpoints. For the deep-internal cases (an mTLS-only broker on a private subnet, a Postgres listener behind your VPC), you'll still want a sidecar or an internal probe runner that can present a client cert, touch the socket, and report dates outward. One inventory, one alerting path, probes running wherever they have to run to reach the listener.

The cert that takes you down at 2am won't be the one on your homepage. Everybody watches that one. It'll be the one on 5432 that never made it into the dashboard, expiring quietly while your monitoring glows green. Go find your listeners before the calendar does.

This is why we built CertPulse

CertPulse connects to your AWS, Azure, and GCP accounts, enumerates every certificate, monitors your external endpoints, and watches Certificate Transparency logs. One dashboard for every cert. Alerts when auto-renewal fails. Alerts when certs approach expiry. Alerts when someone issues a cert for your domain that you didn't request.

If you're looking for complete certificate visibility without maintaining scripts, we can get you there in about 5 minutes.

The certificates nobody monitors: TLS on Postgres, Kafka, SMTP, and the rest of your non-web stack | CertPulse