Industry

How We Cut CertPulse's Scan Time From 47 Minutes to 90 Seconds: A Concurrency Postmortem

May 2, 202611 min readCertPulse Engineering

CertPulse cut TLS certificate scan time from 47 minutes to 90 seconds across an 1,800-certificate fleet spanning AWS, Azure, and GCP. The single biggest win was not concurrency. It was connection reuse plus DNS caching plus bounded worker pools, in that order. After monitoring certificates across multi-cloud environments at CertPulse, I can tell you the unsexy fixes outperformed every clever goroutine pattern we tried.

Last spring, a customer opened a ticket with the subject line "scan times unusable." 1,800 certs across three clouds. A full pass took 47 minutes. By the time it finished, half the data was stale and they were paying us for the privilege of waiting. We rewrote the scanner using better Go concurrency patterns and got that same fleet down to 90 seconds. This is the postmortem: what we tried, what broke, what stuck, and which unsexy optimizations mattered more than the clever ones. If you are tuning a certificate scanner architecture or doing devops SaaS engineering on anything that hits a lot of network I/O, the failure modes here will look familiar.

The baseline: why 47 minutes was a product-killer

47 minutes per scan was a product-killer because the customer's required refresh cycle was 6 hours, meaning 13% of every day was spent scanning, and a single transient DNS failure pushed a refresh to 90 minutes. Industry data on Go I/O profiling indicates network-bound workloads typically show <10% CPU utilization. Ours sat at 4%.

The original scanner was one goroutine looping over a slice of cert targets. For each target it did DNS resolution, opened a TCP connection, completed a TLS handshake, parsed the certificate chain, and queried CT logs. Sequential, predictable, embarrassingly slow.

A scan per cert broke down like this:

Phase Latency
DNS resolution 8–40 ms
TCP dial 20–120 ms (geography dependent)
TLS handshake 80–300 ms
Chain parse ~2 ms
CT log lookup 200–600 ms
End-to-end average 1.57 seconds per cert

Multiply 1.57 seconds by 1,800 certs and you get 47 minutes. A flame graph from a real production scan showed 89% of wall time was network wait. The CPU sat at 4% utilization while goroutines blocked on syscalls. We were not CPU-bound. We were politely-asking-the-network-bound, which is the dominant failure mode for certificate scanner performance and TLS handshake latency at any meaningful fleet size.

First attempt: naive Go concurrency patterns and why they broke everything

Spawning a goroutine per certificate fails at three specific thresholds: file descriptor exhaustion at ~1,024 concurrent, AWS API throttling at ~1,500 concurrent, and OOM kill at ~3,200 concurrent. "Just add concurrency" is the engineering equivalent of pouring gasoline on a grease fire. It worked great on a dev fleet of 200 certs. At customer scale it set three different things on fire inside 90 seconds.

Failure timeline at the actual concurrency thresholds:

  • ~1,024 concurrent — fd exhaustion. Default ulimit -n on Ubuntu 22.04 is 1,024. Error: dial tcp: socket: too many open files.
  • ~1,500 concurrent — AWS API rate limits. ThrottlingException: Rate exceeded from describe-certificate. AWS documents ACM at 50 RPS per region, but burst capacity drops to ~10 within seconds.
  • ~3,200 concurrent — OOM killed. dmesg showed Memory cgroup out of memory: Killed process 1142 (certpulse-scan).

DNS exhaustion was the surprise. The Go DNS resolver serializes lookups when there is one nameserver in /etc/resolv.conf. We watched 2,000 goroutines sit on a single mutex inside net.Resolver. The fix was not more goroutines. The fix was a goroutine pool with proper bounds.

The architecture that worked: bounded worker pools with per-provider backpressure

The architecture that worked uses a three-tier pool: a global semaphore caps total in-flight scans at 256, per-provider sub-pools throttle each cloud API independently, and a separate pool of 128 workers handles raw TLS dials. Backpressure is channel-based, so a slow provider blocks itself, not the others. After tuning across customer fleets, these are the pool sizes we settled on.

Pool sizes, with the constraint each one respects:

Pool Workers Constraint
Global 256 Peak useful parallelism on a 4-vCPU node
AWS 32 ACM 50 RPS per region, 5 regions concurrent
Azure 16 Key Vault throttles aggressively above this
GCP 16 Certificate Manager API quota: 60 RPM default
Raw TLS dials 128 Customer endpoint dials
type ScanPool struct {
    global    chan struct{}
    providers map[string]chan struct{}
}

func (p *ScanPool) Acquire(ctx context.Context, provider string) error {
    select {
    case p.global <- struct{}{}:
    case <-ctx.Done():
        return ctx.Err()
    }
    select {
    case p.providers[provider] <- struct{}{}:
        return nil
    case <-ctx.Done():
        <-p.global
        return ctx.Err()
    }
}

We shipped a deadlock to staging because the AWS sub-pool was waiting on a result channel the global pool was supposed to drain. Fixed by separating result channels from signal channels and adding a context cancel path. The worker pool pattern is well-known. Getting the backpressure design right is the part most articles skip. Our cross-account certificate audit walkthrough covers the per-account enumeration logic that feeds this pool.

DNS was the real bottleneck (and the fix was ugly)

DNS resolution was eating 40% of remaining wall time after the pool rewrite, because Go's pure-Go resolver serializes lookups under contention when only one upstream nameserver is configured. The fix was an in-process DNS cache with TTL-aware eviction. According to Go's net package documentation, the netgo build tag enables the pure-Go resolver; cgo is parallel but pulls in libc, which we did not want in our scratch container image.

strace confirmed the contention:

[pid 1142] futex(0x7f2a...4, FUTEX_WAIT_PRIVATE, 0, NULL
[pid 1143] futex(0x7f2a...4, FUTEX_WAIT_PRIVATE, 0, NULL
[pid 1144] futex(0x7f2a...4, FUTEX_WAIT_PRIVATE, 0, NULL

Three workers, one mutex, sequential resolution. The fix:

  • Pre-resolve known hostnames every 5 minutes
  • Cache hits return inline with no lock contention
  • Eviction respects the record's actual TTL

The bug we shipped: the first version of DNS caching ignored short TTLs on CNAME records that pointed to load-balanced ACM endpoints. For a week, customers got phantom cert mismatch alerts because we were checking a stale IP against a freshly-issued cert. The fix was honoring the actual record TTL instead of a hardcoded 5-minute window.

Connection reuse, TLS session tickets, and the 4x speedup we almost missed

TLS session resumption was the single largest contribution to scan performance in the entire rewrite. Adding a connection pool keyed on host:port plus session ticket caching cut p50 handshake time from 142 ms to 38 ms on repeat scans. After running this in production, I would now call connection reuse the highest-leverage optimization available for any TLS scanner.

Before/after numbers on a wildcard load balancer covering 200 subdomains:

Metric Before After
p50 handshake 142 ms 38 ms
p99 handshake 480 ms 110 ms
Total time on that fleet 28 s 7 s

The trick is that one wildcard cert often answers for 200 subdomains. Without TLS session resumption, each scan was a full handshake costing ~150 ms. With ticket caching, the second through 200th scans dropped to ~40 ms each. Connection pooling and TLS handshake optimization are boring topics nobody covers in conference talks because they do not look impressive on a slide. They mattered more than every flashy concurrency change combined.

What broke at scale (and what I'd do differently)

Three things broke after the rewrite hit production: a goroutine leak in the cert chain parser, Postgres connection pool exhaustion, and a 6-hour customer outage caused by a misconfigured deploy. Each failure surfaced at the storage boundary, not the network boundary, which is where most articles stop looking.

Issues that surfaced under load:

  • Goroutine leak. The parser spawned a goroutine for OCSP fetch and never closed it on context cancel. We saw 40,000 goroutines in pprof after 12 hours.
  • Postgres pool exhaustion. 256 workers × 1 conn each does not fit a 100-connection pool. We hit pq: too many clients already and queries piled up.
  • The 6-hour outage. A deploy raised the global pool to 512 without raising the Postgres pool. Writes stalled, scan workers blocked on result inserts, and the queue backed up until the on-call rolled it back.

The fix for the database side: Postgres batch insert with a 500 ms flush window. One writer drains a results channel and inserts 50 rows per transaction. Postgres CPU dropped 70%. These problems showed where our Go concurrency patterns needed reinforcement at the storage boundary, not the network one. For the broader picture on what tends to break under sustained load, our piece on what actually fails in certificate monitoring covers the failure taxonomy.

If I rebuilt today: scanner emits results to a queue (NATS or Redis Streams), a separate writer service drains. Decoupling scan throughput from database throughput is the right move when scaling SaaS workloads horizontally beats stacking everything into one binary.

The numbers, one year later

CertPulse currently scans at p50 87 seconds per 1,000 certs, p99 134 seconds, at infrastructure cost of ~$0.003 per scan including egress and provider API calls. The metrics dashboard catches regressions before customers notice because we alert on p50 drift relative to the rolling 7-day baseline.

What we monitor for SaaS metrics and scan performance benchmarks:

  • Scan duration (p50, p95, p99) per customer fleet size bucket
  • Provider API error rate, segmented by AWS / Azure / GCP
  • Postgres write batch size and flush latency
  • Goroutine count (alarm at 5x baseline)

What we alert on:

  • p50 +25% over 7-day rolling
  • Any provider error rate above 1%
  • Postgres connection pool > 80% utilization

What we ignore: CPU. We have never been CPU-bound and probably never will be. The embarrassing metric: CT log queries are still sequential per cert. They add roughly 12 seconds to p99. It is on the list. The list is long.

Conclusion

SaaS performance optimization rarely comes from one heroic change. It comes from a sequence of unglamorous fixes: connection reuse, DNS caching, batch writes, per-provider throttling, and a willingness to read flame graphs instead of guessing. Better Go concurrency patterns matter, but only if you back them with bounded pools, real backpressure, and honest measurement. Build in public, measure everything, and accept that the boring optimizations win.

If you are tuning your own pipeline against multi-cloud scanning quirks, the comparison of AWS ACM, Azure Key Vault, and Google Certificate Manager covers the per-provider footguns we hit on the way here. CertPulse is what we use to monitor our own fleet, and the scanner described in this post is what powers it.

FAQ

What concurrency level should a TLS certificate scanner target?

Start at 256 global workers and tune from there. Above that, you hit file descriptor limits at ~1,024 concurrent, AWS API throttling at ~1,500, and OOM kill at ~3,200 before you see throughput gains. Per-provider sub-pools matter more than raw global parallelism, and concurrency tuning should always be paired with per-API rate budgets.

Should I use Go's pure-Go DNS resolver or cgo?

Pure-Go is the right default for containerized workloads, but it serializes lookups when you have one upstream resolver. If you cannot add resolvers to /etc/resolv.conf, bundle an in-process DNS cache that respects each record's TTL. Do not hardcode a TTL value across all record types — we shipped that bug and it produced a week of phantom cert mismatch alerts.

How much does TLS session resumption save in practice?

In our production data, p50 handshake time dropped from 142 ms to 38 ms on repeat scans against the same host, a 73% reduction. For wildcard load balancers covering many subdomains, this is the single highest-impact optimization available, and it costs almost nothing to implement.

How do you avoid AWS API throttling when scanning at scale?

Per-region sub-pools and exponential backoff on ThrottlingException. AWS documents ACM at 50 RPS per region, but burst capacity drops to ~10 within seconds. CertPulse caps at 32 concurrent describe calls per region across 5 regions and has not been throttled since.

What is the most overlooked scanner optimization?

Connection reuse and DNS caching. Both are unsexy, neither makes a good demo, and both delivered larger speedups than any concurrency rewrite. In our case, connection reuse plus session tickets cut a 28-second wildcard scan to 7 seconds. If your scanner is slow, profile network wait before reaching for a goroutine.

This is why we built CertPulse

CertPulse connects to your AWS, Azure, and GCP accounts, enumerates every certificate, monitors your external endpoints, and watches Certificate Transparency logs. One dashboard for every cert. Alerts when auto-renewal fails. Alerts when certs approach expiry. Alerts when someone issues a cert for your domain that you didn't request.

If you're looking for complete certificate visibility without maintaining scripts, we can get you there in about 5 minutes.

How We Cut CertPulse's Scan Time From 47 Minutes to 90 Seconds: A Concurrency Postmortem | CertPulse