Industry

How We Built a Multi-Cloud Certificate Scanner That Doesn't Suck

April 3, 202611 min readCertPulse Engineering

We started building CertPulse's multi-cloud certificate scanner because of a 2:47am PagerDuty alert for a wildcard cert that expired on a load balancer nobody remembered provisioning. The post-incident review revealed something worse than the outage itself: our certificate inventory spreadsheet listed 214 certs, but a manual audit across three cloud accounts found 347. Multi-cloud certificate discovery wasn't a feature we wanted to build. It was a problem that kept waking us up.

This is the story of how we built CertPulse's scanner across AWS, GCP, and Azure that actually reflects reality, the architectural decisions that worked, and the ones that didn't. If you manage TLS certificate monitoring across more than one cloud provider, some of this might save you a few 2am pages.

The problem: nobody knows where all their certs are

Most organizations' certificate inventory is inaccurate. Based on our survey of 40 engineering teams before writing any code, 80% tracked certificates in spreadsheets or internal wikis, and zero had a complete picture across all their cloud accounts. Certificate sprawl is the default state of multi-cloud environments.

Here's what our pre-build research found:

  • 32 of 40 teams tracked certificates in spreadsheets or internal wikis
  • 6 teams used a single-provider dashboard (usually the AWS Certificate Manager console)
  • 2 teams had custom cron-based scripts that broke every time someone rotated IAM credentials
  • 0 teams had complete visibility across all cloud accounts

Certificate sprawl gets worse in predictable ways. The median team we talked to operated across 5–6 separate cloud consoles: 3 AWS accounts, 1–2 GCP projects, and at least one Azure subscription. Each console has its own IAM model and paginates certificate lists differently. In our experience building CertPulse, custom scripts handle one provider well, maybe two. By the third provider, maintenance cost exceeds the time saved.

The single-provider dashboards — ACM console, GCP's Certificate Manager page, Azure Key Vault's certificate blade — each show a fraction of the picture. They work for cloud certificate management within their own ecosystem. They tell you nothing about the cert on another provider that shares a SAN with the one you're looking at. Certificate expiration monitoring across providers requires a tool that sits above all of them.

Three cloud APIs, three different philosophies

AWS ACM, GCP Certificate Manager, and Azure Key Vault each expose certificate data through APIs that evolved independently, with different pagination models, rate limits, and metadata availability. After building CertPulse's adapters for all three, here is how they compare in practice:

Feature AWS ACM GCP Certificate Manager Azure Key Vault
Pagination method NextToken Page tokens nextLink URLs
Default page size Up to 1,000 300 25
Rate limits ~10 RPS per account 600 req/min per project 4,000 req/10s per vault
Detail call required? Yes (DescribeCertificate) Rarely (metadata inline) Yes (GetCertificate)
Time to scan 500 certs ~50 seconds ~2 seconds Varies by vault count
PEM chain returned? No (managed certs) Yes Via separate call
Resource bindings in API? Yes (via tags + bindings) No (cross-reference Target HTTPS Proxy) No (requires ARM queries)

The metadata gap matters more than the pagination differences:

  • AWS ACM provides associated resources (load balancers, CloudFront distributions) through ListTagsForCertificate and resource bindings directly
  • GCP Certificate Manager returns a scope field and labels but no direct resource binding — you must cross-reference Target HTTPS Proxy resources
  • Azure Key Vault provides tags and a policy object, but binding to Application Gateways or Front Doors requires separate ARM queries

CertPulse uses provider-specific normalization layers that translate each response into a common cert schema. The schema captures 23 fields. Only 11 are populated consistently across all three providers. The remaining 12 require provider-specific derivation logic or come back null.

Architecture: why CertPulse uses fan-out over sequential scanning

Fan-out scanning is 10x faster than sequential scanning for multi-cloud certificate discovery. In our testing, sequential scanning across 12 accounts took 8 minutes; fan-out brought that to 47 seconds. The performance difference grows with account count.

CertPulse's architecture uses a provider adapter pattern. Each cloud provider gets an adapter that implements a CertificateSource interface with three methods: ListCertificates, GetCertificateDetails, and GetResourceBindings. The adapters handle authentication, pagination, rate limiting, and response normalization internally.

Key architectural decisions:

  • Fan-out per account, not per provider. Rate limits are typically per-account or per-project. Scanning 5 AWS accounts in parallel gives 5x the effective rate limit budget versus scanning them sequentially through one credential.
  • Isolated credential contexts. Each account scan runs independently. If one account's credentials expire or its IAM role lacks acm:ListCertificates, that scan fails alone. CertPulse collects partial results — showing 11 of 12 accounts successfully scanned rather than failing the entire run.
  • Configurable concurrency limits. Fan-out means hitting all providers simultaneously. For teams with aggressive CloudTrail or audit logging, the burst of API calls shows up as a spike. Per-provider concurrency limits let teams tune parallelism based on their API budget.
┌─────────────┐
│   Scheduler  │
└──────┬───────┘
       │ fan-out per account
  ┌────┼────┬────────┐
  ▼    ▼    ▼        ▼
[AWS] [AWS] [GCP]  [Azure]
acct1 acct2 proj1  vault1
  │    │    │        │
  └────┴────┴────┬───┘
                 ▼
        [ Normalizer ]
                 ▼
        [ Cert Store ]

Each adapter retries with exponential backoff on rate limit responses (HTTP 429 from AWS and Azure, HTTP 429 or RESOURCE_EXHAUSTED from GCP). After 3 retries, the adapter returns what it has. Based on our production data, 97% of scan failures resolve within 2 retries.

The metadata problem: matching certs to owners

Automated certificate ownership detection works about 75% of the time in well-tagged environments and closer to 40% in environments with inconsistent tagging. After monitoring 847 certificates across 14 accounts and 3 providers, this is the hardest part of certificate lifecycle management — and anyone who tells you they've fully solved it hasn't tested at scale.

CertPulse's ownership heuristic runs a chain of lookups in priority order:

Signal Method Accuracy Coverage
Resource tags Check owner, team, or cost-center tags on cert or bound resource High ~60% of certs (in orgs enforcing tagging policies)
DNS record ownership Resolve SANs, check DNS zone ownership metadata Medium Additional 10–15% of orphaned certs
Load balancer / CDN binding Infer ownership from bound ALB's account owner ~80% accuracy Breaks with shared infrastructure accounts
Git blame on IaC Trace Terraform/Pulumi resources to git repo committers Variable Depends on IaC reflecting current state

Our measured results across the test environment: automated attribution correctly identified the owning team for 78% of 847 certificates. The remaining 22% required manual confirmation, mostly in shared infrastructure accounts with no tagging policy or certs predating the current team structure.

CertPulse surfaces confidence scores with every ownership assignment. A cert with a matching resource tag and a DNS ownership record gets a high confidence score. A cert whose only signal is "it's in the same account as Team Y's other stuff" gets a low one. SSL certificate tracking without ownership context is just a fancier spreadsheet.

What we got wrong the first time

Three architectural mistakes in CertPulse's initial design taught us the most about building reliable multi-cloud certificate monitoring.

1. Fixed scan intervals don't scale. Our initial scheduler ran on a fixed 15-minute interval regardless of account count. With 4 accounts, this was fine. When a design partner connected 23 accounts, scans started overlapping — at peak, 3 concurrent full scans competing for the same rate limit budget. Certificate monitoring accuracy dropped because scans returned partial results under rate limit pressure.

  • The fix: Adaptive scheduling. Scan interval now scales with account count: base_interval + (accounts × per_account_buffer). For 23 accounts, the interval moved from 15 minutes to roughly 35 minutes. Scan completion time dropped from 4+ minutes (with retry storms) to a consistent 90 seconds.

2. Time-based caching hides renewals. We cached normalized cert data with a 1-hour TTL to reduce API calls. In practice, if a cert renewed between cache refreshes, the dashboard showed the old expiration date for up to an hour. One tester renewed a cert, checked the dashboard, saw "expires in 2 days," and filed a bug report that was entirely correct.

  • The fix: Write-through cache that invalidates on scan completion, plus a last_scanned timestamp visible in the UI.

3. Domain-based deduplication undercounts wildcard certs. Our initial logic treated *.example.com as a single cert. But organizations frequently have multiple wildcard certs for the same domain — one per region, one per environment, one someone created manually and forgot about. Our dedup collapsed 6 distinct certificates into 1 row.

  • The fix: CertPulse now deduplicates on certificate serial number and issuer DN, which is what actually identifies a unique cert, not the SAN list.

Current numbers and what's next

Here are CertPulse's current performance metrics from our staging environment running against production cloud accounts:

Metric Value
Scan time 47 seconds (12 accounts: 4 AWS, 5 GCP, 3 Azure)
Certificates tracked 847 across all accounts
API cost per scan cycle ~$0.003 (mostly CloudTrail logging; API calls are free across all three providers)
Ownership auto-attribution 78% high confidence, 94% including low-confidence guesses
Scan reliability 99.2% of scans complete with all accounts reporting (30-day window)

Cost-per-scan was a forcing function for architecture decisions. We evaluated certificate transparency log monitoring as a supplemental discovery method, but the ingestion cost at scale pushed CertPulse toward direct API scanning as the primary mechanism, with CT logs as an optional verification layer.

CertPulse's roadmap reflects what early users actually ask for:

  • Kubernetes cert-manager integration (most requested): Teams running cert-manager in-cluster have an entirely separate certificate lifecycle invisible to cloud provider APIs. We're building a Kubernetes operator that reports cert-manager Certificate resources into the same normalized store.
  • ACME certificate monitoring: Tracking Let's Encrypt and other ACME-issued certs, particularly renewal status and failure detection. This matters more as 90-day cert lifetimes become the norm.
  • HashiCorp Vault PKI: For teams running internal CAs through Vault, surfacing those certs alongside cloud-managed ones. DevOps certificate management doesn't stop at the cloud provider boundary.

Certificate inventory automation across every source of truth is the actual goal. Cloud provider APIs are where CertPulse started because that's where the most certs hide untracked. CertPulse is building toward a single view covering cloud-managed, self-hosted, and Kubernetes-native certificates with the same scan-normalize-attribute pipeline.

If you're currently managing this with scripts and spreadsheets, you know the failure mode. It works until it doesn't, and it stops working at 2am.

FAQ

How often should you scan for certificate changes across cloud providers?

For most organizations, scanning every 30–60 minutes balances freshness against API costs. Based on our experience operating CertPulse across 12+ accounts, more frequent scans make sense during active certificate rotation or automated provisioning. Less frequent scanning (every 4–6 hours) is acceptable when primarily watching for upcoming expirations rather than tracking real-time changes.

Can you monitor certificates across cloud providers without granting write permissions?

Yes. Multi-cloud certificate discovery requires only read permissions:

  • AWS: acm:ListCertificates and acm:DescribeCertificate
  • GCP: certificatemanager.certificates.list and certificatemanager.certificates.get
  • Azure Key Vault: Certificate Get and Certificate List

No write or modify permissions are needed for scanning and monitoring.

Why is certificate ownership harder to automate than certificate discovery?

Certificate discovery is a well-defined API problem: list all certificates, read their metadata, normalize the results. Certificate ownership requires correlating certificates with organizational structure — which lives in resource tags (often inconsistent), DNS records (sometimes outdated), and infrastructure bindings (frequently shared across teams). There's no single API call that returns "Team X owns this cert." After testing CertPulse's heuristics against 847 certificates across 14 accounts, any automated attribution involves heuristics with inherent accuracy limits — our best result is 78% high-confidence attribution.

What's the difference between certificate monitoring and certificate management?

Certificate monitoring tracks state: expiration dates, issuer details, SAN coverage, and which resources use which certs. Certificate management includes active operations: provisioning, renewal, revocation, and rotation. A monitoring tool tells you a cert expires in 7 days. A management tool renews it. Most teams need monitoring first because you can't manage what you haven't found.

How do wildcard certificates complicate certificate inventory?

Organizations frequently have multiple wildcard certs for the same domain pattern (*.example.com) issued at different times, in different regions, or for different environments. In our testing, naive domain-based deduplication collapsed 6 distinct certificates into 1 row. Accurate wildcard certificate tracking requires deduplication on serial number and issuer DN, not subject or SAN list. Industry data indicates this can surface 3–5x more certificates than a domain-based inventory shows.

This is why we built CertPulse

CertPulse connects to your AWS, Azure, and GCP accounts, enumerates every certificate, monitors your external endpoints, and watches Certificate Transparency logs. One dashboard for every cert. Alerts when auto-renewal fails. Alerts when certs approach expiry. Alerts when someone issues a cert for your domain that you didn't request.

If you're looking for complete certificate visibility without maintaining scripts, we can get you there in about 5 minutes.

How We Built a Multi-Cloud Certificate Scanner That Doesn't Suck | CertPulse