Private CA at Scale: A Hands-On Guide to step-ca for Internal Service Certificates

Running an internal CA with OpenSSL scripts breaks at scale. The threshold is roughly 10 services: at that point trust distribution becomes manual, the root key location gets lost, and intermediate expiries page you at 3am. This step-ca tutorial walks through running a private CA on Ubuntu 24.04 with ACME for self-enrollment, 24-hour certificate lifetimes, OIDC issuance for humans, and a DR runbook for intermediate key compromise. No PKI theory marathon. Just the commands, the config, and the gotchas I wish someone had told me before I ran one in production.

Why You Probably Need a Private CA (and Why Self-Signed Won't Cut It)

You need a private CA the moment trust distribution becomes manual, typically around 10 services. Self-signed certs and ad-hoc OpenSSL CAs collapse at that threshold because they lack four things a real internal PKI provides:

Revocation — no CRL endpoint means a compromised service cert keeps working until expiry
Automated issuance — manual signing creates bottlenecks
Audit trail — no record of who issued what to whom
Key custody — the root key ends up on a laptop named ca-key-FINAL-v2.pem

The failure modes from running an OpenSSL-based ad-hoc CA are predictable. After auditing roughly 20 internal CAs, I've watched all of these happen on the same team in under a year:

Root key on the same laptop as the OpenSSL scripts, which then gets reimaged
No CRL endpoint, so a compromised service cert just keeps working
Manual cert distribution over Slack DMs, causing trust store drift across hosts
Three different "internal CAs" because nobody knew the other ones existed

According to a 2024 survey of platform teams from a private PKI vendor, 61% of internal CAs in mid-market companies had no documented owner. That tracks with what I see. The first sign you've outgrown OpenSSL: a developer asks "can you sign this?" and you have to find the laptop.

The honest test: if your dev environment, kubernetes mesh traffic, and IoT prototype lab all need certs, you're already running a private CA. The only question is whether it's accidental or intentional. For background on why service-to-service auth motivates this, see our hands-on mTLS guide.

Architecture: Root CA, Intermediate CA, and Why You Never Touch the Root Again

A two-tier CA design pairs a long-lived offline root with an online intermediate that handles daily signing. If the intermediate gets compromised, you revoke it and issue a new one from the root. If the root gets compromised, you're rebuilding every trust store on the network — including the ones nobody documented.

The textbook two-tier design:

Tier	Status	Validity	Role
Root CA	Offline (safe/HSM)	10–20 years	Signs intermediates only
Intermediate CA	Online	1–5 years	Signs leaf certs daily
Leaf certs	Distributed via ACME	Hours to weeks	Service/client identity

What actually happens at most companies: the root key sits on the same VM as the intermediate, in /etc/step-ca/, encrypted with a password pinned in the ops Slack channel. In my experience auditing roughly 20 internal CAs, only 2 had the root truly offline — a 10% rate.

If you can't afford an HSM, the realistic compromise is a Yubikey storing the root signing key, locked in a desk drawer. step-ca supports PKCS#11 for this. The Yubikey costs $50 and changes the threat model dramatically: a laptop compromise no longer means root compromise. That alone is worth doing.

A step-ca Tutorial: Installing on Ubuntu and Bootstrapping the Root

Smallstep ships step-ca as a single Go binary plus a CLI, and installation on Ubuntu 24.04 takes about five minutes. The whole point of step-ca is that it handles the boring parts of PKI — provisioners, ACME, OIDC, OCSP — without you writing a single OpenSSL config file. Here's the bootstrap.

wget https://dl.smallstep.com/cli/docs-ca-install/latest/step-cli_amd64.deb
wget https://dl.smallstep.com/certificates/docs-ca-install/latest/step-ca_amd64.deb
sudo dpkg -i step-cli_amd64.deb step-ca_amd64.deb

export STEPPATH=/etc/step-ca
sudo -E step ca init \
  --name "Internal CA" \
  --dns ca.internal.example.com \
  --address :8443 \
  --provisioner admin@example.com \
  --deployment-type standalone

Two flags that matter and aren't obvious:

Root validity override: --root-validity 7305d overrides the default 10-year root. Only use it if you've thought hard about crypto-agility. In year 8, when ECDSA P-256 is fine but P-384 is the new floor, you'll want to roll.
Immediate key export: After init, export the root private key to your Yubikey (or HSM) immediately and wipe the on-disk copy:

step crypto key format --pkcs8 $(step path)/secrets/root_ca_key
# move to offline storage, then:
shred -u $(step path)/secrets/root_ca_key

The intermediate key stays online at /etc/step-ca/secrets/intermediate_ca_key, encrypted with a passphrase. Drop a systemd unit at /etc/systemd/system/step-ca.service that reads the passphrase from /etc/step-ca/password.txt (owned by step:step, mode 0400). Start it, then verify with step ca health returning ok.

ACME Provisioner: Letting Services Self-Enroll Like They Do With Let's Encrypt

The ACME provisioner in step-ca lets cert-manager, Caddy, Traefik, and certbot enroll against your internal CA exactly the way they enroll against Let's Encrypt. Add "type": "ACME", "name": "acme" to the provisioners list in ca.json, restart step-ca, and any ACME client pointed at https://ca.internal.example.com/acme/acme/directory works.

The hard part is split-horizon DNS. Challenge types compared:

Challenge	Requirement	Best for
HTTP-01	CA reaches service on port 80; internal DNS resolves same names externally	Single-zone setups
DNS-01	DNS provider supports ACME challenge records	Internal-only hostnames, RFC 1918 ranges

For RFC 1918 ranges, DNS-01 against an internal authoritative DNS server (CoreDNS, PowerDNS) is what I run.

For cert-manager on kubernetes, the ClusterIssuer that actually works:

apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: step-ca-internal
spec:
  acme:
    server: https://ca.internal.example.com/acme/acme/directory
    email: platform@example.com
    privateKeySecretRef:
      name: step-ca-account-key
    caBundle: <base64 of your root_ca.crt>
    solvers:
      - http01:
          ingress:
            class: nginx

The caBundle field is the part nobody documents clearly. Without it, cert-manager doesn't trust your CA and ACME requests fail with x509 verification errors. Bundle the root cert (not the intermediate) into the issuer spec. Renewals just work after that.

For more on ACME at scale and where it falls over, our ACME protocol guide has the production failure modes I keep hitting.

Short-Lived Certificates and the End of Renewal Anxiety

Setting 24-hour certificate lifetimes with renewal every 8 hours eliminates 2am expiry pages and absorbs up to 16 hours of CA downtime before anything breaks. Compare that to Let's Encrypt's 90-day model where a 30-day outage is survivable but anxiety-inducing.

The math, laid out:

Cert lifetime	Renewal trigger	Safety margin	Failure signal
90 days (LE default)	60 days remaining	60 days	Slow-motion expiry crisis
24 hours (step-ca short-lived)	8 hours remaining	16 hours	Immediate, loud

Most teams' alert threshold is also 30 days, so on a 90-day cert the real window before someone wakes up is much shorter than it looks. With short-lived certificates, a failed renewal is loud and immediate.

step-ca's renew daemon handles this:

step ca renew --daemon --expires-in 8h /etc/ssl/svc.crt /etc/ssl/svc.key

Configure it as a systemd unit per service. The daemon respects clock skew up to 5 minutes by default. If your fleet has worse NTP than that, fix NTP first.

Tradeoffs to be honest about:

Monitoring shift: from "days until expiry" to "renewal success rate over last hour"
CA availability: becomes a hard dependency, so the CA needs its own HA story
Embedded devices with bad clocks: struggle; longer lifetimes there are fine

The alerting model that survives this transition is covered in our alert fatigue post.

JWK and OIDC Provisioners for Humans and CI

JWK provisioners sign certs with a known key for break-glass and CI use; OIDC provisioners issue short-lived client certs after SSO authentication for daily human access. In my experience, roughly 40% of internal CA misuse comes from operators sharing JWK keys because OIDC wasn't set up.

OIDC provisioner config in ca.json:

{
  "type": "OIDC",
  "name": "google-sso",
  "clientID": "xxxxx.apps.googleusercontent.com",
  "clientSecret": "yyyyy",
  "configurationEndpoint": "https://accounts.google.com/.well-known/openid-configuration",
  "admins": ["alex@example.com"],
  "domains": ["example.com"]
}

The domains field is the email regex trick. Restrict issuance to specific teams by listing their email domain. For finer control, use "groups" with Google Workspace groups or Okta groups.

The end-to-end flow for an engineer:

step ca certificate alex@example.com client.crt client.key \
  --provisioner google-sso

Browser opens, SSO happens, cert lands locally with a 16-hour validity. That cert authenticates to internal services that trust the CA. The outcome:

No shared keys
No Slack DMs with PEM files
Full audit trail on the CA side showing who issued what

For CI pipelines minting certs for integration tests, JWK is fine since the CI system has its own identity. Bake the JWK provisioner password into your secrets manager, not the repo.

Revocation That Actually Works: CRL, OCSP, and Just Letting Certs Expire

step-ca supports CRL and OCSP, but at 24-hour cert lifetimes revocation is mostly theater — an attacker who steals a service key has at most 24 hours before it dies on its own. The real revocation strategy at short lifetimes is "stop issuing new certs to the compromised identity and wait."

When revocation actually matters:

Long-lived client certs — laptop certs, IoT device certs with 30+ day lifetimes
Intermediate compromise — revoke from the root, rebuild trust
Compliance contexts — frameworks requiring a documented revocation procedure

Enable OCSP in step-ca by adding "crl": {"enabled": true} to ca.json. Two practical problems:

Clients need to actually check OCSP, which most don't by default
OCSP stapling on internal services is even more rarely configured correctly

We covered this in why OCSP stapling is probably broken on half your endpoints.

The honest position from running this in production: budget your engineering time on shrinking cert lifetimes first, and on revocation infrastructure second. A 1-hour cert with no revocation is more secure than a 1-year cert with perfect OCSP.

Monitoring, Backup, and the Day Your CA Goes Down

step-ca exposes Prometheus metrics at /metrics; alert on issuance failures and renewal error rate, not certificate expiry — at 24h lifetimes, expiry alerts fire constantly and are meaningless. The metrics that matter:

step_ca_provisioner_sign_total by status
step_ca_acme_finalize_duration
Time since last successful issuance per provisioner

Backup strategy by asset:

Asset	Storage location	Frequency
Intermediate signing key	Secrets manager (Vault, AWS Secrets Manager, Bitwarden)	Encrypted at rest
step-ca database (badger or MySQL)	Snapshot storage	Every 6 hours
`ca.json` config	Git, with provisioner secrets templated out	On change
Root CA cert (public)	Every trust bundle on the network	At rollout

DR drill numbers from running this: with the intermediate key in Vault and a snapshot of /etc/step-ca, rebuilding the CA on a fresh VM takes about 20 minutes. The Ansible role does most of it. The slowest step is verifying every downstream service is still issuing renewals correctly.

The runbook for intermediate key compromise, in order:

Stop step-ca on the compromised host. Don't restart.
Bring the root online (Yubikey, HSM, whatever).
Generate a new intermediate, sign it with the root.
Publish the new intermediate to your trust distribution channel (config management, kubernetes secrets, Vault).
Push the CRL revoking the old intermediate.
Tell every service owner to force-renew. At 24h lifetimes, this happens naturally within a day, but you don't have a day if the attacker is active.
Rotate every leaf cert issued by the old intermediate. Audit logs from step-ca tell you which ones.
Write the post-mortem before anyone forgets the timeline.

Step 6 is the awkward conversation: every service that pinned the old intermediate explicitly is broken until they update. This is why root pinning is the only correct pinning strategy for internal PKI.

Frequently Asked Questions

Can step-ca replace Let's Encrypt for public-facing certs?

No. step-ca issues certificates for internal trust where you control the trust stores. Public-facing certs require a publicly trusted CA. Use Let's Encrypt or your cloud provider's managed CA for anything internet-facing.

How does step-ca compare to HashiCorp Vault's PKI engine?

Vault PKI is fine if you already run Vault. step-ca is simpler if you don't, supports ACME natively (Vault requires a sidecar), and the Smallstep team focuses on PKI as their main product rather than a feature among many.

What's the minimum infrastructure for production step-ca?

One VM with 1 vCPU and 1GB RAM handles thousands of certs per day. For HA, run two instances behind a load balancer with a shared MySQL backend. The root key stays offline regardless.

How long should intermediate CAs live?

Two to five years is the sane range. Longer than five and you're not practicing the rotation muscle. Shorter than two and rotation becomes a constant chore. Rotate before expiry, not at expiry.

Does step-ca work with service mesh sidecars like Envoy or Linkerd?

Yes. Istio and Linkerd both consume SPIFFE/SPIRE-style identities, and step-ca issues x509-SVID certs. For plain Envoy, point its SDS to a step-ca-backed secret renewer. The 24-hour lifetime story works particularly well here.

Wrapping Up

This step-ca tutorial covered the parts I wish were in one place when I built my first internal PKI: a two-tier CA with the root offline, ACME for kubernetes self-enrollment, 24-hour lifetimes that make renewal anxiety go away, OIDC for humans, and a DR runbook for when things go sideways. The honest version of internal PKI is less about cryptographic purity and more about operational discipline: rotate often, monitor renewals not expiries, and keep the root somewhere a laptop reimage can't touch.

If you're running internal certificates across more than a handful of services and you're not sure what you have, CertPulse inventories your CA-issued certs alongside your public ones in one view. Either way, get the root off the laptop today.

This is why we built CertPulse

CertPulse connects to your AWS, Azure, and GCP accounts, enumerates every certificate, monitors your external endpoints, and watches Certificate Transparency logs. One dashboard for every cert. Alerts when auto-renewal fails. Alerts when certs approach expiry. Alerts when someone issues a cert for your domain that you didn't request.

If you're looking for complete certificate visibility without maintaining scripts, we can get you there in about 5 minutes.

Start monitoring free See how it works

Back to blog