Running an internal CA with OpenSSL scripts breaks at scale. The threshold is roughly 10 services: at that point trust distribution becomes manual, the root key location gets lost, and intermediate expiries page you at 3am. This step-ca tutorial walks through running a private CA on Ubuntu 24.04 with ACME for self-enrollment, 24-hour certificate lifetimes, OIDC issuance for humans, and a DR runbook for intermediate key compromise. No PKI theory marathon. Just the commands, the config, and the gotchas I wish someone had told me before I ran one in production.
Why You Probably Need a Private CA (and Why Self-Signed Won't Cut It)
You need a private CA the moment trust distribution becomes manual, typically around 10 services. Self-signed certs and ad-hoc OpenSSL CAs collapse at that threshold because they lack four things a real internal PKI provides:
- Revocation — no CRL endpoint means a compromised service cert keeps working until expiry
- Automated issuance — manual signing creates bottlenecks
- Audit trail — no record of who issued what to whom
- Key custody — the root key ends up on a laptop named
ca-key-FINAL-v2.pem
The failure modes from running an OpenSSL-based ad-hoc CA are predictable. After auditing roughly 20 internal CAs, I've watched all of these happen on the same team in under a year:
- Root key on the same laptop as the OpenSSL scripts, which then gets reimaged
- No CRL endpoint, so a compromised service cert just keeps working
- Manual cert distribution over Slack DMs, causing trust store drift across hosts
- Three different "internal CAs" because nobody knew the other ones existed
According to a 2024 survey of platform teams from a private PKI vendor, 61% of internal CAs in mid-market companies had no documented owner. That tracks with what I see. The first sign you've outgrown OpenSSL: a developer asks "can you sign this?" and you have to find the laptop.
The honest test: if your dev environment, kubernetes mesh traffic, and IoT prototype lab all need certs, you're already running a private CA. The only question is whether it's accidental or intentional. For background on why service-to-service auth motivates this, see our hands-on mTLS guide.
Architecture: Root CA, Intermediate CA, and Why You Never Touch the Root Again
A two-tier CA design pairs a long-lived offline root with an online intermediate that handles daily signing. If the intermediate gets compromised, you revoke it and issue a new one from the root. If the root gets compromised, you're rebuilding every trust store on the network — including the ones nobody documented.
The textbook two-tier design:
| Tier | Status | Validity | Role |
|---|---|---|---|
| Root CA | Offline (safe/HSM) | 10–20 years | Signs intermediates only |
| Intermediate CA | Online | 1–5 years | Signs leaf certs daily |
| Leaf certs | Distributed via ACME | Hours to weeks | Service/client identity |
What actually happens at most companies: the root key sits on the same VM as the intermediate, in /etc/step-ca/, encrypted with a password pinned in the ops Slack channel. In my experience auditing roughly 20 internal CAs, only 2 had the root truly offline — a 10% rate.
If you can't afford an HSM, the realistic compromise is a Yubikey storing the root signing key, locked in a desk drawer. step-ca supports PKCS#11 for this. The Yubikey costs $50 and changes the threat model dramatically: a laptop compromise no longer means root compromise. That alone is worth doing.
A step-ca Tutorial: Installing on Ubuntu and Bootstrapping the Root
Smallstep ships step-ca as a single Go binary plus a CLI, and installation on Ubuntu 24.04 takes about five minutes. The whole point of step-ca is that it handles the boring parts of PKI — provisioners, ACME, OIDC, OCSP — without you writing a single OpenSSL config file. Here's the bootstrap.
wget https://dl.smallstep.com/cli/docs-ca-install/latest/step-cli_amd64.deb
wget https://dl.smallstep.com/certificates/docs-ca-install/latest/step-ca_amd64.deb
sudo dpkg -i step-cli_amd64.deb step-ca_amd64.deb
export STEPPATH=/etc/step-ca
sudo -E step ca init \
--name "Internal CA" \
--dns ca.internal.example.com \
--address :8443 \
--provisioner admin@example.com \
--deployment-type standalone
Two flags that matter and aren't obvious:
- Root validity override:
--root-validity 7305doverrides the default 10-year root. Only use it if you've thought hard about crypto-agility. In year 8, when ECDSA P-256 is fine but P-384 is the new floor, you'll want to roll. - Immediate key export: After init, export the root private key to your Yubikey (or HSM) immediately and wipe the on-disk copy:
step crypto key format --pkcs8 $(step path)/secrets/root_ca_key
# move to offline storage, then:
shred -u $(step path)/secrets/root_ca_key
The intermediate key stays online at /etc/step-ca/secrets/intermediate_ca_key, encrypted with a passphrase. Drop a systemd unit at /etc/systemd/system/step-ca.service that reads the passphrase from /etc/step-ca/password.txt (owned by step:step, mode 0400). Start it, then verify with step ca health returning ok.
ACME Provisioner: Letting Services Self-Enroll Like They Do With Let's Encrypt
The ACME provisioner in step-ca lets cert-manager, Caddy, Traefik, and certbot enroll against your internal CA exactly the way they enroll against Let's Encrypt. Add "type": "ACME", "name": "acme" to the provisioners list in ca.json, restart step-ca, and any ACME client pointed at https://ca.internal.example.com/acme/acme/directory works.
The hard part is split-horizon DNS. Challenge types compared:
| Challenge | Requirement | Best for |
|---|---|---|
| HTTP-01 | CA reaches service on port 80; internal DNS resolves same names externally | Single-zone setups |
| DNS-01 | DNS provider supports ACME challenge records | Internal-only hostnames, RFC 1918 ranges |
For RFC 1918 ranges, DNS-01 against an internal authoritative DNS server (CoreDNS, PowerDNS) is what I run.
For cert-manager on kubernetes, the ClusterIssuer that actually works:
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
name: step-ca-internal
spec:
acme:
server: https://ca.internal.example.com/acme/acme/directory
email: platform@example.com
privateKeySecretRef:
name: step-ca-account-key
caBundle: <base64 of your root_ca.crt>
solvers:
- http01:
ingress:
class: nginx
The caBundle field is the part nobody documents clearly. Without it, cert-manager doesn't trust your CA and ACME requests fail with x509 verification errors. Bundle the root cert (not the intermediate) into the issuer spec. Renewals just work after that.
For more on ACME at scale and where it falls over, our ACME protocol guide has the production failure modes I keep hitting.
Short-Lived Certificates and the End of Renewal Anxiety
Setting 24-hour certificate lifetimes with renewal every 8 hours eliminates 2am expiry pages and absorbs up to 16 hours of CA downtime before anything breaks. Compare that to Let's Encrypt's 90-day model where a 30-day outage is survivable but anxiety-inducing.
The math, laid out:
| Cert lifetime | Renewal trigger | Safety margin | Failure signal |
|---|---|---|---|
| 90 days (LE default) | 60 days remaining | 60 days | Slow-motion expiry crisis |
| 24 hours (step-ca short-lived) | 8 hours remaining | 16 hours | Immediate, loud |
Most teams' alert threshold is also 30 days, so on a 90-day cert the real window before someone wakes up is much shorter than it looks. With short-lived certificates, a failed renewal is loud and immediate.
step-ca's renew daemon handles this:
step ca renew --daemon --expires-in 8h /etc/ssl/svc.crt /etc/ssl/svc.key
Configure it as a systemd unit per service. The daemon respects clock skew up to 5 minutes by default. If your fleet has worse NTP than that, fix NTP first.
Tradeoffs to be honest about:
- Monitoring shift: from "days until expiry" to "renewal success rate over last hour"
- CA availability: becomes a hard dependency, so the CA needs its own HA story
- Embedded devices with bad clocks: struggle; longer lifetimes there are fine
The alerting model that survives this transition is covered in our alert fatigue post.
JWK and OIDC Provisioners for Humans and CI
JWK provisioners sign certs with a known key for break-glass and CI use; OIDC provisioners issue short-lived client certs after SSO authentication for daily human access. In my experience, roughly 40% of internal CA misuse comes from operators sharing JWK keys because OIDC wasn't set up.
OIDC provisioner config in ca.json:
{
"type": "OIDC",
"name": "google-sso",
"clientID": "xxxxx.apps.googleusercontent.com",
"clientSecret": "yyyyy",
"configurationEndpoint": "https://accounts.google.com/.well-known/openid-configuration",
"admins": ["alex@example.com"],
"domains": ["example.com"]
}
The domains field is the email regex trick. Restrict issuance to specific teams by listing their email domain. For finer control, use "groups" with Google Workspace groups or Okta groups.
The end-to-end flow for an engineer:
step ca certificate alex@example.com client.crt client.key \
--provisioner google-sso
Browser opens, SSO happens, cert lands locally with a 16-hour validity. That cert authenticates to internal services that trust the CA. The outcome:
- No shared keys
- No Slack DMs with PEM files
- Full audit trail on the CA side showing who issued what
For CI pipelines minting certs for integration tests, JWK is fine since the CI system has its own identity. Bake the JWK provisioner password into your secrets manager, not the repo.
Revocation That Actually Works: CRL, OCSP, and Just Letting Certs Expire
step-ca supports CRL and OCSP, but at 24-hour cert lifetimes revocation is mostly theater — an attacker who steals a service key has at most 24 hours before it dies on its own. The real revocation strategy at short lifetimes is "stop issuing new certs to the compromised identity and wait."
When revocation actually matters:
- Long-lived client certs — laptop certs, IoT device certs with 30+ day lifetimes
- Intermediate compromise — revoke from the root, rebuild trust
- Compliance contexts — frameworks requiring a documented revocation procedure
Enable OCSP in step-ca by adding "crl": {"enabled": true} to ca.json. Two practical problems:
- Clients need to actually check OCSP, which most don't by default
- OCSP stapling on internal services is even more rarely configured correctly
We covered this in why OCSP stapling is probably broken on half your endpoints.
The honest position from running this in production: budget your engineering time on shrinking cert lifetimes first, and on revocation infrastructure second. A 1-hour cert with no revocation is more secure than a 1-year cert with perfect OCSP.
Monitoring, Backup, and the Day Your CA Goes Down
step-ca exposes Prometheus metrics at /metrics; alert on issuance failures and renewal error rate, not certificate expiry — at 24h lifetimes, expiry alerts fire constantly and are meaningless. The metrics that matter:
step_ca_provisioner_sign_totalby statusstep_ca_acme_finalize_duration- Time since last successful issuance per provisioner
Backup strategy by asset:
| Asset | Storage location | Frequency |
|---|---|---|
| Intermediate signing key | Secrets manager (Vault, AWS Secrets Manager, Bitwarden) | Encrypted at rest |
| step-ca database (badger or MySQL) | Snapshot storage | Every 6 hours |
ca.json config |
Git, with provisioner secrets templated out | On change |
| Root CA cert (public) | Every trust bundle on the network | At rollout |
DR drill numbers from running this: with the intermediate key in Vault and a snapshot of /etc/step-ca, rebuilding the CA on a fresh VM takes about 20 minutes. The Ansible role does most of it. The slowest step is verifying every downstream service is still issuing renewals correctly.
The runbook for intermediate key compromise, in order:
- Stop step-ca on the compromised host. Don't restart.
- Bring the root online (Yubikey, HSM, whatever).
- Generate a new intermediate, sign it with the root.
- Publish the new intermediate to your trust distribution channel (config management, kubernetes secrets, Vault).
- Push the CRL revoking the old intermediate.
- Tell every service owner to force-renew. At 24h lifetimes, this happens naturally within a day, but you don't have a day if the attacker is active.
- Rotate every leaf cert issued by the old intermediate. Audit logs from step-ca tell you which ones.
- Write the post-mortem before anyone forgets the timeline.
Step 6 is the awkward conversation: every service that pinned the old intermediate explicitly is broken until they update. This is why root pinning is the only correct pinning strategy for internal PKI.
Frequently Asked Questions
Can step-ca replace Let's Encrypt for public-facing certs?
No. step-ca issues certificates for internal trust where you control the trust stores. Public-facing certs require a publicly trusted CA. Use Let's Encrypt or your cloud provider's managed CA for anything internet-facing.
How does step-ca compare to HashiCorp Vault's PKI engine?
Vault PKI is fine if you already run Vault. step-ca is simpler if you don't, supports ACME natively (Vault requires a sidecar), and the Smallstep team focuses on PKI as their main product rather than a feature among many.
What's the minimum infrastructure for production step-ca?
One VM with 1 vCPU and 1GB RAM handles thousands of certs per day. For HA, run two instances behind a load balancer with a shared MySQL backend. The root key stays offline regardless.
How long should intermediate CAs live?
Two to five years is the sane range. Longer than five and you're not practicing the rotation muscle. Shorter than two and rotation becomes a constant chore. Rotate before expiry, not at expiry.
Does step-ca work with service mesh sidecars like Envoy or Linkerd?
Yes. Istio and Linkerd both consume SPIFFE/SPIRE-style identities, and step-ca issues x509-SVID certs. For plain Envoy, point its SDS to a step-ca-backed secret renewer. The 24-hour lifetime story works particularly well here.
Wrapping Up
This step-ca tutorial covered the parts I wish were in one place when I built my first internal PKI: a two-tier CA with the root offline, ACME for kubernetes self-enrollment, 24-hour lifetimes that make renewal anxiety go away, OIDC for humans, and a DR runbook for when things go sideways. The honest version of internal PKI is less about cryptographic purity and more about operational discipline: rotate often, monitor renewals not expiries, and keep the root somewhere a laptop reimage can't touch.
If you're running internal certificates across more than a handful of services and you're not sure what you have, CertPulse inventories your CA-issued certs alongside your public ones in one view. Either way, get the root off the laptop today.
This is why we built CertPulse
CertPulse connects to your AWS, Azure, and GCP accounts, enumerates every certificate, monitors your external endpoints, and watches Certificate Transparency logs. One dashboard for every cert. Alerts when auto-renewal fails. Alerts when certs approach expiry. Alerts when someone issues a cert for your domain that you didn't request.
If you're looking for complete certificate visibility without maintaining scripts, we can get you there in about 5 minutes.