Air-gap¶
TL;DR¶
- Air-gap-first is a design constraint, not a feature flag. Every runtime code path MUST be reachable from a disconnected environment. Network calls outside the cluster boundary MUST be opt-in, not the default.
- An embedded (or cluster-local sidecar) OCI registry MUST be the source of every container image the service references at runtime. The host OCI registry MAY be a fallback for connected deployments, but the service MUST NOT depend on it for first-boot success.
- Deployment bundles MUST carry a signed bill-of-materials (BOM). The loader MUST verify the signature against a trusted public key before extracting any payload. Unsigned bundles MUST be rejected.
- Local object storage SHOULD use a portable, S3-API-compatible target (RustFS, MinIO, SeaweedFS) so the same client code runs in both connected and disconnected contexts. Bucket discovery MUST tolerate the absence of public DNS.
- Service discovery for in-cluster dependencies SHOULD prefer a static seed list with health-checked fallbacks over DNS-only resolution. The Kubernetes API server is a reliable discovery surface; public DNS is not.
- A layered registry resolver MUST consult the local registry first, then escalate to a configured fallback registry, then fail closed. "Reach to the public internet on cache miss" MUST NOT be the default behavior.
Why this choice¶
The disconnected deployment is the most demanding operating environment a service will ever face, and it is also the environment that catches the largest class of foundation defects. A service that runs correctly in an air-gapped k3s cluster on a customer-managed appliance will also run correctly behind a corporate proxy, on a private cloud with no egress, and inside a partitioned region. The reverse is not true: a service designed for cloud-hosted runtime convenience routinely fails its first air-gap demo.
The choice of embedded OCI registry rather than "pull from upstream
on first boot" follows the air-gap-first principle from the Discipline
chapter (08-discipline.md). A service that depends on a public
registry for image hydration has a single point of failure that the
operator cannot remediate without restoring egress; a service that
ships its required images inside its own deployment bundle hydrates
into a cluster-local registry and survives the worst-case network
partition.
The choice of signed bundles follows from supply-chain hygiene. An unsigned bundle carries the trust of whoever happened to upload it; a signed bundle carries the trust of whoever holds the signing key. The operator MUST be able to verify provenance before extracting payload, and the verification MUST be a hard precondition for extraction, not a post-hoc audit step.
The choice of S3-compatible local object storage follows from forkability and air-gap discipline. The S3 API is the broadest object storage interface in the industry; clients written against it run against AWS S3, Cloudflare R2, MinIO, SeaweedFS, RustFS, GCS via gateway, or Azure Blob via gateway. The same client code in a guide-conformant service runs both in a connected cloud deployment and in an air-gapped appliance with a sidecar object store.
The choice of layered registry resolver follows from operator ergonomics. A two-layer resolver (local-first, fallback-second) gives the operator a single configuration knob to switch a deployment from "air-gapped" to "connected" without touching service code; the fail-closed third layer guarantees that misconfiguration cannot silently downgrade an air-gapped deployment to one that reaches the public internet.
Prescriptive guidance¶
Air-gap-first design¶
- A service MUST run end-to-end in a disconnected environment. The acceptance gate for "air-gap-first" is the smoke test "bring the service up in a Kubernetes cluster with all egress denied; every user-visible feature MUST work." Anything that fails the smoke test is a defect against this chapter.
- A service MUST NOT make an out-of-cluster network call on first boot unless that call is explicitly enabled by a configuration field whose default is disabled. "Phone home for version check," "fetch upstream image manifest," and "lookup public DNS for the OIDC provider" are common violations.
- A service MUST emit a startup log line naming every out-of-cluster endpoint it intends to contact during normal operation. The operator MUST be able to grep that log line for the list of egress endpoints before authorizing the deployment.
- A service that requires a license check MUST support offline license files in addition to online activation. A service that only works with an online license server fails air-gap-first.
Embedded OCI registry¶
- Every container image referenced at runtime MUST be available from a cluster-local OCI registry. The registry MAY run as an in-cluster Deployment, a sidecar StatefulSet, or an embedded process running on a control-plane node.
- The deployment bundle MUST ship every image the service references at runtime, including init containers, sidecars, and downstream operator-managed workloads. An image that is "usually pulled" from upstream MUST be vendored into the bundle, not assumed reachable.
- The registry SHOULD store its data on a persistent volume backed by S3-compatible object storage (covered in the S3-compatible object storage subsection of this chapter) so registry restarts do not require re-hydration from the bundle.
- Image references in Kubernetes manifests MUST use the cluster-local
registry's resolvable hostname, not the upstream registry's hostname.
Tools like
crane mutateor Kustomizeimages:overlays MAY be used to rewrite references at bundle-build time. - The registry MUST support both
linux/amd64andlinux/arm64manifests at minimum. Air-gapped appliance fleets routinely mix architectures, and a single-arch registry forces operators into ad-hoc workarounds.
Signed bundle verification¶
- Every deployment bundle MUST carry a signed BOM file (cosign-style signature over the manifest, sigstore-equivalent, or in-toto attestation). The signature MUST be verified by the loader before any payload extraction begins.
- The loader MUST reject bundles whose signature does not validate against a trusted public key. "Best-effort verify, warn on failure" MUST NOT be a supported mode in production.
- The trusted public key MUST be embedded in the loader binary or provided via a configuration secret. Fetching the public key from a network endpoint at verification time defeats the purpose of signing.
- A bundle MAY carry signatures from multiple authorities (vendor, customer, integrator). The loader MUST require at least the vendor signature; additional signature requirements (customer co-sign, for instance) MAY be enabled by configuration.
- The signature verification step MUST log the signing key fingerprint it accepted. Operators MUST be able to audit "which key signed this bundle" after the fact without re-running the verification.
- A bundle whose signature has expired (per the signing key's notAfter field) MUST be rejected. Long-lived bundles in stockpile MUST be re-signed before delivery, not given a pass.
S3-compatible object storage¶
- Local object storage MUST present the S3 API. The choice of backend (RustFS, MinIO, SeaweedFS, ceph-rgw) is implementation detail; the service code MUST consume the storage exclusively through an S3 client (MinIO Go SDK, AWS SDK v2, or equivalent).
- The S3 client MUST be configured against a discovered endpoint, not a
hardcoded URL. Endpoint discovery MUST work in both in-cluster
contexts (Kubernetes Service ClusterIP) and out-of-cluster contexts
(e.g.,
kubectl port-forwardfor local development). - Presigned URLs MUST be issued through a client configured with the
externally-reachable endpoint, not the in-cluster endpoint. A
presigned URL minted against
clusterIP:9000is unusable from a browser; the dual-client pattern (one internal client for data operations, one external-endpoint client for presigning) MUST be used. - The S3 client MUST tolerate idle-connection closure by the storage backend. RustFS-class backends close kept-alive connections aggressively to bound resource usage; clients SHOULD disable HTTP keep-alives or set a short idle timeout to avoid "broken pipe" errors on multi-part uploads.
- Buckets MUST be created idempotently. A service MUST tolerate "bucket
already exists" on startup; a service MUST NOT crash-loop on a
pre-existing bucket. The MinIO SDK's
MakeBucketerror MUST be inspected for "already exists" before treating it as fatal.
Service discovery¶
- In-cluster service discovery SHOULD prefer the Kubernetes API server over public DNS. The API server is reachable in every air-gapped deployment by definition; public DNS is not.
- A discovery query MUST be allowed to fail gracefully. A service whose startup depends on discovering a peer SHOULD retry with backoff and publish a "degraded" health status until discovery succeeds, rather than crash-looping.
- Discovery results MUST be cached with a bounded TTL (typically 30 to 60 seconds). The cache MUST invalidate on connection errors so a re-discovery occurs on the next attempt.
- A discovered endpoint MUST include both a hostname/IP and a port. The port MUST come from the Service's port spec (named-port lookup), not from a hardcoded constant. A service that hardcodes "the port is 9000" breaks when the operator changes the port spec for compliance.
- Discovery configuration MAY accept an explicit override via
environment variable (for example,
RUSTFS_ENDPOINT=...). The override MUST take precedence over discovery so operators can short-circuit discovery during incident response.
Layered registry resolver¶
- The registry resolver MUST consult layers in this order: (1) the cluster-local registry, (2) a configured fallback registry list, (3) fail closed. A configured fallback MAY be empty; the empty fallback MUST mean "fail closed at layer 1."
- Layer 1 (cluster-local) MUST be the only layer that runs by default. The fallback layer 2 MUST be opt-in via configuration.
- Layer 3 (fail closed) MUST never be silently replaced by "reach to a
public default registry." A service that reaches
docker.ioafter a cache miss violates this rule. - The resolver MUST log which layer satisfied a request. Operators MUST be able to audit "did this image come from the cluster-local registry or from a fallback?" without re-running the resolution.
- A configured fallback registry MUST be authenticated with the same signed-bundle discipline as the cluster-local registry. A fallback MUST NOT trust an unsigned image just because layer 1 missed.
Cluster-local control-plane integration¶
- An air-gap-conformant service MUST integrate with the host
Kubernetes control plane through a verified client (typed
client-go, controller-runtime, or equivalent). The client MUST load configuration from the in-cluster service account by default, falling back to a kubeconfig path only when explicitly configured. - The service MAY install custom resource definitions (CRDs) at
startup. CRD installation MUST be idempotent: a re-deploy MUST NOT
corrupt the existing CRD state. The Architecture chapter's Operator
subsection (
01-architecture.md) covers the reconciler pattern. - Crossplane v2 GA'd in late 2025 with five hard breaking API changes. An air-gap-conformant service that ships Crossplane Compositions in its bundle MUST pin to a Crossplane v2 (or later) Provider version that is compatible with the cluster's Provider revision; bundle builds MUST fail closed when the Provider version disagrees with the Composition's API group.
Failure modes the pattern prevents¶
- First-boot egress. A service that pulls images from
docker.ioon first boot fails air-gap-first. The embedded registry - bundle pattern means the cluster has the images before the service needs them.
- Silent upstream substitution. A registry resolver that falls
through to
docker.ioon cache miss can substitute an upstream image for a vendored one without operator awareness. The fail-closed third layer prevents this entirely. - Unverifiable provenance. An unsigned bundle leaves the operator unable to answer "where did this come from." Signature verification upgrades the bundle from "binary blob" to "auditable artifact."
- DNS-only discovery brittleness. A service that resolves peer
endpoints via
digornslookupfails when public DNS is denied. The Kubernetes API server is a reachable discovery surface in every conformant deployment. - Presigned URL endpoint mismatch. A presigned URL minted against the cluster-internal endpoint is unusable from a browser. The dual- client pattern keeps the data-plane endpoint and the presigning endpoint separable so the operator can configure each independently.
Reference Implementation: Pioneer
The donor codebase implements the S3-compatible client and RustFS service discovery patterns prescribed above in two files:
/home/ubuntu/pioneer/pkg/storage/s3.go— the S3-compatible client wrapper around the MinIO Go SDK. The file declares aClientstruct that holds two MinIO clients: one for in-cluster data operations and one configured against an external endpoint for minting presigned URLs (the dual-client pattern this chapter prescribes). The file also disables HTTP keep-alives at the transport layer to work around RustFS's aggressive idle-connection closure during multipart uploads (the "broken pipe" tolerance this chapter prescribes)./home/ubuntu/pioneer/pkg/storage/discover.go— the RustFS service-discovery helper that resolves the in-cluster endpoint by querying the Kubernetes API server for therustfs-svcService in thepioneer-systemnamespace and reads theClusterIPplus the namedendpointport from the Service spec. The function accepts a kubeconfig path, falls back to theKUBECONFIGenvironment variable, and finally falls back to in-cluster config — the "kubeconfig-or-in-cluster" pattern this chapter prescribes for service discovery.
Adopters MAY copy the dual-client pattern from s3.go and the
discovery shape from discover.go directly; the imports
(github.com/minio/minio-go/v7, k8s.io/client-go/kubernetes) are
industry-standard and not project-specific.
Pinned versions¶
| Component | Pinned version | Source / notes |
|---|---|---|
| MinIO Go SDK | github.com/minio/minio-go/v7 |
S3-compatible client; rolling stable v7 line. |
Kubernetes client-go |
matched to the cluster's Kubernetes minor version, +/- 1 | Skew policy per Kubernetes upstream guidance. |
| RustFS | latest stable (2026-05 snapshot) | S3-compatible storage backend; alternative implementations include MinIO and SeaweedFS. |
| cosign / sigstore-equivalent | latest stable (2026-05 snapshot) | Bundle signing toolchain; verifier MUST be the binary embedded in the loader, not a remote service. |
| Crossplane | v2 GA (late 2025) | Five hard breaking API changes; bundle builds MUST fail closed when Composition API group disagrees with Provider revision. |
| Helm | v4 GA (November 2025) | Air-gap bundle MAY ship Helm charts; chart values overrides MUST use the cluster-local registry hostname. |
Snapshot date: 2026-05-08. Air-gap-relevant pins change rarely; quarterly reviews SHOULD re-confirm the Crossplane v2 Provider versions and the signing toolchain release line.
Pitfalls¶
- Pulling images on first boot. Even a single
image: nginx:latestreference in a vendored manifest violates air-gap-first. Bundle-build tooling MUST sweep manifests for unvendored image references and fail the build. - Skipping signature verification when the trusted key is missing. A loader that warns-and-continues when the trusted key is absent effectively disables signing. The loader MUST refuse to start without a trusted key.
- Reaching public DNS for "the OIDC provider." Identity providers
configured by hostname imply public DNS. Air-gapped deployments MUST
configure identity providers by ClusterIP or by static
hostAliases. - Single-client presigning. A MinIO client configured against the cluster-internal endpoint cannot mint a presigned URL the browser can reach. The dual-client pattern MUST be used; collapsing to one client is a recurring regression.
- DNS-only discovery. A service that resolves peers via
nslookuprather than the Kubernetes API server fails in air-gapped clusters with restricted CoreDNS. API-server-based discovery is the portable default. - Trusting fallback registries blindly. A configured fallback registry MUST be authenticated. A fallback that accepts unsigned images is a silent supply-chain regression.
See also¶
- The Architecture chapter (
01-architecture.md) for the Operator subsection that covers reconciler-driven CRD installation and Crossplane v2 / CAPI / Helm v4 integration in air-gapped clusters. - The Data chapter (
02-data.md) for object-storage-backed encryption patterns and the(key_id, algorithm, ciphertext)column shape that makes key rotation tractable in air-gapped deployments. - The Infra and Tooling chapter (
04-infra-tooling.md) for the Docker / k3s dev-loop discipline that surfaces air-gap defects before they reach customer appliances. - The Security chapter (
06-security.md) for the key-rotation pattern and the gosec / govulncheck CI gate that complement signed bundles. - The Discipline chapter (
08-discipline.md) for Principle 4 ("Air-gap first") that motivates this chapter. - The decisions subdirectory ADR
0007-llms-txt-inclusion.md(when published) covers documentation discoverability, which the air-gap pattern relies on for offline operator runbooks.