Observability¶
Observability is the property that a system's internal state can be inferred from its external outputs. The three pillars — logs, metrics, and traces — are different projections of the same operational truth; a production service MUST emit all three, and they MUST share correlation identifiers so an operator can pivot between them. This chapter standardises the libraries, wire formats, and process shapes that achieve those properties for Go services and TypeScript frontends.
TL;DR¶
- Logs MUST be structured JSON via zerolog.
A thin
pkg/loggingwrapper MUST add correlation-id and component fields from context. - Traces MUST be exported over OpenTelemetry OTLP gRPC.
Service identity MUST be encoded in resource attributes
(
service.name,service.version,deployment.environment). - Metrics MUST be scraped by Prometheus from a
/metricsHTTP endpoint served by the OpenTelemetry meter provider's Prometheus exporter. - When wiring pgx, the connection pool's
ConnConfig.TracerMUST be a composite that includes bothotelpgx(distributed-trace spans) AND a custom DB query tracer (per-query metrics and structured logs). Picking one and dropping the other MUST NOT happen. - Log shipping in environments without a sidecar SHOULD use Loki direct push; in environments with a sidecar, the application MUST emit JSON to stderr and let the sidecar handle ingest.
- The local dev compose stack MUST run Prometheus, Tempo, Loki, and Grafana so traces, metrics, and logs are visible without a production backend.
Why this choice¶
Three forces select this slate.
- Single ingest protocol. OTLP is the only protocol that carries traces, metrics, and (since OpenTelemetry 1.x) logs in one wire format. Adopting OTLP keeps every collector swap behind a configuration change.
- Per-tier specialism. A single library cannot be the best at
structured logging, distributed tracing, and metric exposition.
zerologis the lowest-allocation structured logger in the Go ecosystem; OpenTelemetry is the de-facto trace standard; Prometheus is the de-facto metric scrape target. The composition lets each tier pick the best-of-niche tool without fighting the others. - Composable instrumentation. The pgx tracer interface (and analogous interfaces in other clients) accepts a single tracer. Wrapping the tracer in a composite that fans out to OpenTelemetry and to a custom metrics/log tracer lets every database query emit a span, a counter increment, a duration histogram observation, and a structured log line from one instrumentation point.
External anchors:
- OpenTelemetry Specification: resource attributes, semantic conventions for HTTP / RPC / DB, exporter requirements.
- Prometheus Best Practices — Naming: metric and label naming conventions every exporter MUST follow.
- OpenTelemetry Logs Bridge API: the path for transporting logs over OTLP when a log collector is preferred to a Loki push.
- Grafana Tempo Documentation: trace storage receiver and Grafana datasource configuration.
Prescriptive¶
Structured logging with zerolog¶
- Every Go binary MUST emit logs as JSON to stderr. Stdout MUST be reserved for tool-style output (CLI subcommands that print structured results); long-running services MUST NOT print non-log output to stdout.
- The application MUST own a thin wrapper package (conventionally
pkg/logging) that constructs azerolog.Loggerfrom configuration. The constructor MUST accept at leastlevel(parsed viazerolog.ParseLevel),format(jsonortext/console), and acomponentlabel. levelMUST default toinfofor production builds and SHOULD default todebugfor builds detected as dev (for example, when a<APP>_DEV_MODE=trueenvironment variable is set).format=text(the human-readablezerolog.ConsoleWriter) MUST be available for local dev. Production deployments MUST emit JSON.- The wrapper MUST expose
WithCorrelationID(ctx, id)andCorrelationID(ctx)helpers that store and retrieve a request correlation identifier on the context. ACtx(ctx, base)helper MUST return a logger derived frombasewithcorrelation_idandcomponentfields attached when those values are present on the context. - HTTP and Connect-RPC middleware MUST inject a correlation ID into
every request context. When the inbound request carries a
TraceparentorX-Correlation-IDheader, the middleware MUST adopt the inbound value; otherwise it MUST generate a new ULID or UUID. - A
NewWithWriterconstructor MUST be available when the logger must participate in aMultiLevelWriter. Passing theLoggeritself as anio.Writerto a multi-writer double-encodes lines becauseLogger.Writeserialises to JSON.
OpenTelemetry SDK init¶
- Telemetry initialisation MUST live in a single package
(conventionally
pkg/telemetry) with a top-levelInit(ctx, cfg)function that returns: - a
Shutdown(ctx) errorclosure that flushes and shuts down both the tracer provider and the meter provider, and - an
http.Handlerfor the Prometheus/metricsendpoint. - The configuration struct MUST carry at least
ServiceName,ServiceVersion,Environment,Endpoint,Insecure, andSampleRate.EnvironmentMUST default to a_ENVIRONMENTenvironment variable lookup with adevelopmentfallback when the environment variable is absent. - The tracer provider MUST attach resource attributes for
service.name,service.version, anddeployment.environment. It MUST use OTLP gRPC viago.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc. - The sampler MUST be configurable via
SampleRate. Production services SHOULD start at1.0(always sample) and reduce only when trace ingest volume becomes a cost problem; reducing prematurely hides defects. - When
Endpointis empty, the tracer MUST fall back to a noop tracer; the meter provider MUST still serve Prometheus metrics. A service running without an OTLP collector MUST still expose metrics. InitMUST tear down a partially-initialised tracer when meter initialisation fails. Leaking a half-init'd provider into the process accumulates goroutines on retry.
Prometheus exporter¶
- The meter provider MUST register the
go.opentelemetry.io/otel/exporters/prometheusexporter so the OTel-recorded metrics surface at a single/metricsendpoint. - The
/metricsendpoint MUST be served by the binary's HTTP server on a path of/metricsand a port that is either the main API port (when the API is the canonical surface) or a dedicated admin port. Production services SHOULD use a dedicated admin port so the metrics endpoint is not exposed to public ingress. - Metric and label names MUST follow Prometheus naming conventions:
snake_case, units in the suffix (_seconds,_bytes,_totalfor counters), no unit collision between metric and unit suffix. - Histograms MUST declare bucket boundaries explicitly. The zero-allocation defaults baked into client libraries are tuned for generic web traffic and SHOULD be overridden for domain-specific metrics (database query latency, controller reconcile duration).
- Cardinality MUST be bounded. Per-user, per-request-ID, and
per-trace-ID labels MUST NOT be applied to metrics; those belong on
traces. Acceptable label dimensions are endpoint, method, status
class (
2xx/4xx/5xx), and any low-cardinality enumerable domain identifier.
Composite pgx tracer¶
- The pgx pool's
ConnConfig.Tracerfield accepts a singlepgx.QueryTracervalue. Applications MUST construct a composite tracer that fans out to both: otelpgx.NewTracer()— emits one span per query, with attributes for the SQL text, the database name, and the duration; produces the distributed-trace edges the OTel tracer needs.- A custom
DBQueryTracer— records Prometheus histograms and counters (db_query_duration_seconds,db_queries_total{status=...}) and writes structured log lines on slow or failing queries. - Picking one and dropping the other MUST NOT happen. Dropping
otelpgxremoves database edges from the distributed trace; dropping the custom tracer removes the metric histograms and log lines an operator uses for first-line triage. - The composite SHOULD be implemented as a small adapter in the
telemetry package:
NewCompositeTracer(tracers ...pgx.QueryTracer) pgx.QueryTracer. Each interface method (TraceQueryStart,TraceQueryEnd,TraceConnectStart,TraceConnectEnd,TraceBatchStart,TraceBatchEnd,TracePrepareStart,TracePrepareEnd) MUST fan out to every wrapped tracer. - The composite MUST be installed before the pool is opened. Setting
Tracerafterpgxpool.NewWithConfighas no effect because pgx copies the value into per-connection state at acquire time.
Loki direct push¶
- Services running in environments without a log-shipping sidecar MAY
push logs directly to Loki using the
loki-push-apiHTTP endpoint or the Grafana-supplied Go client. - Services running with a sidecar (Fluent Bit, Vector, Grafana Alloy) MUST NOT push to Loki directly; they MUST emit JSON to stderr and let the sidecar handle ingest. Mixing both produces duplicate log lines in Loki.
- The push client MUST batch by time (every 1–5 seconds) AND by size (a cap such as 1 MiB) so a high-volume burst is not delivered line-by-line.
- The push client MUST be resilient to Loki downtime: in-memory buffer with a bounded size, drop-oldest on overflow, structured log line emitted to stderr on every dropped batch so the operator sees it.
Grafana / Prometheus / Tempo / Loki dev stack¶
- The local dev compose stack MUST run four containers:
prom/prometheus— scrapes/metricsfrom every dev binary's admin port. The scrape config MUST be a bind-mounted file so an engineer can add a new target without rebuilding the image.grafana/tempo— receives OTLP traces from the OTLP gRPC exporter on port 4317. The default configuration uses the local filesystem backend for trace storage.grafana/loki— receives log pushes on port 3100. In compose stacks without a sidecar, services push directly.grafana/grafana— serves the dashboards. Datasources MUST be provisioned viaprovisioning/datasources/*.yamlso the stack boots with Tempo, Loki, and Prometheus already wired.- The Grafana provisioning directory MUST be checked in. Engineers MUST NOT click datasources into Grafana by hand; the configuration MUST be reproducible from the repository.
- The OTLP endpoint MUST default to
localhost:4317for dev. Dev builds MUST setInsecure: trueon the tracer config because the Tempo container in the compose stack does not terminate TLS.
/metrics endpoint is mandatory¶
- Every binary that runs as a long-lived process MUST expose
/metrics. A binary that intentionally has no metrics yet MUST still expose the endpoint and return an empty-body 200 from the Prometheus handler so the scrape config can be authored once and re-used. - The endpoint MUST NOT require authentication when bound to localhost or to a private dev-network admin port. In multi-tenant production environments where the endpoint is reachable, it MUST require an authentication token or be reachable only from the Prometheus scrape network.
Reference Implementation: Pioneer
Concrete files in the Pioneer donor codebase that implement the prescriptions above:
- Telemetry init —
/home/ubuntu/pioneer/pkg/telemetry/provider.godefinesConfig,Shutdown,InitResult, andInit(ctx, cfg) (*InitResult, error). It composesInitTracerandInitMeter; the returnedInitResult.MetricsHandleris the Prometheus handler the binary mounts at/metrics. TheShutdownclosure flushes both providers and returns the first error rather than masking it. WhenEndpoint == ""the tracer falls back to a noop and metrics still work — services degrade gracefully when no OTLP collector is reachable. - Logger —
/home/ubuntu/pioneer/pkg/logging/logger.goexposesNew,NewWithWriter,WithCorrelationID/CorrelationID,WithComponent/Component, andCtx. Theformat=textpath useszerolog.ConsoleWriterand excludes timestamps so console output stays readable. TheCtxhelper attaches bothcorrelation_idandcomponentfields when present on the context — middleware places them, the application pulls them out at log time. - Composite pgx tracer wiring —
/home/ubuntu/pioneer/internal/server/database/pool.go. TheNewPoolfunction constructstelemetry.NewCompositeTracer(otelpgx.NewTracer(), telemetry.NewDBQueryTracer())and assigns it topoolCfg.ConnConfig.Tracerbefore callingpgxpool.NewWithConfig. This composition is the canonical example of fanning out one instrumentation point to both the trace and the metric/log tier.
Pinned versions¶
| Component | Version pinned | Rationale |
|---|---|---|
github.com/rs/zerolog |
v1.35.1 | Stable; lowest-allocation structured logger in the Go ecosystem. |
go.opentelemetry.io/otel (SDK + API) |
v1.43.0 | Latest stable; matches the donor's tracer/meter init pattern. |
go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc |
v1.43.0 | OTLP gRPC trace exporter. |
go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetricgrpc |
v1.43.0 | OTLP gRPC metric exporter. |
go.opentelemetry.io/otel/exporters/prometheus |
v0.65.0 | Pre-1.0 Prometheus exporter; tracks the OTel meter SDK closely. |
github.com/exaring/otelpgx |
v0.10.0 | pgx v5-compatible OTel tracer. |
github.com/prometheus/client_golang |
v1.23.2 | Used only when bypassing the OTel meter for stdlib-style metrics. |
| Prometheus (server) | v3.x | Latest stable; supports OTLP receive natively. |
| Grafana Tempo | v2.7.x | Default OTLP receiver on 4317; filesystem backend for dev. |
| Grafana Loki | v3.x | Direct push API stable. |
| Grafana | v11.x | Datasource provisioning schema stable. |
Pitfalls¶
- Forgetting one half of the composite pgx tracer. Wiring only
otelpgxloses metric histograms and structured slow-query logs; wiring only the custom tracer loses distributed-trace edges. SHOULD always compose both. - Setting
Tracerafter pool creation. pgx copies the value into per-connection state at acquire time. Set the tracer on the config BEFOREpgxpool.NewWithConfig. - Per-user labels on metrics. Cardinality blow-up is the most common Prometheus operational failure. Labels MUST be bounded to a small enumerable set; per-user identifiers belong on traces.
- Empty
service.versionresource attribute. Without a real version stamp, traces are unattributable when two release lines run side-by-side. The-ldflagsblock in the Aircmd(covered in04-infra-tooling.md) stampsbuildinfo.Versionprecisely so the telemetry init can read it. - Mixed log shipping (sidecar AND direct push). Loki receives duplicates and the operator sees doubled rates. Pick exactly one path per deployment environment.
- Mutating
levelwithout a restart. Log level MUST be settable at process boot from configuration. Hot-reloading the log level is a feature SOME applications support, but a mature, testable baseline MUST treat level as immutable per process. - Mounting
/metricson the public API port. A public scrape endpoint leaks operational data. SHOULD bind/metricson a dedicated admin port reachable only from the scrape network. - Using
Loggerdirectly as anio.Writerin aMultiLevelWriter.Logger.Writere-encodes the line as JSON, producing nested JSON in the output. Use the underlying writer returned fromNewWithWriterinstead. - No correlation ID middleware. Logs from one request are scattered across the JSON stream; an operator cannot reconstruct a single request flow. SHOULD inject correlation IDs at the edge.
See also¶
- RFC 2119 keywords — every MUST/SHOULD/MAY in this chapter follows the canonical definitions.
- OpenTelemetry Specification — resource attributes, semantic conventions, exporter requirements.
- Prometheus Best Practices — Naming — metric and label conventions.
- Grafana Tempo Documentation.
- Grafana Loki Documentation.
exaring/otelpgxREADME — pgx v5 OTel tracer used in the composite above.- Chapter
02-data.md— pgx pool construction, sqlc query interfaces, and goose migration handling. - Chapter
04-infra-tooling.md— Aircmdlines stamp theservice.versionthat telemetry init reads. - Chapter
06-security.md—/metricsMUST be private when the service is in a multi-tenant environment. - Chapter
08-discipline.md— log-quality discipline (no PII, no secret values, structured fields rather than free text). - Future ADRs — sampling strategy (head-based vs. tail-based) is a candidate ADR; sidecar-vs-direct log shipping is a candidate ADR.