Compute: EKS → k3s
The original plan called for EKS with a multi-account AWS Organization, 3 NAT Gateways, and ArgoCD for GitOps deployments. The AWS bill hit ~$30/day with zero production traffic. EKS control plane alone is $72/month, a hard floor you can't negotiate. NAT Gateways added $97/month. ArgoCD is powerful but heavy for a single node.
k3s on a t4g.small gives you the same Kubernetes API: same Helm charts, same kubectl, same RBAC, for $14/month. Single node, no HA. The migration path to EKS is a values.yaml change, not a rewrite. ARM64 (Graviton) gives better price-performance than x86. ArgoCD replaced by SSH + helm upgrade --wait. Simpler, sufficient, and the deploy fails if pods don't become ready. Total savings: $180+/month.
Build System: Bazel for Go
Alternatives considered: go build with golangci-lint, GoReleaser, or Mage. The payoff of Bazel: nogo runs static analysis at compile time (not as a separate lint step that developers skip), builds are hermetic (same output on every machine, every time), and cross-compilation to ARM64 is a --platforms flag. CI builds Go on an x86 runner and produces ARM64 Docker images for production. No emulation, no multi-arch builds, no QEMU.
Monorepo: Two Build Systems, One Repo
Go services use Bazel. TypeScript frontend uses Nx. The conventional wisdom says pick one. Bazel can build JavaScript. Nx has plugins for everything. But each tool is dominant in its own ecosystem for a reason. Bazel's rules_go and nogo integration is unmatched for Go: hermetic builds, compile-time static analysis, and zero-config cross-compilation. Nx's understanding of TypeScript project references, its computation cache, and its integration with Vite, Storybook, and Playwright is unmatched for frontend. Forcing one tool to do both means fighting the tool instead of shipping.
The conventional knock on this approach is setup complexity. Gazelle for BUILD file generation, gopackagesdriver to bridge gopls and Bazel for IDE support, Starlark macros for custom rules, Nx workspace config, TypeScript project references. With AI, that complexity flattens to near-zero. Claude generated the initial BUILD files, configured Gazelle's resolve directives, wired up gopackagesdriver, scaffolded the Nx workspace config with project references, and built the cross-build-system CI wiring. The setup that would normally take days of reading docs and fighting config took hours.
The hard part of a monorepo with generated code (proto types, mocks, instrumented wrappers) is telling the build system where things live. Gazelle can't see Bazel-generated outputs, so it needs explicit resolve directives. A handful of regex rules in the root BUILD.bazel handles every generated import pattern:
# Root BUILD.bazel — Gazelle resolve directives for generated code
# gazelle:prefix github.com/example/project
# gazelle:go_naming_convention go_default_library
# Proto-generated Connect RPC stubs (explicit, since they have custom rules)
# gazelle:resolve go .../gen/yeet/v1/yeetv1connect //proto/yeet/v1:yeetv1connect
# gazelle:resolve go .../gen/yeet/v1/yeetv1connect/instrumented_yeetv1connect //proto/yeet/v1:instrumented_yeetv1connect
# All other proto-generated Go types (regex catches the rest)
# gazelle:resolve_regexp go \.com/.*/gen/(.+)$ //proto/$1:go_default_library
# Mock libraries generated by go_mock_library macro
# gazelle:resolve_regexp go \.com/.*/(backend/.*)/mock_(\w+) //$1:mock_$2
# Instrumented decorator libraries generated by go_instrumented_library macro
# gazelle:resolve_regexp go \.com/.*/(backend/.*)/instrumented_(\w+) //$1:instrumented_$2
# Proto BUILD files are hand-written (custom rules opaque to Gazelle)
# gazelle:exclude proto
Five directives. That's the entire cost of teaching Gazelle about three categories of generated code across the whole repo. Every new mock, every new instrumented wrapper, every new proto package is automatically resolvable. No per-package BUILD.bazel edits required.
The glue between Bazel and Nx is Protobuf. .proto files live in proto/ at the repo root. Bazel generates Go types via rules_proto. Nx triggers buf generate for TypeScript types. A CI workflow detects proto changes and runs both code generators, failing the PR if generated code is stale. The contract is the boundary: backend and frontend can evolve independently as long as the proto contract holds.
Alternatives considered: Turborepo (JavaScript-only, no Go story), Pants (strong Python/Go support but weaker TypeScript ecosystem), Bazel-for-everything (possible but rules_js is high-friction for a React + Vite + Storybook stack), and Nx-for-everything (no nogo equivalent, weaker hermeticity for Go). The two-tool approach adds the cost of two dependency graphs and two cache systems, but each tool operates in its strength zone.
API Contract: Connect RPC over REST
REST with OpenAPI gives you documentation. gRPC gives you a compiler but locks out browsers. Connect RPC gives you both: one set of .proto files generates the Go server, TypeScript client, and dev stubs. Browsers speak HTTP/JSON; services speak gRPC. A breaking schema change fails at compile time in both languages. The entire frontend was built and tested against generated dev stubs before a single line of backend existed. The trade-off is less ecosystem tooling than OpenAPI: no Swagger UI, fewer code generators. Worth it for a typed contract that can't drift.
Search: Typesense over Elasticsearch
The plan originally used Postgres full-text search. Plan v4 introduced Typesense for what Postgres FTS can't do: typo tolerance, prefix search, search-as-you-type, facet counts, highlighting, and synonyms. The features that define great recipe discovery. Elasticsearch was considered but uses ~500MB+ of RAM for a corpus of thousands of documents. Typesense uses ~50MB, returns results in under a millisecond, and the entire schema is a single JSON definition. Postgres FTS remains as a fallback.
Job Queue: River over Redis/RabbitMQ
Image processing and search index sync need background jobs. The conventional choices like Sidekiq, Bull, and RabbitMQ all introduce new infrastructure. River uses Postgres as the queue. No new operational surface, and critically: transactional job enqueue. Insert the recipe and enqueue the index job in the same transaction. If the insert fails, the job never enqueues. The trade-off is coupling to Postgres, which is already the single point of failure anyway.
Images: imgproxy + S3
Alternatives considered: pre-generated thumbnails (storage explosion, processing pipeline), Cloudinary/Imgix (per-transformation pricing, vendor lock-in), or Sharp in Node (ties image processing to the frontend). imgproxy reads from S3 and resizes on demand via signed URLs. WebP, AVIF, smart crop, ~30MB of RAM. The signed URL scheme means clients can't request arbitrary transformations. Only the API can mint valid URLs. IMDS provides S3 credentials automatically. No static keys. No thumbnails to pre-generate, no storage multiplication, no external service dependency.
Observability: Decoration over Bespoke
The conventional approach to observability: sprinkle span.Start(), metrics.Increment(), and logger.Info() through your business logic. Every method becomes half observability code, half actual logic. Untestable, inconsistent, and a maintenance nightmare.
This project uses a decorator pattern via Bazel codegen. A custom Starlark macro (go_instrumented_library) reads Go interfaces at build time and generates Instrumented* wrapper types that delegate each method through an Instrumentor: automatic span creation, duration metrics, and structured logging with zero boilerplate in the implementation:
# BUILD.bazel — declares which interfaces get instrumented
load("//tools/bazel:instrumentorgen.bzl", "go_instrumented_library")
go_instrumented_library(
name = "instrumented_store",
library = ":go_default_library",
interfaces = ["RecipeStore", "UserStore", "RatingStore"],
importpath = ".../instrumented_store",
)
# With per-method auth policies (auth + observability in one decorator):
go_instrumented_library(
name = "instrumented_controller",
library = ":go_default_library",
interfaces = ["RecipeController"],
auth = {
"RecipeController.Create": "authenticate",
"RecipeController.Update": "authorize_owner:recipe.ID",
"RecipeController.Delete": "authorize_owner:recipe.ID",
},
)
The generated code only exists in Bazel's output tree, never on disk, never in git. The instrumentorgen tool is pure stdlib Go (go/parser, go/types, text/template). No external dependencies. Gazelle auto-resolves instrumented imports via a single regex directive.
At the wiring site (cmd/*/main.go), each concrete implementation is wrapped with its generated decorator. Business logic packages never import observability. The result: every interface call is traced, metered, and logged consistently across 21K lines of Go, without a single manual span.Start().
Observability as Code: Datadog
Vendor-neutral instrumentation: OTel traces, Prometheus metrics, zap structured logs. DD Agent consumes all three. Zero DD SDK in app code. Dashboards, monitors, and SLOs are Pulumi resources, not clickops:
// infra/pulumi/stacks/observability/dashboards.ts
import * as datadog from "@pulumi/datadog";
// Service Overview — the RED dashboard (Rate, Errors, Duration)
const serviceOverview = new datadog.Dashboard("service-overview", {
title: "Eat or Yeet — Service Overview",
layoutType: "ordered",
widgets: [
// Request rate per service
timeseriesWidget("Requests / sec", [
"sum:trace.connectrpc.request{service:yeet-api}.as_rate()",
"sum:http.server.request{service:yeet-frontend}.as_rate()",
"sum:river.job.completed{service:yeet-worker}.as_rate()",
]),
// Error rate (% of total)
queryValueWidget("API Error Rate",
"sum:trace.connectrpc.request.errors{service:yeet-api}.as_rate() / " +
"sum:trace.connectrpc.request{service:yeet-api}.as_rate() * 100"
),
// Latency percentiles — P50, P95, P99
timeseriesWidget("API Latency", [
"p50:trace.connectrpc.request.duration{service:yeet-api}",
"p95:trace.connectrpc.request.duration{service:yeet-api}",
"p99:trace.connectrpc.request.duration{service:yeet-api}",
]),
],
});
// API Detail — per-endpoint breakdown from decorator metrics
const apiDetail = new datadog.Dashboard("api-detail", {
title: "Eat or Yeet — API Endpoints",
layoutType: "ordered",
widgets: [
// Every decorated method is a row — the naming convention makes this automatic
toplistWidget("Slowest Endpoints (P95)",
"p95:trace.connectrpc.request.duration{service:yeet-api} by {resource_name}"
),
// Error breakdown by RPC method + status code
timeseriesWidget("Errors by Endpoint", [
"sum:trace.connectrpc.request.errors{service:yeet-api} by {resource_name}",
]),
// Store-layer latency — catches query regressions before they hit users
timeseriesWidget("Store Layer (decorated)", [
"avg:store.recipe.GetByID.duration{service:yeet-api}",
"avg:store.recipe.List.duration{service:yeet-api}",
"avg:store.recipe.Search.duration{service:yeet-api}",
]),
],
});
// Data Layer — leading indicators
const dataLayer = new datadog.Dashboard("data-layer", {
title: "Eat or Yeet — Data Layer",
widgets: [
queryValueWidget("PG Pool Utilization",
"avg:db.pool.open_connections{service:yeet-api} / " +
"avg:db.pool.max_open{service:yeet-api} * 100"
),
timeseriesWidget("Redis Hit Rate", [
"sum:cache.hits{service:yeet-api}.as_rate()",
"sum:cache.misses{service:yeet-api}.as_rate()",
]),
timeseriesWidget("Typesense Query Latency", [
"p95:typesense.search.duration{service:yeet-api}",
]),
queryValueWidget("River Queue Depth",
"avg:river.queue.depth{service:yeet-worker}"
),
],
});
// infra/pulumi/stacks/observability/monitors.ts
// P1 — page immediately
const apiErrorRate = new datadog.Monitor("api-error-rate-p1", {
type: "query alert",
query: "sum(last_2m):sum:trace.connectrpc.request.errors{service:yeet-api}.as_rate() / " +
"sum:trace.connectrpc.request{service:yeet-api}.as_rate() * 100 > 5",
name: "[P1] API error rate > 5%",
message: `{{#is_alert}}
API error rate is {{value}}% — users are impacted.
Dashboard: https://app.datadoghq.com/dashboard/api-detail
First action: check recent deploys, then trace errors by resource_name.
@pagerduty-yeet
{{/is_alert}}`,
priority: 1,
});
const pgPoolExhaustion = new datadog.Monitor("pg-pool-p1", {
type: "query alert",
query: "avg(last_2m):avg:db.pool.open_connections{service:yeet-api} / " +
"avg:db.pool.max_open{service:yeet-api} * 100 > 90",
name: "[P1] Postgres connection pool > 90%",
message: "Pool exhaustion imminent. Check for long-running queries or connection leaks.",
priority: 1,
});
// P2 — investigate within the hour
const apiLatencyP99 = new datadog.Monitor("api-latency-p2", {
type: "query alert",
query: "avg(last_5m):p99:trace.connectrpc.request.duration{service:yeet-api} > 0.5",
name: "[P2] API P99 latency > 500ms",
message: "Latency regression. Check store-layer decorated metrics for the slow method.",
priority: 2,
});
const redisHitRate = new datadog.Monitor("redis-hitrate-p2", {
type: "query alert",
query: "avg(last_5m):sum:cache.hits{service:yeet-api}.as_rate() / " +
"(sum:cache.hits{service:yeet-api}.as_rate() + " +
"sum:cache.misses{service:yeet-api}.as_rate()) * 100 < 80",
name: "[P2] Redis hit rate < 80%",
message: "Cache effectiveness degraded. Check eviction policy or key expiry.",
priority: 2,
});
const riverQueueGrowing = new datadog.Monitor("river-queue-p2", {
type: "query alert",
query: "avg(last_10m):avg:river.queue.depth{service:yeet-worker} > 100",
name: "[P2] River queue depth growing",
message: "Jobs enqueuing faster than processing. Check worker pod status and job error rate.",
priority: 2,
});
// infra/pulumi/stacks/observability/rum.ts
// RUM — what server metrics can't see
const rumDashboard = new datadog.Dashboard("rum-frontend", {
title: "Eat or Yeet — Real User Monitoring",
widgets: [
// Core Web Vitals — the metrics Google ranks you on
queryValueWidget("LCP (target < 2.5s)",
"p75:rum.largest_contentful_paint{service:yeet-frontend}"
),
queryValueWidget("CLS (target < 0.1)",
"p75:rum.cumulative_layout_shift{service:yeet-frontend}"
),
queryValueWidget("INP (target < 200ms)",
"p75:rum.interaction_to_next_paint{service:yeet-frontend}"
),
// SSR vs hydration — where does the browser spend time?
timeseriesWidget("Page Load Breakdown", [
"avg:rum.time_to_first_byte{service:yeet-frontend}", // SSR
"avg:rum.dom_content_loaded{service:yeet-frontend}", // parse
"avg:rum.load_event{service:yeet-frontend}", // hydration complete
]),
// Connect RPC from the browser's perspective
timeseriesWidget("RPC Latency (client-side)", [
"p95:rum.resource.duration{service:yeet-frontend,resource_type:xhr}",
]),
// JS errors — hydration mismatches, runtime failures
timeseriesWidget("JS Errors", [
"sum:rum.error{service:yeet-frontend}.as_rate()",
]),
// Custom business metrics
timeseriesWidget("User Signals", [
"sum:rum.action{service:yeet-frontend,action_name:recipe_search}.as_rate()",
"sum:rum.action{service:yeet-frontend,action_name:recipe_view}.as_rate()",
"sum:rum.action{service:yeet-frontend,action_name:rating_submit}.as_rate()",
]),
],
});
// RUM alert — Core Web Vitals regression
const lcpRegression = new datadog.Monitor("lcp-regression-p2", {
type: "query alert",
query: "avg(last_15m):p75:rum.largest_contentful_paint{service:yeet-frontend} > 2500",
name: "[P2] LCP > 2.5s — Core Web Vitals failing",
message: "Largest Contentful Paint regressed. Check SSR performance, image loading, and font delivery.",
priority: 2,
});
Every dashboard, monitor, and SLO is a Pulumi resource. pulumi up provisions them. pulumi preview diffs them. No one edits a dashboard in the UI. Changes go through the same plan/review workflow as application code. When someone adds a new decorated interface method, the metrics exist automatically. When someone adds a new RUM action, it shows up in the user signals widget. The observability layer grows with the codebase.
The HTTP middleware layer extracts JWT claims from the Authorization header and attaches them to context, but it never rejects requests. It's permissive by design: unauthenticated requests proceed, and per-method decorators enforce what's required. This means public endpoints (recipe listing, search) need zero auth configuration. Only methods that appear in the auth map require a caller.
Two policies cover every case in the app. "authenticate" requires a valid caller — any logged-in user. "authorize_owner:expr" resolves the resource owner via an OwnerResolver interface and checks that the caller is the owner or an admin. The expr is a Go expression evaluated against the method's arguments, so recipe.ID and id both work depending on the method signature.
The result: auth policy is visible in BUILD.bazel files, reviewable in PRs, and enforced at compile time. A new controller method with no auth entry is public by default — an explicit, auditable decision. A method with the wrong policy fails at the ownership check, not silently. And the business logic in RecipeController never imports the auth package.