01 High-level Overview
Turborepo monorepo with three independent workspaces — apps/web (Next.js 15),
apps/api (NestJS 11), apps/worker (BullMQ + Puppeteer) — and a
shared packages/shared (DTOs, Zod schemas, types). Every app ships as its own
Docker image and deploys to a separate Fly.io app. Managed services provide Postgres, Redis,
object storage, email, and observability.
Frontend Fly.io
apps/web— App Router, standalone build- Port
5174inside container - Rewrites
/api/*→ API via Fly private network - Zustand (client state) + TanStack Query (server state)
API Fly.io
apps/api— REST under/api/*- Port
3000; global prefixapi - Tenant isolation via Prisma extension + RLS
- Swagger UI (non-prod)
Worker Fly.io
apps/worker— no HTTP surface- Queues:
pdf,scan-upload,sms - Fly volume
autoinspect_worker_dataat/repo/uploads - System Chromium (apk) for PDF rendering
Shared Package
- DTOs + Zod schemas (single source of truth)
- Constants, enums, utils
- Dual build: ESM (
dist/) + CJS (dist-cjs/) - Only path that crosses workspace boundaries
02 Topology Diagram
Logical flow: client → Cloudflare edge → Fly web app → Fly API → Postgres / Valkey / R2. Worker pulls jobs from Valkey, renders PDFs via Chromium, stores artefacts in R2.
03 Runtime — Fly.io
Three Fly apps per environment (staging + production = 6 apps total). Region syd. Images built remotely via flyctl deploy --remote-only. Rolling strategy. Web/API auto-stop with min_machines_running = 1.
| App | Runtime | Ports | HTTP | Resources | Deploy |
|---|---|---|---|---|---|
autoinspect-web-{env} |
Next.js 15 standalone (node apps/web/server.js) |
5174 internal | Public HTTPS · force_https · auto_stop | shared-cpu-1x · 1 GB | Dockerfile.web · API_INTERNAL_URL baked at build |
autoinspect-api-{env} |
NestJS 11 (node dist/src/main) |
3000 internal | Public HTTPS · health /api/health/ready |
shared-cpu-1x · 512 MB | Dockerfile.api · release = prisma migrate deploy |
autoinspect-worker-{env} |
Node + BullMQ + Chromium (node dist/main.js) |
None (outbound only) | No HTTP surface · pgrep healthcheck | shared-cpu-1x · 1 GB + Fly volume | Dockerfile.worker · apk chromium · SIGTERM 60s drain |
Private networking
- Web → API over Fly 6PN using
http://autoinspect-api-<env>.internal:3000 - Baked into Next.js rewrites via
API_INTERNAL_URLbuild-arg (non-negotiable) - Worker → Redis/Postgres/R2 all over public TLS endpoints (no Fly-internal data stores)
Secrets & config
flyctl secrets setper app (never committed)- Local dev loads from root
.envviaset -a; . ../../.env; set +a - Prisma auto-generates in postinstall (requires
DATABASE_URLavailable)
04 Data Layer
Managed Postgres (Supabase, Sydney) is the source of truth. Prisma 6 owns the schema; migrations run as the Fly API release_command before rollout. Every domain table is tenant-scoped and protected by Row Level Security.
Postgres 17 · Supabase SYD
- 25+ Prisma models (Tenant, User, Inspection, ChecklistResponse, DamageMarker, MediaItem, Report, etc.)
- RLS policies enforce
tenant_id = current_setting('app.tenant_id') - Prisma Client Extension auto-injects tenant context on every query
- Full-text search (tsvector) on inspections/vehicles
Redis / Valkey · Upstash
- BullMQ queues (
pdf,scan-upload,sms) - Rate-limit counters (
express-rate-limitadapter) - Short-lived cache (dashboard aggregates, feature flags)
- Queue depth gauge polled and surfaced via
/api/health/ready
Object Storage · Cloudflare R2
- Inspection photos, vehicle photos, avatars, branding logos, PDFs
- Key format
{tenantId}/inspections/{id}/reports/{uuid}.pdf S3_PRESIGN_TTL_SECdefault 3600s — URL-only delivery; no direct reads- Separate bucket for backups with 90-day lifecycle (set on bucket)
05 Queues & Background Jobs
API enqueues jobs via BullMQ; worker pulls them over the outbound Redis connection. IMPLEMENTED_QUEUES gate prevents spinning up Workers for queues without processors (avoids burning Upstash quota on retries).
| Queue | Producer | Processor | Purpose | Notes |
|---|---|---|---|---|
pdf | reports module | pdf.processor.ts | Render report HTML → PDF, store in R2, email link | Puppeteer + system Chromium; concurrency=1 |
scan-upload | media module | scan-upload.processor.ts | Post-upload processing: mime sniff, sharp resize, thumbnails | Uses @aws-sdk/client-s3 |
sms | notifications | sms.processor.ts | Outbound SMS (future Twilio/ClickSend) | Provider not yet wired |
email planned | notifications | — (not in IMPLEMENTED_QUEUES) | Transactional email via Resend | A6.1 task open |
06 Edge, DNS & Assets
Cloudflare DNS + Proxy
- Nameservers delegated at registrar (VentraIP / Cloudflare Registrar)
- Orange-cloud proxy in front of Fly
.fly.devhosts via CNAME - WAF / bot mgmt / TLS termination (Fly also terminates TLS — double TLS)
- DNS for Resend (SPF TXT, DKIM CNAMEs, DMARC TXT, bounce MX)
Cloudflare R2
- Buckets:
autoinspect-pro-production,-staging,-backups - Zero egress — critical for photo-heavy dealers
- Private bucket; access only via presigned URLs issued by API
- Lifecycle rules set manually in Cloudflare UI
07 Email & Observability
Resend
- Verified sending domain (SPF/DKIM/DMARC on CF DNS)
- Used by API directly today; will migrate into
emailworker queue (A6.1) - Templates authored with React Email (planned)
New Relic
- APM on
apps/apiandapps/worker(newrelicpackage,newrelic-loader.ts) - Browser RUM on
apps/web(NEXT_PUBLIC_NEW_RELIC_*) - Config via env —
NEW_RELIC_NO_CONFIG_FILE=true - Pino logs → stdout → Fly → New Relic log ingest
Logging
nestjs-pinowithpino-http— structured JSON logs- No
console.login committed code (enforced by lint) - Pretty logs locally (
pino-pretty), structured in prod
08 CI/CD
| Workflow | Trigger | Does |
|---|---|---|
ci.yml | PR + push to main | install · prisma generate · lint · type-check · security gates (tenant isolation, rate limits) · test · build |
deploy-staging.yml | push to main | build · scripts/deploy.sh staging (api + worker + web in parallel) |
deploy-prod.yml | tag v* / manual | manual-approved prod deploy · smoke tests against /api/health/* + web→api proxy probe |
backup-db.yml | weekly cron (Sun 02:00 UTC) | pg_dump → gzip → aws s3 cp to R2 backups bucket (streaming, no disk) |
k6-smoke.yml | manual / scheduled | k6 smoke + authenticated load test |
nightly-regression.yml | nightly | regression suite |
scripts/deploy.sh
- Single source of truth for deploy flags
- Bakes
--build-arg API_INTERNAL_URLper env (or web breaks with ECONNREFUSED) - Parallel targets; exit code aggregates per-target status
- Reads token from
/tmp/fly-token.shlocally orFLY_API_TOKENin CI
Migrations on deploy
- Fly
release_commandonfly.api.tomlrunsprisma migrate deploy - Executed in a one-off VM before rollout; failure aborts deploy, previous version keeps serving
- Never modify the database directly — migrations are the only path
09 Key Data Flows
Inspection photo upload
- Client requests presigned PUT from
POST /api/media/presign - API issues presigned URL to R2 (TTL 3600s) and returns object key
- Client PUTs binary directly to R2 (bypasses Fly bandwidth)
- Client notifies
POST /api/media/:id/finalize→ API enqueuesscan-upload - Worker: mime sniff → sharp resize → thumbnails → update
MediaItemrow
Customer PDF report
- User clicks "Generate Report" → API enqueues
pdfjob with inspection ID + tenant - Worker fetches data (Prisma with tenant scope), renders HTML via template, Puppeteer → PDF
- PDF uploaded to R2 at
{tenantId}/inspections/{id}/reports/{uuid}.pdf - Worker creates
Report+ReportDeliveryrows and triggers email - Resend sends a share link containing an opaque token; public viewer hits
/api/public/report/:token
Tenant isolation on every request
- Passport JWT guard verifies access token; attaches
user+tenantId TenantContextMiddlewarestores tenant ID in AsyncLocalStorage- Prisma Client Extension injects
where: { tenantId }on every query - Postgres RLS is a defence-in-depth backstop if the extension ever misses
- Tenant isolation test (
test:tenant) runs on every CI build as a security gate
10 Security Model
Transport & edge
- TLS at Cloudflare + Fly (end-to-end HTTPS)
helmeton API with strict CSP (default-src 'none') — relaxed only in dev- Refuses to boot if
NODE_ENV != productionon a deployed origin - CORS origins from
getAllowedWebOrigins()
AuthN / AuthZ
- Passport JWT (access 15m) + httpOnly refresh cookie (7d) scoped to
/api/auth - Non-httpOnly
ai_session=1used only by Next.js middleware for flash prevention - RBAC via
@Roles()+RolesGuardon every endpoint - TOTP 2FA (
otplib+qrcode)
Rate limiting
- Per-route limiters: login (10/min), refresh (30-60/min), public report (30-60/min), public approval
express-rate-limitbacked by Redis- Tested in CI as a named security gate
Tenant data
- Every domain table has
tenant_id— no exceptions - Prisma extension + Postgres RLS (belt + braces)
- Presigned URLs scoped per-tenant via key prefix
- Audit log records sensitive writes
11 Environment Variables (condensed)
| Category | Keys |
|---|---|
| Core | NODE_ENV, PORT, APP_URL, API_INTERNAL_URL (web build-arg), WEB_PORT, API_PORT |
| Auth | JWT_SECRET, JWT_REFRESH_SECRET, JWT_ACCESS_EXPIRY, JWT_REFRESH_EXPIRY, BCRYPT_ROUNDS |
| Database | DATABASE_URL (Supabase), BACKUP_DATABASE_URL (direct, for pg_dump) |
| Redis | VALKEY_URL, VALKEY_HOST/PORT/PASSWORD/DB |
| Object store | S3_ENDPOINT, S3_ACCESS_KEY, S3_SECRET_KEY, S3_BUCKET, S3_REGION, S3_PRESIGN_TTL_SEC |
| Backups (CI) | R2_ACCESS_KEY_ID, R2_SECRET_ACCESS_KEY, R2_ACCOUNT_ID, R2_BACKUPS_BUCKET |
| Rate limits | RATE_LIMIT_LOGIN, RATE_LIMIT_REFRESH, RATE_LIMIT_PUBLIC_REPORT |
| External APIs | NHTSA_API_URL (VIN decode), Resend API key (secret) |
| Observability | NEW_RELIC_LICENSE_KEY, NEW_RELIC_APP_NAME, NEXT_PUBLIC_NEW_RELIC_* |
| Deploy | FLY_API_TOKEN (CI), PUPPETEER_EXECUTABLE_PATH (worker) |
12 Full AWS Migration Plan
Translate the current "Fly + Cloudflare + Supabase + Upstash + Resend" stack into a pure AWS equivalent. The application code requires near-zero changes — Prisma, ioredis, and the AWS S3 SDK already work against managed services. The work is mostly infrastructure, networking, and cost trade-offs.
Component-by-component mapping
ap-southeast-2). Use RDS Proxy as PgBouncer replacement. Automated backups + PITR. Parameter group with rds.force_ssl=1. Secrets Manager rotation.ap-southeast-2) buckets: media-prod, media-staging, backups. Block Public Access on. Lifecycle rules for backups (90-day Glacier → expire). Egress is no longer free — mitigate with CloudFront for public artefacts and a Gateway VPC Endpoint for S3 from ECS.ap-southeast-2. Move out of the sandbox (production access request). Configuration set for bounce/complaint events → SNS → SQS → event handler. DKIM via Route 53. For richer templating/analytics, keep Resend.flyctl secretssecrets block in the task definition.flyctl deploypg_dump → R2, 90-day lifecyclepg_dump cron as EventBridge Scheduled Rule → ECS one-off task → S3 + S3 Object Lock (compliance mode) for ransomware-proof retention.@sparticuz/chromium on demand; simpler, cheaper at low volume. Shared scratch → EFS mount on Fargate if you need it (rarely necessary — R2/S3 handles everything).express-rate-limit backed by Redis/api/auth/login and the public report endpoint before they hit the ALB.fly.*.toml files + scripts/deploy.shTarget AWS topology (ap-southeast-2)
Migration phases
Phase 1 — Foundation (no traffic)
- Pick an AWS account structure (single account OK for beta; Organizations + SSO for real).
- Write CDK/Terraform: VPC (3 AZs), NAT, S3 gateway endpoint, ECR repos, Secrets, Route 53 zone, ACM certs (us-east-1 for CloudFront).
- Provision RDS Postgres (empty) + ElastiCache Valkey (empty) in private subnets.
- Set up OIDC trust for GitHub Actions → AWS role (no static keys).
Phase 2 — Build + push images
- Duplicate Dockerfiles (already build-clean) and push to ECR per app.
- Author ECS task definitions:
secretsfrom Secrets Manager, CloudWatch log groups, healthcheck matching Fly (/api/health/ready). - Create ECS services behind ALB target groups (web + api). Worker service has no LB.
Phase 3 — Data migration
- Postgres:
pg_dump | pg_restorefrom Supabase into RDS during a brief freeze, or use AWS DMS for near-zero-downtime CDC. - Redis: no migration needed — BullMQ jobs drain naturally; flip
VALKEY_URLat cutover. - R2 → S3:
aws s3 sync(orrclone sync) for existing media; same key layout.
Phase 4 — Cutover
- Freeze writes (maintenance page), run final delta-sync of DB + S3.
- Point Route 53 record to CloudFront. Cloudflare (old) DNS TTL lowered 24h in advance.
- Flip Fly web/api to read-only health 503 (keeps old stack paused but instantly restorable).
- Unfreeze. Monitor CloudWatch dashboards + error rates.
Phase 5 — Decommission
- After 7–14 days of clean operation: destroy Fly apps, cancel Supabase + Upstash + R2 (keep backup bucket if still referenced).
- Delete Cloudflare DNS records no longer needed (keep the zone if you want CF WAF in front of CloudFront as a second edge — possible but unusual).
- Tag all AWS resources
app=autoinspect,env=prod|stagingfor cost attribution.
Code changes required
- None in the API —
@aws-sdk/client-s3,ioredis, and@prisma/clientalready target AWS-native services. - Remove the
S3_ENDPOINTenv value so the SDK uses AWS default endpoints. - Swap
resendcalls for@aws-sdk/client-sesv2if going all-in on SES (otherwise keep Resend). - Replace
flyctlcommands inscripts/deploy.shwithaws ecs update-serviceor a CDK deploy. - Puppeteer: no change on Fargate. If switching to Lambda, replace Puppeteer with
@sparticuz/chromium+puppeteer-core.
Cost posture (rough, AUD/mo, beta scale)
| Line | Current stack | AWS equivalent | Notes |
|---|---|---|---|
| Compute (3 svc × 2 envs) | ~$30 Fly | $80–150 Fargate | AWS wins on scale, loses on small instances |
| Postgres | $0–40 Supabase | $70–120 RDS Multi-AZ | Multi-AZ doubles cost vs single-AZ Supabase |
| Redis | $0–10 Upstash | $20–50 ElastiCache Serverless | ElastiCache has a minimum baseline |
| Object storage | $0–5 R2 (no egress) | $5–30 S3 + egress | CloudFront + S3 GW endpoint mitigates egress |
| $0–20 Resend | ~$0 SES (+ $0.10/1k) | SES cheapest at volume | |
| Observability | $0 New Relic free | $10–40 CloudWatch | Or keep New Relic |
| Edge/DNS/WAF | $0 Cloudflare free | $10–30 CloudFront + WAF | AWS WAF is rule-count priced |
| Total ballpark | $30–100 | $200–400 | AWS is 2–4× at beta; flips around at medium scale |
Gotchas & recommendations
Don't skip
- S3 Gateway VPC Endpoint — without it every S3 call from ECS pays NAT egress.
- RDS Proxy — Fargate tasks churn connections quickly; Proxy behaves like PgBouncer and protects the DB.
- OIDC for GitHub Actions — never paste long-lived AWS keys into repo secrets.
- ACM in us-east-1 for CloudFront certs (CloudFront only reads us-east-1) and ap-southeast-2 for the ALB.
- SES production access — approval can take days; start early.
- Backup restore drill — practice
pg_restorefrom a snapshot before cutover.
Things you can keep
- Cloudflare DNS in front of AWS is fine (orange-cloud → CloudFront). Two WAFs is overkill; pick one.
- Resend for transactional email — saves you SES sandbox/reputation work. Swap later if cost matters.
- New Relic works on ECS without changes. Only migrate to CloudWatch/X-Ray if you want single-vendor billing.
- GitHub Actions — keep as the CI/CD driver; AWS CodePipeline is optional.
- Docker images — unchanged. The Dockerfiles are portable.