AutoInspect PRO — Infrastructure Topology

01 High-level Overview

Turborepo monorepo with three independent workspaces — apps/web (Next.js 15), apps/api (NestJS 11), apps/worker (BullMQ + Puppeteer) — and a shared packages/shared (DTOs, Zod schemas, types). Every app ships as its own Docker image and deploys to a separate Fly.io app. Managed services provide Postgres, Redis, object storage, email, and observability.

Frontend Fly.io

Next.js 15 · React 19 · Tailwind v4 · next-intl

apps/web — App Router, standalone build
Port 5174 inside container
Rewrites /api/* → API via Fly private network
Zustand (client state) + TanStack Query (server state)

API Fly.io

NestJS 11 · Prisma 6 · Passport JWT · Helmet

apps/api — REST under /api/*
Port 3000; global prefix api
Tenant isolation via Prisma extension + RLS
Swagger UI (non-prod)

Worker Fly.io

BullMQ 5 · ioredis · Puppeteer + Chromium

apps/worker — no HTTP surface
Queues: pdf, scan-upload, sms
Fly volume autoinspect_worker_data at /repo/uploads
System Chromium (apk) for PDF rendering

Shared Package

@autoinspect/shared

DTOs + Zod schemas (single source of truth)
Constants, enums, utils
Dual build: ESM (dist/) + CJS (dist-cjs/)
Only path that crosses workspace boundaries

02 Topology Diagram

Logical flow: client → Cloudflare edge → Fly web app → Fly API → Postgres / Valkey / R2. Worker pulls jobs from Valkey, renders PDFs via Chromium, stores artefacts in R2.

Runtime request / data path Control plane / async integration Fly.io Cloudflare Postgres Redis/Valkey

03 Runtime — Fly.io

Three Fly apps per environment (staging + production = 6 apps total). Region syd. Images built remotely via flyctl deploy --remote-only. Rolling strategy. Web/API auto-stop with min_machines_running = 1.

App	Runtime	Ports	HTTP	Resources	Deploy
`autoinspect-web-{env}`	Next.js 15 standalone (`node apps/web/server.js`)	5174 internal	Public HTTPS · force_https · auto_stop	shared-cpu-1x · 1 GB	Dockerfile.web · `API_INTERNAL_URL` baked at build
`autoinspect-api-{env}`	NestJS 11 (`node dist/src/main`)	3000 internal	Public HTTPS · health `/api/health/ready`	shared-cpu-1x · 512 MB	Dockerfile.api · release = `prisma migrate deploy`
`autoinspect-worker-{env}`	Node + BullMQ + Chromium (`node dist/main.js`)	None (outbound only)	No HTTP surface · pgrep healthcheck	shared-cpu-1x · 1 GB + Fly volume	Dockerfile.worker · apk chromium · SIGTERM 60s drain

Private networking

Web → API over Fly 6PN using http://autoinspect-api-<env>.internal:3000
Baked into Next.js rewrites via API_INTERNAL_URL build-arg (non-negotiable)
Worker → Redis/Postgres/R2 all over public TLS endpoints (no Fly-internal data stores)

Secrets & config

flyctl secrets set per app (never committed)
Local dev loads from root .env via set -a; . ../../.env; set +a
Prisma auto-generates in postinstall (requires DATABASE_URL available)

04 Data Layer

Managed Postgres (Supabase, Sydney) is the source of truth. Prisma 6 owns the schema; migrations run as the Fly API release_command before rollout. Every domain table is tenant-scoped and protected by Row Level Security.

Postgres 17 · Supabase SYD

DATABASE_URL · PgBouncer pooled + direct

25+ Prisma models (Tenant, User, Inspection, ChecklistResponse, DamageMarker, MediaItem, Report, etc.)
RLS policies enforce tenant_id = current_setting('app.tenant_id')
Prisma Client Extension auto-injects tenant context on every query
Full-text search (tsvector) on inspections/vehicles

Redis / Valkey · Upstash

VALKEY_URL · ioredis

BullMQ queues (pdf, scan-upload, sms)
Rate-limit counters (express-rate-limit adapter)
Short-lived cache (dashboard aggregates, feature flags)
Queue depth gauge polled and surfaced via /api/health/ready

Object Storage · Cloudflare R2

S3-compatible · presigned GET

Inspection photos, vehicle photos, avatars, branding logos, PDFs
Key format {tenantId}/inspections/{id}/reports/{uuid}.pdf
S3_PRESIGN_TTL_SEC default 3600s — URL-only delivery; no direct reads
Separate bucket for backups with 90-day lifecycle (set on bucket)

05 Queues & Background Jobs

API enqueues jobs via BullMQ; worker pulls them over the outbound Redis connection. IMPLEMENTED_QUEUES gate prevents spinning up Workers for queues without processors (avoids burning Upstash quota on retries).

Queue	Producer	Processor	Purpose	Notes
`pdf`	reports module	`pdf.processor.ts`	Render report HTML → PDF, store in R2, email link	Puppeteer + system Chromium; concurrency=1
`scan-upload`	media module	`scan-upload.processor.ts`	Post-upload processing: mime sniff, sharp resize, thumbnails	Uses @aws-sdk/client-s3
`sms`	notifications	`sms.processor.ts`	Outbound SMS (future Twilio/ClickSend)	Provider not yet wired
`email` planned	notifications	— (not in `IMPLEMENTED_QUEUES`)	Transactional email via Resend	A6.1 task open

06 Edge, DNS & Assets

Cloudflare DNS + Proxy

Nameservers delegated at registrar (VentraIP / Cloudflare Registrar)
Orange-cloud proxy in front of Fly .fly.dev hosts via CNAME
WAF / bot mgmt / TLS termination (Fly also terminates TLS — double TLS)
DNS for Resend (SPF TXT, DKIM CNAMEs, DMARC TXT, bounce MX)

Cloudflare R2

Buckets: autoinspect-pro-production, -staging, -backups
Zero egress — critical for photo-heavy dealers
Private bucket; access only via presigned URLs issued by API
Lifecycle rules set manually in Cloudflare UI

07 Email & Observability

Resend

Verified sending domain (SPF/DKIM/DMARC on CF DNS)
Used by API directly today; will migrate into email worker queue (A6.1)
Templates authored with React Email (planned)

New Relic

APM on apps/api and apps/worker (newrelic package, newrelic-loader.ts)
Browser RUM on apps/web (NEXT_PUBLIC_NEW_RELIC_*)
Config via env — NEW_RELIC_NO_CONFIG_FILE=true
Pino logs → stdout → Fly → New Relic log ingest

Logging

nestjs-pino with pino-http — structured JSON logs
No console.log in committed code (enforced by lint)
Pretty logs locally (pino-pretty), structured in prod

08 CI/CD

Workflow	Trigger	Does
`ci.yml`	PR + push to `main`	install · prisma generate · lint · type-check · security gates (tenant isolation, rate limits) · test · build
`deploy-staging.yml`	push to `main`	build · `scripts/deploy.sh staging` (api + worker + web in parallel)
`deploy-prod.yml`	tag `v*` / manual	manual-approved prod deploy · smoke tests against `/api/health/*` + web→api proxy probe
`backup-db.yml`	weekly cron (Sun 02:00 UTC)	`pg_dump → gzip → aws s3 cp` to R2 backups bucket (streaming, no disk)
`k6-smoke.yml`	manual / scheduled	k6 smoke + authenticated load test
`nightly-regression.yml`	nightly	regression suite

`scripts/deploy.sh`

Single source of truth for deploy flags
Bakes --build-arg API_INTERNAL_URL per env (or web breaks with ECONNREFUSED)
Parallel targets; exit code aggregates per-target status
Reads token from /tmp/fly-token.sh locally or FLY_API_TOKEN in CI

Migrations on deploy

Fly release_command on fly.api.toml runs prisma migrate deploy
Executed in a one-off VM before rollout; failure aborts deploy, previous version keeps serving
Never modify the database directly — migrations are the only path

09 Key Data Flows

Inspection photo upload

Client requests presigned PUT from POST /api/media/presign
API issues presigned URL to R2 (TTL 3600s) and returns object key
Client PUTs binary directly to R2 (bypasses Fly bandwidth)
Client notifies POST /api/media/:id/finalize → API enqueues scan-upload
Worker: mime sniff → sharp resize → thumbnails → update MediaItem row

Customer PDF report

User clicks "Generate Report" → API enqueues pdf job with inspection ID + tenant
Worker fetches data (Prisma with tenant scope), renders HTML via template, Puppeteer → PDF
PDF uploaded to R2 at {tenantId}/inspections/{id}/reports/{uuid}.pdf
Worker creates Report + ReportDelivery rows and triggers email
Resend sends a share link containing an opaque token; public viewer hits /api/public/report/:token

Tenant isolation on every request

Passport JWT guard verifies access token; attaches user + tenantId
TenantContextMiddleware stores tenant ID in AsyncLocalStorage
Prisma Client Extension injects where: { tenantId } on every query
Postgres RLS is a defence-in-depth backstop if the extension ever misses
Tenant isolation test (test:tenant) runs on every CI build as a security gate

10 Security Model

Transport & edge

TLS at Cloudflare + Fly (end-to-end HTTPS)
helmet on API with strict CSP (default-src 'none') — relaxed only in dev
Refuses to boot if NODE_ENV != production on a deployed origin
CORS origins from getAllowedWebOrigins()

AuthN / AuthZ

Passport JWT (access 15m) + httpOnly refresh cookie (7d) scoped to /api/auth
Non-httpOnly ai_session=1 used only by Next.js middleware for flash prevention
RBAC via @Roles() + RolesGuard on every endpoint
TOTP 2FA (otplib + qrcode)

Rate limiting

Per-route limiters: login (10/min), refresh (30-60/min), public report (30-60/min), public approval
express-rate-limit backed by Redis
Tested in CI as a named security gate

Tenant data

Every domain table has tenant_id — no exceptions
Prisma extension + Postgres RLS (belt + braces)
Presigned URLs scoped per-tenant via key prefix
Audit log records sensitive writes

11 Environment Variables (condensed)

Category	Keys
Core	`NODE_ENV`, `PORT`, `APP_URL`, `API_INTERNAL_URL` (web build-arg), `WEB_PORT`, `API_PORT`
Auth	`JWT_SECRET`, `JWT_REFRESH_SECRET`, `JWT_ACCESS_EXPIRY`, `JWT_REFRESH_EXPIRY`, `BCRYPT_ROUNDS`
Database	`DATABASE_URL` (Supabase), `BACKUP_DATABASE_URL` (direct, for pg_dump)
Redis	`VALKEY_URL`, `VALKEY_HOST/PORT/PASSWORD/DB`
Object store	`S3_ENDPOINT`, `S3_ACCESS_KEY`, `S3_SECRET_KEY`, `S3_BUCKET`, `S3_REGION`, `S3_PRESIGN_TTL_SEC`
Backups (CI)	`R2_ACCESS_KEY_ID`, `R2_SECRET_ACCESS_KEY`, `R2_ACCOUNT_ID`, `R2_BACKUPS_BUCKET`
Rate limits	`RATE_LIMIT_LOGIN`, `RATE_LIMIT_REFRESH`, `RATE_LIMIT_PUBLIC_REPORT`
External APIs	`NHTSA_API_URL` (VIN decode), Resend API key (secret)
Observability	`NEW_RELIC_LICENSE_KEY`, `NEW_RELIC_APP_NAME`, `NEXT_PUBLIC_NEW_RELIC_*`
Deploy	`FLY_API_TOKEN` (CI), `PUPPETEER_EXECUTABLE_PATH` (worker)

12 Full AWS Migration Plan

Translate the current "Fly + Cloudflare + Supabase + Upstash + Resend" stack into a pure AWS equivalent. The application code requires near-zero changes — Prisma, ioredis, and the AWS S3 SDK already work against managed services. The work is mostly infrastructure, networking, and cost trade-offs.

Component-by-component mapping

Runtime

Fly.io 3 apps × 2 envs, Docker images, built-in TLS, private 6PN

AWS ECS Fargate behind an ALB. One service per workspace (web, api, worker), two clusters (staging/prod) or one cluster with env-scoped services. Task def per image. Worker is a headless service (no target group).

Postgres

Supabase Postgres 17 in SYD, PgBouncer pooled, PITR on paid plan

Amazon RDS for PostgreSQL 17 (Multi-AZ in ap-southeast-2). Use RDS Proxy as PgBouncer replacement. Automated backups + PITR. Parameter group with rds.force_ssl=1. Secrets Manager rotation.

Redis / Valkey

Upstash Redis (Valkey) serverless, TLS, pay-per-request

Amazon ElastiCache for Valkey (Serverless) or Redis OSS. Private subnets only; security group allows the ECS tasks. Encryption in-transit + at-rest.

Object storage

Cloudflare R2 — zero egress, S3 API, three buckets

Amazon S3 (ap-southeast-2) buckets: media-prod, media-staging, backups. Block Public Access on. Lifecycle rules for backups (90-day Glacier → expire). Egress is no longer free — mitigate with CloudFront for public artefacts and a Gateway VPC Endpoint for S3 from ECS.

CDN / WAF / DNS

Cloudflare DNS + proxy + WAF

Route 53 (DNS) + CloudFront (CDN in front of ALB and S3) + AWS WAF + ACM (TLS certs, us-east-1 for CloudFront / ap-southeast-2 for ALB). Optional AWS Shield Standard included.

Email

Resend (SPF/DKIM/DMARC)

Amazon SES in ap-southeast-2. Move out of the sandbox (production access request). Configuration set for bounce/complaint events → SNS → SQS → event handler. DKIM via Route 53. For richer templating/analytics, keep Resend.

Observability

New Relic APM + browser RUM

Option A (keep): New Relic works fine on ECS — no changes. Option B (full AWS): CloudWatch Logs + X-Ray (ADOT instrumentation for NestJS) + CloudWatch RUM. Use Container Insights for ECS metrics.

Secrets & config

flyctl secrets

AWS Secrets Manager (JWT, DB URL, API keys) + SSM Parameter Store (non-secret config). Injected into Fargate via secrets block in the task definition.

CI/CD

GitHub Actions → flyctl deploy

Keep GitHub Actions. Build images, push to Amazon ECR, update ECS service with new task def. Use OIDC federation (no long-lived AWS keys). Optional: mirror into CodePipeline + CodeDeploy for blue/green.

Backups

Weekly GH Action: pg_dump → R2, 90-day lifecycle

RDS automated snapshots + PITR (primary). Keep the pg_dump cron as EventBridge Scheduled Rule → ECS one-off task → S3 + S3 Object Lock (compliance mode) for ransomware-proof retention.

PDF worker (Chromium)

Fly worker with apk chromium + Fly volume

Option A: Fargate task with same Dockerfile — works today. Option B: replace with AWS Lambda container image + @sparticuz/chromium on demand; simpler, cheaper at low volume. Shared scratch → EFS mount on Fargate if you need it (rarely necessary — R2/S3 handles everything).

Queues

BullMQ on Upstash

Keep BullMQ on ElastiCache Valkey (minimal change). Alternative native: SQS standard queues + EventBridge for scheduling. SQS gives you DLQs and visibility-timeout semantics for free, but requires rewriting producers/consumers.

Rate limiting

express-rate-limit backed by Redis

Keep application-level limiters. Add AWS WAF rate-based rules at the edge (CloudFront) to protect /api/auth/login and the public report endpoint before they hit the ALB.

IaC

fly.*.toml files + scripts/deploy.sh

AWS CDK (TypeScript) or Terraform. One stack per environment: VPC, subnets, ALB, ECS services, RDS, ElastiCache, S3, CloudFront, Route 53, Secrets. Reuse Dockerfiles untouched.

Target AWS topology (ap-southeast-2)

Migration phases

Phase 1 — Foundation (no traffic)

Pick an AWS account structure (single account OK for beta; Organizations + SSO for real).
Write CDK/Terraform: VPC (3 AZs), NAT, S3 gateway endpoint, ECR repos, Secrets, Route 53 zone, ACM certs (us-east-1 for CloudFront).
Provision RDS Postgres (empty) + ElastiCache Valkey (empty) in private subnets.
Set up OIDC trust for GitHub Actions → AWS role (no static keys).

Phase 2 — Build + push images

Duplicate Dockerfiles (already build-clean) and push to ECR per app.
Author ECS task definitions: secrets from Secrets Manager, CloudWatch log groups, healthcheck matching Fly (/api/health/ready).
Create ECS services behind ALB target groups (web + api). Worker service has no LB.

Phase 3 — Data migration

Postgres: pg_dump | pg_restore from Supabase into RDS during a brief freeze, or use AWS DMS for near-zero-downtime CDC.
Redis: no migration needed — BullMQ jobs drain naturally; flip VALKEY_URL at cutover.
R2 → S3: aws s3 sync (or rclone sync) for existing media; same key layout.

Phase 4 — Cutover

Freeze writes (maintenance page), run final delta-sync of DB + S3.
Point Route 53 record to CloudFront. Cloudflare (old) DNS TTL lowered 24h in advance.
Flip Fly web/api to read-only health 503 (keeps old stack paused but instantly restorable).
Unfreeze. Monitor CloudWatch dashboards + error rates.

Phase 5 — Decommission

After 7–14 days of clean operation: destroy Fly apps, cancel Supabase + Upstash + R2 (keep backup bucket if still referenced).
Delete Cloudflare DNS records no longer needed (keep the zone if you want CF WAF in front of CloudFront as a second edge — possible but unusual).
Tag all AWS resources app=autoinspect, env=prod|staging for cost attribution.

Code changes required

None in the API — @aws-sdk/client-s3, ioredis, and @prisma/client already target AWS-native services.
Remove the S3_ENDPOINT env value so the SDK uses AWS default endpoints.
Swap resend calls for @aws-sdk/client-sesv2 if going all-in on SES (otherwise keep Resend).
Replace flyctl commands in scripts/deploy.sh with aws ecs update-service or a CDK deploy.
Puppeteer: no change on Fargate. If switching to Lambda, replace Puppeteer with @sparticuz/chromium + puppeteer-core.

Cost posture (rough, AUD/mo, beta scale)

Line	Current stack	AWS equivalent	Notes
Compute (3 svc × 2 envs)	~$30 Fly	$80–150 Fargate	AWS wins on scale, loses on small instances
Postgres	$0–40 Supabase	$70–120 RDS Multi-AZ	Multi-AZ doubles cost vs single-AZ Supabase
Redis	$0–10 Upstash	$20–50 ElastiCache Serverless	ElastiCache has a minimum baseline
Object storage	$0–5 R2 (no egress)	$5–30 S3 + egress	CloudFront + S3 GW endpoint mitigates egress
Email	$0–20 Resend	~$0 SES (+ $0.10/1k)	SES cheapest at volume
Observability	$0 New Relic free	$10–40 CloudWatch	Or keep New Relic
Edge/DNS/WAF	$0 Cloudflare free	$10–30 CloudFront + WAF	AWS WAF is rule-count priced
Total ballpark	$30–100	$200–400	AWS is 2–4× at beta; flips around at medium scale

Gotchas & recommendations

Don't skip

S3 Gateway VPC Endpoint — without it every S3 call from ECS pays NAT egress.
RDS Proxy — Fargate tasks churn connections quickly; Proxy behaves like PgBouncer and protects the DB.
OIDC for GitHub Actions — never paste long-lived AWS keys into repo secrets.
ACM in us-east-1 for CloudFront certs (CloudFront only reads us-east-1) and ap-southeast-2 for the ALB.
SES production access — approval can take days; start early.
Backup restore drill — practice pg_restore from a snapshot before cutover.

Things you can keep

Cloudflare DNS in front of AWS is fine (orange-cloud → CloudFront). Two WAFs is overkill; pick one.
Resend for transactional email — saves you SES sandbox/reputation work. Swap later if cost matters.
New Relic works on ECS without changes. Only migrate to CloudWatch/X-Ray if you want single-vendor billing.
GitHub Actions — keep as the CI/CD driver; AWS CodePipeline is optional.
Docker images — unchanged. The Dockerfiles are portable.