AutoInspect PRO — Infrastructure Topology

Current production stack (Fly.io + Cloudflare + Supabase + Upstash + Resend + New Relic) and a full AWS migration plan.

01 High-level Overview

Turborepo monorepo with three independent workspaces — apps/web (Next.js 15), apps/api (NestJS 11), apps/worker (BullMQ + Puppeteer) — and a shared packages/shared (DTOs, Zod schemas, types). Every app ships as its own Docker image and deploys to a separate Fly.io app. Managed services provide Postgres, Redis, object storage, email, and observability.

Frontend Fly.io

Next.js 15 · React 19 · Tailwind v4 · next-intl
  • apps/web — App Router, standalone build
  • Port 5174 inside container
  • Rewrites /api/* → API via Fly private network
  • Zustand (client state) + TanStack Query (server state)

API Fly.io

NestJS 11 · Prisma 6 · Passport JWT · Helmet
  • apps/api — REST under /api/*
  • Port 3000; global prefix api
  • Tenant isolation via Prisma extension + RLS
  • Swagger UI (non-prod)

Worker Fly.io

BullMQ 5 · ioredis · Puppeteer + Chromium
  • apps/worker — no HTTP surface
  • Queues: pdf, scan-upload, sms
  • Fly volume autoinspect_worker_data at /repo/uploads
  • System Chromium (apk) for PDF rendering

Shared Package

@autoinspect/shared
  • DTOs + Zod schemas (single source of truth)
  • Constants, enums, utils
  • Dual build: ESM (dist/) + CJS (dist-cjs/)
  • Only path that crosses workspace boundaries

02 Topology Diagram

Logical flow: client → Cloudflare edge → Fly web app → Fly API → Postgres / Valkey / R2. Worker pulls jobs from Valkey, renders PDFs via Chromium, stores artefacts in R2.

CLIENTS CLOUDFLARE EDGE FLY.IO · syd · PRIVATE NETWORK (.internal) MANAGED DATA EXTERNAL SERVICES / CI&CD Dealership Browser PWA · Next.js client · JWT Public Report Viewer Unauthenticated · share token Cloudflare DNS + WAF TLS · bot mgmt · proxy Cloudflare R2 (S3 API) photos · PDFs · backups autoinspect-web Next.js :5174 · standalone rewrites /api/* → api.internal autoinspect-api NestJS :3000 · helmet · CORS rate limiting · RBAC · RLS /api/health/{live,ready} autoinspect-worker BullMQ · Chromium · no HTTP pdf · scan-upload · sms Fly volume · /repo/uploads S3Service presigned GET/PUT · sharp Cache / Rate-limit ioredis client Queues service BullMQ producer Supabase Postgres 17 syd · Prisma · RLS Multi-tenant (tenant_id) Upstash Redis (Valkey) cache · BullMQ queues rate-limit counters Cloudflare R2 Bucket media · reports · backups S3-compatible · zero egress Prisma Migrations release_command on deploy Resend transactional email SPF/DKIM/DMARC on CF DNS New Relic APM api · worker · browser RUM NHTSA vPIC public VIN decode GitHub Actions CI · deploy-staging · deploy-prod weekly pg_dump → R2 flyctl / Fly Secrets per-app env injection Turborepo Cache local + GH Actions cache Mailpit (local dev) SMTP :1025 · UI :8025 MinIO (local dev) S3-compat :9000 Docker Compose (local) postgres + valkey + minio + mailpit jobs via Redis
Runtime request / data path Control plane / async integration Fly.io Cloudflare Postgres Redis/Valkey

03 Runtime — Fly.io

Three Fly apps per environment (staging + production = 6 apps total). Region syd. Images built remotely via flyctl deploy --remote-only. Rolling strategy. Web/API auto-stop with min_machines_running = 1.

AppRuntimePortsHTTPResourcesDeploy
autoinspect-web-{env} Next.js 15 standalone (node apps/web/server.js) 5174 internal Public HTTPS · force_https · auto_stop shared-cpu-1x · 1 GB Dockerfile.web · API_INTERNAL_URL baked at build
autoinspect-api-{env} NestJS 11 (node dist/src/main) 3000 internal Public HTTPS · health /api/health/ready shared-cpu-1x · 512 MB Dockerfile.api · release = prisma migrate deploy
autoinspect-worker-{env} Node + BullMQ + Chromium (node dist/main.js) None (outbound only) No HTTP surface · pgrep healthcheck shared-cpu-1x · 1 GB + Fly volume Dockerfile.worker · apk chromium · SIGTERM 60s drain

Private networking

  • Web → API over Fly 6PN using http://autoinspect-api-<env>.internal:3000
  • Baked into Next.js rewrites via API_INTERNAL_URL build-arg (non-negotiable)
  • Worker → Redis/Postgres/R2 all over public TLS endpoints (no Fly-internal data stores)

Secrets & config

  • flyctl secrets set per app (never committed)
  • Local dev loads from root .env via set -a; . ../../.env; set +a
  • Prisma auto-generates in postinstall (requires DATABASE_URL available)

04 Data Layer

Managed Postgres (Supabase, Sydney) is the source of truth. Prisma 6 owns the schema; migrations run as the Fly API release_command before rollout. Every domain table is tenant-scoped and protected by Row Level Security.

Postgres 17 · Supabase SYD

DATABASE_URL · PgBouncer pooled + direct
  • 25+ Prisma models (Tenant, User, Inspection, ChecklistResponse, DamageMarker, MediaItem, Report, etc.)
  • RLS policies enforce tenant_id = current_setting('app.tenant_id')
  • Prisma Client Extension auto-injects tenant context on every query
  • Full-text search (tsvector) on inspections/vehicles

Redis / Valkey · Upstash

VALKEY_URL · ioredis
  • BullMQ queues (pdf, scan-upload, sms)
  • Rate-limit counters (express-rate-limit adapter)
  • Short-lived cache (dashboard aggregates, feature flags)
  • Queue depth gauge polled and surfaced via /api/health/ready

Object Storage · Cloudflare R2

S3-compatible · presigned GET
  • Inspection photos, vehicle photos, avatars, branding logos, PDFs
  • Key format {tenantId}/inspections/{id}/reports/{uuid}.pdf
  • S3_PRESIGN_TTL_SEC default 3600s — URL-only delivery; no direct reads
  • Separate bucket for backups with 90-day lifecycle (set on bucket)

05 Queues & Background Jobs

API enqueues jobs via BullMQ; worker pulls them over the outbound Redis connection. IMPLEMENTED_QUEUES gate prevents spinning up Workers for queues without processors (avoids burning Upstash quota on retries).

QueueProducerProcessorPurposeNotes
pdfreports modulepdf.processor.tsRender report HTML → PDF, store in R2, email linkPuppeteer + system Chromium; concurrency=1
scan-uploadmedia modulescan-upload.processor.tsPost-upload processing: mime sniff, sharp resize, thumbnailsUses @aws-sdk/client-s3
smsnotificationssms.processor.tsOutbound SMS (future Twilio/ClickSend)Provider not yet wired
email plannednotifications— (not in IMPLEMENTED_QUEUES)Transactional email via ResendA6.1 task open

06 Edge, DNS & Assets

Cloudflare DNS + Proxy

  • Nameservers delegated at registrar (VentraIP / Cloudflare Registrar)
  • Orange-cloud proxy in front of Fly .fly.dev hosts via CNAME
  • WAF / bot mgmt / TLS termination (Fly also terminates TLS — double TLS)
  • DNS for Resend (SPF TXT, DKIM CNAMEs, DMARC TXT, bounce MX)

Cloudflare R2

  • Buckets: autoinspect-pro-production, -staging, -backups
  • Zero egress — critical for photo-heavy dealers
  • Private bucket; access only via presigned URLs issued by API
  • Lifecycle rules set manually in Cloudflare UI

07 Email & Observability

Resend

  • Verified sending domain (SPF/DKIM/DMARC on CF DNS)
  • Used by API directly today; will migrate into email worker queue (A6.1)
  • Templates authored with React Email (planned)

New Relic

  • APM on apps/api and apps/worker (newrelic package, newrelic-loader.ts)
  • Browser RUM on apps/web (NEXT_PUBLIC_NEW_RELIC_*)
  • Config via env — NEW_RELIC_NO_CONFIG_FILE=true
  • Pino logs → stdout → Fly → New Relic log ingest

Logging

  • nestjs-pino with pino-http — structured JSON logs
  • No console.log in committed code (enforced by lint)
  • Pretty logs locally (pino-pretty), structured in prod

08 CI/CD

WorkflowTriggerDoes
ci.ymlPR + push to maininstall · prisma generate · lint · type-check · security gates (tenant isolation, rate limits) · test · build
deploy-staging.ymlpush to mainbuild · scripts/deploy.sh staging (api + worker + web in parallel)
deploy-prod.ymltag v* / manualmanual-approved prod deploy · smoke tests against /api/health/* + web→api proxy probe
backup-db.ymlweekly cron (Sun 02:00 UTC)pg_dump → gzip → aws s3 cp to R2 backups bucket (streaming, no disk)
k6-smoke.ymlmanual / scheduledk6 smoke + authenticated load test
nightly-regression.ymlnightlyregression suite

scripts/deploy.sh

  • Single source of truth for deploy flags
  • Bakes --build-arg API_INTERNAL_URL per env (or web breaks with ECONNREFUSED)
  • Parallel targets; exit code aggregates per-target status
  • Reads token from /tmp/fly-token.sh locally or FLY_API_TOKEN in CI

Migrations on deploy

  • Fly release_command on fly.api.toml runs prisma migrate deploy
  • Executed in a one-off VM before rollout; failure aborts deploy, previous version keeps serving
  • Never modify the database directly — migrations are the only path

09 Key Data Flows

Inspection photo upload
  1. Client requests presigned PUT from POST /api/media/presign
  2. API issues presigned URL to R2 (TTL 3600s) and returns object key
  3. Client PUTs binary directly to R2 (bypasses Fly bandwidth)
  4. Client notifies POST /api/media/:id/finalize → API enqueues scan-upload
  5. Worker: mime sniff → sharp resize → thumbnails → update MediaItem row
Customer PDF report
  1. User clicks "Generate Report" → API enqueues pdf job with inspection ID + tenant
  2. Worker fetches data (Prisma with tenant scope), renders HTML via template, Puppeteer → PDF
  3. PDF uploaded to R2 at {tenantId}/inspections/{id}/reports/{uuid}.pdf
  4. Worker creates Report + ReportDelivery rows and triggers email
  5. Resend sends a share link containing an opaque token; public viewer hits /api/public/report/:token
Tenant isolation on every request
  1. Passport JWT guard verifies access token; attaches user + tenantId
  2. TenantContextMiddleware stores tenant ID in AsyncLocalStorage
  3. Prisma Client Extension injects where: { tenantId } on every query
  4. Postgres RLS is a defence-in-depth backstop if the extension ever misses
  5. Tenant isolation test (test:tenant) runs on every CI build as a security gate

10 Security Model

Transport & edge

  • TLS at Cloudflare + Fly (end-to-end HTTPS)
  • helmet on API with strict CSP (default-src 'none') — relaxed only in dev
  • Refuses to boot if NODE_ENV != production on a deployed origin
  • CORS origins from getAllowedWebOrigins()

AuthN / AuthZ

  • Passport JWT (access 15m) + httpOnly refresh cookie (7d) scoped to /api/auth
  • Non-httpOnly ai_session=1 used only by Next.js middleware for flash prevention
  • RBAC via @Roles() + RolesGuard on every endpoint
  • TOTP 2FA (otplib + qrcode)

Rate limiting

  • Per-route limiters: login (10/min), refresh (30-60/min), public report (30-60/min), public approval
  • express-rate-limit backed by Redis
  • Tested in CI as a named security gate

Tenant data

  • Every domain table has tenant_id — no exceptions
  • Prisma extension + Postgres RLS (belt + braces)
  • Presigned URLs scoped per-tenant via key prefix
  • Audit log records sensitive writes

11 Environment Variables (condensed)

CategoryKeys
CoreNODE_ENV, PORT, APP_URL, API_INTERNAL_URL (web build-arg), WEB_PORT, API_PORT
AuthJWT_SECRET, JWT_REFRESH_SECRET, JWT_ACCESS_EXPIRY, JWT_REFRESH_EXPIRY, BCRYPT_ROUNDS
DatabaseDATABASE_URL (Supabase), BACKUP_DATABASE_URL (direct, for pg_dump)
RedisVALKEY_URL, VALKEY_HOST/PORT/PASSWORD/DB
Object storeS3_ENDPOINT, S3_ACCESS_KEY, S3_SECRET_KEY, S3_BUCKET, S3_REGION, S3_PRESIGN_TTL_SEC
Backups (CI)R2_ACCESS_KEY_ID, R2_SECRET_ACCESS_KEY, R2_ACCOUNT_ID, R2_BACKUPS_BUCKET
Rate limitsRATE_LIMIT_LOGIN, RATE_LIMIT_REFRESH, RATE_LIMIT_PUBLIC_REPORT
External APIsNHTSA_API_URL (VIN decode), Resend API key (secret)
ObservabilityNEW_RELIC_LICENSE_KEY, NEW_RELIC_APP_NAME, NEXT_PUBLIC_NEW_RELIC_*
DeployFLY_API_TOKEN (CI), PUPPETEER_EXECUTABLE_PATH (worker)

12 Full AWS Migration Plan

Translate the current "Fly + Cloudflare + Supabase + Upstash + Resend" stack into a pure AWS equivalent. The application code requires near-zero changes — Prisma, ioredis, and the AWS S3 SDK already work against managed services. The work is mostly infrastructure, networking, and cost trade-offs.

Component-by-component mapping

Runtime
Fly.io 3 apps × 2 envs, Docker images, built-in TLS, private 6PN
AWS ECS Fargate behind an ALB. One service per workspace (web, api, worker), two clusters (staging/prod) or one cluster with env-scoped services. Task def per image. Worker is a headless service (no target group).
Postgres
Supabase Postgres 17 in SYD, PgBouncer pooled, PITR on paid plan
Amazon RDS for PostgreSQL 17 (Multi-AZ in ap-southeast-2). Use RDS Proxy as PgBouncer replacement. Automated backups + PITR. Parameter group with rds.force_ssl=1. Secrets Manager rotation.
Redis / Valkey
Upstash Redis (Valkey) serverless, TLS, pay-per-request
Amazon ElastiCache for Valkey (Serverless) or Redis OSS. Private subnets only; security group allows the ECS tasks. Encryption in-transit + at-rest.
Object storage
Cloudflare R2 — zero egress, S3 API, three buckets
Amazon S3 (ap-southeast-2) buckets: media-prod, media-staging, backups. Block Public Access on. Lifecycle rules for backups (90-day Glacier → expire). Egress is no longer free — mitigate with CloudFront for public artefacts and a Gateway VPC Endpoint for S3 from ECS.
CDN / WAF / DNS
Cloudflare DNS + proxy + WAF
Route 53 (DNS) + CloudFront (CDN in front of ALB and S3) + AWS WAF + ACM (TLS certs, us-east-1 for CloudFront / ap-southeast-2 for ALB). Optional AWS Shield Standard included.
Email
Resend (SPF/DKIM/DMARC)
Amazon SES in ap-southeast-2. Move out of the sandbox (production access request). Configuration set for bounce/complaint events → SNS → SQS → event handler. DKIM via Route 53. For richer templating/analytics, keep Resend.
Observability
New Relic APM + browser RUM
Option A (keep): New Relic works fine on ECS — no changes. Option B (full AWS): CloudWatch Logs + X-Ray (ADOT instrumentation for NestJS) + CloudWatch RUM. Use Container Insights for ECS metrics.
Secrets & config
flyctl secrets
AWS Secrets Manager (JWT, DB URL, API keys) + SSM Parameter Store (non-secret config). Injected into Fargate via secrets block in the task definition.
CI/CD
GitHub Actions → flyctl deploy
Keep GitHub Actions. Build images, push to Amazon ECR, update ECS service with new task def. Use OIDC federation (no long-lived AWS keys). Optional: mirror into CodePipeline + CodeDeploy for blue/green.
Backups
Weekly GH Action: pg_dump → R2, 90-day lifecycle
RDS automated snapshots + PITR (primary). Keep the pg_dump cron as EventBridge Scheduled Rule → ECS one-off task → S3 + S3 Object Lock (compliance mode) for ransomware-proof retention.
PDF worker (Chromium)
Fly worker with apk chromium + Fly volume
Option A: Fargate task with same Dockerfile — works today. Option B: replace with AWS Lambda container image + @sparticuz/chromium on demand; simpler, cheaper at low volume. Shared scratch → EFS mount on Fargate if you need it (rarely necessary — R2/S3 handles everything).
Queues
BullMQ on Upstash
Keep BullMQ on ElastiCache Valkey (minimal change). Alternative native: SQS standard queues + EventBridge for scheduling. SQS gives you DLQs and visibility-timeout semantics for free, but requires rewriting producers/consumers.
Rate limiting
express-rate-limit backed by Redis
Keep application-level limiters. Add AWS WAF rate-based rules at the edge (CloudFront) to protect /api/auth/login and the public report endpoint before they hit the ALB.
IaC
fly.*.toml files + scripts/deploy.sh
AWS CDK (TypeScript) or Terraform. One stack per environment: VPC, subnets, ALB, ECS services, RDS, ElastiCache, S3, CloudFront, Route 53, Secrets. Reuse Dockerfiles untouched.

Target AWS topology (ap-southeast-2)

EDGE (global) ap-southeast-2 VPC Route 53 DNS + alias to CloudFront CloudFront + AWS WAF TLS (ACM) · rate-based rules ACM Certificates us-east-1 + ap-southeast-2 AWS Shield Std DDoS baseline Public subnets (2 AZ) Application Load Balancer HTTPS :443 · health checks Private subnets — ECS Fargate ECS svc: web Next.js · target group ECS svc: api NestJS · target group ECS svc: worker BullMQ · no LB Amazon ECR web/api/worker images Secrets Manager JWT · DB · API keys Parameter Store non-secret config Private subnets — Data RDS Postgres 17 Multi-AZ · RDS Proxy ElastiCache Valkey Serverless EFS (optional) worker scratch Amazon S3 media · reports · backups Amazon SES SPF/DKIM via Route 53 CloudWatch + X-Ray logs · metrics · traces EventBridge Scheduler weekly backups · digests Lambda (optional) Chromium PDF on-demand S3 VPC Gateway Endpoint no NAT egress for S3

Migration phases

Phase 1 — Foundation (no traffic)

  • Pick an AWS account structure (single account OK for beta; Organizations + SSO for real).
  • Write CDK/Terraform: VPC (3 AZs), NAT, S3 gateway endpoint, ECR repos, Secrets, Route 53 zone, ACM certs (us-east-1 for CloudFront).
  • Provision RDS Postgres (empty) + ElastiCache Valkey (empty) in private subnets.
  • Set up OIDC trust for GitHub Actions → AWS role (no static keys).

Phase 2 — Build + push images

  • Duplicate Dockerfiles (already build-clean) and push to ECR per app.
  • Author ECS task definitions: secrets from Secrets Manager, CloudWatch log groups, healthcheck matching Fly (/api/health/ready).
  • Create ECS services behind ALB target groups (web + api). Worker service has no LB.

Phase 3 — Data migration

  • Postgres: pg_dump | pg_restore from Supabase into RDS during a brief freeze, or use AWS DMS for near-zero-downtime CDC.
  • Redis: no migration needed — BullMQ jobs drain naturally; flip VALKEY_URL at cutover.
  • R2 → S3: aws s3 sync (or rclone sync) for existing media; same key layout.

Phase 4 — Cutover

  • Freeze writes (maintenance page), run final delta-sync of DB + S3.
  • Point Route 53 record to CloudFront. Cloudflare (old) DNS TTL lowered 24h in advance.
  • Flip Fly web/api to read-only health 503 (keeps old stack paused but instantly restorable).
  • Unfreeze. Monitor CloudWatch dashboards + error rates.

Phase 5 — Decommission

  • After 7–14 days of clean operation: destroy Fly apps, cancel Supabase + Upstash + R2 (keep backup bucket if still referenced).
  • Delete Cloudflare DNS records no longer needed (keep the zone if you want CF WAF in front of CloudFront as a second edge — possible but unusual).
  • Tag all AWS resources app=autoinspect, env=prod|staging for cost attribution.

Code changes required

  • None in the API — @aws-sdk/client-s3, ioredis, and @prisma/client already target AWS-native services.
  • Remove the S3_ENDPOINT env value so the SDK uses AWS default endpoints.
  • Swap resend calls for @aws-sdk/client-sesv2 if going all-in on SES (otherwise keep Resend).
  • Replace flyctl commands in scripts/deploy.sh with aws ecs update-service or a CDK deploy.
  • Puppeteer: no change on Fargate. If switching to Lambda, replace Puppeteer with @sparticuz/chromium + puppeteer-core.

Cost posture (rough, AUD/mo, beta scale)

LineCurrent stackAWS equivalentNotes
Compute (3 svc × 2 envs)~$30 Fly$80–150 FargateAWS wins on scale, loses on small instances
Postgres$0–40 Supabase$70–120 RDS Multi-AZMulti-AZ doubles cost vs single-AZ Supabase
Redis$0–10 Upstash$20–50 ElastiCache ServerlessElastiCache has a minimum baseline
Object storage$0–5 R2 (no egress)$5–30 S3 + egressCloudFront + S3 GW endpoint mitigates egress
Email$0–20 Resend~$0 SES (+ $0.10/1k)SES cheapest at volume
Observability$0 New Relic free$10–40 CloudWatchOr keep New Relic
Edge/DNS/WAF$0 Cloudflare free$10–30 CloudFront + WAFAWS WAF is rule-count priced
Total ballpark$30–100$200–400AWS is 2–4× at beta; flips around at medium scale

Gotchas & recommendations

Don't skip

  • S3 Gateway VPC Endpoint — without it every S3 call from ECS pays NAT egress.
  • RDS Proxy — Fargate tasks churn connections quickly; Proxy behaves like PgBouncer and protects the DB.
  • OIDC for GitHub Actions — never paste long-lived AWS keys into repo secrets.
  • ACM in us-east-1 for CloudFront certs (CloudFront only reads us-east-1) and ap-southeast-2 for the ALB.
  • SES production access — approval can take days; start early.
  • Backup restore drill — practice pg_restore from a snapshot before cutover.

Things you can keep

  • Cloudflare DNS in front of AWS is fine (orange-cloud → CloudFront). Two WAFs is overkill; pick one.
  • Resend for transactional email — saves you SES sandbox/reputation work. Swap later if cost matters.
  • New Relic works on ECS without changes. Only migrate to CloudWatch/X-Ray if you want single-vendor billing.
  • GitHub Actions — keep as the CI/CD driver; AWS CodePipeline is optional.
  • Docker images — unchanged. The Dockerfiles are portable.