Recommended Architecture

Use an always-on lightweight control plane + an ephemeral Vast GPU worker

Cloudflare Worker

tiny gateway at

api.example.com

Durable Object

state store

Vast API

Ephemeral Vast GPU Instance

Cloudflared Named Tunnel

sam-origin.example.com

FastAPI SAM Service

SAM 3 Runtime

Object Storage (R2 / S3)

Key Design Decisions

Stable public endpoint lives at Cloudflare, not on the GPU host
GPU worker is created on demand and destroyed after idle time
Named Cloudflare Tunnel provides a stable origin hostname
Persist media and results in object storage; keep session metadata separately
Design the backend adapter so you can swap Mac / MPS or Linux / CUDA later

Ask Codex To

Draft the Dockerfile and FastAPI skeleton
Define the Worker proxy contract and error states
Create Terraform or scripts for repeatable setup

Prototype on Mac, Deploy on Vast

Use the M3 MacBook Air for local experimentation, but treat Vast.ai as the production path

Local Mac prototype

Best path on macOS: Hugging Face Transformers on MPS
If image-only segmentation is enough, MLX is worth testing
Good for single-user demos, API prototyping, and prompt experiments
Avoid the official Meta repo as the primary Mac path because Mac/MPS support is less friction-free
Do not expect strong concurrent video tracking on a fanless M3 Air

Remote Vast deployment

Better fit for real GPU workloads and heavier image / video jobs
Lets you keep a stable public endpoint through Cloudflare
Much easier to evolve later into Linux + CUDA production
Use ephemeral workers to reduce idle cost
Recommended for stateful SAM 3 video sessions

Request Lifecycle

How the system behaves on cold starts, warm requests, and long-running video sessions.

1

Client calls api.example.com

2

Cloudflare Worker checks state

warming_up ready draining off failed

3

If no worker exists: create Vast instance

Vast API

Provision on demand

4

GPU instance starts FastAPI + cloudflared

FastAPI

cloudflared

5

Worker marks origin ready and proxies request

Ready

Proxy to origin

6

FastAPI runs SAM job and stores outputs in R2 / S3

SAM Job

R2 / S3

7

Client polls job status or receives completed result

Status / complete

Warm Path

If the worker is already ready, skip provisioning and proxy immediately.

Video Session Rule

Keep per-video state alive for the whole session; do not treat video tracking as isolated stateless requests.

Error States

warming_up
ready
draining
off
failed

Ask Codex To

Generate the state machine for the Worker
Implement job polling and 202 responses
Add retry logic and health checks

Step 1 — Build the GPU Worker Container

Pre-bake dependencies so cold starts are dominated by instance startup, not environment setup

1

What goes into the image

Python runtime, FastAPI, uvicorn, SAM dependencies
cloudflared binary
healthcheck and startup scripts
model weights or a fast first-run download strategy
logging and metrics hooks

2

Container behavior

Start FastAPI on localhost:8000
Start cloudflared tunnel with token-based auth
Expose only the local API port internally
Write readiness and liveness endpoints
Keep one inference worker process to avoid duplicate model memory

3

Recommended files

/app
├── /api
├── /sam_runtime
├── /workers
├── /scripts
├── Dockerfile
├── entrypoint.sh
├── healthcheck.py
└── requirements.txt

Tip: bake as much as possible into the image — large downloads are the biggest cold-start tax.

Ask Codex To

Write the Dockerfile and entrypoint
Implement /health and /ready endpoints
Add structured logging and env-based config

Step 4 — Build the FastAPI SAM Service

Model the API around the core SAM image and video task families

Internal Service Modules

Router and request validation
Session manager for image and video sessions
Backend adapter: Transformers / MPS, MLX, or CUDA runtime
Background job worker for long-running tasks
Result formatter for masks, overlays, and metadata

Recommended Endpoints

POST/v1/pcs/image
POST/v1/pcs/video/sessions
POST/v1/pcs/video/{session_id}/prompts
POST/v1/pcs/video/{session_id}/run
POST/v1/pvs/image
POST/v1/pvs/video/sessions
POST/v1/pvs/video/{session_id}/annotations
POST/v1/pvs/video/{session_id}/propagate
GET/v1/jobs/{job_id}

Return Formats

COCO RLE masks by default
Optional PNG overlay URLs
boxes, scores, object_ids, prompt_to_obj_ids
semantic segmentation when available

Concurrency Rule

On a single GPU worker, keep one process and gate expensive jobs with a small semaphore.

Ask Codex To

Generate Pydantic models and OpenAPI docs
Implement the backend adapter interface
Create job queue and result serialization

Step 5 — Storage, State, and Job Handling

Persist the right things, keep hot session state in the right place, and avoid long blocking requests

1

Object storage (R2 / S3)

Upload raw media, overlays, completed masks, logs, and exports
Return signed URLs or internal URLs to the client
Use predictable object naming by user, session, and job IDs

2

Metadata / state store

Store instance_id, worker state, session metadata, and job metadata
Durable Object is enough for the control plane; SQLite or Redis can back the service
Cache image embeddings when repeated prompts are likely

3

Background execution

Queue video propagation and longer jobs
Return 202 + job_id instead of holding the HTTP request open
Poll GET /jobs/{job_id} for status and artifacts
Destroy the GPU worker after the idle timeout expires

Session Rule

For video, maintain per-session inference state until the session is explicitly ended or times out.

Operational Default

Lower resolution by default on Mac or constrained GPUs; make full quality opt-in.

Ask Codex To

Write the storage abstraction
Implement job records and status transitions
Add cache and eviction rules

Execution Plan for You + Codex

Implement the system in small phases, using AI heavily for scaffolding and repeatable setup

1

API contract

Have Codex define endpoints, payloads, job states, and Pydantic models.

2

Local dev mode

Ask Codex to create a local FastAPI + MPS adapter so you can test the webapp contract on your Mac.

3

Worker container

Generate the Dockerfile, entrypoint, health checks, and SAM runtime wrapper.

4

Cloudflare control plane

Generate the Worker, state machine, and wake / proxy logic.

5

Vast automation

Generate scripts for searching offers, creating instances, polling readiness, and destroying idle workers.

6

Ops hardening

Add auth, CORS, structured logs, request IDs, retries, and monitoring.

Final Recommendation

Prototype locally on the Mac, keep the API contract portable, and use Cloudflare + Vast.ai for the real hosted architecture.