Recommended Architecture
Use an always-on lightweight control plane + an ephemeral Vast GPU worker
Cloudflare Worker
tiny gateway at
api.example.comDurable Object
state storeVast API
Ephemeral Vast GPU Instance
Cloudflared Named Tunnel
sam-origin.example.comFastAPI SAM Service
SAM 3 Runtime
Object Storage (R2 / S3)
Key Design Decisions
- Stable public endpoint lives at Cloudflare, not on the GPU host
- GPU worker is created on demand and destroyed after idle time
- Named Cloudflare Tunnel provides a stable origin hostname
- Persist media and results in object storage; keep session metadata separately
- Design the backend adapter so you can swap Mac / MPS or Linux / CUDA later
Ask Codex To
- Draft the Dockerfile and FastAPI skeleton
- Define the Worker proxy contract and error states
- Create Terraform or scripts for repeatable setup
Prototype on Mac, Deploy on Vast
Use the M3 MacBook Air for local experimentation, but treat Vast.ai as the production path
Local Mac prototype
- Best path on macOS: Hugging Face Transformers on MPS
- If image-only segmentation is enough, MLX is worth testing
- Good for single-user demos, API prototyping, and prompt experiments
- Avoid the official Meta repo as the primary Mac path because Mac/MPS support is less friction-free
- Do not expect strong concurrent video tracking on a fanless M3 Air
Remote Vast deployment
- Better fit for real GPU workloads and heavier image / video jobs
- Lets you keep a stable public endpoint through Cloudflare
- Much easier to evolve later into Linux + CUDA production
- Use ephemeral workers to reduce idle cost
- Recommended for stateful SAM 3 video sessions
Request Lifecycle
How the system behaves on cold starts, warm requests, and long-running video sessions.
Vast API
Provision on demand
Proxy to origin
Status / complete
Warm Path
If the worker is already ready, skip provisioning and proxy immediately.
Video Session Rule
Keep per-video state alive for the whole session; do not treat video tracking as isolated stateless requests.
Error States
- warming_up
- ready
- draining
- off
- failed
Ask Codex To
- Generate the state machine for the Worker
- Implement job polling and 202 responses
- Add retry logic and health checks
Step 1 — Build the GPU Worker Container
Pre-bake dependencies so cold starts are dominated by instance startup, not environment setup
What goes into the image
- Python runtime, FastAPI, uvicorn, SAM dependencies
- cloudflared binary
- healthcheck and startup scripts
- model weights or a fast first-run download strategy
- logging and metrics hooks
Container behavior
- Start FastAPI on localhost:8000
- Start cloudflared tunnel with token-based auth
- Expose only the local API port internally
- Write readiness and liveness endpoints
- Keep one inference worker process to avoid duplicate model memory
Recommended files
/app ├── /api ├── /sam_runtime ├── /workers ├── /scripts ├── Dockerfile ├── entrypoint.sh ├── healthcheck.py └── requirements.txt
Tip: bake as much as possible into the image — large downloads are the biggest cold-start tax.
Ask Codex To
- Write the Dockerfile and entrypoint
- Implement /health and /ready endpoints
- Add structured logging and env-based config
Step 4 — Build the FastAPI SAM Service
Model the API around the core SAM image and video task families
Internal Service Modules
- Router and request validation
- Session manager for image and video sessions
- Backend adapter: Transformers / MPS, MLX, or CUDA runtime
- Background job worker for long-running tasks
- Result formatter for masks, overlays, and metadata
Recommended Endpoints
- POST/v1/pcs/image
- POST/v1/pcs/video/sessions
- POST/v1/pcs/video/{session_id}/prompts
- POST/v1/pcs/video/{session_id}/run
- POST/v1/pvs/image
- POST/v1/pvs/video/sessions
- POST/v1/pvs/video/{session_id}/annotations
- POST/v1/pvs/video/{session_id}/propagate
- GET/v1/jobs/{job_id}
Return Formats
- COCO RLE masks by default
- Optional PNG overlay URLs
- boxes, scores, object_ids, prompt_to_obj_ids
- semantic segmentation when available
Concurrency Rule
On a single GPU worker, keep one process and gate expensive jobs with a small semaphore.
Ask Codex To
- Generate Pydantic models and OpenAPI docs
- Implement the backend adapter interface
- Create job queue and result serialization
Step 5 — Storage, State, and Job Handling
Persist the right things, keep hot session state in the right place, and avoid long blocking requests
Object storage (R2 / S3)
- Upload raw media, overlays, completed masks, logs, and exports
- Return signed URLs or internal URLs to the client
- Use predictable object naming by user, session, and job IDs
Metadata / state store
- Store instance_id, worker state, session metadata, and job metadata
- Durable Object is enough for the control plane; SQLite or Redis can back the service
- Cache image embeddings when repeated prompts are likely
Background execution
- Queue video propagation and longer jobs
- Return 202 + job_id instead of holding the HTTP request open
- Poll GET /jobs/{job_id} for status and artifacts
- Destroy the GPU worker after the idle timeout expires
Session Rule
For video, maintain per-session inference state until the session is explicitly ended or times out.
Operational Default
Lower resolution by default on Mac or constrained GPUs; make full quality opt-in.
Ask Codex To
- Write the storage abstraction
- Implement job records and status transitions
- Add cache and eviction rules
Execution Plan for You + Codex
Implement the system in small phases, using AI heavily for scaffolding and repeatable setup
API contract
Have Codex define endpoints, payloads, job states, and Pydantic models.
Local dev mode
Ask Codex to create a local FastAPI + MPS adapter so you can test the webapp contract on your Mac.
Worker container
Generate the Dockerfile, entrypoint, health checks, and SAM runtime wrapper.
Cloudflare control plane
Generate the Worker, state machine, and wake / proxy logic.
Vast automation
Generate scripts for searching offers, creating instances, polling readiness, and destroying idle workers.
Ops hardening
Add auth, CORS, structured logs, request IDs, retries, and monitoring.
Prototype locally on the Mac, keep the API contract portable, and use Cloudflare + Vast.ai for the real hosted architecture.