Auto-Scaling Supabase Studio on GCP with Managed Instance Groups
In previous posts, we covered the database, networking, and security layers of our self-hosted Supabase setup. Now let's talk about the application layer: how we deploy and scale Supabase Studio.
Our goals:
- Auto-scaling — Handle traffic spikes without manual intervention
- Auto-healing — Replace unhealthy instances automatically
- Zero-downtime updates — Deploy new versions without interruption
- Cost efficiency — Scale down when traffic is low
What is Supabase Studio?
Supabase Studio is the web-based admin interface for managing your Supabase/PostgreSQL database. It includes:
- SQL Editor — Run queries directly
- Table Editor — Visual database management
- API Documentation — Auto-generated from your schema
- Auth Management — User administration
It's open source and runs as a Docker container, making it perfect for self-hosting.
The architecture
┌─────────────────────────────────────────────────────┐
│ Global Load Balancer │
│ (SSL Termination) │
└───────────────────────┬─────────────────────────────┘
│
┌───────────────────────▼─────────────────────────────┐
│ Regional Instance Group │
│ ┌──────────────┐ ┌──────────────┐ │
│ │ Zone A │ │ Zone B │ │
│ │ ┌────────┐ │ │ ┌────────┐ │ │
│ │ │ VM │ │ │ │ VM │ │ │
│ │ │ Studio │ │ │ │ Studio │ │ │
│ │ └────────┘ │ │ └────────┘ │ │
│ └──────────────┘ └──────────────┘ │
└─────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────┐
│ Cloud SQL (Private IP) │
└─────────────────────────────────────────────────────┘
Instance Template
The foundation of auto-scaling is the instance template. It defines what each VM looks like:
this.instanceTemplate = new gcp.compute.InstanceTemplate(
`${resourceName}-template`,
{
namePrefix: `${resourceName}-template-`,
machineType: machineType,
region: region,
tags: ["http-server", "allow-ssh"],
disks: [
{
sourceImage: "projects/cos-cloud/global/images/family/cos-stable",
autoDelete: true,
boot: true,
diskSizeGb: 50,
diskType: "hyperdisk-balanced",
},
],
networkInterfaces: [
{
network: args.networking.vpc.id,
subnetwork: args.networking.subnet.id,
// No external IP - instances are private
},
],
serviceAccount: {
email: this.serviceAccount.email,
scopes: ["cloud-platform"],
},
metadataStartupScript: startupScript,
metadata: {
"google-logging-enabled": "true",
"google-monitoring-enabled": "true",
},
shieldedInstanceConfig: {
enableSecureBoot: false, // COS needs this off for Docker
enableVtpm: true,
enableIntegrityMonitoring: true,
},
},
{
parent: this,
dependsOn: [args.database.instance, args.database.database, args.database.user],
},
);Key decisions
Container-Optimized OS
We use cos-stable (Container-Optimized OS). It's:
- Minimal and secure
- Optimized for running Docker
- Auto-updates the OS
- Maintained by Google
No external IP
Instances don't have public IPs. They access the internet through Cloud NAT (for pulling Docker images) and are accessed through the load balancer.
Hyperdisk Balanced
For the boot disk, we use hyperdisk-balanced. It offers better IOPS than standard persistent disk, which helps with container startup times.
Dependencies
The template depends on the database being ready. No point starting Studio if it can't connect to PostgreSQL.
The startup script
This is where the magic happens. When a VM boots, it runs this script to configure and start Supabase Studio:
#!/bin/bash
set -euo pipefail
exec > >(tee /var/log/supabase-startup.log) 2>&1
echo "Starting Supabase Studio setup at $(date)"
# Wait for Docker
for i in {1..30}; do
if docker info > /dev/null 2>&1; then
break
fi
echo "Waiting for Docker... ($i/30)"
sleep 2
done
# Environment variables (replaced at deploy time)
DB_HOST="__DB_HOST__"
DB_NAME="__DB_NAME__"
DB_PASSWORD="__DB_PASSWORD__"
AUTH_USER="__AUTH_USER__"
AUTH_PASSWORD="__AUTH_PASSWORD__"
ANON_KEY="__ANON_KEY__"
SERVICE_ROLE_KEY="__SERVICE_ROLE_KEY__"
JWT_SECRET="__JWT_SECRET__"
# Create network for containers
docker network create supabase-network 2>/dev/null || true
# Start postgres-meta (API for database introspection)
docker run -d \
--name postgres-meta \
--restart always \
--network supabase-network \
-e "PG_META_HOST=0.0.0.0" \
-e "PG_META_PORT=8080" \
-e "PG_META_DB_HOST=${DB_HOST}" \
-e "PG_META_DB_PORT=5432" \
-e "PG_META_DB_NAME=${DB_NAME}" \
-e "PG_META_DB_USER=postgres" \
-e "PG_META_DB_PASSWORD=${DB_PASSWORD}" \
supabase/postgres-meta:latest
# Wait for postgres-meta
for i in {1..30}; do
if curl -s http://localhost:8080/health > /dev/null; then
break
fi
echo "Waiting for postgres-meta... ($i/30)"
sleep 2
done
# Start Supabase Studio
docker run -d \
--name studio \
--restart always \
--network supabase-network \
-p 3000:3000 \
-e "STUDIO_PG_META_URL=http://postgres-meta:8080" \
-e "SUPABASE_URL=http://localhost:8000" \
-e "SUPABASE_PUBLIC_URL=http://localhost:8000" \
-e "SUPABASE_ANON_KEY=${ANON_KEY}" \
-e "SUPABASE_SERVICE_KEY=${SERVICE_ROLE_KEY}" \
-e "AUTH_JWT_SECRET=${JWT_SECRET}" \
-e "DEFAULT_ORGANIZATION_NAME=Acme Corp" \
-e "DEFAULT_PROJECT_NAME=Production" \
-e "NEXT_PUBLIC_SITE_URL=http://localhost:3000" \
-e "NEXT_ANALYTICS_BACKEND_PROVIDER=postgres" \
supabase/studio:latest
echo "Supabase Studio setup complete at $(date)"Secret injection
Notice the __DB_HOST__, __DB_PASSWORD__, etc. placeholders. These are replaced at deployment time:
const startupScript = pulumi.all([args.secrets, args.database.privateIp]).apply(([s, dbHost]) => {
return startupScriptTemplate
.replace(/__DB_HOST__/g, dbHost)
.replace(/__DB_NAME__/g, s.infra.postgresDb)
.replace(/__DB_PASSWORD__/g, s.infra.postgresPassword)
.replace(/__AUTH_USER__/g, s.studio.authUser)
.replace(/__AUTH_PASSWORD__/g, s.studio.authPassword)
.replace(/__ANON_KEY__/g, s.studio.anonKey)
.replace(/__SERVICE_ROLE_KEY__/g, s.studio.serviceRoleKey)
.replace(/__JWT_SECRET__/g, s.studio.jwtSecret);
});This keeps secrets out of the instance template (which would be visible in GCP console) while still baking them into the startup script.
Two containers
We run two containers:
- postgres-meta — Provides the API that Studio uses for database introspection
- studio — The actual Supabase Studio web UI
They communicate over a Docker network.
Health checks
Health checks are critical for auto-healing. If an instance fails the health check, it gets replaced.
this.healthCheck = new gcp.compute.HealthCheck(
`${resourceName}-health-check`,
{
name: `${resourceName}-health-check`,
checkIntervalSec: 10,
timeoutSec: 5,
healthyThreshold: 2,
unhealthyThreshold: 3,
httpHealthCheck: {
port: 3000,
requestPath: "/api/platform/profile",
},
},
{ parent: this },
);Configuration explained
| Setting | Value | Meaning |
|---|---|---|
| checkIntervalSec | 10 | Check every 10 seconds |
| timeoutSec | 5 | Request must respond in 5 seconds |
| healthyThreshold | 2 | 2 consecutive successes = healthy |
| unhealthyThreshold | 3 | 3 consecutive failures = unhealthy |
The health endpoint
We hit /api/platform/profile on port 3000. This endpoint:
- Returns 200 if Studio is running and connected to the database
- Returns an error if something's wrong
Choosing the right health endpoint matters. A simple / might return 200 even if the database connection is broken. We want to verify the full stack is working.
Managed Instance Group
The instance group manages the VMs:
this.instanceGroupManager = new gcp.compute.RegionInstanceGroupManager(
`${resourceName}-mig`,
{
name: `${resourceName}-mig`,
region: region,
baseInstanceName: resourceName,
targetSize: minInstances,
distributionPolicyZones: [`${region}-a`, `${region}-b`],
versions: [{ instanceTemplate: this.instanceTemplate.selfLinkUnique }],
namedPorts: [{ name: "http", port: 3000 }],
autoHealingPolicies: {
healthCheck: this.healthCheck.id,
initialDelaySec: 300,
},
updatePolicy: {
type: "PROACTIVE",
minimalAction: "REPLACE",
maxSurgeFixed: 2,
maxUnavailableFixed: 0,
replacementMethod: "SUBSTITUTE",
},
},
{ parent: this },
);Regional distribution
We spread instances across two zones (region-a and region-b). If one zone has issues, the other keeps serving traffic.
Auto-healing
The autoHealingPolicies configuration:
- Uses our health check to monitor instances
initialDelaySec: 300— Wait 5 minutes before checking (startup takes time)- Unhealthy instances are automatically terminated and replaced
Update policy
The updatePolicy controls how new versions roll out:
| Setting | Value | Effect |
|---|---|---|
| type | PROACTIVE | Apply updates immediately |
| minimalAction | REPLACE | Create new instances (don't just restart) |
| maxSurgeFixed | 2 | Create up to 2 extra instances during update |
| maxUnavailableFixed | 0 | Never have fewer than target instances |
| replacementMethod | SUBSTITUTE | Delete old, create new (vs. recreate in-place) |
This gives us zero-downtime deployments:
- New instances are created with the new template
- Once healthy, traffic shifts to them
- Old instances are terminated
Autoscaler
The autoscaler adjusts instance count based on load:
this.autoscaler = new gcp.compute.RegionAutoscaler(
`${resourceName}-autoscaler`,
{
name: `${resourceName}-autoscaler`,
region: region,
target: this.instanceGroupManager.id,
autoscalingPolicy: {
minReplicas: minInstances,
maxReplicas: maxInstances,
cooldownPeriod: 60,
cpuUtilization: { target: 0.7 },
},
},
{ parent: this },
);Scaling policy
- minReplicas — Never go below this (1 for dev, 2 for prod)
- maxReplicas — Never exceed this (keeps costs bounded)
- cooldownPeriod — Wait 60 seconds between scaling decisions
- cpuUtilization.target — Scale up when average CPU exceeds 70%
Environment differences
| Setting | Dev | Prod |
|---|---|---|
| minReplicas | 1 | 2 |
| maxReplicas | 1 | 2 |
For dev, we fix at 1 instance (cost savings). For prod, we ensure at least 2 for high availability.
Backend service
The backend service connects the load balancer to the instance group:
this.backendService = new gcp.compute.BackendService(
`${resourceName}-backend`,
{
name: `${resourceName}-backend`,
protocol: "HTTP",
portName: "http",
timeoutSec: 30,
healthChecks: this.healthCheck.id,
securityPolicy: securityPolicy?.selfLink,
backends: [
{
group: this.instanceGroupManager.instanceGroup,
balancingMode: "UTILIZATION",
capacityScaler: 1.0,
},
],
logConfig: { enable: true, sampleRate: 1.0 },
},
{ parent: this },
);Connection draining
When an instance is being removed, GCP waits for existing requests to complete. The timeoutSec: 30 gives in-flight requests 30 seconds to finish.
Logging
logConfig: { enable: true, sampleRate: 1.0 } logs every request. In production, you might reduce sampleRate to save on logging costs.
Load balancer
We covered the load balancer in detail in the infrastructure post. Key points:
- Global IP — Single static IP for DNS
- SSL termination — Google-managed certificate
- HTTP to HTTPS redirect — All traffic forced to HTTPS
this.loadBalancer = new LoadBalancer(
`${resourceName}-lb`,
{
name: resourceName,
domain: args.domain,
backendService: this.backendService,
},
{ parent: this },
);Putting it all together
Here's the complete service component:
const supabaseStudio = new SupabaseStudioService("supabase-studio", {
name: "supabase-studio",
projectId: config.projectId,
region: config.region,
domain: `db.${config.environment}.example.com`,
networking: networking,
database: database,
secrets: secrets,
machineType: "c4d-standard-2",
minInstances: config.environment === "prod" ? 2 : 1,
maxInstances: config.environment === "prod" ? 2 : 1,
vpnPublicIp: vpn?.publicIp,
});Machine type
We use c4d-standard-2:
- 2 vCPUs
- 8 GB RAM
- Compute-optimized (C4D series)
This is plenty for Supabase Studio. The containers are lightweight.
Deployment workflow
When we run pulumi up:
- Template changes? If the startup script or machine config changed, a new template is created
- Rolling update — New instances are created with the new template
- Health check passes — Traffic shifts to new instances
- Old instances terminated — Previous version instances are deleted
The whole process takes 5-10 minutes and requires no manual intervention.
Monitoring
Built-in metrics
With google-monitoring-enabled: true, we get:
- CPU utilization
- Memory usage
- Disk I/O
- Network traffic
Custom health checks
For deeper monitoring, we could add:
- Database connection latency
- Container startup time
- API response times
Logging
Container logs go to Cloud Logging via google-logging-enabled: true. We can:
- Search and filter logs
- Set up alerts on error patterns
- Export to BigQuery for analysis
Common issues and fixes
1. Startup too slow
If instances take too long to start, they might fail health checks during boot. Solutions:
- Increase
initialDelaySecin auto-healing policy - Optimize Docker image pulls (use regional mirrors)
- Pre-pull images in a base image
2. Memory pressure
If instances run out of memory:
- Increase machine type
- Add swap (not ideal but works)
- Optimize container memory limits
3. Database connection pool exhaustion
Each Studio instance opens connections to the database. With many instances:
- Monitor
max_connectionsin PostgreSQL - Consider connection pooling (PgBouncer)
- Limit concurrent Studio instances
4. Slow health check response
If the health endpoint is slow:
- Check database connection latency
- Increase health check timeout
- Optimize the health endpoint
Cost optimization
Right-size instances
Start small. Monitor CPU and memory. Scale up only if needed.
Use committed use discounts
For predictable workloads, committed use discounts save 20-50%.
Preemptible/Spot instances
For dev environments, preemptible instances cost 60-80% less. They can be terminated with 30 seconds notice, but for non-production, that's usually fine.
What's next
In the final post of this series, we'll preview our plans for multi-region read replicas — bringing the database closer to users around the world.
Need help deploying auto-scaling applications on GCP? Get in touch — we've built infrastructure for everything from MVPs to enterprise workloads.
