How do I scale MatCraft for large workloads?

Question

Accepted Answer

MatCraft is designed to scale horizontally. Here is how to handle increasing numbers of users, campaigns, and evaluations.

## Scaling the API Layer

The FastAPI backend is stateless -- all session state is stored in the database and Redis. To handle more concurrent users:

```bash
# Scale API instances behind a load balancer
docker compose up -d --scale api=4
```

Each API instance can handle approximately 500 concurrent connections (including WebSocket). For 100 concurrent dashboard users, 2-3 instances are sufficient.

**Key considerations:**

- Use sticky sessions (session affinity) at the load balancer for WebSocket connections.
- All instances share the same database and Redis, so no state synchronization is needed.

## Scaling Workers

Workers are the primary bottleneck for campaign throughput. Each worker process handles one task at a time:

```bash
# Scale horizontally (more worker containers)
docker compose up -d --scale worker=8

# Scale vertically (more processes per worker)
celery -A materia.tasks worker --concurrency=8
```

**Rules of thumb:**

- Each campaign iteration requires 1 worker process for 2-30 seconds.
- For N concurrent campaigns running simultaneously, you need approximately N/2 worker processes (since campaigns spend time waiting between iterations).
- CPU-bound tasks (CMA-ES, surrogate training): Set concurrency equal to the number of CPU cores.
- GPU-bound tasks (large surrogate training): Set concurrency to 1 per GPU.

## Priority Queues

For mixed workloads, use priority queues to ensure interactive requests are not blocked by batch jobs:

```python
# In task definitions
@app.task(queue="high_priority")
def train_surrogate(campaign_id: str):
    ...

@app.task(queue="low_priority")
def export_results(campaign_id: str):
    ...
```

```bash
# Dedicated workers per queue
celery -A materia.tasks worker --queues=high_priority --concurrency=4
celery -A materia.tasks worker --queues=low_priority --concurrency=2
```

## Scaling the Database

PostgreSQL is typically the last bottleneck. Strategies for scaling:

1. **Connection pooling**: Use PgBouncer between the application and PostgreSQL. This reduces the number of direct database connections from (API instances * worker processes * connections per process) to a manageable pool.

2. **Read replicas**: Route read-heavy dashboard queries (listing campaigns, viewing results) to a read replica. Write operations (creating candidates, storing measurements) go to the primary.

3. **Partitioning**: For very large deployments (>1M candidates), partition the `candidates` table by campaign ID:

```sql
CREATE TABLE candidates (
    id UUID PRIMARY KEY,
    campaign_id UUID NOT NULL,
    ...
) PARTITION BY HASH (campaign_id);
```

## Auto-Scaling

On Kubernetes or ECS, configure horizontal pod auto-scaling based on:

- **API pods**: Scale on CPU utilization (target: 60%) or request rate.
- **Worker pods**: Scale on Redis queue depth (target: <10 pending tasks per worker).

```yaml
# Kubernetes HPA example
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: matcraft-worker
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: matcraft-worker
  minReplicas: 2
  maxReplicas: 20
  metrics:
    - type: External
      external:
        metric:
          name: redis_queue_depth
        target:
          type: AverageValue
          averageValue: "10"
```

How do I scale MatCraft for large workloads?

Scaling the API Layer

Scaling Workers

Priority Queues

Scaling the Database

Auto-Scaling

Related Questions

What is MatCraft's system architecture?

How does the Celery task queue work?

What are the best practices for cloud deployment?