Back to all questions

Does MatCraft support GPU acceleration?

Technical
gpu
cuda
performance
hardware

Yes. MatCraft supports GPU acceleration for surrogate model training, which can provide a 5-20x speedup for large datasets and complex model architectures.

When GPU Helps

GPU acceleration is most beneficial when:

  • Large datasets: More than 1,000 training points. For smaller datasets, the overhead of moving data to GPU memory outweighs the computation benefit.
  • Large models: Hidden layers with 256+ neurons or 3+ layers. Small models (64-64) train fast enough on CPU.
  • Frequent retraining: Active learning campaigns retrain the surrogate every iteration. Faster training means faster campaign completion.
  • Ensemble surrogates: Training 5-10 independent models benefits significantly from GPU parallelism.

Installation

Install the GPU-enabled version:

bash
pip install matcraft[gpu]

This installs PyTorch with CUDA support. Verify GPU availability:

python
import torch
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"GPU: {torch.cuda.get_device_name(0)}")

Configuration

Enable GPU training in your campaign config:

yaml
surrogate:
  type: mlp
  device: cuda        # "cpu" or "cuda" (default: auto-detect)
  hidden_layers: [256, 256, 128]
  epochs: 500
  batch_size: 256     # Larger batch sizes benefit more from GPU

Or in Python:

python
from materia.surrogate import MLPSurrogate

surrogate = MLPSurrogate(
    hidden_layers=[256, 256, 128],
    device="cuda",
    batch_size=256,
)

Docker with GPU

Use the NVIDIA Container Toolkit:

bash
# Install NVIDIA Container Toolkit
# (See: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html)

# Run with GPU support
docker compose -f docker-compose.yml -f docker-compose.gpu.yml up -d

The GPU compose override:

yaml
# docker-compose.gpu.yml
services:
  worker:
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    environment:
      MATCRAFT_DEVICE: cuda

Supported GPUs

| GPU | Status | Notes | |——-|————|———-| | NVIDIA A100 | Fully supported | Best performance, recommended for production | | NVIDIA V100 | Fully supported | Good performance | | NVIDIA T4 | Fully supported | Cost-effective for cloud (AWS g4dn instances) | | NVIDIA RTX 3090/4090 | Supported | Good for local development | | NVIDIA RTX 3060/4060 | Supported | Adequate for small to medium models | | AMD GPUs (ROCm) | Experimental | Requires PyTorch ROCm build | | Apple M1/M2/M3 (MPS) | Experimental | Use device: mps |

Performance Benchmarks

Training time for an MLP surrogate with [256, 256, 128] architecture, 200 epochs:

| Dataset Size | CPU (i7-12700) | GPU (RTX 3090) | Speedup | |——————-|————————|————————|————-| | 100 points | 2.1s | 1.8s | 1.2x | | 1,000 points | 8.3s | 1.9s | 4.4x | | 10,000 points | 45.2s | 3.1s | 14.6x | | 100,000 points | 412s | 21.4s | 19.3x |

For typical active learning campaigns with <1,000 data points, CPU training is fast enough. GPU becomes valuable for large-scale campaigns or when running many campaigns concurrently.

Cloud GPU Instances

Recommended cloud instances for GPU workers:

  • AWS: g4dn.xlarge (T4, $0.52/hr) for cost-effective training
  • GCP: n1-standard-4 with T4 ($0.35/hr + $0.35/hr GPU)
  • Azure: Standard_NC4as_T4_v3 (T4, $0.53/hr)

Related Questions