Why Serverless ML
Serverless platforms like AWS Lambda are attractive for low-latency inference at unpredictable scale. You only pay for actual inference time, not idle GPU hours.
Use cases that benefit from serverless ML:
- Sporadic inference requests (not continuous streaming)
- Cost-sensitive applications
- Auto-scaling requirements
- Multi-tenant SaaS with variable load
Challenges
The main challenges deploying PyTorch to Lambda:
- Package size: Lambda has a 250MB limit (50MB zipped)
- Cold start: Model loading can take 5-10 seconds
- Memory: Limited to 10GB maximum
- CPU-only: No GPU support (yet)
The Solution
Use container images (up to 10GB) with multi-stage builds and quantization.
# Multi-stage Dockerfile for PyTorch Lambda
FROM public.ecr.aws/lambda/python:3.11 as builder
# Install build dependencies
RUN pip install --upgrade pip
COPY requirements.txt .
RUN pip install -r requirements.txt -t /asset
# Production stage
FROM public.ecr.aws/lambda/python:3.11
# Copy only runtime dependencies
COPY --from=builder /asset /var/task
# Copy model and inference code
COPY model.pt /var/task/
COPY handler.py /var/task/
CMD ["handler.lambda_handler"]
Inference Handler
import torch
import json
from transformers import AutoTokenizer, AutoModel
# Global model (loaded once per container)
MODEL = None
TOKENIZER = None
def load_model():
global MODEL, TOKENIZER
if MODEL is None:
MODEL = AutoModel.from_pretrained(
"./model",
torch_dtype=torch.float16 # Half precision
)
MODEL.eval()
TOKENIZER = AutoTokenizer.from_pretrained("./model")
return MODEL, TOKENIZER
def lambda_handler(event, context):
model, tokenizer = load_model()
input_text = json.loads(event['body'])['text']
# Tokenize and run inference
inputs = tokenizer(input_text, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
return {
'statusCode': 200,
'body': json.dumps({
'embeddings': outputs.last_hidden_state.tolist()
})
}
Optimization Techniques
1. Model Quantization
import torch
from torch.quantization import quantize_dynamic
# Dynamic quantization (reduces size by 75%)
model_quantized = quantize_dynamic(
model,
{torch.nn.Linear},
dtype=torch.qint8
)
torch.save(model_quantized.state_dict(), 'model_quantized.pt')
2. Provisioned Concurrency
Mitigate cold starts by keeping containers warm:
aws lambda put-provisioned-concurrency-config \
--function-name ml-inference \
--provisioned-concurrent-executions 5
3. Lazy Loading
Load model weights only when needed, cache in global scope.
With these optimizations, we achieved sub-200ms inference latency at 1/10th the cost of dedicated GPU instances.
Cost Comparison
| Deployment | Cost/1M inferences | Cold Start |
|---|---|---|
| EC2 g4dn.xlarge | $150 | N/A |
| Lambda (optimized) | $12 | ~500ms |
| SageMaker Serverless | $45 | ~2s |