Deploying PyTorch on AWS Lambda

Why Serverless ML

Serverless platforms like AWS Lambda are attractive for low-latency inference at unpredictable scale. You only pay for actual inference time, not idle GPU hours.

Use cases that benefit from serverless ML:

Sporadic inference requests (not continuous streaming)
Cost-sensitive applications
Auto-scaling requirements
Multi-tenant SaaS with variable load

Challenges

The main challenges deploying PyTorch to Lambda:

Package size: Lambda has a 250MB limit (50MB zipped)
Cold start: Model loading can take 5-10 seconds
Memory: Limited to 10GB maximum
CPU-only: No GPU support (yet)

The Solution

Use container images (up to 10GB) with multi-stage builds and quantization.

# Multi-stage Dockerfile for PyTorch Lambda
FROM public.ecr.aws/lambda/python:3.11 as builder

# Install build dependencies
RUN pip install --upgrade pip
COPY requirements.txt .
RUN pip install -r requirements.txt -t /asset

# Production stage
FROM public.ecr.aws/lambda/python:3.11

# Copy only runtime dependencies
COPY --from=builder /asset /var/task

# Copy model and inference code
COPY model.pt /var/task/
COPY handler.py /var/task/

CMD ["handler.lambda_handler"]

Inference Handler

import torch
import json
from transformers import AutoTokenizer, AutoModel

# Global model (loaded once per container)
MODEL = None
TOKENIZER = None

def load_model():
    global MODEL, TOKENIZER
    if MODEL is None:
        MODEL = AutoModel.from_pretrained(
            "./model",
            torch_dtype=torch.float16  # Half precision
        )
        MODEL.eval()
        TOKENIZER = AutoTokenizer.from_pretrained("./model")
    return MODEL, TOKENIZER

def lambda_handler(event, context):
    model, tokenizer = load_model()
    
    input_text = json.loads(event['body'])['text']
    
    # Tokenize and run inference
    inputs = tokenizer(input_text, return_tensors="pt")
    
    with torch.no_grad():
        outputs = model(**inputs)
    
    return {
        'statusCode': 200,
        'body': json.dumps({
            'embeddings': outputs.last_hidden_state.tolist()
        })
    }

Optimization Techniques

1. Model Quantization

import torch
from torch.quantization import quantize_dynamic

# Dynamic quantization (reduces size by 75%)
model_quantized = quantize_dynamic(
    model,
    {torch.nn.Linear},
    dtype=torch.qint8
)

torch.save(model_quantized.state_dict(), 'model_quantized.pt')

2. Provisioned Concurrency

Mitigate cold starts by keeping containers warm:

aws lambda put-provisioned-concurrency-config \
  --function-name ml-inference \
  --provisioned-concurrent-executions 5

3. Lazy Loading

Load model weights only when needed, cache in global scope.

With these optimizations, we achieved sub-200ms inference latency at 1/10th the cost of dedicated GPU instances.

Cost Comparison

Deployment	Cost/1M inferences	Cold Start
EC2 g4dn.xlarge	$150	N/A
Lambda (optimized)	$12	~500ms
SageMaker Serverless	$45	~2s