AWS Infrastructure for AI Workloads: The Complete Setup

AWS for AI: Choosing the Right Architecture

AWS offers three distinct paths for AI workloads: managed AI APIs via Amazon Bedrock, managed ML infrastructure via SageMaker, and self-managed compute on EC2 GPU instances. Choosing the wrong path costs money and engineering time. The right choice depends on whether you're calling hosted foundation models, deploying custom models, or running fine-tuned open-source models.

This guide covers the architecture decisions and concrete configurations for each path, plus the cross-cutting concerns (networking, storage, monitoring, cost) that apply to all three.

Option 1: Amazon Bedrock for Foundation Models

Bedrock gives you API access to Anthropic Claude, Meta Llama, Mistral, Amazon Titan, and others without managing any infrastructure. It is the right choice for teams that want to build AI features without becoming ML infrastructure operators.

import boto3

bedrock = boto3.client('bedrock-runtime', region_name='us-east-1')

response = bedrock.invoke_model(
    modelId='anthropic.claude-3-5-sonnet-20241022-v2:0',
    body=json.dumps({
        'anthropic_version': 'bedrock-2023-05-31',
        'max_tokens': 1024,
        'messages': [{'role': 'user', 'content': 'Hello'}]
    })
)

result = json.loads(response['body'].read())

Bedrock pricing is per-token with no infrastructure overhead. For variable workloads, this is almost always cheaper than maintaining dedicated GPU instances. The tradeoff is less control over latency and no ability to customize the model.

Option 2: SageMaker for Custom Model Deployment

SageMaker is the right choice when you have a fine-tuned model or a specific open-source model that is not available on Bedrock. It handles the undifferentiated heavy lifting of model serving: auto-scaling, health checks, A/B testing, and monitoring.

from sagemaker.huggingface import HuggingFaceModel
import sagemaker

role = sagemaker.get_execution_role()

model = HuggingFaceModel(
    model_data='s3://my-models/mistral-7b-finetuned.tar.gz',
    role=role,
    transformers_version='4.37.0',
    pytorch_version='2.1.0',
    py_version='py310',
    env={'HF_TASK': 'text-generation'}
)

predictor = model.deploy(
    initial_instance_count=1,
    instance_type='ml.g5.2xlarge',
    endpoint_name='mistral-7b-production'
)

Option 3: EC2 GPU Instances for Self-Managed Inference

When you need maximum control — custom serving configuration, specific frameworks, or per-request GPU access — self-managed EC2 is the answer. Use the Deep Learning AMIs which come pre-installed with CUDA, PyTorch, and common ML libraries.

Instance Type Selection

g4dn.xlarge (T4 GPU, 16GB VRAM): Best for 7B parameter models, cost-effective at ~$0.53/hr
g5.2xlarge (A10G GPU, 24GB VRAM): Good for 13B models or high-throughput 7B
p3.2xlarge (V100, 16GB VRAM): Older but still cost-effective for training jobs
p4d.24xlarge (8x A100, 320GB VRAM): For 70B+ models or distributed training

S3 for Model Storage

Store model weights in S3 with versioning enabled. Use Intelligent-Tiering for cost optimization on models that are used infrequently, and standard S3 for production models that are loaded regularly.

# Sync model from HuggingFace to S3
aws s3 sync   s3://my-models/mistral-7b-v0.1/   /tmp/mistral-7b/   --region us-east-1

# Or use the aws cli to upload a local fine-tuned model
aws s3 cp ./fine-tuned-model/   s3://my-models/mistral-7b-finetuned-v2/   --recursive

VPC and Networking for AI Workloads

Keep AI compute in private subnets. Access the internet (for model downloads, API calls) through a NAT Gateway. Use VPC endpoints for S3 and Bedrock to avoid internet transit costs and improve security.

Create a dedicated VPC for AI workloads with private subnets in 2+ AZs
Use VPC endpoints for S3, SageMaker, and Bedrock
Place a load balancer in the public subnet; inference instances in private subnet
Use security groups to restrict GPU instance access to only your application layer

Cost Optimization Strategies

Spot instances for batch inference: 60-90% savings over on-demand for non-time-sensitive workloads
Savings Plans: Commit 1-3 years for on-demand inference to save 30-40%
Auto-scaling with scale-to-zero: SageMaker Serverless Inference scales to zero when idle
Right-sizing: Start with the smallest instance type that fits your model; benchmark before over-provisioning
Model quantization: INT8 or INT4 quantized models run on smaller instances with minimal quality loss

The cheapest AWS AI architecture is almost always Bedrock for foundation model API calls combined with spot EC2 for any batch processing jobs. Reserve SageMaker for cases where you genuinely need managed model hosting with SLAs, and self-managed EC2 only when you need configuration that managed services cannot provide.

AWS Infrastructure for AI Workloads: The Complete Setup

AWS for AI: Choosing the Right Architecture

Option 1: Amazon Bedrock for Foundation Models

Option 2: SageMaker for Custom Model Deployment

Option 3: EC2 GPU Instances for Self-Managed Inference

Instance Type Selection

S3 for Model Storage

VPC and Networking for AI Workloads

Cost Optimization Strategies

Bookt.dk — Danish Salon Booking

Kubernetes for LLM Inference: Scaling AI Workloads

Helm Charts: Packaging and Deploying Kubernetes Applications

Want to Build This for Your Team?