Compute Expansion - AWS Spot

Meeting notes from 2020.12.16

  1. Instance flexibility is important.

  2. POC - would be fundable.

Price: up to 90% discount.
Can scale up to significant amounts.

Want to build workloads to handle two-minute sudden termination.

Want time/region flexible.

  • spot pool = instance type/size * availability zone

Autoscaling group, supports mixed instance types, purchase options

  • ASG features lifecycle hooks, termination policies, etc.

Allocation strategies

  • N lowest priced

  • capacity optimized (preferred)

EC2 instance rebalance recommendation: new signal notification when spot instance is at elevated risk of interruption

  • allows for proactive rebalances of workloads to a deeper pool

  • capacity rebalancing on Spot, ASG will create instances when rebalance signal received, eventually kill old instances
    AutoscalingGroup: capacityrebalance: true

Batch

  1. Recommend that you choose "Spot capacity optimized" for Batch

ECS

  1. Fargate Spot is available.

  2. Overprovisioning on spot with capacity providers.

    • capacity providers map to ASGs, allows ECS to manager ASGs

EKS

  1. After creating cluster with cloudformation, create nodegroup on EKS

    • cluster autoscaler requires a nodegroup

      • same size instance type (cpu/memory)

    • separate spot and on-demand in different nodegroups

  2. Handling interruptions

    • identify

    • 2-minute notification

    • taint

    • drain

    • replace

  3. For self-managed node groups - need a daemonset for handling spot interruptions

    • managed nodegroups already have this, so those are recommended

  4. Horizontal Pod Autoscaler (HPA)

    • should run on the on-demand nodegroup

  5. Cluster Autoscaler (CA)

    • should run on the on-demand nodegroup

  6. Taints applied at node level

    • a node won't accept any pods that do not tolerate its taints

  7. Node affinity: allows you to constrain which nodes your pod is eligible to be scheduled on based on labels on the node

  8. Can use both taints and node affinity to control whether containers go to spot, on-demand

Spot instance advisor

  1. 3-month trailing statistics

  2. Aim for best practices, flexibility, diversification.

Spot Blueprints

  1. Just released; gives you templates for deployment via cloudformation, terraform, etc.

Questions

  1. ECS - would that require any ondemand base capacity?

    1. Will get back to me, but believe the answer is no

    2. Base ondemand capacity for EKS could likely be small; perhaps could try micro instances?

  2. Next steps for me is to evaluate the choices and arrive at a decision on which deployment approach.

Other notes

  1. Need to do some major work on manager code to better support signal handling, pre-emptible, retries, etc.