I built a Machine Learning Platform on AWS after passing SAP-C01 exam: Infrastructure and Software layers

Properly preparing and passing the SAP-C01 exam gave me the answers… But what are the questions?… The job of an architect is to make sure everything he creates is designed to withstand the weight placed upon it. Such perfection is unreachable without asking the right questions.

8 min readJul 5, 2020

This is the second slice of my journey in building a machine learning (ML) platform on AWS and a continuity of the high-level overview presented in the first article.

In this part, I am going to study in detail the first two layers of the ML platform: infrastructure and software layers.

1 | So, what are the questions?

A good design for the infrastructure and software layers should respond to many challenges:

How to guarantee high availability?
Is it possible to have a scalable design?
How to solve the capacity planning riddle?
How to insure data resiliency?
What about the platform security?

Answering all these challenges is a nightmare when dealing with an on-premise platform. However, thanks to the continuous enhancement of the cloud services, it is becoming more and more accessible with less work overhead.

And if this is not enough of a challenge, today’s companies are looking for a way to get things well done and in a short period of time…

2 | What are the answers then?

To answer these challenges, I chose to:

Use AWS managed services as much as I can. These services are here to save time. As I want to have some control over the infrastructure, I did not go for a serverless design.
Use containerization and microservices: I believe this kind of architecture is one of the most scalable and easiest to maintain nowadays.

Here is a high-level design of the Infrastructure and Software layers of the ML platform:

Infrastructure and Software layers of the Machine Learning Platform, by the author

One of the big pillars of this design is EKS (Elastic Kubernetes Service) which is a managed AWS service. Let us see how this design solves every single one of the listed challenges.

2.1 | High availability

AWS guarantees the high availability of its managed services. That is why I am using AWS managed services a lot: Route53, Cluster Autoscaler, NAT Gateway, FSx for luster, Route53, and more.

To guarantee the high availability for the whole design, at least two availability zones (AZs) are required. These AZs have the same shape with a slightly different resource types.

2.2 | Scalable design

Autoscaling groups (ASGs) are a good candidate for solving this challenge. ASGs are used in two places:

Autoscaling group for bastion hosts: because this is the only way to access the platform, I would advise having an ASG with a minimum capacity of two instances. Some people choose to have only one instance as a minimum capacity for the ASG to minimize cost because they consider bastion hosts are used for administration purposes and so a downtime of some minutes is not a big deal. But, with a minimum capacity of one instance, we only guarantee business continuity, not high availability.
Cluster Autoscaler for EKS cluster: that is a great feature from AWS. The Autoscaler “automatically adjusts the number of nodes in your cluster when pods fail to launch due to lack of resources or when nodes in the cluster are underutilized and their pods can be rescheduled onto other nodes in the cluster” ¹.

2.3 | Capacity planning

Having at least two EKS clusters is highly recommended. I saw companies (like Babylon Health²) launch a lot more EKS clusters, each contains one type of EC2: this makes it easy to adapt the use of the platform with the use case landing on it.

One example of using these two clusters is to use one cluster for the heavy jobs like training the machine learning models (with P3 type EC2 instances here), and the other cluster for the models’ inference phase (with P2 type EC2 instances which are less expensive).

2.4 | Data resiliency

Yet another challenge granted by the AWS managed services. S3 and FSx provide high levels of reliability and are the safest way to store data. Some best practices for data resiliency in S3 could be lifecycle configuration, enabling versioning, and using cross-region replication.

In this architecture, I used FSx for Luster following Amazon’s recommendation for this type of platforms: FSx For Luster file system integrates well with S3 and provides very high performances for accessing the data, and machine learning platforms really need these great performances. For example, during the model training phase, the training data could be copied from S3 to FSx only once.

As the FSx file system is shared by the different worker nodes in the EKS cluster, these worker nodes can access the data very efficiently from this file system, instead of making a call all the way to S3.

2.5 | Platform security

This is the most fun part. It pushed me to deeply investigate EKS to find answers:

For a higher level of security, I tried to obfuscate the resources as much as I can. So, all the worker nodes for EKS clusters are in private subnets. Obviously, a NAT Gateway in a public subnet is a must in each AZ as instances, sometimes, need to access the internet to download patches or to access a container registry like ECR or DockerHub.
Deploying the EKS cluster with private and public subnets is the recommended way according to AWS.
I chose to have private API access for Kubernetes: when used, all the traffic between Kubernetes nodes stays inside the VPC and the API is inaccessible from outside this VPC.
“When you enable endpoint private access for your cluster, Amazon EKS creates a Route 53 private hosted zone on your behalf and associates it with your cluster’s VPC. This private hosted zone is managed by Amazon EKS, and it doesn’t appear in your account’s Route 53 resources. In order for the private hosted zone to properly route traffic to your API server, your VPC must have enableDnsHostnames and enableDnsSupport set to true, and the DHCP options set for your VPC must include AmazonProvidedDNS in its domain name servers list.” ³
The Kubernetes API could then be accessed only by bastion hosts which should be created inside the same VPC.
Some special configurations for bastion hosts must be considered to allow connectivity:
- Allow ingress traffic coming from bastion hosts on port 443 in the Amazon EKS control plane security group.
- Role-based access control (RBAC) for Kubernetes must know the user or the IAM role used by those bastion hosts.

3 | But how I got these answers?

Simply by studying in depth the used services. Let us zoom in the EKS cluster and see what is happening inside.

Zoom inside the EKS cluster, by the author

When deploying an EKS cluster with private and public subnets, and enabling private API access, the resulting cluster structure is as follows:

EKS Control Plane: the master part of the cluster. These are EC2 nodes running the brainy side of EKS to ensure a consistent state of the Kubernetes cluster. They run the Key-value store etcd which is used as Kubernetes backing store for all cluster data, the kube-api-server and kube-scheduler. More details here.
These nodes run in a separate AWS managed VPC.
Worker nodes with their Cluster Autoscaler in a private subnet.
A NAT Gateway in a public subnet to enable worker nodes’ egress traffic.
A private hosted zone: “A private hosted zone is a container that holds information about how you want Amazon Route 53 to respond to DNS queries for a domain and its subdomains within one or more VPCs that you create with the Amazon VPC service.” ⁴.
This is created by EKS when private access API is enabled. In order to communicate, worker nodes and EKS Control Plane use the private hosted zone’s records.
An Elastic Network Interface (ENI) is provisioned in addition to the private hosted zone. This ENI is the network connection between the Control Plane and worker nodes. It supports traffics like kubectl executions, logs, and proxy data flows.

4 | Sorry! But I need to see more details…

The next diagram provides more details about components inside EKS Control Plane and worker nodes.

We can see how AWS deploys EKS Control plane in multiple AZs for high availability.
Plus, we can observe the exact path followed by a request emerging from a worker node when communicating with the API server.
Notice the insides of a worker node as well: Kubernetes components like Kubelet and Kube-proxy are deployed on each worker node.

Detailed view of the EKS cluster, by the author

5 | Pieces of advice

When I tried to create this architecture in my AWS account, I followed the AWS Management Console guide⁵, and it was not easy at all. There are a lot of tags to add, a lot of connectivity issues to take care of, especially when the private API is used, and a lot of IAM roles to manage.

A better way is to use the ekctl command line utility⁶. It is definitely the easiest way to deploy EKS. It takes care of all those tagging and IAM roles steps which is really great and saves a lot of time.

Conclusion

In this article, I tried to design the infrastructure and software layers of the machine learning platform. This architecture is based on AWS managed services. In order to build this architecture, I tried to follow best practices like obfuscating resources for security purposes.

Every step in thinking this design was meant to solve one of the challenges faced when building a platform: high availability, scalability, capacity planning, data resiliency, and platform security.

Fully understanding the used services is crucial to properly solve these challenges, that is why I gave a detailed technical view of the insides of the AWS EKS service which is the pillar of this design.

In the next article, I will talk about the third layer of this machine learning platform: The Framework Layer.

If you have any questions, please reach out to me on LinkedIn.

Note: The software stack presented in this part will be completed in the next article with other AWS services to fully cover the framework layer capacities.

[1] https://docs.aws.amazon.com/eks/latest/userguide/cluster-autoscaler.html

[2] https://www.youtube.com/watch?v=ULlqukKVKBo

[3] https://docs.aws.amazon.com/eks/latest/userguide/cluster-endpoint.html

[4] https://docs.aws.amazon.com/Route53/latest/DeveloperGuide/hosted-zones-private.html

[5] https://docs.aws.amazon.com/eks/latest/userguide/getting-started-console.html

[6] https://docs.aws.amazon.com/eks/latest/userguide/getting-started-eksctl.html