I built a Machine Learning Platform on AWS after passing SAP-C01 exam: Framework layer

Architecting is all about how things fit together… How an object that performs a function can also be a work of art… That’s exactly what a framework is, a fascinating work of art!

12 min readJul 13, 2020

This is the third piece of my journey in building a machine learning (ML) platform on AWS and a continuity of the high-level overview presented in the first article as well as the Infrastructure and Software layers demystified in the second one.

In this part, I am going to study the third layer of the ML platform: the framework layer.

1 | So, why a framework?

By definition, a framework is “an abstraction […] providing generic functionality.” ¹ In other words, it is a high-level layer that hides the tricky details of the platform’s software stack and exposes user-friendly functionalities.

Amazon understood this very well. That is why, every year, a lot of new AWS services are built. Each new service is not only well integrated with the already existing ones, but also is abstracting the use of them.
One example could be to use AWS Glue to execute an ETL job, instead of launching and managing a whole EMR cluster.

I have seen companies (like ABEJA²) building a machine learning platform by efficiently combining some of the AWS services. And it works…

But what if there were only one entry point, one abstraction layer from which a data scientist accesses the machine learning platform and builds his pipeline for his ML model without worrying about all these AWS services?

That is the reason behind the framework layer, and that is what makes it a piece of art!

Uber mastered it with Michelangelo³, Netflix did it with Metaflow⁴. As for me… I am trying in this article…

2 | How am I doing it?

By simply following the steps of these leaders. Three main inspiration sources contributed to my final design: Uber³, Comcast⁵, and one book: Machine Learning with Apache Spark Quick Start Guide by Jillur Quddus.

In his book, Jillur Quddus stated: “We can represent a data insights platform as a series of logical layers, where each layer provides a distinct functional capability. When we combine these layers, we form a reference logical architecture for a data insights platform” ⁶.
And that is what I did. I started by putting the logical layers, then the functional capabilities of each layer, to finally defining the granular functions of each capability.

Here is the result:

Logical Architecture of the Machine Learning Platform, by the author

In this logical architecture, the machine learning framework is sitting quietly on the infrastructure and software layers detailed in the previous article. Its job is to fit six layers together: Data Storage, Data & Model preparation, Model Operationalization, Model Serving, Governance & Security, and Management, Administration & Orchestration.

Let us understand each one of these layers:

2.1 | Data Storage

The ML Framework needs four main storages:

Collected Data Storage: this storage is responsible for holding the data in its raw as well as its cleansed format. As seen in the first article, this storage is maintained and governed by another data platform. The framework here is just a consumer of this data.
Features Storage: Uber’s team insisted on the importance of such storage. That is because, they “[…] found that many modeling problems at Uber use identical or similar features, and there is substantial value in enabling teams to share features between their own projects and for teams in different organizations to share features with each other.” ³
Two types of feature stores are needed:
- Offline Features: offline features are computed based on the historical data and they are updated every few hours or once a day. One example of these features built by Uber’s team: “restaurant’s average meal preparation time over the last seven days.” ³
- Online Features: these features are generated in near-real-time and computed by a near-real-time processing job applied to streaming data. One example of these features: “restaurant’s average meal preparation time over the last one hour.” ³
Models Storage: Trained Models are stored in a repository with their metadata for future use. These metadata are very important to trace the history of the model. “It’s all about Metadata” ⁵ as Comcast team stated. Every update of the trained model must result in a new model version which should be synched with its metadata version.
When a model is ready, it gets packaged and stored in the Packaged Models repository.
Monitoring Storage: two data storages are essential for monitoring:
- Predictions: a sample set of predictions produced by the ML model. Without these predictions, it would be impossible to spot the model’s underperformances over time.
- Performance Metrics: to guarantee continuous service and good performances of the model, some metrics must be usually monitored. One example could be to monitor the inference time when the model is served in real-time and check whether it respects the Service Level Agreement (SLA).

2.2 | Data & Model preparation

This layer provides three functional capabilities:

Data Exploration: during this step, the Data Scientist begins understanding his data and cleansing it. Then, he starts brainstorming to extract the features which seem to answer his use case.
Model Training: this is a back-and-forth process between selecting an ML model and developing it in order to shortlist the most performant one.
Model Evaluation: some parameters cannot be learned from the data. They are called hyperparameters. This step is meant to optimize them by testing different parameter combinations on the chosen model.

2.3 | Model Operationalization

This layer is responsible for well packaging the model and efficiently deploying it. The model package includes the trained model, the features extractor, and as a best practice, a sample data set that helps check the model’s performances when deployed. Different deployment strategies will be discussed in the next article.

2.4 | Model Serving

This layer makes it possible for the ML model to interact with the real world by providing two functional capabilities:

Model Exposition: two modes of model exposition may be used:
- Batch Serving: this mode of serving is used when the ML model is applied to a large input size like a week history of songs chosen by the user in order to recommend the right songs for the next week. The predictions of this model do not go straight to the user, but instead, go to a database for future use.
- Real-time Serving: in this case, the user interacts directly with the model. He sends an API request containing the input data and receives a prediction.
Model Monitoring: because human behavior is unexpectable, the model’s performances may change over time. Thus, the need for a continuously monitored model. This monitoring does not only apply the accuracy of predictions, but also to the whole serving performance like the respect of SLAs.

2.5 | Governance & Security

One of the most important layers of the Framework. This layer must control access to the platform and make sure everything is secured and properly governed.

2.6 | Management, Administration & Orchestration

It is crucial to trace every single interaction with the platform. That is why this layer must provide strong Audit and Logging capabilities.
Plus, the previously listed layers need to be glued together in order to produce a pipeline. That is where the Workflow Management capability comes in handy.

3 | How to implement this logical architecture?

Jillur Quddus⁶ relied on open source community to implement his data insights platform. No wonder, as open-source projects with active communities “lower the barriers to adoption and collaboration, allowing people to spread and improve projects quickly” ⁷.

So, I followed his steps. My mission, should I choose to accept it, was to find an open-source project which integrates well with AWS as well as the infrastructure choices I made in the previous article.

Along with AWS services, I chose Kubeflow⁸ as a framework for this machine learning platform for three reasons:

It is a pluggable framework: it is possible to extend his capabilities by implementing new plugins. This is one of the main reasons why Kubeflow has grown so fast.
Kubeflow integrates well with some AWS services and these integrations are getting better with time.
Kubeflow is tightly integrated with Kubernetes and even considered as “the machine learning toolkit for Kubernetes” ⁹

Here is the final implementation of the machine learning framework:

Implementation of the Machine Learning Framework, by the author

Let us view, layer by layer, the implementation of the machine learning framework and discuss some technical challenges that could be faced when trying to deploy such a framework.

3.1 | Data Storage

Because it is the most reliable storage, S3 is used massively in this layer. As explained in the last post, FSx for luster is used for sharing the training set between worker nodes to have better data access performances.
The online features store should provide fast access to the data as it is used by the real-time served model. That is why DynamoDB is one of the best choices for this storage.
For the offline features store, S3 is enough as it is easily queryable by services like Athena.
Trained models can be easily fetched from S3 as well. These models are stored in a versioned repository with their metadata. Metadata are stored in a versioned JSON/XML File.
Elastic Container Registry (ECR) is used for hosting the docker images of the packaged models.
As for predictions and performance metrics, they are stored in S3.

3.2 | Data & Model preparation

Data Exploration capability: because this capability is the playground of Data Scientists, Jupyter notebooks are strongly present here. Kubeflow comes with a dashboard that simplifies Jupyter’s use. From the dashboard, the user can launch a JupyterHub server, then launches Jupyter notebooks.
Another great feature from Kubeflow is multi-user isolation: Based on namespaces, the cluster’s admin can allow or not a user to access a Jupyter notebook.
Model Training capability: Kubeflow integrates with a lot of machine learning frameworks like Tensorflow and Pytorch by defining Kubernetes custom resources. TFJob for example is a custom resource for Kubernetes and used to run TensorFlow training jobs on the cluster.
Plus, it is possible to harness all the power of the existing GPUs in the EKS cluster by running a distributed training job: MPI Job (a component of Kubeflow) spreads the model’s code across the nodes of the cluster and uses Horovod (another Uber masterpiece) to handle all the exchanges of parameters between nodes during the training.
Model Evaluation capability: Kubeflow comes with another component to handle this feature: Katib. “Katib runs several training jobs (known as trials) within each hyperparameter tuning job (experiment). Each trial tests a different set of hyperparameter configurations. At the end of the experiment, Katib outputs the optimized values for the hyperparameters.” ¹⁰

3.3 | Model Operationalization

Docker, installed on each worker node, is used to create a docker image containing all the pieces of the puzzle allowing the model to function correctly.
The model is then deployed as a docker container.

3.4 | Model Serving

Seldon Core along with Ambassador API Gateway solved the riddle of the real-time serving by exposing a REST API.
Multiple deployment strategies are possible with Seldon Core.
One more time, this is feasible thanks to the extensibility of Kubernetes and Kubeflow: a SeldonDeployment Kubernetes custom resource is used in this case.
Kubeflow’s TensorFlow Batch Prediction component is used for the batch serving.
Following the steps of Babylon Health¹¹, I chose Grafana and Prometheus for monitoring the model.

3.5 | Governance & Security

This layer, like always, is the most fun part and it is quite challenging:

Access Control: to authenticate the users, AWS Single Sign-On and AWS Directory Service are used especially by companies already running an Active Directory (AD), Active Directory Federated Services (ADFS), or a Lightweight Directory Access Protocol (LDAP) for identity and access management on-premise.
However, to integrate with Kubernetes, AWS IAM Authenticator for Kubernetes, which is a third-party solution, must be installed and used in addition to these two.
IAM Authenticator for Kubernetes is a tool to map AWS IAM user credentials to Kubernetes identities. A detailed user guide of this tool here.
The following diagram shows how these three solutions are used together. Notice the mapping between IAM Roles and RBAC Roles done by the AWS IAM Authenticator inside the EKS cluster.

Detailed view of the Machine Learning Framework’s access control capability, by the author

Giving the right privileges to the right microservice: in the world of microservices and containerized applications, where multiple microservices share the same worker node, isolating these services and giving each service the least privileges it needs to function properly is not that easy.
Obviously, creating one IAM role by a worker node is not a solution. Why? Simply because when doing this, the IAM role should be a union of all IAM roles needed by each service deployed on this same worker node.
Sadly, this gives services a lot more privileges than needed.
Two possible solutions:
- Use IAM Role for Service Account from AWS: this role will be assigned to a pod. Multiple pods could coexist inside the same worker node, and each pod has its isolated set of permissions.
Unfortunately, Kubeflow is not properly integrated with this feature yet.
- Use Kube2iam: yet another third-party solution. The idea behind it is “to redirect the traffic that is going to the ec2 metadata API for docker containers to a container running on each instance, make a call to the AWS API to retrieve temporary credentials, and return these to the caller. Other calls will be proxied to the EC2 metadata API.” ¹² The project’s github explains very well how to use this solution and the different configurations to consider (enable host networking: hostNetwork: true, proxy the traffic coming to the EC2 metadata API on 169.254.169.254, considerations for IAM roles).
This is the recommended solution by Kubeflow for the moment.
Secrets Management: as simple as it sounds, Kubernetes secrets can be used for managing the platform secrets, but it is not strong enough: it provides just base64 encryption.
A stronger way to encrypt these secrets is to use AWS KMS.
The following diagram came straight from the amazing EKS Workshop provided by Amazon.

Encrypting Secrets with AWS Key Management Service (KMS) Keys, from EKS Workshop

Take a look here for a detailed encryption/decryption process.

3.6 | Management, Administration & Orchestration

In order to automate the process of preparing the model, Kubeflow pipelines component allows us to design a workflow with a simple Python code. This code contains the steps of the workflow where each step defines a set of parameters: code location, data location, output location, etc.
Each step of the workflow, also named pipeline component, will be deployed as a docker container.

CloudTrail and CloudWatch are used respectively for auditing and logging purposes.

4 | Wait… But haven’t you heard about SageMaker!

Sure! SageMaker is a great Machine Learning as a Service (MLaaS) platform and it provides a lot of good features.
However, we do not own it, and more importantly, we cannot extend its capabilities.

The greatest thing I see about owning a machine learning platform is that, by definition, I am able to optimize its design and extend its capabilities with whichever new feature I dreamed about yesterday!

And the greatest news is: Amazon, one step ahead like always, recently launched Amazon SageMaker Components for Kubeflow Pipelines¹³ which makes it possible to use the advanced SageMaker’s features while owning the kind of platform designed in this series of articles.

Two use cases seem to be efficiently solved thanks to this new project:

Have a temporary burst of computing capacity by launching jobs directly on SageMaker.
Optimize the computing cost by launching SageMaker jobs with Spot Instances.

For more details about this new project, take a look here.

Conclusion

In this article, I tried to design the framework layer of the machine learning platform.
After defining the logical architecture of the Framework, I tried to implement it by using a combination of AWS services and open source projects: Kubeflow, Seldon Core, Ambassador API Gateway, Docker, Grafana, and Prometheus.

To clarify the implementation of the machine learning framework, I gave technical details about its functional capabilities and went deeper into entertaining aspects like security.

Finally, I discussed possible integrations with Amazon SageMaker and explained why I am not using it as a whole platform.

In the next article, I will talk about the fourth layer of this machine learning platform: Use Cases Layer.

If you have any questions, please reach out to me on LinkedIn.

[1] https://en.wikipedia.org/wiki/Software_framework

[2] https://www.youtube.com/watch?v=66h7DrOEF5k

[3] https://eng.uber.com/michelangelo-machine-learning-platform/

[4] https://docs.metaflow.org/introduction/what-is-metaflow

[5] https://www.youtube.com/watch?v=V__qXSXms9w

[6] https://www.oreilly.com/library/view/machine-learning-with/9781789346565/35fe3bc5-a06a-4885-815b-be347137c3ba.xhtml

[7] https://opensource.guide/starting-a-project/

[8] https://www.kubeflow.org/

[9] https://ubuntu.com/blog/kubernetes-for-data-science-meet-kubeflow