Version: 3.2

Katonic MLOps Platform Architecture

Overview

Katonic runs in a Kubernetes cluster with a set of master nodes, worker nodes dedicated to hosting Katonic platform services, worker nodes dedicated to hosting compute workloads and a set of worker nodes dedicated to storage, and a load balancer that regulates connections from users.

The Katonic application hosts three major workloads:

Platform nodes

These are worker nodes of Kubernetes that handle platform-specific components. These components provide user interfaces, the Katonic API server, orchestration, metadata and supporting services.
These nodes are of the highest importance in a cluster as they keep all the essential components of the Katonic MLOps platform running.
No other workloads can be assigned to these nodes and they have a fixed number. The number of platform nodes required in a cluster is static.

Compute nodes

These are worker nodes of Kubernetes that handle all the users' data. This is where users’ data science, engineering, and machine learning workflows are executed.
As the name suggests, Compute nodes handle all the computation needs of the users and so require high resources and numbers.
Compute node failures will not take the platform offline and are therefore of lower priority than Platform Nodes.
The workloads assigned to the compute nodes are dynamic and thus these nodes are usually auto-scaled to meet these demands.

Storage Nodes

Unlike cloud deployment of Kubernetes where the cloud provider handles the storage of data for the platform, On-Premise systems need to have internal solutions to handle the storage.
Storage nodes have Katonic’s storage solution that works with the Kubernetes cluster and Katonic platform.
These nodes are reserved exclusively for pods and deployments of the storage services and are considered very Important.
The number of storage nodes depends on the storage requirement and fault tolerance required.

All workloads in the Katonic application run as containerized processes, orchestrated by Kubernetes. Kubernetes is an industry-standard container orchestration system. Kubernetes was launched by Google and has broad community and vendor support, including managed offerings from all major cloud providers. Typically, Katonic customers will provision and manage their own Kubernetes cluster into which they install Katonic. Katonic Platform is built to be incredibly flexible and supports many different deployment environments for your Katonic application layer. After you've deployed the platform you can add all of your different worker's layers, no matter how they are hosted!

Cloud Architecture

On Cloud, Katonic runs in a Kubernetes cluster with a set of master nodes, a set of worker nodes dedicated to hosting Katonic platform services, and a set of worker nodes dedicated to hosting compute workloads. Outside the cluster is a durable blob storage system, and a load balancer that regulates connections from users.

Architecture1

On-Prem Architecture

On an On-Premise deployment of the Katonic Platform, the workloads of the system are the same as in a Cloud Deployment. In addition to the Platform and Compute nodes in a Cloud system, On-Premise has 2 additional node types i.e Master and Storage

Architecture1

Services

The platform comes pre-deployed with proprietary and third-party open-source tools and libraries that are exposed as application services and are managed using Kubernetes. Users can view and manage relevant services from the platform dashboard using a self-service model. (Note that some services that don't require user intervention aren't visible in the dashboard.) . The platform has two types of managed application services. A description of the functionality provided by each service is as follows.

Platform services

This service layer contains the Katonic API server, Pipelines, Keycloak authentication service, and the metadata services that Katonic uses to provide reproducibility and collaboration features. MongoDB stores application object metadata, Git manages code and file versioning, Docker registry is used by Katonic Environments. All of these services run on platform nodes. The service layer also contains the dedicated master nodes for the Kubernetes cluster. These are The default services and can't be deleted by users, but service administrators can disable or restart these services and modify some service configurations.

Compute services

The execution layer is where Katonic will launch and manage ephemeral pods that run user workloads. These may host Pipeline Runs, Model APIs, Apps, Workspaces, and docker image builds. These run on compute nodes.

Pre Deployed Services

The following software packages, services, and tools are pre-deployed as part of the version 3.2 platform installation:

Service	Namespace	Type	Description
Istio Ingress	istio-system	Networking Service	Istio extends Kubernetes to establish a programmable, application-aware network using the powerful Envoy service proxy.
Katonic UI	application	UI Component	The platform's graphical user interface.
Katonic API Server	application	API Server	The Katonic Platform’s API server and endpoint
Katonic File manager	default	Service	The distributed File storage and Management system. MinIO
Keycloak	keycloak	Identity Provider	Keycloak is an enterprise-grade open-source authentication service. Katonic uses Keycloak to store user identities and properties, and optionally for identity brokering or identity federation to SSO systems and identity providers.
MLflow	mlflow	Service	MLflow is an open-source platform to manage the ML lifecycle, including experimentation, reproducibility, deployment, and a central model registry.
MLflow-MinIO	mlflow	Service	The distributed File storage and Management system. MinIO
MongoDB	application	Database	MongoDB is an open-source document database. Katonic uses MongoDB to store Katonic entities, like projects, users, and organizations. Katonic stores the structure of these entities in MongoDB, but underlying data is stored separately in encrypted blob storage.
Monitoring (monitoring)	monitoring	Logging and Monitoring	A platform service for monitoring application services and gathering performance statistics and additional data. The gathered data is visualized on Grafana dashboards using the platform's Grafana services.
Pipelines (pipelines)	kubeflow	Service	The Google Kubeflow Pipelines open-source framework for building and deploying portable, scalable machine learning (ML) workflows based on Docker containers.
Postgres Database	application	Database	Postgres is an open-source relational database system. Katonic uses Postgres as a storage system for Keycloak data on user identities and attributes. In addition to the Keycloak data, It stores Experimentation Meta information and is also used for Feature store offline meta DB.
Redis	application	Database	Redis is an open-source (BSD licensed), in-memory data structure store, used as a database, cache, and message broker. It stores Katonic deployment ML Meta information and is also used for the online Feature store

User Accounts

Katonic uses Keycloak to manage user accounts. Keycloak supports the following modes of authentication to Katonic.

Local accounts

When using local accounts, anyone with network access to the Katonic application may create a Katonic account. Users supply a username, password, and email address on the signup page to create a Katonic-managed account. Katonic administrators can track, manage, and deactivate these accounts through the application. Katonic can be configured with multi-factor authentication and password requirements through Keycloak.

Learn more about Keycloak administration

Identity federation

Keycloak can be configured to integrate with an Active Directory (AD) or LDAP(S) identity provider (IdP). When identity federation is enabled, local account creation is disabled and Keycloak will authenticate users against identities in the external IdP and retrieve configurable properties about those users for Katonic usernames and email addresses.

Learn more about Keycloak identity federation

Identity brokering

Keycloak can be configured to broker authentication between Katonic and an external authentication or SSO system. When identity brokering is enabled, Katonic will redirect users in the authentication flow to a SAML, OAuth, or OIDC service for authentication. Following authentication in the external service, the user is routed back to Katonic with a token containing user properties.

Learn more about Keycloak identity brokering

Hardware Configurations

The platform is available in two configurations, which differ in a variety of aspects, including the performance capacity, footprint, storage size, and scale capabilities:

Proof of Concept

A single platform-node cluster with a single compute node implementation. This configuration is designed mainly for proof of concepts and evaluations and doesn't include high availability (HA) or performance testing.

Operational Cluster

A scalable cluster implementation that is composed of a standard set of three master nodes, a set of worker nodes dedicated to hosting Katonic platform services, and a set of worker nodes dedicated to hosting compute workloads. This configuration is designed to achieve superior performance that enables real-time execution of analytics, machine learning (ML), and artificial intelligence (AI) applications in a production pipeline.

Set up compatible Kubernetes infrastructure for Katonic

Cluster Requirements

You can deploy Katonic into a Kubernetes cluster that meets the following requirements.

General requirements

Kubernetes 1.19+

Katonic 3.2 has been validated with Kubernetes 1.19–1.21.

Cluster permissions

Katonic needs permission to install and configure pods in the cluster via our Installer. The Katonic installer is delivered as a containerized Python utility that operates ansible through a kubeconfig that provides service account access to the cluster.

Namespaces

No namespace configuration is necessary prior to installation. Katonic creates the dedicated namespaces as part of the installation

Node pool requirements

Katonic requires a minimum of three-node pools, one to host the Katonic Platform, one to host Compute workloads and one for storage. Additional optional pools can be added to provide specialized execution hardware for some Compute workloads.

Master pool requirements

Boot Disk: 128GB
Min Nodes: 1
Max Nodes: 3
Spec: 2 CPU / 8GB

Platform pool requirements

Boot Disk: 128GB
Min Nodes: 2
Max Nodes: 3
Spec: 4 CPU / 16GB
Labels: agentpool=platform

Compute pool requirements

Boot Disk: 128GB
Recommended Min Nodes: 1
Max Nodes: Set as necessary to meet demand and resourcing needs.
Recommended min spec: 8 CPU / 32GB
Labels: agentpool=compute

Storage pool requirements

Boot Disk: 128GB
Recommended Min Nodes: 3
Max Nodes: Set as necessary to meet demand and resourcing needs
Recommended min spec: 2 CPU / 8GB
Labels: agentpool=storage

Optional GPU compute pool

Boot Disk: 400GB
Recommended Min Nodes: 0
Max Nodes: Set as necessary to meet demand and resourcing needs
Recommended min Spec: 8 CPU / 16GB / One or more Nvidia GPU Device
Nodes must be pre-configured with the appropriate Nvidia driver, Nvidia-docker2, and set the default docker runtime to Nvidia.
Labels: agentpool=gpu

Cluster networking

Katonic relies on Kubernetes network policies to manage secure communication between pods in the cluster. Network policies are implemented by the network plugin, so your cluster uses a networking solution that supports NetworkPolicy, such as Calico.

Ingress and SSL

Katonic must be configured to serve from a specific FQDN, and DNS for that name must resolve to the address of an SSL-terminating load balancer with a valid certificate. The load balancer must target incoming connections on ports 80 and 443 to port 80 on all nodes in the Platform pool. This load balancer must support WebSocket connections.

Overview​

Cloud Architecture​

On-Prem Architecture​

Services​

Platform services​

Compute services​

Pre Deployed Services​

User Accounts​

Local accounts​

Identity federation​

Identity brokering​

Hardware Configurations​

Proof of Concept​

Operational Cluster​

Set up compatible Kubernetes infrastructure for Katonic​

Cluster Requirements​

General requirements​

Cluster permissions​

Namespaces​

Node pool requirements​

Cluster networking​

Ingress and SSL​