Katonic MLOps Platform Architecture
Overviewβ
Katonic operates within a Kubernetes cluster, utilizing specific worker nodes for different purposes. Platform nodes handle essential components, compute nodes execute data workflows, deployment nodes manage model and app deployments, and storage nodes handle data storage. All workloads run as containerized processes orchestrated by Kubernetes, providing flexibility for deployment in various environments.
The Katonic application hosts four major workloads:
1. Platform nodesβ
- These are worker nodes of Kubernetes that handle platform-specific components. These components provide user interfaces, the Katonic API server, orchestration, metadata and supporting services.
- These nodes are of the highest importance in a cluster as they keep all the essential components of the Katonic MLOps platform running.
- No other workloads can be assigned to these nodes and they have a fixed number. The number of platform nodes required in a cluster is static.
2. Compute nodesβ
- These are worker nodes of Kubernetes that handle all the users' data. This is where usersβ data science, engineering, and machine learning workflows are executed.
- As the name suggests, Compute nodes handle all the computation needs of the users and so require high resources and numbers.
- Compute node failures will not take the platform offline and are therefore of lower priority than Platform Nodes.
- The workloads assigned to the compute nodes are dynamic and thus these nodes are usually auto-scaled to meet these demands.
3. Deployment Nodesβ
- These are worker nodes of Kubernetes that handle all models api and app deployments.
- These set of nodes are separtaed from other nodes to ensure enough compute resources are available for model apis and app deployments.
- Deployment node failures will not take the platform offline and are therefore of lower priority than Platform Nodes.
4. Storage Nodesβ
- Unlike cloud deployment of Kubernetes where the cloud provider handles the storage of data for the platform, On-Premise systems need to have internal solutions to handle the storage.
- Storage nodes have Katonicβs storage solution that works with the Kubernetes cluster and Katonic platform.
- These nodes are reserved exclusively for pods and deployments of the storage services and are considered very Important.
- The number of storage nodes depends on the storage requirement and fault tolerance required.
The Katonic application leverages containerized processes orchestrated by Kubernetes, an industry-standard container orchestration system. Developed by Google, Kubernetes enjoys widespread support from communities and vendors, including managed services offered by major cloud providers. Typically, Katonic customers take charge of provisioning and managing their Kubernetes cluster, enabling them to install Katonic within it.
The Katonic Platform offers remarkable flexibility, accommodating various deployment environments for the application layer. Once the platform is deployed, users can seamlessly add different worker layers regardless of their hosting methods.
Cloud Architectureβ
In the cloud, Katonic operates within a Kubernetes cluster consisting of master nodes, worker nodes specifically allocated for hosting Katonic platform services, and worker nodes dedicated to handling compute workloads. Outside the cluster, there is a reliable blob storage system, and a load balancer responsible for managing user connections.
On-Premise Architectureβ
When deploying the Katonic Platform on-premise, the workloads mirror those of a cloud deployment. Alongside the platform and compute nodes found in a cloud system, on-premise deployments introduce two additional node types: master nodes and storage nodes. Master nodes serve a crucial role in managing the cluster, while storage nodes are dedicated to handling storage-related tasks within the on-premise environment.
Servicesβ
The Katonic platform offers a variety of services, comprising both proprietary and third-party open-source tools and libraries. These services are exposed as application services and managed through Kubernetes. Users can conveniently access and manage these services using the platform dashboard, following a self-service model. It is important to note that certain services, which operate without user intervention, may not be visible in the dashboard. The platform consists of two types of managed application services, each serving specific functions:
Platform Services:β
The platform services encompass the Katonic API server, Pipelines, Keycloak authentication service, and metadata services. These services enable features such as reproducibility and collaboration. Application object metadata is stored in MongoDB, code and file versioning are managed by Git, and the Docker registry is utilized by Katonic Environments. These services are hosted on platform nodes, which also include dedicated master nodes for the Kubernetes cluster. While users cannot delete default services, service administrators have the ability to disable or restart them and modify certain configurations.
Compute Services:β
The compute services function as the execution layer where Katonic launches and manages ephemeral pods for user workloads. These pods handle various tasks such as Pipeline Runs, Workspaces, and docker image builds. Compute nodes are responsible for running these pods.
Deployment Services:β
Similar to compute services, deployment services operate as the execution layer where Katonic launches and manages ephemeral pods for user workloads. These pods primarily handle Model APIs, Apps, and docker image builds. Deployment nodes are dedicated to running these pods.
Pre Deployed Servicesβ
The following software packages, services, and tools are pre-deployed as part of the version 4.5 platform installation:
Service | Namespace | Type | Description |
---|---|---|---|
Istio Ingress | istio-system | Networking Service | Istio extends Kubernetes to establish a programmable, application-aware network using the powerful Envoy service proxy. |
Katonic UI | application | UI Component | The platform's graphical user interface. |
Katonic API Server | application | API Server | The Katonic Platformβs API server and endpoint |
Katonic File manager | default | Service | The distributed File storage and Management system. MinIO |
Keycloak | keycloak | Identity Provider | Keycloak is an enterprise-grade open-source authentication service. Katonic uses Keycloak to store user identities and properties, and optionally for identity brokering or identity federation to SSO systems and identity providers. |
MLflow | mlflow | Service | MLflow is an open-source platform to manage the ML lifecycle, including experimentation, reproducibility, deployment, and a central model registry. |
MLflow-MinIO | mlflow | Service | The distributed File storage and Management system. MinIO |
MongoDB | application | Database | MongoDB is an open-source document database. Katonic uses MongoDB to store Katonic entities, like projects, users, and organizations. Katonic stores the structure of these entities in MongoDB, but underlying data is stored separately in encrypted blob storage. |
Monitoring (monitoring) | monitoring | Logging and Monitoring | A platform service for monitoring application services and gathering performance statistics and additional data. The gathered data is visualized on Grafana dashboards using the platform's Grafana services. |
Pipelines (pipelines) | kubeflow | Service | The Google Kubeflow Pipelines open-source framework for building and deploying portable, scalable machine learning (ML) workflows based on Docker containers. |
Postgres Database | application | Database | Postgres is an open-source relational database system. Katonic uses Postgres as a storage system for Keycloak data on user identities and attributes. In addition to the Keycloak data, It stores Experimentation Meta information and is also used for Feature store offline meta DB. |
Redis | application | Database | Redis is an open-source (BSD licensed), in-memory data structure store, used as a database, cache, and message broker. It stores Katonic deployment ML Meta information and is also used for the online Feature store |
User Accountsβ
Katonic uses Keycloak to manage user accounts. Keycloak supports the following modes of authentication to Katonic.
Local accountsβ
When utilizing local accounts, individuals with network access to the Katonic application have the ability to create a Katonic account. During the signup process, users provide a username, password, and email address to establish a Katonic-managed account. Administrators within Katonic can monitor, handle, and deactivate these accounts via the application. Additionally, Katonic can be configured to incorporate multi-factor authentication and password criteria by utilizing Keycloak.
Learn more about Keycloak Administration.
Identity federationβ
Keycloak can be configured to integrate with an Active Directory (AD) or LDAP(S) identity provider (IdP). When identity federation is enabled, local account creation is disabled and Keycloak will authenticate users against identities in the external IdP and retrieve configurable properties about those users for Katonic usernames and email addresses.
Learn more about Keycloak Identity Federation.
Identity brokeringβ
Keycloak can be configured to broker authentication between Katonic and an external authentication or SSO system. When identity brokering is enabled, Katonic will redirect users in the authentication flow to a SAML, OAuth, or OIDC service for authentication. Following authentication in the external service, the user is routed back to Katonic with a token containing user properties.
Learn more about Keycloak Identity Brokering.
Hardware Configurationsβ
The platform is available in two configurations, which differ in a variety of aspects, including the performance capacity, footprint, storage size, and scale capabilities:
Operational Clusterβ
This configuration is designed mainly for the operational cluster. It includes high availability (HA) or performance testing. This configuration is designed to achieve superior performance that enables real-time execution of analytics, machine learning (ML), and artificial intelligence (AI) applications in a production pipeline.
Set up compatible Kubernetes infrastructure for Katonicβ
Cluster Requirementsβ
You can deploy Katonic into a Kubernetes cluster that meets the following requirements.
General requirementsβ
- Kubernetes version: Katonic 4.5 has been validated with Kubernetes version 1.27.
Cluster permissionsβ
Katonic needs permission to install and configure pods in the cluster via our Installer. The Katonic installer is delivered as a containerized Python utility that operates ansible through a kubeconfig that provides service account access to the cluster.
Namespacesβ
No namespace configuration is necessary prior to installation. Katonic creates the dedicated namespaces as part of the installation.
Node pool requirementsβ
Katonic requires a minimum of three-node pools, one to host the Katonic Platform, one to host Compute workloads and one for storage. Additional optional pools can be added to provide specialized execution hardware for some Compute workloads.
Master pool requirements
Boot Disk: Min 128GB
Min Nodes: 1
Max Nodes: 3
Spec: 2 CPU / 8GB
Platform pool requirements
Boot Disk: Min 128GB
Min Nodes: 2
Max Nodes: 3
Spec: 4 CPU / 16GB
Taints: katonic.ai/node-pool=platform:NoSchedule
Labels: katonic.ai/node-pool=platform
Compute pool requirements
Boot Disk: Min 128GB
Recommended Min Nodes: 1
Max Nodes: Set as necessary to meet demand and resourcing needs
Recommended min spec: 8 CPU / 32GB
Labels: katonic.ai/node-pool=compute
Deployment pool requirements
Boot Disk: Min 128GB
Recommended Min Nodes: 1
Max Nodes: Set as necessary to meet demand and resourcing needs
Recommended min spec: 8 CPU / 32GB
Taints: katonic.ai/node-pool=deployment:NoSchedule
Labels: katonic.ai/node-pool=deployment
Storage pool requirements
Boot Disk: Min 128GB
Recommended Min Nodes: 3
Max Nodes: Set as necessary to meet demand and resourcing needs
Recommended min spec: 2 CPU / 8GB
External disks with required storage size.
Taints: node.kubernetes.io/storage-node:NoSchedule
Labels: node-role.kubernetes.io/storage-node=ceph
Optional GPU compute pool
Boot Disk: recommended 512GB
Recommended Min Nodes: 0
Max Nodes: Set as necessary to meet demand and resourcing needs
Recommended min Spec: 8 CPU / 16GB / One or more Nvidia GPU Device
Nodes must be pre-configured with the appropriate Nvidia driver, Nvidia-docker2, and set the default docker runtime to Nvidia.
Taints: nvidia.com/gpu=katonic-{GPU type}
Labels: katonic.ai/node-pool=gpu-{GPU type}
Note: For example we can use GPU type as v100, A30, A100
Cluster networkingβ
Katonic relies on Kubernetes Network Policies to manage secure communication between pods in the cluster. Network policies are implemented by the network plugin, so your cluster uses a networking solution that supports NetworkPolicy, such as Calico.
Ingress and SSLβ
Katonic must be configured to serve from a specific FQDN, and DNS for that name must resolve to the address of an SSL-terminating load balancer with a valid certificate. The load balancer must target incoming connections on ports 80 and 443 to port 80 on all nodes in the Platform pool. This load balancer must support WebSocket connections.