Skip to main content

Katonic MLOps

Node pool requirementsโ€‹

The AKS cluster must have at least two node pools that produce worker nodes with the following specifications and distinct node labels, and it might include an optional GPU pool:

SR NO.POOLMIN-MAXVMLABELSTAINTS
1Platform3-4Standard_DS3_v2 or Standard_D8ads_v5katonic.ai/node-pool=platformkatonic.ai/node-pool=platform:NoSchedule
2Compute1-10Standard_D8s_v3 or Standard_D8ads_v5katonic.ai/node-pool=compute
3Deployment1-10Standard_D8s_v3 or Standard_D8ads_v5katonc.ai/node-pool=deploymentkatonic.ai/node-pool=deployment:NoSchedule
4GPU (Optional)0-5Standard_NC6s_v3katonic.ai/node-pool=gpu-{GPU-type}nvidia.com/gpu=gpu-{GPU-type}:NoSchedule

Note: Instance Type and Region Considerations
The cost of virtual machines varies according to the chosen instance type and region. You are encouraged to select from the mentioned instance types based on your specific requirements and budgetary considerations.

Please note that the following regions are not supported for deployment:

  • Brazil South
  • Germany Central
  • Germany Northeast
  • Austria East
  • Denmark East
  • Italy South
  • Italy Central
  • East India
  • New Zealand North
  • New Zealand Southeast
  • Central Indonesia
  • East Indonesia

Note: For example we can use GPU type as v100, A30, A100

Note: When backup_enabled = True, then compute_nodes.min_count should be set to 2.

If you want Katonic to run with some components deployed as highly available ReplicaSets you must use 2 availability zones. All compute node pools you use must have corresponding ASGs in any AZ used by other node pools. Setting up an isolated node pool in one zone can cause volume affinity issues.

To run the node pools across multiple availability zones, you will need duplicate ASGs in each zone with the same configuration, including the same labels, to ensure pods are delivered to the zone where the required ephemeral volumes are available.

Additional ASGs with distinct katonic.ai/node-pool labels can be added to make other instance types available for Katonic executions.

The Katonic installer can set up all configurations of ASG and zones for the Katonic platform.

Azure Platform-Node Specificationsโ€‹

Platform nodes in platform Azure cloud deployments must fulfil the following hardware specification requirements according to the deployment type:

SR NO.COMPONENTSPECIFICATION
1Node countMin 3
2Instance typeStandard_DS3_v2 or Standard_D8ads_v5
3vCPUs4
4Memory14 GB
5Boot disk size128 GB

Azure Compute-Node Specificationsโ€‹

The following instance types are required for compute nodes in Azure cloud deployments for the Katonic platform:

Choose the type that best fits your requirements. Azure Kubernetes Service (AKS) is also supported for application nodes, using the instance types listed below. The Katonic platform requires at least 1 minimum Compute node for the Katonic Data Science version. For specification details for each type, refer to the Azure documentation.

Note: Supported compute node configurations

  • Standard_D8ads_v5 (default configuration)
  • Standard_D8s_v3
  • Standard_D16s_v3
  • Standard_D32s_v3
  • Standard_D48s_v3
  • Standard_D64s_v3
  • Boot Disk: 128GB

Azure Deployment-Node Specificationsโ€‹

The following instance types are required for deployment nodes in Azure cloud deployments for the Katonic platform:

Choose the type that best fits your requirements. Azure Kubernetes Service (AKS) is also supported for application nodes, using the instance types listed below. The Katonic platform requires at least 1 minimum deployment node for the teams version. For specification details for each type, refer to the Azure documentation.

Note: Supported deployment node configurations

  • Standard_D8ads_v5 (default configuration)
  • Standard_D8s_v3
  • Standard_D16s_v3
  • Standard_D32s_v3
  • Standard_D48s_v3
  • Standard_D64s_v3
  • Boot Disk: 128GB

Azure GPU-Node Specificationsโ€‹

GPU nodes in platform Azure cloud deployments must use one of the following instance types:

Choose the type that best fits your requirements. Azure Kubernetes Service (AKS) is also supported for application nodes, using the instance types listed below. For specification details for each type, refer to the Azure documentation.

Note: Supported GPU node configurations

  • NCv3-series (GPU optimized)
  • Boot Disk: 512 GB

Additional node pools can be added with distinct katonic.ai/node-pool labels to make other instance types available for Katonic executions.

Prerequisitesโ€‹

To install and configure Katonic in your Azure account you must have:

  1. Quay credentials from Katonic.

  2. PEM encoded public key certificate for your domain and private key associated with the given certificate.

  3. An Azure subscription with enough quota to create:

  • At least 4 Standard_D8s_v3 or Standard_D8ads_v5 VMs.
  • NC6s_v3 or similar SKU VMs, if you want to use GPU.
  1. A Linux operating system (Ubuntu/Debian) based machine with the following Steps:

    a. Create a Resource Group in Azure

    az group create --name <RESOURCE_GROUP> \ --location <ZONE> 

    Note: You can get a list of all available locations by running the following command:

    az account list-locations

    You need to pass the name of the resource group later to the Katonic-installer.

    b. A Linux operating system (Ubuntu/Debian) based machine having 4GB RAM and 2vcpus. Skip this step if you already have the machine with the given specifications.

    Note: After the platform is deployed successfully, the VM can be deleted.

    c. Switch to the root user inside the machine.

    d. Azure CLI's Latest version 2.35.0+ specifically must be installed and logged in to your Azure account using the az login command, with a user that has a contributor role on the subscription.

    Note: To achieve this on Debian-based machines, follow the install Azure CLI v2.35+.

    e. If your Azure has tenants, use the following command to get your subscription ID.

    az account list --output table

Save this as later on you need to pass it to the Katonic-installer.

To install Katonic Platform MLOps version follow the steps mentioned below:โ€‹

1. Access the JumpHost and perform az login.โ€‹

2. Log in to Quay with the credentials described in the requirements section above.โ€‹

docker login quay.io

3. Retrieve the Katonic installer image from Quay.โ€‹

docker pull quay.io/katonic/katonic-installer:v4.5

4. Create a directory.โ€‹

mkdir katonic
cd katonic

5. Adding PEM Encoded Public Key Certificate and Private Key to Directoryโ€‹

Put the PEM encoded public key certificate (having extension.crt) for your domain and private key associated with the given certificate (having extension .key) inside the current directory (katonic).

6. The Katonic Installer can deploy the Katonic Platform MLOps version in two ways:โ€‹

  1. Creating Private AKS and deploying the Katonic Platform MLOps version.
  2. Install Katonic Platform MLOps version on existing Private AKS Azure Kubernetes Service.

1. Creating AKS and deploying the Katonic MLOps Platformโ€‹

Initialize the installer application to generate a template configuration file named katonic.yml.

docker run -it --rm --name generating-yaml -v $(pwd):/install quay.io/katonic/katonic-installer:v4.5 init azure katonic_mlops deploy_kubernetes private

Edit the configuration file with all necessary details about the target cluster, storage systems, and hosting domain. Read the following configuration reference:

SR NO.PARAMETERDESCRIPTIONVALUE
1katonic_platform_versionIt has the value by default regarding the Katonic Platform Version.katonic_mlops
2deploy_onCloud platform on which Katonic is to be deployed.Azure
3private_clusterSet "True" when opting for private clusterFalse
4enable_exposing_genai_applications_to_internetset "True" if opting for exposing genai applications to internetFalse
5public_domain_for_genai_applicationsPublic FQDN of domain for genai applications that will be exposed to the internet(eg. public-chatbots.google.com)
6internal_loadbalancerSet "True" when opting for private ip for loadbalancerFalse
7create_k8s_clusterIs set to false if the Kubernetes cluster is already present. If it is true, the installer will create Kubernetes cluster on provided cloud platformTrue
8kubernetes_versionAKS Versioneg. 1.28.5(1.27 and above versions supported)
9cluster_nameName of the clustereg. katonic-mlops-platform-v4-5
10resource_group_nameAzure resource group nameeg. my-resource-group
11resource_group_locationAzure resource group locationeg. centralindia
12azure_subscription_idAzure Subscription ID
13vnet_namename of vnet created for private cluster
14aks_subnet_namename of subnet created for private cluster
15platform_nodes.instance_typePlatform node VM sizeeg. Standard_D8ads_v5
16platform_nodes.min_countMinimum number of platform nodes should be 2 Note: You require a minimum of 3 platform nodes to install Superset or Airbyteeg. 2
17platform_nodes.max_countMaximum number of platform nodeseg. 4
18platform_nodes.os_disk_sizePlatform Nodes OS Disk Sizeeg. 128 GB
19compute_nodes.instance_typeCompute node VM sizeeg. Standard_D8ads_v5
20compute_nodes.min_countMinimum number of compute nodes shoul be 1eg. 1
21compute_nodes.max_countMaximum number of compute nodeseg. 4
22compute_nodes.os_disk_sizeCompute Nodes OS Disk Sizeeg. 128 GB
23deployment_nodes.instance_typeDeployment Node VM sizeeg. Standard_D8ads_v5
24deployment_nodes.min_countMinimum number of Deployment nodes should be 1eg. 1
25deployment_nodes.max_countMaximum number of Deployment should be greater than Deployment nodes min count nodes.eg. 4
26deployment_nodes.os_disk_sizeDeployment Nodes OS Disk Sizeeg. 128 GB
27gpu_enabledAdd GPU nodepoolTrue or False
28gpu_nodes.instance_typeGPU node VM sizeeg. Standard_NC6s_v3
29gpu_nodes.gpu_typeEnter the type of gpu available on machineeg v100, k80
30gpu_nodes.min_countMinimum number of GPU nodeseg. 1
31gpu_nodes.max_countMaximum number of GPU nodeseg. 2
32gpu_nodes.os_disk_sizeGPU Nodes OS Disk Sizeeg. 512 GB
33gpu_nodes.gpu_vRAMEnter Gpu node RAM size
34gpu_nodes.gpus_per_nodeEnter GPU per node count
35enable_gpu_workspaceSet it true if you want to use GPU WorkspaceTrue or False
36storage_class_type.Premium_LRSIf you prefer to select "Premium_LRS" as your storage class type instead of "StandardSSD_LRS," please write "Trueโ€.True or False
37shared_storage.createCreate shared storageTrue or False
38private_bucket_limitSet the private bucket size.eg. 10GB
39minio_storageStorage for minioeg 250
40workspace_timeout_intervalSet timeout interval hourseg. 1
41backup_enabledBackup enablingTrue or False
42backup_scheduleScheduling of backup0 0 1 * *
43backup_expirationExpiration of backup2160h0m0s
44use_custom_domainSet this to True if you want to host katonic platform on your custom domain. Skip if use_katonic_domain: TrueTrue or False
45custom_domain_nameExpected a valid domain.eg. katonic.tesla.com
46use_katonic_domainSet this to True if you want to host katonic platform on Katonic MLOps Platform domain. Skip if use_custom_domain: TrueTrue or False
47katonic_domain_prefixOne word expected with no special characters and all small alphabetseg. tesla
48enable_pre_checksSet this to True if you want to perform the Pre-checksTrue / False
49enable_acceleratorSet "True" to enable acceleratorsFalse
50enable_playgroundSet "True" to enable playgroundFalse
51AD_Group_ManagementSet "True" to enable functionality that provides you ability to sign in using Azure ADFalse
52AD_CLIENT_IDClient ID of App registered for SSO in client's Azure or Identity Provider
53AD_CLIENT_IDClient Secret of App registered for SSO in client's Azure or any other Identity Provider
54AD_AUTH_URLAuthorization URL endpoint of app registered for SSO.
55AD_TOKEN_URLToken URL endpoint of app registered for SSO.
56quay_usernameUsername for quay
57quay_passwordPassword for quay
58adminUsernameEmail for admin usereg. john@katonic.ai
59adminPasswordPassword for admin userat least 1 special character at least 1 upper case letter at least 1 lower case letter minimum 8 characters
60adminFirstNameAdmin first nameeg. john
61adminLastNameAdmin last nameeg. musk

Installing the Katonic Platform MLOps versionโ€‹

docker run -it --rm --name install-katonic -v /root/.azure:/root/.azure -v $(pwd):/inventory quay.io/katonic/katonic-installer:v4.5

This will start a container and deploy the entire platform.

2. Deploying Katonic Platform MLOps version on existing Private AKSโ€‹

The steps are similar to Installing the Katonic Platform with Azure Kubernetes Service. Just edit the configuration file with all the necessary details about the target cluster, storage systems, and hosting domain. Read the following configuration reference, these are the only parameters required when installing the Katonic MLOps platform on existing AKS.

Prerequisites

You will need to create a kfs named storage class. Please refer to the main documentation of Azure โ†’ Dynamic Block Storage for instructions on how to create the storage class.

Initialize the installer application to generate a template configuration file named katonic.yml.

docker run -it --rm --name generating-yaml -v $(pwd):/install quay.io/katonic/katonic-installer:v4.5 init azure katonic_mlops kubernetes_already_exists private
SR NO.PARAMETERDESCRIPTIONVALUE
1katonic_platform_versionIt has the value by default regarding the Katonic Platform Version.katonic_mlops
2deploy_onKatonic MLOps platform can be deployed onAzure
3cluster_nameEnter cluster name which you deployeg. katonic-mlops-platform-v4-5
4private_clusterSet "True" when opting for private clusterFalse
5enable_exposing_genai_applications_to_internetset "True" if opting for exposing genai applications to internetFalse
6public_domain_for_genai_applicationsPublic FQDN of domain for genai applications that will be exposed to the internet(eg. public-chatbots.google.com)
7internal_loadbalancerSet "True" when opting for private ip for loadbalancerFalse
8resource_group_nameEnter your cluster resource group nameeg. my-resource-group
9resource_group_locationEnter your cluster resource group name locationeg. centralindia
10azure_subscription_idAzure Subscription ID
11vnet_namename of subnet created for private cluster
12aks_subnet_nameaks_subnet_name
13private_bucket_limitSet the private bucket size.eg. 10GB
14minio_storageSet the value to amount of storage required in file manager /16eg. 20Gi
15workspace_timeout_intervalSet timeout interval hourseg. 1
16backup_enabledBackup enablingTrue or False
17backup_scheduleScheduling of backup0 0 1 * *
18backup_expirationExpiration of backup2160h0m0s
19use_custom_domainSet this to True if you want to host Katonic platform on your custom domain. Skip if use_katonic_domain: TrueTrue or False
20custom_domain_nameExpected a valid domain.eg. katonic.tesla.com
21use_katonic_domainSet this to True if you want to host Katonic platform on Katonic MLOps Platform domain. Skip if use_custom_domain: TrueTrue or False
22katonic_domain_prefixOne word expected with no special characters and all small alphabetseg. tesla
23enable_pre_checksSet this to True if you want to perform the Pre-checksTrue / False
24enable_acceleratorSet "True" to enable acceleratorsFalse
25enable_playgroundSet "True" to enable playgroundFalse
26AD_Group_ManagementSet "True" to enable functionality that provides you ability to sign in using Azure ADFalse
27AD_CLIENT_IDClient ID of App registered for SSO in client's Azure or Identity Provider
28AD_CLIENT_SECRETClient Secret of App registered for SSO in client's Azure or any other Identity Provider
29AD_AUTH_URLAuthorization URL endpoint of app registered for SSO.
30AD_TOKEN_URLToken URL endpoint of app registered for SSO.
31quay_usernameUsername for quay
32quay_passwordPassword for quay
33adminUsernameEmail for admin usereg. john@katonic.ai
34adminPasswordPassword for admin userat least 1 special character at least 1 upper case letter at least 1 lower case letter minimum 8 characters
35adminFirstNameAdmin first nameeg. john
36adminLastNameAdmin last nameeg. musk

Installing the Katonic Platform MLOps versionโ€‹

docker run -it --rm --name install-katonic -v /root/.azure:/root/.azure -v $(pwd):/inventory quay.io/katonic/katonic-installer:v4.5

Installation Verificationโ€‹

The installation process can take up to 45 minutes to fully complete. The installer will output verbose logs, and commands to take kubectl access of deployed cluster and surface any errors it encounters. After installation, you can use the following commands to check whether all applications are in a running state or not.

cd /root/katonic
az aks get-credentials --resource-group $(cat /root/katonic/katonic.yml | grep resource_group_name | awk '{print $2}') --name $(cat /root/katonic/katonic.yml | grep cluster_name | awk '{print $2}')
kubectl get pods --all-namespaces

This will show the status of all pods being created by the installation process. If you see any pods enter a crash loop or hang in a non-ready state, you can get logs from that pod by running:

kubectl logs $POD_NAME --namespace $NAMESPACE_NAME

If the installation completes successfully, you should see a message that says:

TASK [platform-deployment : Credentials to access Katonic MLOps Platform] *******************************ok: [localhost] => {
"msg": [
"Platform Domain: $domain_name",
"Username: $adminUsername",
"Password: $adminPassword"
]
}

However, the application will only be accessible via HTTPS at that FQDN if you have configured DNS for the name to point to an ingress load balancer with the appropriate SSL certificate that forwards traffic to your platform nodes.

Test and troubleshootโ€‹

To verify the successful installation of Katonic, perform the following tests:

  • If you encounter a 500 or 502 error, take access of your cluster and execute the following command:

    kubectl rollout restart deploy nodelog-deploy -n application
  • If you have any file manager-related issues:

    kubectl rollout restart sts minio
    kubectl rollout status sts minio
    kubectl rollout restart deploy minio-console
    kubectl rollout status deploy minio-console
  • Login to the Katonic application and ensure that all the navigation panel options are operational. If this test fails, please verify that Keycloak was set up properly.

  • Create a new project and launch a Jupyter/JupyterLab workspace. If this test fails, please check that the default environment images have been loaded in the cluster.

  • Publish an app with Flask or Shiny. If this test fails, please verify that the environment images have Flask and Shiny installed.

Deleting the Katonic Platform from Azureโ€‹

To delete Katonic Platform from your Azure, you must delete its resource group.