Skip to main content
Version: 4.5

Katonic MLOps

Node pool requirementsโ€‹

The GKE cluster must have at least three node pools with the following specifications and distinct node labels:

SR NO.POOLMIN-MAXVMLABELSTAINTS
1Platform1-4 (With HA) 2-4 (Without HA)c2-standard-4katonic.ai/node-pool=platformkatonic.ai /node-pool=platform:NoSchedule
2Compute1-10c2-standard-8katonic.ai/node-pool=compute
3Deployment1-10c2-standard-8katonc.ai/node-pool=deploymentkatonic.ai/node-pool=deployment:NoSchedule
4GPU (Optional)0-5Required VM typekatonic.ai/node-pool=gpu-{GPU-type}nvidia.com/gpu=present:NoSchedule

Note: For example we can use GPU type as v100, A30, A100

Note: When backup_enabled = True, then compute_nodes.min_count should be set to 2.

GCP Platform-Node Specificationsโ€‹

Platform nodes in platform GCP cloud deployments must fulfil the following hardware specification requirements according to the deployment type:

ComponentSpecification
Node countmin 2
Instance typec2-standard-4
vCPUs4
Memory16 GB
Boot disk size128 GB

GCP Compute-Node Specificationsโ€‹

Instance types that must be used by compute nodes in GCP cloud installations on the Katonic platform include:

Choose the type that best fits your requirements. GCP GKE (Google Kubernetes Engine) is also supported for application nodes, using the instance types listed below. For specification details for each type, refer to the GCP documentation.

Note: Supported compute node configurations

  • c2-standard-8
  • c2-standard-16
  • c2-standard-32
  • Boot Disk: Min 128GB

GCP Deployment-Node Specificationsโ€‹

Instance types that must be used by deployment nodes in GCP cloud installations on the Katonic platform include:

Choose the type that best fits your requirements. GCP GKE (Google Kubernetes Engine) is also supported for application nodes, using the instance types listed below. The Katonic platform requires at least 1 minimum deployment node for the community version. For specification details for each type, refer to the GCP documentation.

Note: Supported deployment node configurations

  • c2-standard-8
  • c2-standard-16
  • c2-standard-32
  • Boot Disk: Min 128GB

GCP GPU-Node Specificationsโ€‹

As of now, the GPU node pool is supported by Katonic-installer version 4.5.

Choose the instance type that best fits your requirements. Google Kubernetes Engine (GKE) is also supported for application nodes in the GKS (Google Kubernetes Service) platform, utilizing the instance types provided by Google Cloud. For specification details for each type, refer to the GCP documentation.

Note: Supported gpu node configurations

  • Boot disk size = Min 512GB
  • Label = katonic.ai/node-pool=gpu-{gpu-type}
  • Taints = nvidia.com/gpu=present:NoSchedule

Note: For example we can use GPU type as v100, A30, A100

Additional node pools with distinct katonic.ai/node-pool labels can be added to make other instance types available for Katonic executions.

Katonic Platform Installationโ€‹

General completion time: 45 minute

Installation processโ€‹

The Katonic platform runs on Kubernetes. To simplify the deployment and configuration of Katonic services, Katonic provides an install automation tool called the Katonic-installer that will deploy Katonic into your compatible cluster. The Katonic-installer is an ansible role delivered in a Docker container and can be run locally.

Prerequisitesโ€‹

To install and configure Katonic in your GCP account you must have:

  • quay.io credentials from Katonic.

  • GCP with enough quota to create:

    • At least 2 c2-standard-4 machines for platform nodes and at least 1 c2-standard types EC2 machine for compute nodes
  • A Linux operating system (Ubuntu/Debian) based machine with the following Steps:

    a. A Linux operating system (Ubuntu/Debian) based machine needs 4GB RAM and 2vcpus and The boot disk size should be 50GB.

    b. While creating VM select the service account (Katonic) in the Identity and API access section. Skip to step c if you already have the machine with the given specifications.

    Note: After the platform is deployed successfully, the VM can be deleted.

    c. Switch to the root user inside the machine.

    d. gcloud CLI must be installed and logged in to your GCP project and service account using the gcloud init command.

    Commands for installing gcloud CLI:

    apt-get install snapd -y
    snap install google-cloud-cli --classic

    Commands to login using gcloud CLI:

    gcloud init
    gcloud auth application-default login
  • The following tools should be installed:

To install Katonic Platform MLOps version follow the steps mentioned below:โ€‹

1. Log in to Quay with the credentials described in the requirements section above.โ€‹

docker login quay.io

2. Retrieve the Katonic installer image from Quay.โ€‹

docker pull quay.io/katonic/katonic-installer:v4.5

3. Create a directory.โ€‹

mkdir katonic
cd katonic

4. Adding PEM Encoded Public Key Certificate and Private Key to Directoryโ€‹

Put the PEM encoded public key certificate (having extension .crt) for your domain and private key associated with the given certificate (having extension .key) inside the current directory (katonic).

5. The Katonic Installer can deploy the Katonic Platform MLOps version in two ways:โ€‹

  1. Creating GKE and deploying the Katonic Platform MLOps version.
  2. Install Katonic Platform MLOps version on existing GKE.

1. Creating GKE and deploying the Katonic MLOps Platformโ€‹

Initialize the installer application to generate a template configuration file named katonic.yml.

docker run -it --rm --name generating-yaml -v $(pwd):/install quay.io/katonic/katonic-installer:v4.5 init gcp katonic_mlops deploy_kubernetes private 

Edit the configuration file with all necessary details about the target cluster, storage systems, and hosting domain. Read the following configuration reference

PARAMETERDESCRIPTIONVALUE
katonic_platform_versionIt has the value by default regarding the Katonic Platform Version.katonic_mlops
deploy_onCluster to be deployed onGCP
create_k8s_clusterMust be set to TrueTrue
private_clusterSet "True" when opting for private clusterFalse
enable_exposing_genai_applications_to_internetset "True" if opting for exposing genai applications to internetFalse
public_domain_for_genai_applicationsPublic FQDN of domain for genai applications that will be exposed to the internet(eg. public-chatbots.google.com)
control_plane_authorized_networksList of allowed IP ranges (CIDR) for control plane access
internal_loadbalancerSet "True" when opting for internal loadbalancerFalse
gke_k8s_versionGKE versioneg. 1.28.8-gke.1095000(1.27 and above versions supported)
cluster_nameCluster name to beeg. katonic-mlops-platform-v4-5
gcp_regionGCP region nameeg. us-east1
gcp_project_idSet your GCP project IDeg. ardent-timm-1000678
service_account_idSet created service account email IDeg. katonic-main@ardent-timm-1000678.iam.gserviceaccount.com
vpc_nameEnter the name of VPC created for Private Cluster
subnet_nameEnter the name of subnet created for Private Cluster
high_availabilityTrue or False
zone_1eg. us-east1-b
zone_2eg. us-east2-c
platform_nodes.instance_typePlatform node VM sizeeg. c2-standard-4
platform_nodes.min_countMinimum number of platform nodes should be 2eg. 2
platform_nodes.max_countMaximum number of platform should be greater than platform nodes min counteg. 3
platform_nodes.os_disk_sizePlatform Nodes OS Disk Sizeeg. 128 GB
compute_nodes.instance_typeCompute node VM sizeeg. c2-standard-8
compute_nodes.min_countMinimum number of compute nodes should not be less than 1eg. 1
compute_nodes.max_countMaximum number of compute should be greater than compute nodes min count nodes.eg. 3
compute_nodes.os_disk_sizeCompute Nodes OS Disk Sizeeg. 128 GB
deployment_nodes.instance_typeDeployment Node VM sizeeg. c2-standard-8
deployment_nodes.min_countMinimum number of Deployment nodes should be 1eg. 1
deployment_nodes.max_countMaximum number of Deployment should be greater than Deployment nodes min count nodes.eg. 4
deployment_nodes.os_disk_sizeDeployment Nodes OS Disk Sizeeg. 128 GB
gpu_enabledadd GPU nodepoolTrue or False
gpu_nodes.instance_typeGPU node VM sizeeg n1-standard-1
gpu_machine_typeType of machine you needeg nvidia-tesla-p4
gpu_nodes.gpu_typeEnter the GPU type availaible on machineeg. v100,k80
gpu_nodes.gpu_counteg 2
gpu_nodes.min_countMinimum number of GPU nodeseg. 1
gpu_nodes.max_countMaximum number of GPU nodeseg. 2
gpu_nodes.disk_sizeEnter GPU nodes OS disk sizeeg 512 GB
gpu_nodes.gpu_vRAMEnter GPU node RAM size
gpu_nodes.gpus_per_nodeEnter GPU per node count
enable_gpu_workspaceSet it true if you want to use GPU WorkspaceTrue or False
shared_storage_createCreate Filestore storage class(kfs-shared)True or False
private_bucket_limitSet the private bucket size.eg. 10GB
minio_storageSet the value to amount of storage required in file manager /16eg. 20Gi
workspace_timeout_intervalSet timeout interval hourseg. 1
backup_enabledenabling of the backupTrue or False
backup_schedulescheduling of the backup0 0 1 * *
backup_expirationexpiration of the backup2160h0m0s
use_custom_domainSet this to True if you want to host katonic platform on your custom domain. Skip if use_katonic_domain: TrueTrue or False
custom_domain_nameExpected a valid domain.eg. katonic.tesla.com
use_katonic_domainSet this to True if you want to host katonic platform on Katonic MLOps Platform domain. Skip if use_custom_domain: TrueTrue or False
katonic_domain_prefixOne word expected with no special characters and all small alphabetseg. tesla
enable_pre_checksSet this to True if you want to perform the Pre-checksTrue / False
enable_acceleratorSet "True" to enable acceleratorsFalse
enable_playgroundSet "True" to enable playgroundFalse
AD_Group_ManagementSet "True" to enable functionality that provides you ability to sign in using Azure ADFalse
AD_CLIENT_IDClient ID of App registered for SSO in client's Azure or Identity Provider
AD_CLIENT_SECRETClient Secret of App registered for SSO in client's Azure or any other Identity Provider
AD_AUTH_URLAuthorization URL endpoint of app registered for SSO.
AD_TOKEN_URLToken URL endpoint of app registered for SSO.
quay_usernameUsername for quay
quay_passwordPassword for quay
adminUsernameemail for admin usereg. john@katonic.ai
adminPasswordpassword for admin userat least 1 special character at least 1 upper case letter at least 1 lower case letter minimum 8 characters
adminFirstNameAdmin first nameeg. john
adminLastNameAdmin last nameeg. musk

Installing the Katonic Platform MLOps versionโ€‹

docker run -it --rm --name install-katonic -v /root/.config:/root/.config -v $(pwd):/inventory quay.io/katonic/katonic-installer:v4.5

2. Deploying Katonic Platform MLOps version on existing GKEโ€‹

The steps are similar to Installing the Katonic Platform with GCP Google Kubernetes Engine. Just edit the configuration file with all the details about the target cluster, storage systems, and hosting domain. Read the following configuration reference, these are the only parameters required when installing the Katonic MLOps platform on existing GKE.

Prerequisites

You will need to create a kfs named storage class. Please refer to the main documentation of GCP โ†’ Dynamic Block Storage for instructions on how to create the storage class.

Initialize the installer application to generate a template configuration file named katonic.yml.

docker run -it --rm --name generating-yaml -v $(pwd):/install quay.io/katonic/katonic-installer:v4.5 init gcp katonic_mlops kubernetes_already_exists private
ParameterDescriptionValue
katonic_platform_versionIt has the value by default regarding the Katonic Platform Version.katonic_mlops
deploy_onCluster to be deployed onGCP
cluster_nameEnter cluster name that you deployeg katonic-mlops-platform-v4-5
private_clusterSet "True" when opting for private clusterFalse
enable_exposing_genai_applications_to_internetset "True" if opting for exposing genai applications to internetFalse
public_domain_for_genai_applicationsPublic FQDN of domain for genai applications that will be exposed to the internet(eg. public-chatbots.google.com)
control_plane_authorized_networks:List of allowed IP ranges (CIDR) for control plane access.
internal_loadbalancerSet "True" when opting for internal loadbalancerFalse
gcp_regionGCP region nameeg. us-east1
gcp_project_idSet your GCP project IDeg. ardent-timm-1000678
vpc_nameEnter the name of VPC created for Private Cluster
subnet_nameEnter the name of subnet created for Private Cluster
private_bucket_limitSet the private bucket size.eg. 10GB
minio_storageSet the value to amount of storage required in file manager /16eg. 20Gi
workspace_timeout_intervalSet timeout interval hourseg. 1
backup_enabledenabling of the backupTrue or False
backup_schedulescheduling of the backup0 0 1 * *
backup_expirationexpiration of the backup2160h0m0s
use_custom_domainSet this to True if you want to host katonic platform on your custom domain. Skip if use_katonic_domain: TrueTrue or False
custom_domain_nameExpected a valid domain.eg. katonic.tesla.com
use_katonic_domainSet this to True if you want to host katonic platform on Katonic MLOps Platform domain. Skip if use_custom_domain: TrueTrue or False
katonic_domain_prefixOne word expected with no special characters and all small alphabetseg. tesla
enable_pre_checksSet this to True if you want to perform the Pre-checksTrue / False
enable_acceleratorSet "True" to enable acceleratorsFalse
enable_playgroundSet "True" to enable playgroundFalse
AD_Group_ManagementSet "True" to enable functionality that provides you ability to sign in using Azure ADFalse
AD_CLIENT_IDClient ID of App registered for SSO in client's Azure or Identity Provider
AD_CLIENT_SECRETClient Secret of App registered for SSO in client's Azure or any other Identity Provider
AD_AUTH_URLAuthorization URL endpoint of app registered for SSO.
AD_TOKEN_URLToken URL endpoint of app registered for SSO.
quay_usernameUsername of quay
quay_passwordPassword of quay
adminUsernameemail for admin usereg. john@katonic.ai
adminPasswordpassword for admin userat least 1 special character at least 1 upper case letter at least 1 lower case letter minimum 8 characters
adminFirstNameAdmin first nameeg. john
adminLastNameAdmin last nameeg. musk

Installing Katonic Platform MLOps versionโ€‹

docker run -it --rm --name install-katonic -v /root/.config:/root/.config -v $(pwd):/inventory quay.io/katonic/katonic-installer:v4.5

Installation Verificationโ€‹

The installation process can take up to one hour to complete fully. The installer will output verbose logs, and commands to take kubectl access of deployed cluster and surface any errors it encounters. After installation, you can use the following commands to check whether all applications are running or not.

kubectl get pods --all-namespace

This will show the status of all pods being created by the installation process. If you see any pods enter a crash loop or hang in a non-ready state, you can get logs from that pod by running:

kubectl logs $POD_NAME --namespace $NAMESPACE_NAME

If the installation completes successfully, you should see a message that says:

TASK [platform-deployment : Credentials to access Katonic MLOps Platform] *******************************ok: [localhost] => {
"msg": [
"Platform Domain: $domain_name",
"Username: $adminUsername",
"Password: $adminPassword"
]
}

However, the application will only be accessible via HTTPS at that FQDN if you have configured DNS for the name to point to an ingress load balancer with the appropriate SSL certificate that forwards traffic to your platform nodes.

Test and troubleshootโ€‹

To verify the successful installation of Katonic, perform the following tests:

  • If you encounter a 500 or 502 error, take access of your cluster and execute the following command:

    kubectl rollout restart deploy nodelog-deploy -n application
  • If you have any file manager-related issues:

    kubectl rollout restart sts minio
    kubectl rollout status sts minio
    kubectl rollout restart deploy minio-console
    kubectl rollout status deploy minio-console
  • Login to the Katonic application and ensure that all the navigation panel options are operational. If this test fails, please verify that Keycloak was set up properly.

  • Create a new project and launch a Jupyter/JupyterLab workspace. If this test fails, please check that the default environment images have been loaded in the cluster.

  • Publish an app with Flask or Shiny. If this test fails, please verify that the environment images have Flask and Shiny installed.

Deleting the Katonic platform from GCPโ€‹

When you start the installation, in your current directory, you will get the platform deletion script. you just need to run the script.

./gcp-cluster-delete.sh