Katonic MLOps
Node pool requirementsโ
Katonic requires a minimum of three-node pools, one to host the Katonic Platform, one to host Compute workloads and one for storage. Additional optional pools can be added to provide specialized execution hardware for some Compute workloads.
Master pool requirements
Boot Disk: Min 128GB
Min Nodes: 1
Max Nodes: 3
Spec: 2 CPU / 8GB
Nodes must be equipped with the Advanced Vector Extensions (AVX) instruction set, as it is imperative for the optimal functionality of specific services within our platform that leverage AVX capabilities - SSE4.2, AVX, AVX2, AVX-512
Platform pool requirements
Boot Disk: Min 128GB
Min Nodes: 2
Max Nodes: 3
Spec: 4 CPU / 16GB
Taints: katonic.ai/node-pool=platform:NoSchedule
Labels: katonic.ai/node-pool=platform
Nodes must be equipped with the Advanced Vector Extensions (AVX) instruction set, as it is imperative for the optimal functionality of specific services within our platform that leverage AVX capabilities - SSE4.2, AVX, AVX2, AVX-512
Compute pool requirements
Boot Disk: Min 128GB
Recommended Min Nodes: 1
Max Nodes: Set as necessary to meet demand and resourcing needs
Recommended min spec: 8 CPU / 32GB
Labels: katonic.ai/node-pool=compute
Nodes must be equipped with the Advanced Vector Extensions (AVX) instruction set, as it is imperative for the optimal functionality of specific services within our platform that leverage AVX capabilities - SSE4.2, AVX, AVX2, AVX-512
> **Note**: When **backup_enabled = True**, then **compute_nodes.min_count** should be set to **2**.
Deployment pool requirements
Boot Disk: Min 128GB
Recommended Min Nodes: 1
Max Nodes: Set as necessary to meet demand and resourcing needs
Recommended min spec: 8 CPU / 32GB
Taints: katonic.ai/node-pool=deployment:NoSchedule
Labels: katonic.ai/node-pool=deployment
Nodes must be equipped with the Advanced Vector Extensions (AVX) instruction set, as it is imperative for the optimal functionality of specific services within our platform that leverage AVX capabilities - SSE4.2, AVX, AVX2, AVX-512
Optional GPU compute pool
Boot Disk: recommended 512GB
Recommended Min Nodes: 0
Max Nodes: Set as necessary to meet demand and resourcing needs
Recommended min Spec: 8 CPU / 16GB / One or more Nvidia GPU Device
Nodes must be pre-configured with the appropriate Nvidia driver, Nvidia-docker2, and set the default docker runtime to Nvidia.
Taints: nvidia.com/gpu=gpu-{GPU-type}
Labels: katonic.ai/node-pool=gpu-{GPU-type}
Nodes must be equipped with the Advanced Vector Extensions (AVX) instruction set, as it is imperative for the optimal functionality of specific services within our platform that leverage AVX capabilities - SSE4.2, AVX, AVX2, AVX-512
Note: For example we can use GPU type as v100, A30, A100
Katonic Platform Installationโ
Installation processโ
The Katonic platform runs on Kubernetes. To simplify the deployment and configuration of Katonic services, Katonic provides an install automation tool called the katonic-installer that will deploy Katonic into your compatible cluster. The katonic-installer is a Python application delivered in a Docker container, and can be run locally or as a job inside the target cluster.
Prerequisitesโ
The install automation tools are delivered as a Docker image, and must run on an installation workstation that meets the following requirements:
Access to quay.io and credentials for an installation service account with access to the Katonic installer image and upstream image repositories. Throughout these instructions, these credentials will be referred to as $QUAY_USERNAME and $QUAY_PASSWORD. Contact your Katonic account team if you need new credentials.
The hosting cluster must have access to the following domains through the Internet to retrieve component and dependency images for online installation:
Alternatively, you can configure the katonic-installer to point to a private docker registry and application registry for offline installation. please reach out to your account manager if you would like an offline/private installation.
1. Create a new directory to go ahead with the installation.โ
mkdir katonic
cd katonic
2. Custom certificatesโ
Katonic Platform is accessed using HTTPS protocol, for that you need to pass to files listed below to secure the Katonic Platform using custom certificates.
- PEM encoded public key certificate (file name must end with .crt extension).
The private key is associated with the given certificate (file name must end with a .key extension).
- Put these files in the katonic directory.
3. Pull the katonic-installer imageโ
- Log in to quay.io with the credentials described in the requirements previous section.
docker login quay.io
Find the image URI for the version of the katonic-installer you want to use from the release notes.
Pull the image to your local machine.
docker pull quay.io/katonic/katonic-installer:v4.4
4. Initializeโ
Initialize the installer application to generate a template configuration file named katonic.yml.
Note: This command must be entered inside the katonic directory
docker run -it --rm --name generating-yaml -v $(pwd):/install quay.io/katonic/katonic-installer:v4.4 init on-premise katonic_mlops kubernetes_already_exists public
Edit the configuration file with all necessary details about the target cluster, storage systems, and hosting domain. Read the configuration reference for more information about available keys, and consult the configuration examples for guidance on getting started.
PARAMETER | DESCRIPTION | VALUE |
---|---|---|
katonic_platform_version | It has the value by default regarding the Katonic Platform Version. | katonic_mlops |
deploy_on | Katonic MLOps can be deployed on | On-Premise |
private_bucket_limit | Set the private bucket size. | eg. 10GB |
minio_storage | Set the value to amount of storage required in file manager /16 | eg. 20Gi |
workspace_timeout_interval | Set timeout interval hours | eg. 1 |
backup_enabled | Enabling of the backup (For On-premise Katonic Installer only support AWS S3 bucket) | True or False |
s3_bucket_name | Name of the s3 bucket | katonic-backup |
s3_bucket_region | Region of the s3 bucket | us-east-1 |
backup_schedule | Scheduling of the backup | 0 0 1 * * |
backup_expiration | Expiration of the backup | 2160h0m0s |
use_custom_domain | Custom domain name enabling | True or False |
custom_domain_name | Custom domain name | eg. app.katonic.ai |
use_katonic_domain | Katonic domain name enabling | True or False |
katonic_domain_prefix | Katonic domain name prefix | eg. tesla |
AD_Group_Management | Set "True" to enable functionality that provides you ability to sign in using Azure AD | False |
AD_CLIENT_ID | Client ID of App registered for SSO in client's Azure or Identity Provider | |
AD_CLIENT_SECRET | Client Secret of App registered for SSO in client's Azure or any other Identity Provider | |
AD_AUTH_URL | Authorization URL endpoint of app registered for SSO. | |
AD_TOKEN_URL | Token URL endpoint of app registered for SSO. | |
quay_username | Username for quay | |
quay_password | Password for quay | |
adminUsername | Email for admin user | eg. john@katonic.ai |
adminPassword | Password for admin user | at least 1 special character at least 1 upper case letter at least 1 lower case letter minimum 8 characters |
adminFirstName | Admin first name | eg. john |
adminLastName | Admin last name | eg. musk |
5. Installing Katonic MLOps Platformโ
docker run -it --rm --name install-katonic -v /root/.kube:/root/.kube -v $(pwd):/inventory quay.io/katonic/katonic-installer:v4.4
Installation Verificationโ
The installation process can take up to 45 minutes to fully complete. The installer will output verbose logs, and commands to take kubectl access of deployed cluster and surface any errors it encounters. After installation, you can use the following commands to check whether all applications are in a running state or not.
kubectl get pods --all-namespace
This will show the status of all pods being created by the installation process. If you see any pods enter a crash loop or hang in a non-ready state, you can get logs from that pod by running:
kubectl logs $POD_NAME --namespace $NAMESPACE_NAME
If the installation completes successfully, you should see a message that says:
TASK [platform-deployment : Credentials to access Katonic MLOps Platform] *******************************ok: [localhost] => {
"msg": [
"Platform Domain: $domain_name",
"Username: $adminUsername",
"Password: $adminPassword"
]
}
However, the application will only be accessible via HTTPS at that FQDN if you have configured DNS for the name to point to an istio ingress load balancer with the appropriate SSL certificate that forwards traffic to your platform nodes.
Post Installation Stepsโ
Domainโ
You can identify a domain for your cluster. This allows you to use any domain as the location for the cluster. For example, you could set the domain for the cluster as katonic.company.com.
For this option to work, you will need to set the required DNS routing rules between the domain and the IP address of the cluster after the katonic-installer has finished running.
You will need to create a CNAME/A listing for .<your_domain> with the IP address of the auto scaler for the cluster. Make sure you include the wildcard.
The domain is the same domain you entered as <your_domain> in the katonic-installer
To get the IP address of the cluster run the following command has been deployed:
kubectl get svc istio-ingressgateway -n istio-system | awk '{print $4}' | tail -n +2
Test and troubleshootโ
To verify the successful installation of Katonic, perform the following tests:
If you encounter a 500 or 502 error, take access of your cluster and execute the following command:
kubectl rollout restart deploy nodelog-deploy -n application
If you have any file manager-related issues:
kubectl rollout restart sts minio
kubectl rollout status sts minio
kubectl rollout restart deploy minio-console
kubectl rollout status deploy minio-consoleLogin to the Katonic application and ensure that all the navigation panel options are operational. If this test fails, please verify that Keycloak was set up properly.
Create a new project and launch a Jupyter/JupyterLab workspace. If this test fails, please check that the default environment images have been loaded in the cluster.
Publish an app with Flask or Shiny. If this test fails, please verify that the environment images have Flask and Shiny installed.