nix-ml-ops is a collection of flake parts for setting up a machine learning development environment and deploying machine learning jobs and services onto cloud platforms.
See options documentation for all available options.
Here is an example of the ml-ops config. In the example: ."${"key"}". denotes a name picked by the user, while ."key". denotes a union, i.e. there are multiple choices. The notations are used for documenting purpose, and they are all identical to simple .key. in Nix language.
{
ml-ops = {
# Common volumes shared between devcontainer and jobs
common.volumeMounts."nfs"."${"/mnt/ml-data/"}". = {
server = "my-server.example.com";
path = "/ml_data";
};
# Common environment variables shared between devcontainer and jobs
common.environmentVariables = {};
# Environment variables in additionto the ml-ops.common.environmentVariables
devcontainer.environmentVariables = {
}
devcontainer.volumeMounts = {
# Volumes in addition to the ml-ops.common.volumeMounts
"emptyDir"."${"/mnt/my-temporary-data/"}" = {
medium = "Memory";
};
};
# TODO: Support elastic
# jobs."${"training"}".resources."elastic"
# This is the configuration for single node training orstatic distributed
# training, not for elastic distributed training
jobs."${"training"}".resources."static".accelerators."A100" = 16;
jobs."${"training"}".resources."static".cpus = 32;
jobs."${"training"}".run = ''
torchrun ...
'';
# Environment variables in additionto the ml-ops.common.environmentVariables
jobs."${"training"}".environmentVariables = {
HF_DATASETS_IN_MEMORY_MAX_SIZE = "200000000";
};
# Volumes in addition to the ml-ops.common.volumeMounts
jobs."${"training"}".volumeMounts = {};
jobs."${"training"}".launchers."${"my-aks-launcher"}"."kubernetes".imageRegistry.host = "us-central1-docker.pkg.dev/ml-solutions-371721/training-images";
jobs."${"training"}".launchers."${"my-aks-launcher"}"."kubernetes".namespace = "default";
jobs."${"training"}".launchers."${"my-aks-launcher"}"."kubernetes".aks = {
cluster = "mykubernetescluster";
resourceGroup = "myresourcegroup";
registryName = "mycontainerregistry";
};
# TODO: Other types of launcher
# jobs."${"training"}".launchers."${"my-aws-ec2-launcher"}"."skypilot" = { ... };
# Extra package available in both runtime and development environment:
pythonEnvs."pep508".common.extraPackages."${"peft"}"."buildPythonPackage".src = peft-src;
# Extra packages available in development environment only:
pythonEnvs."pep508".development.extraPackages = {};
# TODO: Support poetry projects:
# pythonEnvs."poetry" = { ... };
};
}Then, run the following command to start the job:
nix build .#training-my-aks-launcher-helmUpgradeThe command will internally do the following things:
- Build an image including a Python script with the environment of dependencies specified in
requirements.txt. - Push the image to Azure Container Registry
mycontainerregistry.azurecr.io - Create a Helm chart including job to run the image
- Upgrade the Helm chart on AKS cluster
mykubernetesclusterin resource groupmyresourcegroup
This repository also includes packages to build VM images as a NixOS based devserver.
nix build .#devserver-gcenix build .#devserver-amazon# Azure Image Generation 1
nix build .#devserver-azure# Azure Image Generation 2
nix build .#devserver-hypervNote that KVM must be enabled on the devserver. See this document for enabling KVM on GCE.
Also the following steps are required on Debian to install kvm kernel modules:
sudo apt-get install qemu-kvm
sudo tee -a /etc/nix/nix.conf <<EOF
extra-experimental-features = nix-command flakes
extra-system-features = kvm
EOFnix run .#upload-devserver-gce-imageNote that in order to upload the built image, the nix run command must be executed in a GCP VM whose service account has the permission to upload image, or it is executed after a successful gcloud auth login.
nix run .#upload-devserver-azure-imagenix run .#upload-devserver-azure-hypervNote that in order to upload the built image, the nix run command must be executed in an Azure VM whose Identity has the permission to upload image, or it is executed after a successful az login.
If you already checked out this repository, run the following command in the work tree: For VM on GCE:
sudo nixos-rebuild switch --flake .#devserverGceFor Azure VM:
sudo nixos-rebuild switch --flake .#devserverAzureOr under an abitrary path, run
sudo nixos-rebuild switch --flake github:Preemo-Inc/nix-ml-ops#devserverGceor
sudo nixos-rebuild switch --flake github:Preemo-Inc/nix-ml-ops#devserverAzure