From Notebooks to Nodes: Scaling Machine Learning on K8s
Machine learning (ML) often starts as something simple—perhaps a single Jupyter notebook on a local machine. In the early stages, this might be enough to train small models or experiment with data. However, as your workload grows, so does the need for computational power, distributed training, version control, reproducibility, and the ability to quickly iterate and deploy models. This is where Kubernetes (K8s) steps in, providing a powerful platform to manage growth effortlessly.
In this blog post, we will explore how to move from the simplicity of local notebooks to production-grade deployments at scale on Kubernetes. Whether you’re just starting out in your ML journey or looking to leverage the advanced features of K8s for production workloads, this comprehensive guide will help you navigate the key concepts, tools, and best practices.
Table of Contents
- Introduction to Containers and Kubernetes
- Why Run Machine Learning on Kubernetes?
- Fundamental Kubernetes Concepts for ML
- From Notebook to Docker Container
- Deploying Your ML Service on Kubernetes
- Managing Data and Storage
- Scaling Strategies
- Basic Example: Deploying a Simple Model on K8s
- Intermediate Concepts: CI/CD and Reproducibility
- Advanced Concepts: GPUs, Hyperparameter Tuning, and Auto-Scaling
- Orchestration Frameworks: Kubeflow and Beyond
- Best Practices and Future Trends
- Conclusion
Introduction to Containers and Kubernetes
Before diving into the mechanics of running machine learning workloads on Kubernetes, let’s do a brief recap of the foundational technologies.
Containers: The Building Blocks
A container is a packaged unit of software that bundles together code, libraries, environments, and all dependencies into a single artifact. Containers ensure that your application runs the same way in every environment.
Key benefits for ML:
- Consistency: Your model runs reliably across different development, testing, and production environments.
- Portability: Containers can run on any machine that has a container runtime (such as Docker).
- Isolation: Each container is isolated from the rest of the system, making dependency management easier.
Kubernetes: The Container Orchestrator
Kubernetes is an open-source system for automating deployment, scaling, and management of containerized applications. Once you have your machine learning model packaged in a container, Kubernetes can help you:
- Distribute replicas of your container across multiple nodes.
- Handle failover and high availability.
- Automatically scale based on resource usage.
- Provide rolling updates to avoid downtime.
Kubernetes is especially important for machine learning workflows that require distributed training or large-scale inference.
Why Run Machine Learning on Kubernetes?
Many data scientists and ML engineers often use frameworks like scikit-learn, TensorFlow, or PyTorch on a single machine or a specialized cluster. While that works for development and experimentation, scaling up can pose significant hurdles:
- Resource Allocation: Different models require different resources. K8s can dynamically allocate resources (CPU, GPU, memory) to containers.
- Rapid Experimentation: Launching and tearing down environments is simpler when using container deployments, enabling faster iterations.
- Autoscaling: You can automatically scale your application—or model inference service—up or down based on metrics.
- Standardization: With K8s, your architecture becomes more standardized, reducing complexities in multi-tenant or multi-team scenarios.
In short, Kubernetes provides the orchestration needed to handle both small and large-scale requirements for modern machine learning.
Fundamental Kubernetes Concepts for ML
Before getting started, let’s explore the elementary concepts in Kubernetes that matter most to machine learning workloads.
Kubernetes Object | Description | Typical Use in ML |
---|---|---|
Pod | The smallest deployable unit in K8s. A Pod can contain one or more containers. | Hosting a training job, inference microservice, or data preprocessing step. |
Deployment | Manages stateless Pods. It can scale and roll out updates. | Hosting model services (inference APIs) in a resilient, autoscaling way. |
Service | Exposes Pods to the network or other Pods inside the cluster. | Providing stable endpoints for inference requests. |
StatefulSet | Manages stateful Pods, maintaining a sticky identity for each Pod. | Running distributed frameworks like Spark, or advanced ML pipelines needing state. |
Job/CronJob | Creates Pods to run a batch job to completion (once or on a schedule). | Model training or data processing tasks that must be completed. |
Volume (PVC) | Provides persistent storage for Pods through Persistent Volume Claims (PVCs). | Storing training data, model weights, or logs that must persist. |
Understanding these concepts helps prepare you for more sophisticated deployments.
From Notebook to Docker Container
Step 1: Conceptual Shift
Data scientists often develop initial prototypes in Jupyter notebooks. While notebooks are great for exploration, they’re not ideal for production. Instead, production-ready code is usually packaged as a Python script or an installable module. This code then gets containerized.
Step 2: Converting Notebook to a Script
Let’s imagine you have a simple notebook, train_model.ipynb
, that trains a linear regression model on some dataset. To convert it to a script:
import pandas as pdfrom sklearn.linear_model import LinearRegressionimport joblib
# Example datadata = pd.read_csv("data.csv")X = data[["feature1", "feature2"]]y = data["target"]
model = LinearRegression()model.fit(X, y)
joblib.dump(model, "linear_regression_model.pkl")print("Model trained and saved!")
Your code no longer relies on Jupyter’s interactive environment. Instead, it runs independently as a Python script.
Step 3: Creating a Dockerfile
A Dockerfile describes the environment needed to run your code. For instance:
# Start from a lightweight Python imageFROM python:3.9-slim
# Set a working directoryWORKDIR /app
# Copy your requirements file first for caching benefitsCOPY requirements.txt /app/
# Install dependenciesRUN pip install --no-cache-dir -r requirements.txt
# Copy the rest of your applicationCOPY . /app/
# Run the training script by defaultCMD ["python", "train_model.py"]
Step 4: Building and Testing the Image
Build and tag the image:
docker build -t my-ml-app:latest .
Run a container locally to test:
docker run --rm my-ml-app:latest
Once you’ve confirmed everything works, you’re prepared to push this image to a container registry (like Docker Hub or a private registry) for deployment to Kubernetes.
Deploying Your ML Service on Kubernetes
Container Registry
To deploy to Kubernetes, your image must be accessible from the cluster, which implies pushing it to a container registry:
docker tag my-ml-app:latest myregistry.com/my-ml-app:latestdocker push myregistry.com/my-ml-app:latest
Creating Kubernetes Manifests
Once your image is in a registry, you can define a Kubernetes Deployment:
apiVersion: apps/v1kind: Deploymentmetadata: name: ml-model-deploymentspec: replicas: 2 selector: matchLabels: app: ml-model template: metadata: labels: app: ml-model spec: containers: - name: ml-container image: myregistry.com/my-ml-app:latest ports: - containerPort: 80
You can expose your Deployment as a Service:
apiVersion: v1kind: Servicemetadata: name: ml-model-servicespec: selector: app: ml-model ports: - protocol: TCP port: 80 targetPort: 80 type: ClusterIP
Apply these configurations to your cluster:
kubectl apply -f ml-model-deployment.yamlkubectl apply -f ml-model-service.yaml
Confirming the Deployment
Use kubectl get pods
and kubectl get services
to verify that your Pods are running and your Service is exposed internally (or externally, if configured).
Managing Data and Storage
Machine learning workloads often require large datasets or need to produce outputs (models, logs, metrics) that must persist beyond container lifespans. Kubernetes abstracts storage using Volumes and Persistent Volume Claims (PVCs).
Using Persistent Volume (PV) and Persistent Volume Claim (PVC)
- Define a Persistent Volume: An interface to the actual physical storage (local or network-attached).
- Create a PVC: A request for storage that references a PV.
- Attach PVC to a Pod: The Pod then sees the persistent disk as part of its filesystem.
Here’s an example PersistentVolumeClaim:
apiVersion: v1kind: PersistentVolumeClaimmetadata: name: training-data-pvcspec: accessModes: - ReadWriteOnce resources: requests: storage: 10Gi
You can mount this PVC in the Pod:
apiVersion: apps/v1kind: Deploymentmetadata: name: ml-training-deploymentspec: replicas: 1 selector: matchLabels: app: ml-training template: metadata: labels: app: ml-training spec: containers: - name: ml-training-container image: myregistry.com/ml-training-image:latest volumeMounts: - name: training-data-volume mountPath: /app/data volumes: - name: training-data-volume persistentVolumeClaim: claimName: training-data-pvc
Mounting a PVC for training data ensures that large datasets are locally available for ML tasks and remain persistent even if the container restarts.
Scaling Strategies
Kubernetes offers autoscaling capabilities to match your resource usage. This can be particularly helpful for inference workloads that see spikes in demand.
Horizontal Pod Autoscaler (HPA)
The Horizontal Pod Autoscaler uses metrics such as CPU, memory, or custom metrics (e.g., requests per second) to adjust the number of running Pods.
apiVersion: autoscaling/v2kind: HorizontalPodAutoscalermetadata: name: ml-model-hpaspec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: ml-model-deployment minReplicas: 1 maxReplicas: 10 metrics: - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 70
When CPU usage exceeds 70%, Kubernetes automatically scales up Pods in the ml-model-deployment
. And when the usage goes down, it scales back down.
Distributed Training
Distributed training usually requires more advanced frameworks like Horovod or PyTorch’s Distributed Data Parallel. Kubernetes can schedule multiple containers across different nodes and allow them to communicate via a Service. This is more advanced and typically involves specialized custom resource definitions (CRDs) offered by tools like Kubeflow.
Basic Example: Deploying a Simple Model on K8s
Let’s walk through a more concrete but simplified example of deploying a pre-trained model (e.g., a scikit-learn regression model) behind a REST API in a Flask app.
1. Flask Prediction App
from flask import Flask, request, jsonifyimport joblib
app = Flask(__name__)
# Load the model at startupmodel = joblib.load("linear_regression_model.pkl")
@app.route("/predict", methods=["POST"])def predict(): data = request.json features = data.get("features", []) if len(features) == 2: # expecting 2 features prediction = model.predict([features]) return jsonify({"prediction": prediction.tolist()}) else: return jsonify({"error": "Invalid input"}), 400
if __name__ == "__main__": app.run(host="0.0.0.0", port=80)
2. Dockerfile
FROM python:3.9-slimWORKDIR /appCOPY requirements.txt /app/RUN pip install --no-cache-dir -r requirements.txtCOPY . /app/CMD ["python", "app.py"]
3. Kubernetes Manifests
apiVersion: apps/v1kind: Deploymentmetadata: name: flask-ml-deploymentspec: replicas: 2 selector: matchLabels: app: flask-ml template: metadata: labels: app: flask-ml spec: containers: - name: flask-ml-container image: myregistry.com/flask-ml-image:latest ports: - containerPort: 80
apiVersion: v1kind: Servicemetadata: name: flask-ml-servicespec: selector: app: flask-ml ports: - port: 80 targetPort: 80 protocol: TCP type: ClusterIP
After applying these YAML files, your model is accessible within the cluster on the flask-ml-service:80
endpoint. For external traffic, you’d configure an Ingress or use a LoadBalancer-type Service.
Testing with cURL or Python
Within the cluster or via port-forwarding, you can test your endpoint:
curl -X POST -H "Content-Type: application/json" \ -d '{"features": [3.2, 1.5]}' \ http://flask-ml-service/predict
Intermediate Concepts: CI/CD and Reproducibility
Continuous Integration and Continuous Deployment
Automating the build and deployment process is essential for agility. A typical flow might be:
- Commit to Git triggers a CI pipeline.
- Tests are run automatically.
- Docker image is built and pushed to the registry.
- Kubernetes manifests are updated or a Helm chart is deployed to your K8s cluster.
Tools such as Jenkins, GitLab CI, or GitHub Actions can facilitate these steps.
Reproducibility and Version Control
In machine learning:
- Data versioning matters as much as code versioning. Tools like DVC (Data Version Control) or MLflow can help track changes.
- Model versioning is critical for proper A/B testing and rollbacks. Store your models in a central registry with distinct version tags.
Using GitOps strategies (e.g., Argo CD or Flux) ensures your cluster’s state is always inferred from source control.
Advanced Concepts: GPUs, Hyperparameter Tuning, and Auto-Scaling
Using GPUs in Kubernetes
Training deep learning models often requires GPU acceleration. When you have GPU-enabled nodes, Kubernetes can schedule containers that request GPU resources.
apiVersion: apps/v1kind: Deploymentmetadata: name: gpu-training-deploymentspec: replicas: 1 selector: matchLabels: app: gpu-training template: metadata: labels: app: gpu-training spec: containers: - name: gpu-trainer image: nvidia/cuda:11.0-base resources: limits: nvidia.com/gpu: 1
This example requests a single GPU. K8s, with the NVIDIA device plugin, ensures pods requiring GPUs are placed on GPU-enabled nodes.
Hyperparameter Tuning with Kubernetes Jobs
Hyperparameter tuning can be done by spawning multiple Jobs, each testing a different hyperparameter set:
apiVersion: batch/v1kind: Jobmetadata: name: hyperparam-jobspec: template: spec: containers: - name: trainer image: myregistry.com/training-image:latest command: ["python", "train.py", "--lr=0.01", "--batch-size=64"] restartPolicy: Never backoffLimit: 0
By programmatically generating multiple Job manifests, you can run them in parallel to sweep over a parameter search space.
Autoscaling with Custom Metrics
Kubernetes also allows for scaling based on custom metrics, such as the rate of inference requests or queue depth. With the help of metrics adapters and a system like Prometheus, you can define custom rules.
apiVersion: autoscaling/v2kind: HorizontalPodAutoscalermetadata: name: custom-metric-hpaspec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: inference-deployment minReplicas: 1 maxReplicas: 20 metrics: - type: Pods pods: metric: name: inference_requests_per_second target: type: AverageValue averageValue: "50"
When the average inference requests per second surpass 50 across all Pods, Kubernetes spins up more Pods automatically.
Orchestration Frameworks: Kubeflow and Beyond
Kubernetes is a strong foundation, but ML workflows often require even more specialized tooling. That’s where frameworks like Kubeflow, Argo, or MLflow on Kubernetes come in.
Kubeflow
Kubeflow aims to make running machine learning workflows on Kubernetes straightforward. It includes:
- Kubeflow Pipelines: A platform for building and deploying scalable ML workflows.
- TFJob, PyTorchJob, MXNetJob: Custom resources for distributed training.
- Central Dashboard: To monitor pipelines, notebooks, and experiments.
MLflow on K8s
MLflow is another popular tool for tracking experiments, storing models, and managing the entire model lifecycle. It can be installed on Kubernetes, allowing you to store training artifacts and serve models in a standardized manner.
Argo Workflow
Argo is a workflow engine for Kubernetes, allowing you to define multi-step data and ML pipelines. Similar to Kubeflow Pipelines, you define each stage as a set of containers that run in sequence or in parallel.
Orchestration Tool | Strengths | Typical Use Cases |
---|---|---|
Kubeflow | End-to-end ML platform on K8s, focuses on distributed training and pipelines. | Large-scale pipelines, end-to-end ML lifecycle. |
Argo Workflow | General-purpose workflow engine. | Complex data processing or multi-step workflows. |
MLflow | Experiment tracking, model registry, and serving. | All stages of ML development and deployment with tracking. |
Best Practices and Future Trends
- Minimize Image Size: Large Docker images slow down deployment and scaling. Use slim images and ensure your Dockerfile only installs necessary dependencies.
- Enable Logging and Monitoring: Integrate your ML workloads with logging (e.g., EFK stack) and monitoring (Prometheus + Grafana). This gives visibility into performance and resource usage.
- Secure Your Pipeline: Implement security best practices, including scanning container images for vulnerabilities, using read-only file systems, and restricting permissions with Pod Security Policies or Pod Security Admission.
- Use Helm or Kustomize: For more advanced or repeated deployments, Helm or Kustomize templates reduce YAML duplication and confusion.
- Look Ahead to Serverless ML: Serverless offerings like Knative or vendor-specific solutions (e.g., AWS Fargate) could simplify the operational overhead. In some cases, you simply need to run your code without managing the underlying cluster.
Conclusion
Transitioning your machine learning workflows from local notebooks to production-ready pipelines on Kubernetes is a massive step up in maturity. With containers providing consistency and Kubernetes offering robust orchestration, you can:
- Easily scale training and inference workloads.
- Manage multi-tenant environments for different data scientists and teams.
- Integrate logs, monitoring, and advanced workflows like hyperparameter tuning or distributed training.
- Lay the groundwork for future trends like serverless ML.
While the initial ramp-up in complexity might feel steep, the benefits in scalability, reproducibility, and efficiency are worth it. Whether you choose to simply orchestrate your Dockerized Python scripts or leverage full-fledged platforms like Kubeflow, the Kubernetes ecosystem is ready to handle your machine learning ambitions at scale.
By consistently refining your processes—adopting CI/CD, versioning data, integrating advanced autoscaling metrics, and using specialized frameworks—you’ll move beyond the days of single-machine notebooks. Your ML journey can then fully leverage the power and reliability of Kubernetes, turning ideas into reliable, scalable production services. Happy scaling!