From Notebooks to Nodes: Scaling Machine Learning on K8s#

Machine learning (ML) often starts as something simple—perhaps a single Jupyter notebook on a local machine. In the early stages, this might be enough to train small models or experiment with data. However, as your workload grows, so does the need for computational power, distributed training, version control, reproducibility, and the ability to quickly iterate and deploy models. This is where Kubernetes (K8s) steps in, providing a powerful platform to manage growth effortlessly.

In this blog post, we will explore how to move from the simplicity of local notebooks to production-grade deployments at scale on Kubernetes. Whether you’re just starting out in your ML journey or looking to leverage the advanced features of K8s for production workloads, this comprehensive guide will help you navigate the key concepts, tools, and best practices.

Table of Contents#

Introduction to Containers and Kubernetes
Why Run Machine Learning on Kubernetes?
Fundamental Kubernetes Concepts for ML
From Notebook to Docker Container
Deploying Your ML Service on Kubernetes
Managing Data and Storage
Scaling Strategies
Basic Example: Deploying a Simple Model on K8s
Intermediate Concepts: CI/CD and Reproducibility
Advanced Concepts: GPUs, Hyperparameter Tuning, and Auto-Scaling
Orchestration Frameworks: Kubeflow and Beyond
Best Practices and Future Trends
Conclusion

Introduction to Containers and Kubernetes#

Before diving into the mechanics of running machine learning workloads on Kubernetes, let’s do a brief recap of the foundational technologies.

Containers: The Building Blocks#

A container is a packaged unit of software that bundles together code, libraries, environments, and all dependencies into a single artifact. Containers ensure that your application runs the same way in every environment.

Key benefits for ML:

Consistency: Your model runs reliably across different development, testing, and production environments.
Portability: Containers can run on any machine that has a container runtime (such as Docker).
Isolation: Each container is isolated from the rest of the system, making dependency management easier.

Kubernetes: The Container Orchestrator#

Kubernetes is an open-source system for automating deployment, scaling, and management of containerized applications. Once you have your machine learning model packaged in a container, Kubernetes can help you:

Distribute replicas of your container across multiple nodes.
Handle failover and high availability.
Automatically scale based on resource usage.
Provide rolling updates to avoid downtime.

Kubernetes is especially important for machine learning workflows that require distributed training or large-scale inference.

Why Run Machine Learning on Kubernetes?#

Many data scientists and ML engineers often use frameworks like scikit-learn, TensorFlow, or PyTorch on a single machine or a specialized cluster. While that works for development and experimentation, scaling up can pose significant hurdles:

Resource Allocation: Different models require different resources. K8s can dynamically allocate resources (CPU, GPU, memory) to containers.
Rapid Experimentation: Launching and tearing down environments is simpler when using container deployments, enabling faster iterations.
Autoscaling: You can automatically scale your application—or model inference service—up or down based on metrics.
Standardization: With K8s, your architecture becomes more standardized, reducing complexities in multi-tenant or multi-team scenarios.

In short, Kubernetes provides the orchestration needed to handle both small and large-scale requirements for modern machine learning.

Fundamental Kubernetes Concepts for ML#

Before getting started, let’s explore the elementary concepts in Kubernetes that matter most to machine learning workloads.

Kubernetes Object	Description	Typical Use in ML
Pod	The smallest deployable unit in K8s. A Pod can contain one or more containers.	Hosting a training job, inference microservice, or data preprocessing step.
Deployment	Manages stateless Pods. It can scale and roll out updates.	Hosting model services (inference APIs) in a resilient, autoscaling way.
Service	Exposes Pods to the network or other Pods inside the cluster.	Providing stable endpoints for inference requests.
StatefulSet	Manages stateful Pods, maintaining a sticky identity for each Pod.	Running distributed frameworks like Spark, or advanced ML pipelines needing state.
Job/CronJob	Creates Pods to run a batch job to completion (once or on a schedule).	Model training or data processing tasks that must be completed.
Volume (PVC)	Provides persistent storage for Pods through Persistent Volume Claims (PVCs).	Storing training data, model weights, or logs that must persist.

Understanding these concepts helps prepare you for more sophisticated deployments.

From Notebook to Docker Container#

Step 1: Conceptual Shift#

Data scientists often develop initial prototypes in Jupyter notebooks. While notebooks are great for exploration, they’re not ideal for production. Instead, production-ready code is usually packaged as a Python script or an installable module. This code then gets containerized.

Step 2: Converting Notebook to a Script#

Let’s imagine you have a simple notebook, train_model.ipynb, that trains a linear regression model on some dataset. To convert it to a script:

1
import pandas as pd
2
from sklearn.linear_model import LinearRegression
3
import joblib
4

5
# Example data
6
data = pd.read_csv("data.csv")
7
X = data[["feature1", "feature2"]]
8
y = data["target"]
9

10
model = LinearRegression()
11
model.fit(X, y)
12

13
joblib.dump(model, "linear_regression_model.pkl")
14
print("Model trained and saved!")

Your code no longer relies on Jupyter’s interactive environment. Instead, it runs independently as a Python script.

Step 3: Creating a Dockerfile#

A Dockerfile describes the environment needed to run your code. For instance:

1
# Start from a lightweight Python image
2
FROM python:3.9-slim
3

4
# Set a working directory
5
WORKDIR /app
6

7
# Copy your requirements file first for caching benefits
8
COPY requirements.txt /app/
9

10
# Install dependencies
11
RUN pip install --no-cache-dir -r requirements.txt
12

13
# Copy the rest of your application
14
COPY . /app/
15

16
# Run the training script by default
17
CMD ["python", "train_model.py"]

Step 4: Building and Testing the Image#

Build and tag the image:

1
docker build -t my-ml-app:latest .

Run a container locally to test:

1
docker run --rm my-ml-app:latest

Once you’ve confirmed everything works, you’re prepared to push this image to a container registry (like Docker Hub or a private registry) for deployment to Kubernetes.

Deploying Your ML Service on Kubernetes#

Container Registry#

To deploy to Kubernetes, your image must be accessible from the cluster, which implies pushing it to a container registry:

1
docker tag my-ml-app:latest myregistry.com/my-ml-app:latest
2
docker push myregistry.com/my-ml-app:latest

Creating Kubernetes Manifests#

Once your image is in a registry, you can define a Kubernetes Deployment:

1
apiVersion: apps/v1
2
kind: Deployment
3
metadata:
4
  name: ml-model-deployment
5
spec:
6
  replicas: 2
7
  selector:
8
    matchLabels:
9
      app: ml-model
10
  template:
11
    metadata:
12
      labels:
13
        app: ml-model
14
    spec:
15
      containers:
16
      - name: ml-container
17
        image: myregistry.com/my-ml-app:latest
18
        ports:
19
        - containerPort: 80

You can expose your Deployment as a Service:

1
apiVersion: v1
2
kind: Service
3
metadata:
4
  name: ml-model-service
5
spec:
6
  selector:
7
    app: ml-model
8
  ports:
9
  - protocol: TCP
10
    port: 80
11
    targetPort: 80
12
  type: ClusterIP

Apply these configurations to your cluster:

1
kubectl apply -f ml-model-deployment.yaml
2
kubectl apply -f ml-model-service.yaml

Confirming the Deployment#

Use kubectl get pods and kubectl get services to verify that your Pods are running and your Service is exposed internally (or externally, if configured).

Managing Data and Storage#

Machine learning workloads often require large datasets or need to produce outputs (models, logs, metrics) that must persist beyond container lifespans. Kubernetes abstracts storage using Volumes and Persistent Volume Claims (PVCs).

Using Persistent Volume (PV) and Persistent Volume Claim (PVC)#

Define a Persistent Volume: An interface to the actual physical storage (local or network-attached).
Create a PVC: A request for storage that references a PV.
Attach PVC to a Pod: The Pod then sees the persistent disk as part of its filesystem.

Here’s an example PersistentVolumeClaim:

1
apiVersion: v1
2
kind: PersistentVolumeClaim
3
metadata:
4
  name: training-data-pvc
5
spec:
6
  accessModes:
7
    - ReadWriteOnce
8
  resources:
9
    requests:
10
      storage: 10Gi

You can mount this PVC in the Pod:

1
apiVersion: apps/v1
2
kind: Deployment
3
metadata:
4
  name: ml-training-deployment
5
spec:
6
  replicas: 1
7
  selector:
8
    matchLabels:
9
      app: ml-training
10
  template:
11
    metadata:
12
      labels:
13
        app: ml-training
14
    spec:
15
      containers:
16
      - name: ml-training-container
17
        image: myregistry.com/ml-training-image:latest
18
        volumeMounts:
19
        - name: training-data-volume
20
          mountPath: /app/data
21
      volumes:
22
      - name: training-data-volume
23
        persistentVolumeClaim:
24
          claimName: training-data-pvc

Mounting a PVC for training data ensures that large datasets are locally available for ML tasks and remain persistent even if the container restarts.

Scaling Strategies#

Kubernetes offers autoscaling capabilities to match your resource usage. This can be particularly helpful for inference workloads that see spikes in demand.

Horizontal Pod Autoscaler (HPA)#

The Horizontal Pod Autoscaler uses metrics such as CPU, memory, or custom metrics (e.g., requests per second) to adjust the number of running Pods.

1
apiVersion: autoscaling/v2
2
kind: HorizontalPodAutoscaler
3
metadata:
4
  name: ml-model-hpa
5
spec:
6
  scaleTargetRef:
7
    apiVersion: apps/v1
8
    kind: Deployment
9
    name: ml-model-deployment
10
  minReplicas: 1
11
  maxReplicas: 10
12
  metrics:
13
  - type: Resource
14
    resource:
15
      name: cpu
16
      target:
17
        type: Utilization
18
        averageUtilization: 70

When CPU usage exceeds 70%, Kubernetes automatically scales up Pods in the ml-model-deployment. And when the usage goes down, it scales back down.

Distributed Training#

Distributed training usually requires more advanced frameworks like Horovod or PyTorch’s Distributed Data Parallel. Kubernetes can schedule multiple containers across different nodes and allow them to communicate via a Service. This is more advanced and typically involves specialized custom resource definitions (CRDs) offered by tools like Kubeflow.

Basic Example: Deploying a Simple Model on K8s#

Let’s walk through a more concrete but simplified example of deploying a pre-trained model (e.g., a scikit-learn regression model) behind a REST API in a Flask app.

1. Flask Prediction App#

1
from flask import Flask, request, jsonify
2
import joblib
3

4
app = Flask(__name__)
5

6
# Load the model at startup
7
model = joblib.load("linear_regression_model.pkl")
8

9
@app.route("/predict", methods=["POST"])
10
def predict():
11
    data = request.json
12
    features = data.get("features", [])
13
    if len(features) == 2:  # expecting 2 features
14
        prediction = model.predict([features])
15
        return jsonify({"prediction": prediction.tolist()})
16
    else:
17
        return jsonify({"error": "Invalid input"}), 400
18

19
if __name__ == "__main__":
20
    app.run(host="0.0.0.0", port=80)

2. Dockerfile#

1
FROM python:3.9-slim
2
WORKDIR /app
3
COPY requirements.txt /app/
4
RUN pip install --no-cache-dir -r requirements.txt
5
COPY . /app/
6
CMD ["python", "app.py"]

3. Kubernetes Manifests#

1
apiVersion: apps/v1
2
kind: Deployment
3
metadata:
4
  name: flask-ml-deployment
5
spec:
6
  replicas: 2
7
  selector:
8
    matchLabels:
9
      app: flask-ml
10
  template:
11
    metadata:
12
      labels:
13
        app: flask-ml
14
    spec:
15
      containers:
16
      - name: flask-ml-container
17
        image: myregistry.com/flask-ml-image:latest
18
        ports:
19
        - containerPort: 80

1
apiVersion: v1
2
kind: Service
3
metadata:
4
  name: flask-ml-service
5
spec:
6
  selector:
7
    app: flask-ml
8
  ports:
9
  - port: 80
10
    targetPort: 80
11
    protocol: TCP
12
  type: ClusterIP

After applying these YAML files, your model is accessible within the cluster on the flask-ml-service:80 endpoint. For external traffic, you’d configure an Ingress or use a LoadBalancer-type Service.

Testing with cURL or Python#

Within the cluster or via port-forwarding, you can test your endpoint:

1
curl -X POST -H "Content-Type: application/json" \
2
    -d '{"features": [3.2, 1.5]}' \
3
    http://flask-ml-service/predict

Intermediate Concepts: CI/CD and Reproducibility#

Continuous Integration and Continuous Deployment#

Automating the build and deployment process is essential for agility. A typical flow might be:

Commit to Git triggers a CI pipeline.
Tests are run automatically.
Docker image is built and pushed to the registry.
Kubernetes manifests are updated or a Helm chart is deployed to your K8s cluster.

Tools such as Jenkins, GitLab CI, or GitHub Actions can facilitate these steps.

Reproducibility and Version Control#

In machine learning:

Data versioning matters as much as code versioning. Tools like DVC (Data Version Control) or MLflow can help track changes.
Model versioning is critical for proper A/B testing and rollbacks. Store your models in a central registry with distinct version tags.

Using GitOps strategies (e.g., Argo CD or Flux) ensures your cluster’s state is always inferred from source control.

Advanced Concepts: GPUs, Hyperparameter Tuning, and Auto-Scaling#

Using GPUs in Kubernetes#

Training deep learning models often requires GPU acceleration. When you have GPU-enabled nodes, Kubernetes can schedule containers that request GPU resources.

1
apiVersion: apps/v1
2
kind: Deployment
3
metadata:
4
  name: gpu-training-deployment
5
spec:
6
  replicas: 1
7
  selector:
8
    matchLabels:
9
      app: gpu-training
10
  template:
11
    metadata:
12
      labels:
13
        app: gpu-training
14
    spec:
15
      containers:
16
      - name: gpu-trainer
17
        image: nvidia/cuda:11.0-base
18
        resources:
19
          limits:
20
            nvidia.com/gpu: 1

This example requests a single GPU. K8s, with the NVIDIA device plugin, ensures pods requiring GPUs are placed on GPU-enabled nodes.

Hyperparameter Tuning with Kubernetes Jobs#

Hyperparameter tuning can be done by spawning multiple Jobs, each testing a different hyperparameter set:

1
apiVersion: batch/v1
2
kind: Job
3
metadata:
4
  name: hyperparam-job
5
spec:
6
  template:
7
    spec:
8
      containers:
9
      - name: trainer
10
        image: myregistry.com/training-image:latest
11
        command: ["python", "train.py", "--lr=0.01", "--batch-size=64"]
12
      restartPolicy: Never
13
  backoffLimit: 0

By programmatically generating multiple Job manifests, you can run them in parallel to sweep over a parameter search space.

Autoscaling with Custom Metrics#

Kubernetes also allows for scaling based on custom metrics, such as the rate of inference requests or queue depth. With the help of metrics adapters and a system like Prometheus, you can define custom rules.

1
apiVersion: autoscaling/v2
2
kind: HorizontalPodAutoscaler
3
metadata:
4
  name: custom-metric-hpa
5
spec:
6
  scaleTargetRef:
7
    apiVersion: apps/v1
8
    kind: Deployment
9
    name: inference-deployment
10
  minReplicas: 1
11
  maxReplicas: 20
12
  metrics:
13
  - type: Pods
14
    pods:
15
      metric:
16
        name: inference_requests_per_second
17
      target:
18
        type: AverageValue
19
        averageValue: "50"

When the average inference requests per second surpass 50 across all Pods, Kubernetes spins up more Pods automatically.

Orchestration Frameworks: Kubeflow and Beyond#

Kubernetes is a strong foundation, but ML workflows often require even more specialized tooling. That’s where frameworks like Kubeflow, Argo, or MLflow on Kubernetes come in.

Kubeflow#

Kubeflow aims to make running machine learning workflows on Kubernetes straightforward. It includes:

Kubeflow Pipelines: A platform for building and deploying scalable ML workflows.
TFJob, PyTorchJob, MXNetJob: Custom resources for distributed training.
Central Dashboard: To monitor pipelines, notebooks, and experiments.

MLflow on K8s#

MLflow is another popular tool for tracking experiments, storing models, and managing the entire model lifecycle. It can be installed on Kubernetes, allowing you to store training artifacts and serve models in a standardized manner.

Argo Workflow#

Argo is a workflow engine for Kubernetes, allowing you to define multi-step data and ML pipelines. Similar to Kubeflow Pipelines, you define each stage as a set of containers that run in sequence or in parallel.

Orchestration Tool	Strengths	Typical Use Cases
Kubeflow	End-to-end ML platform on K8s, focuses on distributed training and pipelines.	Large-scale pipelines, end-to-end ML lifecycle.
Argo Workflow	General-purpose workflow engine.	Complex data processing or multi-step workflows.
MLflow	Experiment tracking, model registry, and serving.	All stages of ML development and deployment with tracking.

Best Practices and Future Trends#

Minimize Image Size: Large Docker images slow down deployment and scaling. Use slim images and ensure your Dockerfile only installs necessary dependencies.
Enable Logging and Monitoring: Integrate your ML workloads with logging (e.g., EFK stack) and monitoring (Prometheus + Grafana). This gives visibility into performance and resource usage.
Secure Your Pipeline: Implement security best practices, including scanning container images for vulnerabilities, using read-only file systems, and restricting permissions with Pod Security Policies or Pod Security Admission.
Use Helm or Kustomize: For more advanced or repeated deployments, Helm or Kustomize templates reduce YAML duplication and confusion.
Look Ahead to Serverless ML: Serverless offerings like Knative or vendor-specific solutions (e.g., AWS Fargate) could simplify the operational overhead. In some cases, you simply need to run your code without managing the underlying cluster.

Conclusion#

Transitioning your machine learning workflows from local notebooks to production-ready pipelines on Kubernetes is a massive step up in maturity. With containers providing consistency and Kubernetes offering robust orchestration, you can:

Easily scale training and inference workloads.
Manage multi-tenant environments for different data scientists and teams.
Integrate logs, monitoring, and advanced workflows like hyperparameter tuning or distributed training.
Lay the groundwork for future trends like serverless ML.

While the initial ramp-up in complexity might feel steep, the benefits in scalability, reproducibility, and efficiency are worth it. Whether you choose to simply orchestrate your Dockerized Python scripts or leverage full-fledged platforms like Kubeflow, the Kubernetes ecosystem is ready to handle your machine learning ambitions at scale.

By consistently refining your processes—adopting CI/CD, versioning data, integrating advanced autoscaling metrics, and using specialized frameworks—you’ll move beyond the days of single-machine notebooks. Your ML journey can then fully leverage the power and reliability of Kubernetes, turning ideas into reliable, scalable production services. Happy scaling!