Deploying Granite LLMs on OpenShift with Ollama: A Technical Deep Dive
Large Language Models (LLMs) are revolutionizing how we interact with technology, but deploying and managing them can be complex. Ollama simplifies this by providing an easy-to-use framework for running open-source LLMs locally or within your infrastructure. When combined with a robust container orchestration platform like OpenShift, you gain scalability, manageability, and enterprise-grade features for your LLM deployments.
This guide will walk you through deploying Ollama on OpenShift, pulling IBM's Granite model (specifically granite3.3:2b), and interacting with it via its API. We'll leverage standard Kubernetes objects orchestrated by OpenShift to create a resilient and accessible LLM service.
Why Ollama on OpenShift?
-
Simplified LLM Management: Ollama abstracts away the complexities of model loading, GPU management (if available, though not explicitly configured in this basic setup), and API serving.
-
Control and Privacy: Hosting LLMs within your OpenShift cluster gives you complete control over your data and model usage, crucial for sensitive applications.
-
Scalability & Resilience: OpenShift provides robust mechanisms for scaling your Ollama deployment and ensuring high availability.
-
Resource Management: Fine-grained control over CPU, memory, and storage resources allocated to your LLM service.
-
Integration with MLOps Pipelines: OpenShift can be the foundation for more complex MLOps pipelines involving LLMs.
Step 1: Defining the OpenShift Resources for Ollama
Following is a manifest to install ollama (creating namespace, creating storage, creating deploying, service and route). Save it as a file name ollama.yaml
---
apiVersion: v1
kind: Namespace
metadata:
name: ollama
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: ollama-storage
namespace: ollama
spec:
accessModes:
- ReadWriteOnce
volumeMode: Filesystem
resources:
requests:
storage: 10Gi
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: ollama
namespace: ollama
spec:
selector:
matchLabels:
name: ollama
template:
metadata:
labels:
name: ollama
spec:
containers:
- name: ollama
image: ollama/ollama:latest
ports:
- name: http
containerPort: 11434
protocol: TCP
volumeMounts:
- mountPath: /.ollama
name: ollama-storage
restartPolicy: Always
imagePullPolicy: Always
volumes:
- name: ollama-storage
persistentVolumeClaim:
claimName: ollama-storage
---
apiVersion: v1
kind: Service
metadata:
name: ollama
namespace: ollama
spec:
type: ClusterIP
selector:
name: ollama
ports:
- port: 80
name: http
targetPort: http
protocol: TCP
---
kind: Route
apiVersion: route.openshift.io/v1
metadata:
name: ollama-route
namespace: ollama
annotations:
openshift.io/host.generated: 'true'
spec:
to:
kind: Service
name: ollama
weight: 100
port:
targetPort: http
tls:
termination: edge
wildcardPolicy: None
Step 2: Applying the Manifests
Use the oc CLI to apply these manifests to your OpenShift cluster:
oc apply -f ollama.yaml
output will be similar to below:
namespace/ollama created
persistentvolumeclaim/ollama-storage created
deployment.apps/ollama created
service/ollama created
route.route.openshift.io/ollama-route created
And retrieve the generate route hostname:
oc get route ollama-route -n ollama -o jsonpath='{.spec.host}'
Use the value of route hosename retried below in place of {OLLAMA_HOST}
Step 3: Pulling the Granite Model
Use the below api to pull granite 3.3 model
curl https://{OLLAMA_HOST}/api/pull -d '{
"model": "granite3.3:2b"
}'
{"status":"pulling manifest"}
{"status":"pulling ac71e9e32c0b","digest":"sha256:ac71e9e32c0bea919b409c5918f69ca74339854b0319c5065e4e9fb6d95c4852","total":1545303328,"completed":1545303328}
{"status":"pulling 3da071a01bbe","digest":"sha256:3da071a01bbe5a1aa1e9766149ff67ed2b232f63d55e6ed50e3777b74536a67f","total":6560,"completed":6560}
{"status":"pulling 4a99a6dd617d","digest":"sha256:4a99a6dd617d9f901f29fe91925d5032600fcd78f315a9fa78c1667c950a3a5f","total":11332,"completed":11332}
{"status":"pulling f9ed27df66e9","digest":"sha256:f9ed27df66e9a0484b0bc04ae1cbcea5a2a0216ad2b0b673a63b9b8a120d06f1","total":417,"completed":417}
{"status":"verifying sha256 digest"}
{"status":"writing manifest"}
{"status":"success"}
Step 4: Verifying the Model List
To confirm the model has been successfully downloaded and is available to Ollama, query the /api/tags endpoint:
curl https://{OLLAMA_HOST}/api/tags
You will see output like below:
{"models":[{"name":"granite3.3:2b","model":"granite3.3:2b","modified_at":"2025-05-06T03:40:21.607095246Z","size":1545321637,"digest":"07bd1f170855240f9e162bf54ea494a8bc1c73d8cbd1365d7fccbeb7d2504947","details":{"parent_model":"","format":"gguf","family":"granite","families":["granite"],"parameter_size":"2.5B","quantization_level":"Q4_K_M"}}]}
Step 5: Interacting with the Granite Model
Now, let's send a prompt to the Granite model using Ollama's OpenAI-compatible chat completions endpoint (/v1/chat/completions).
curl --request POST \
--url https://${OLLAMA_HOST}/v1/chat/completions \
--header 'Content-Type: application/json' \
--data '{
"model": "granite3.3:2b",
"messages": [
{
"role": "user",
"content": "What is Capital of India?"
}
],
"stream": false
}'
You will see output like below:
{"id":"chatcmpl-931","object":"chat.completion","created":1746503021,"model":"granite3.3:2b","system_fingerprint":"fp_ollama","choices":[{"index":0,"message":{"role":"assistant","content":"The capital of India is New Delhi. It's been the de facto capital since 1911, although it was proclaimed as the ceremonial capital long before."},"finish_reason":"stop"}],"usage":{"prompt_tokens":51,"completion_tokens":40,"total_tokens":91}}
The output will be something like below
{"id":"chatcmpl-931","object":"chat.completion","created":1746503021,"model":"granite3.3:2b","system_fingerprint":"fp_ollama","choices":[{"index":0,"message":{"role":"assistant","content":"The capital of India is New Delhi. It's been the de facto capital since 1911, although it was proclaimed as the ceremonial capital long before."},"finish_reason":"stop"}],"usage":{"prompt_tokens":51,"completion_tokens":40,"total_tokens":91}}
Best Practices and Next Steps
-
Scaling: Use OpenShift’s Horizontal Pod Autoscaler to scale Ollama pods based on CPU or memory usage.
-
Security: Implement role-based access control (RBAC) and network policies to restrict access to the Ollama service.
-
Monitoring: Use OpenShift’s monitoring tools or integrate with Prometheus to track model performance and resource usage.
-
Fine-Tuning: Explore IBM’s InstructLab to fine-tune Granite models with enterprise-specific data for better task-specific performance.
-
Web Interface: Deploy Open WebUI alongside Ollama for a user-friendly interface to interact with the Granite model.
Conclusion
Deploying IBM Granite models on OpenShift with Ollama combines the power of advanced LLMs with the scalability and security of a Kubernetes-based platform. Ollama’s ease of use, OpenAI-compatible APIs, and support for Granite models make it an ideal choice for enterprise AI deployments. By following this guide, you can set up a production-ready environment, test the model’s capabilities, and integrate it with your applications.