Global AI and Data Science

Global AI & Data Science

Train, tune and distribute models with generative AI and machine learning capabilities

 View Only

Deploying Granite Model in OpenShift using Ollama

By ANUJ BAHUGUNA posted Tue May 06, 2025 02:52 PM

  

Deploying Granite LLMs on OpenShift with Ollama: A Technical Deep Dive

Large Language Models (LLMs) are revolutionizing how we interact with technology, but deploying and managing them can be complex. Ollama simplifies this by providing an easy-to-use framework for running open-source LLMs locally or within your infrastructure. When combined with a robust container orchestration platform like OpenShift, you gain scalability, manageability, and enterprise-grade features for your LLM deployments.

This guide will walk you through deploying Ollama on OpenShift, pulling IBM's Granite model (specifically granite3.3:2b), and interacting with it via its API. We'll leverage standard Kubernetes objects orchestrated by OpenShift to create a resilient and accessible LLM service.

Why Ollama on OpenShift?

  • Simplified LLM Management: Ollama abstracts away the complexities of model loading, GPU management (if available, though not explicitly configured in this basic setup), and API serving.

  • Control and Privacy: Hosting LLMs within your OpenShift cluster gives you complete control over your data and model usage, crucial for sensitive applications.

  • Scalability & Resilience: OpenShift provides robust mechanisms for scaling your Ollama deployment and ensuring high availability.

  • Resource Management: Fine-grained control over CPU, memory, and storage resources allocated to your LLM service.

  • Integration with MLOps Pipelines: OpenShift can be the foundation for more complex MLOps pipelines involving LLMs.

Step 1: Defining the OpenShift Resources for Ollama

Following is a manifest to install ollama (creating namespace, creating storage, creating deploying, service and route). Save it as a file name ollama.yaml

---
apiVersion: v1
kind: Namespace
metadata:
  name: ollama

---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: ollama-storage
  namespace: ollama
spec:
  accessModes:
    - ReadWriteOnce
  volumeMode: Filesystem
  resources:
    requests:
      storage: 10Gi

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ollama
  namespace: ollama
spec:
  selector:
    matchLabels:
      name: ollama
  template:
    metadata:
      labels:
        name: ollama
    spec:
      containers:
      - name: ollama
        image: ollama/ollama:latest
        ports:
        - name: http
          containerPort: 11434
          protocol: TCP
        volumeMounts:
        - mountPath: /.ollama
          name: ollama-storage
      restartPolicy: Always
      imagePullPolicy: Always
      volumes:
      - name: ollama-storage
        persistentVolumeClaim:
          claimName: ollama-storage

---
apiVersion: v1
kind: Service
metadata:
  name: ollama
  namespace: ollama
spec:
  type: ClusterIP
  selector:
    name: ollama
  ports:
  - port: 80
    name: http
    targetPort: http
    protocol: TCP

---

kind: Route
apiVersion: route.openshift.io/v1
metadata:
  name: ollama-route
  namespace: ollama
  annotations:
    openshift.io/host.generated: 'true'
spec:
  to:
    kind: Service
    name: ollama
    weight: 100
  port:
    targetPort: http
  tls:
    termination: edge
  wildcardPolicy: None

Step 2: Applying the Manifests

Use the oc CLI to apply these manifests to your OpenShift cluster:

oc apply -f ollama.yaml

output will be similar to below:

namespace/ollama created
persistentvolumeclaim/ollama-storage created
deployment.apps/ollama created
service/ollama created
route.route.openshift.io/ollama-route created

And retrieve the generate route hostname:

oc get route ollama-route -n ollama -o jsonpath='{.spec.host}'

Use the value of route hosename retried below in place of {OLLAMA_HOST}

Step 3: Pulling the Granite Model

Use the below api to pull granite 3.3 model

curl https://{OLLAMA_HOST}/api/pull -d '{
  "model": "granite3.3:2b"
}'

{"status":"pulling manifest"}
{"status":"pulling ac71e9e32c0b","digest":"sha256:ac71e9e32c0bea919b409c5918f69ca74339854b0319c5065e4e9fb6d95c4852","total":1545303328,"completed":1545303328}
{"status":"pulling 3da071a01bbe","digest":"sha256:3da071a01bbe5a1aa1e9766149ff67ed2b232f63d55e6ed50e3777b74536a67f","total":6560,"completed":6560}
{"status":"pulling 4a99a6dd617d","digest":"sha256:4a99a6dd617d9f901f29fe91925d5032600fcd78f315a9fa78c1667c950a3a5f","total":11332,"completed":11332}
{"status":"pulling f9ed27df66e9","digest":"sha256:f9ed27df66e9a0484b0bc04ae1cbcea5a2a0216ad2b0b673a63b9b8a120d06f1","total":417,"completed":417}
{"status":"verifying sha256 digest"}
{"status":"writing manifest"}
{"status":"success"}

Step 4: Verifying the Model List

To confirm the model has been successfully downloaded and is available to Ollama, query the /api/tags endpoint:

curl https://{OLLAMA_HOST}/api/tags

You will see output like below:

{"models":[{"name":"granite3.3:2b","model":"granite3.3:2b","modified_at":"2025-05-06T03:40:21.607095246Z","size":1545321637,"digest":"07bd1f170855240f9e162bf54ea494a8bc1c73d8cbd1365d7fccbeb7d2504947","details":{"parent_model":"","format":"gguf","family":"granite","families":["granite"],"parameter_size":"2.5B","quantization_level":"Q4_K_M"}}]}

Step 5: Interacting with the Granite Model

Now, let's send a prompt to the Granite model using Ollama's OpenAI-compatible chat completions endpoint (/v1/chat/completions).

curl --request POST \
  --url https://${OLLAMA_HOST}/v1/chat/completions \
  --header 'Content-Type: application/json' \
  --data '{
	"model": "granite3.3:2b",
	"messages": [
		{
			"role": "user",
			"content": "What is Capital of India?"
		}
	],
	"stream": false
}'

You will see output like below:

{"id":"chatcmpl-931","object":"chat.completion","created":1746503021,"model":"granite3.3:2b","system_fingerprint":"fp_ollama","choices":[{"index":0,"message":{"role":"assistant","content":"The capital of India is New Delhi. It's been the de facto capital since 1911, although it was proclaimed as the ceremonial capital long before."},"finish_reason":"stop"}],"usage":{"prompt_tokens":51,"completion_tokens":40,"total_tokens":91}}

The output will be something like below

{"id":"chatcmpl-931","object":"chat.completion","created":1746503021,"model":"granite3.3:2b","system_fingerprint":"fp_ollama","choices":[{"index":0,"message":{"role":"assistant","content":"The capital of India is New Delhi. It's been the de facto capital since 1911, although it was proclaimed as the ceremonial capital long before."},"finish_reason":"stop"}],"usage":{"prompt_tokens":51,"completion_tokens":40,"total_tokens":91}}

Best Practices and Next Steps

  • Scaling: Use OpenShift’s Horizontal Pod Autoscaler to scale Ollama pods based on CPU or memory usage.
  • Security: Implement role-based access control (RBAC) and network policies to restrict access to the Ollama service.
  • Monitoring: Use OpenShift’s monitoring tools or integrate with Prometheus to track model performance and resource usage.
  • Fine-Tuning: Explore IBM’s InstructLab to fine-tune Granite models with enterprise-specific data for better task-specific performance.
  • Web Interface: Deploy Open WebUI alongside Ollama for a user-friendly interface to interact with the Granite model.

Conclusion

Deploying IBM Granite models on OpenShift with Ollama combines the power of advanced LLMs with the scalability and security of a Kubernetes-based platform. Ollama’s ease of use, OpenAI-compatible APIs, and support for Granite models make it an ideal choice for enterprise AI deployments. By following this guide, you can set up a production-ready environment, test the model’s capabilities, and integrate it with your applications.
0 comments
286 views

Permalink