Cloud Platform as a Service

Join us to learn more from a community of collaborative experts and IBM Cloud product users to share advice and best practices with peers and stay up to date regarding product enhancements, regional user group meetings, webinars, how-to blogs, and other helpful materials.

View Only

Back to Blog List

How to deploy a serverless Hugging Face LLM to IBM Cloud Code Engine

By Enrico Regge posted Mon February 05, 2024 04:18 AM

Written by

Finn Fassnacht (finnfassnacht@gmail.com)
Corporate Student @ IBM

Michael Magrian (michael.magrian@ibm.com)
Software Engineer @ IBM

Enrico Regge (enrico.regge@de.ibm.com)
Senior Software Developer @ IBM

Introduction

Recently, there has been an explosion of popularity around AI, particularly Large Language Models (LLM), due to their ability to interact with text in a human-like way and perform various language tasks with remarkable accuracy. IBM’s strategic product when bringing these two worlds together is IBM watsonx™. Watsonx covers a wide range of use cases - model inferencing, model training, model governance and so on - in other words, it’s an extensive platform providing AI and data solutions with the well known IBM business standards, including governance, security, compliance and high availability. Further information about watsonx is available under https://www.ibm.com/watsonx.

While IBM watsonx provides a rich set of features to serve your own private, governed models, with an extensive and powerful set of capabilities - we all know sometimes we just need a tool to get something done quickly and easily, without much overhead. So if you want to “just“ serve an open-source model offered by a provider like Hugging Face to fulfil your use case, this blog article will demonstrate how you can do that … in a serverless fashion, using the IBM Cloud. With a few simple steps you will gain the knowledge how to interact with a model and how to set it up to be accessed from anywhere.

The IBM Cloud service IBM Cloud Code Engine makes it easy to run workloads, small or large in scale, with an intuitive user experience to get started quickly.

In this blog post, we'll provide a step-by-step guide to deploy NLP models from Hugging Face to Code Engine in order to publish a web application to translate any text from German to English. We'll also include code snippets to help you follow along.

For this tutorial, we'll be using Python3 as our programming language, and our NLP model of choice will be the LLM OPUS-MT-de-en, which has been trained on the OPUS dataset. It has been crafted by the Language Technology Research Group at the University of Helsinki and provides the ability to translate texts from German to English. In addition, we’ll briefly showcase how to wrap the Python-based API into a modern web application based on Next.js and the Carbon Design System.

By following the steps outlined in this blog post, you’ll learn how to use IBM Cloud to provide AI-based translation capabilities served through a web application hosted on Code Engine. If you would like to fast-forward to the solution, please take a look at the source code sample on https://github.com/IBM/CodeEngine/tree/main/llm-translator-app

Prerequisites

Basic knowledge about Hugging Face
Basic knowledge of container images
Knowledge of IBM Code Engine and how to deploy a basic app (find out more here)
An IBM Cloud account with sufficient privileges to create and manage resources

Get your Model

Before we can begin translating text using the OPUS-MT-de-en model, we need to obtain the model files. These files can be downloaded from the Helsinki-NLP GitHub repository on Hugging Face's website using Git. Once downloaded, we can begin working with the model.

Make sure that you have installed Git CLI and Git LFS (Large File Storage). If you use Homebrew as a package manager on MacOS, you can run these commands in the terminal:
```
$ brew install git
$ brew install git-lfs
$ git lfs install
```
Create a new folder to add our code in and navigate into it. Either use your file explorer or use execute these commands:
```
$ mkdir code-engine-translator-llm
$ cd code-engine-translator-llm
```

Git clone the model. Bear in mind that this will take a while:

$ git clone https://huggingface.co/Helsinki-NLP/opus-mt-de-en

Delete unnecessary files:

$ (cd opus-mt-de-en && rm README.md rust_model.ot && rm -rf .git)

Now that you have downloaded the OPUS-MT-de-en model files, we can begin using the model to translate text on your local machine.

How to Run Your Model

Before we can start using the our model to translate text, we need to ensure that we have all the necessary packages install.

To do that, we need to install Pytorch and transformers.

Install PyTorch for CPUs by running the following command:

$ pip3 install torch --index-url https://download.pytorch.org/whl/cpu

Install the transformers library by running the following commands:

$ pip3 install transformers
$ pip3 install sentencepiece
$ pip3 install sacremoses

Once these packages are installed, create a new Python file called “main.py” in the same directory where you issued the git clone operation to download the OPUS-MT-de-en model files. This file will be used to write the code for generating text using the model. Here's some sample code you can use to translate any German text:

# Import the pipeline module, which bundles all files in your model directory together
from transformers import pipeline 

# Specify the task and the model directory
translator = pipeline("translation", model="Helsinki-NLP/opus-mt-de-en")

# Translate text "Hallo, wie geht es dir?"
res = translator("Hallo, wie geht es dir?")

# Print the generated text
print(res[0]["translation_text"])

Save & execute, congratulations you just translated some text with your local LLM (Large Language Model)!
```
$ python3 main.py 

Hello, how are you?
```

Create a HTTP server and deploy it on IBM Cloud Code Engine

To interact with the Large Language Model (LLM) in the context of a web application, we need to set up a server that can handle requests and return generated text from the model. This can be achieved by creating a simple HTTP server on port 8080 and serving a single route that accepts prompts and returns the corresponding translated text.

Adjust the "main.py" by replacing it with the following lines of code:

from flask import Flask, request, jsonify
from transformers import pipeline 
import logging
import sys

logging.basicConfig(stream=sys.stdout, level=logging.DEBUG, format='%(asctime)s.%(msecs)03d %(levelname)s %(module)s - %(funcName)s: %(message)s',
    datefmt='%Y-%m-%d %H:%M:%S')
log = logging.getLogger(__name__)

# Init the model (https://huggingface.co/Helsinki-NLP/opus-mt-de-en)
log.debug("pipeline de->en init ...")
de_to_en_translator = pipeline("translation", model="Helsinki-NLP/opus-mt-de-en") 
log.debug("pipeline de->en init [done]")

app = Flask(__name__)

# Set up a route at /api/translate/de-en
@app.route('/api/translate/de-en', methods=['POST'])
def translate_to_en():
    log.debug(">")
    data = request.json
    # run the translation operation
    res = de_to_en_translator(data["text"])
    # return text back to user
    log.debug("< done")
    return jsonify({"translated":(res[0]["translation_text"])})

# Start the Flask server
port = '8080'
if __name__ == "__main__":
	app.run(host='0.0.0.0',port=int(port))

To run a web application in python, we choose to use the web framework Flask. In order to be able to consume it as a dependency within a container we will create a file, called "requirements.txt", in the root directory of your project with the following content:

Flask
transformers
sacremoses
sentencepiece

To start the server locally, install the dependencies and run the main.py:

$ pip3 install -r requirements.txt

$ python3 main.py

Open a second terminal, to play around with the locally running HTTP server:

$ curl -H 'Content-Type: application/json' \
       -d '{ "text":"Hallo, wie geht es dir?" }' \
       -X POST https://my-translator.<random-id>.<region>.codeengine.appdomain.cloud/api/translate/de-en 

{"translated":"Hello, how are you?"}

Now that we know how to install and use our model locally, let's talk about how we can run it on IBM Cloud Code Engine.

Code Engine enables developers to easily deploy source code or container images as an web app, batch job or cloud function. For our use case, we’ll choose to deploy the LLM as an app, which has a URL for incoming requests. Furthermore, the number of running instances of an application are automatically scaled up or down (to zero) based on incoming workload. If you are interested in learning more about the different workload concepts that Code Engine offers, you’ll find useful information in the product documentation article "Planning for Code Engine"

Note: Each container has a bit of storage called ephemeral storage which is like a little hard drive for each container. We can use this storage to store our model, however it's important to note that when a container is terminated, its ephemeral storage is also terminated. Given the fact that LLM can get quite large, including them in your application can introduce challenges with regards to image pull times or exceeding your storage quota. However, specialised models, like the one we use in this example, take up significantly less storage, thus enabling to include them.
On another note, you likely hear a lot of talk about running LLM on hardware with GPU support. While we don’t have support for GPU resources, this it not much of an issue, since the model we are using here is not very large and delivers quick response times on the hardware we provide.

Create a file, called ".ceignore", in the root directory of your project with the following content. If you’re interested to learn more about how to configure your image builds, please see our product documentation page "Planning your build":

# We do want to ignore the locally downloaded huggingface model folder
opus-mt-de-en/

# Ignore the Python cache
__pycache__

Create file, called "Dockerfile", in the root directory of your project (note that "Dockerfile" has no file extension) with the following content:

# Use a prebuilt pytorch image
FROM docker.io/cnstark/pytorch:2.0.1-py3.10.11-ubuntu22.04

# Install Git on top
RUN apt update \
    && apt install -y git

# Define the working dir
WORKDIR /app

# Download the model during the container build operation
RUN git clone https://huggingface.co/Helsinki-NLP/opus-mt-de-en \
    && (cd opus-mt-de-en && rm README.md rust_model.ot && rm -rf .git)

# Copy and install requirements
COPY requirements.txt requirements.txt
RUN pip3 install -r requirements.txt

# Copy your main code file
COPY main.py main.py

# Command to start the app within the container
CMD ["python3","main.py"]

Note: As a base image we chose to use a PyTorch image, which provides the proper Python and PyTorch runtime environment suitable to run models on CPU hardware.

Next, we’ll utilize the IBM Cloud CLI to work with IBM Cloud Code Engine. If you don’t have the "IBM Cloud Code Engine CLI" configured yet, you’ll find useful instructions in the linked documentation page.

# Choose a proper region as well as resource group as part of the login operation
$ ibmcloud login 

# In case you already have a project, run "ibmcloud ce project select --name my-translator-project" instead
$ ibmcloud ce project create --name my-translator-project

# Deploy the application 
$ ibmcloud ce app create \
    --name my-translator \
    --build-source . \
    --build-size xlarge \
    --ephemeral-storage 4G \
    --memory 4G \
    --cpu 2 \
    --port 8080 \
    --min-scale 0 \
    --scale-down-delay 600

To help understand what we are passing in as our configuration, here are a couple explanations:

build-source & build-size: Since we are creating an image based on local source code, we signal that by setting the "build-source" to our file context. To read more about this topic, visit "Deploying your app from local source code with the CLI" in our documentation. The selection of size for the uild will depend on your code bundle size. In this case we chose "xlarge". While we don’t have much code, we do include model data.
ephemeral-storage: It is important for our use-case to set the "ephemeral-storage". Our application including the model would exceed the default of 400MB, resulting in an error.
cpu & memory: We want to provide enough resources to handle a good amount of requests simultaneously. Also keep in mind that the "ephemeral-storage" can be set to the maximum value of what we have as "memory". In our documentation we provide a list of "Supported memory and CPU combinations".
min-scale: With it being set to 0, the application will scale down to 0 instances, reducing costs, when it is not being used. More information on application scaling can be found in the documentation page "Configuring application scaling".
scale-down-delay: One downside of "min-scale: 0" is that the application takes some time to scale up, when it hasn’t been used. To remediate that somewhat, adding "scale-down-delay" delays this scale-down, in our case by 600 seconds.

After Code Engine is finished deploying your app, you can play around with your model online.

$ curl -H 'Content-Type: application/json' \
       -d '{ "text":"Hallo, wie geht es dir?" }' \
       -X POST https://my-translator.<random-id>.<region>.codeengine.appdomain.cloud/api/translate/de-en 

{"translated":"Hello, how are you?"}

Develop a modern web application based on Python Flask and Next.js

While our hands-on tutorial ends here, we do want to give you an outlook on how you could further extend your application by combining Python, the most popular programming language to interact with AI models, and Next.js, a React-based framework to build a modern web application.

To get this rolling, you can use the Next.js Flask Starter, which creates a Next.js web application backed by a Python Flask backend API. As an alternative, you can just fast-forward and fork our Code Engine sample https://github.com/IBM/CodeEngine/tree/main/llm-translator-app 😉

Simple translator application deployed on IBM Cloud Code Engine

Conclusion

In this blog post, we explored how to deploy LLMs from Hugging Face to IBM Cloud Code Engine with a few simple steps. We provided sample code for using your model locally and deploying it in a server configuration. We discussed how to deploy a web application with a Python API server.

With the steps outlined in this post, you should have a clear understanding of how to deploy open-sourced LLM models to Code Engine. With a few small tweaks you should be able to adjust this source code to cover your transformer-based use-case.

So go ahead, try it for yourself. To get started, head on over to https://www.ibm.com/products/code-engine.

0 comments

151 views

Permalink

https://community.ibm.com/community/user/blogs/enrico-regge/2024/02/05/how-to-deploy-a-serverless-hugging-face-llm-to-ibm

Cloud Platform as a Service

Cloud Platform as a Service

How to deploy a serverless Hugging Face LLM to IBM Cloud Code Engine

By Enrico Regge posted Mon February 05, 2024 04:18 AM

Introduction

Prerequisites

Get your Model

How to Run Your Model

Create a HTTP server and deploy it on IBM Cloud Code Engine

Develop a modern web application based on Python Flask and Next.js

Conclusion

Permalink

Additional
Resources

Office

Quick Links

Cloud Platform as a Service

Cloud Platform as a Service

How to deploy a serverless Hugging Face LLM to IBM Cloud Code Engine

By Enrico Regge posted Mon February 05, 2024 04:18 AM

Introduction

Prerequisites

Get your Model

How to Run Your Model

Create a HTTP server and deploy it on IBM Cloud Code Engine

Develop a modern web application based on Python Flask and Next.js

Conclusion

Permalink

Additional Resources

Office

Quick Links

Additional
Resources