watsonx.ai

 View Only

Deploying custom foundation models to watsonx.ai

By Saloni Saluja posted Thu March 28, 2024 01:17 PM

  

Deploying custom foundation models to watsonx.ai

Since its release, IBM watsonx.ai has been enabling businesses to train, validate, tune, and deploy AI models in a fraction of the time and with a fraction of the data. And starting with watsonx.ai 1.1.4 software release, you can do even more with the enterprise studio: upload and deploy your own custom foundation models.  

There are many reasons to import a custom foundation model, all driven by the unique needs of your organization. Ultimately, it boils down to a specific foundation model that is optimal for the task at hand but resides outside of watsonx.ai. For instance, you may need support for a language that is not currently available in the watsonx.ai foundation model library. Or, your organization may have invested resources to fine-tune a model to optimize it for your specific industry or business domain.  This “bring your own model” approach provides greater flexibility in how you select and utilize the right foundation model to meet your specific generative AI use case.  

In this article, we will describe how to install and deploy a custom foundation model in watsonx.ai (version 1.1.4 or later) using the example of a foundation model from Hugging Face. 

 

What kind of model can I deploy? 

 

A custom foundation model:  

  • must be built with an architecture supported by watsonx.ai (see the table below) 

  • can be the original base version or a fine-tuned version 

  • must include a config.json file  

The following table provides information about architectures, quantization methods, and parallel tensors that are supported for each architecture: 

You must check the architecture type of a model before registering the model with watsonx.ai. For example, in this article we will be using the Falcon-40b model obtained from Hugging Face. To check the model architecture type for that model, follow these steps: 

  1. Open the Falcon-40b model on the Hugging Face website and click Files and Versions. 

  1. Open the config.json file and check for the model type. This file is built with the supported Falcon architecture.   
     
     

Deployment process overview  

 

Whether you are deploying a model obtained from Hugging Face or deploying a model from your own environment, the steps are similar. In this article, we will go through the steps for a deployment process for a model obtained from Hugging Face. Hugging Face is one of the largest repositories of open-source models and a place where you can find a wide variety of foundation models. If you select a model from Hugging Face that suits your needs, you can manually import and deploy it into watsonx.ai provided it is based on one of the supported architecture types. Next, use the Prompt Lab or API to inference and embed the model into your generative AI applications as you would any other model that is currently available off-the-shelf with watsonx.ai 

 

The variations for deploying a model from your own environment are described in the product documentation. In this example, we will install the Falcon-40b model obtained from Hugging Face.. 

 

To deploy a custom foundation model, you must have admin rights on the underlying admin cluster where watsonx.ai is installed.  

 

The steps to deploy a custom model are divided between system administrator tasks and watsonx.ai user tasks, as follows: 

Admin tasks: 

  • Task 1: Prepare the model and upload it to PVC storage 

  • Task 2: Register the model with watsonx.ai 

watsonx.ai user tasks: 

  • Task 3: Create the deployment for the custom model 

  • Task 4: Prompt the custom foundation model 

Prerequisites   

Before you start deploying your custom model to watsonx.ai, make sure you follow these guidelines:  

  • You must have Git Large File Storage (git lfs) installed on your cluster. To download git lfs, follow the steps described in the Github guide to installing git lfs. 

  • You must get the digest for the fmaas-runtime-wisdom-ansible image from the case bundle ibm-watsonx-ai-ifm/inventory/watsonxaiifm/resources.yaml. 

  • If your model is hosted on Hugging Face, you must have a Hugging Face account. To create a new account, go to the Hugging Face website. After creating a new account, generate a new Hugging Face token. To generate a token, see Hugging Face's guide to creating a token. 

 

How to deploy a custom foundation model to watsonx.ai  

Step 1: Prepare to deploy a custom foundation model 

 

In this first task, you will set up storage, clone the model, prepare the model, create the PVC, and run the job to upload the model to the PVC storage.  

 

To set up storage and upload a model that is located on Hugging Face, follow these steps: 

 

  1. Find the name of the model on the Hugging Face website.

  2. Set up basic environment variables: 

    export MODEL_NAME="<Hugging Face model name>" 
    export HF_TOKEN="<Hugging Face token>"
    export MODEL_PATH="<Path to the directory where you want to download your model>"
  1. Clone the model. 

    git clone --no-checkout https://<Hugging Face username>:<Hugging Face token>@huggingface.co/${MODEL_NAME} 

After cloning the model, check the name of the created folder. 

  1. Navigate to the folder that contains the model and then check the model size:

    cd <folder that contains the cloned model> 
    git lfs ls-files -s
    Example output:
    root@wmlubntu1:~/falcon-40b# git lfs ls-files -s 
    1c4b989693 - pytorch_model-00001-of-00002.bin (10 GB)
    11822397cd - pytorch_model-00002-of-00002.bin (4.5 GB)

    Calculate the total size of the model and add a 100% buffer to the result. For example, if the model size is 14.5 GB, the size of the PVC that you create is 29 GB. 

  1.  Create the pvc.yaml file that contains the details of a PVC to upload your model to: 

    apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      name: <pvc-name>
      namespace: <namespace-name>
    spec:
      storageClassName: <storage-class>
      accessModes:
      - ReadWriteMany
      resources:
        requests:
          storage: <model size + buffer>Gi

    For example:

    apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      name: model-copy-pvc
      namespace: cpd-instance
    spec:
      storageClassName: ocs-storagecluster-cephfs
      accessModes:
      - ReadWriteOnce
      resources:
        requests:
          storage: 29Gi
  1. Create the PVC in your cluster. After creating the PVC, wait for two minutes and then run this command to verify that the PVC is bounded:

    oc create -f pvc.yaml
  • After creating the PVC, wait for two minutes and then run this command to verify that the PVC is bounded:

    oc get pvc <pvc-name> -n ${PROJECT_CPD_INST_OPERANDS} -o jsonpath='{.stats.phase}'
  1. Encode your Hugging Face token to base64:

    echo ${HF_TOKEN} | base64
  1. Create the secret.yaml file: 

    apiVersion: v1
    kind: Secret
    metadata:
      name: <secret-name>
      namespace: ${PROJECT_CPD_INST_OPERANDS}
    type: Opaque
    data:
      TOKEN: <base64-encoded Hugging Face token>
  1. Create the secret in your cluster:

    oc apply -f secret.yaml

 

  1. Get the digest for the fmaas-runtime-wisdom-ansible image: 

    oc get pods -n ${PROJECT_CPD_INST_OPERATORS} | grep ibm-cpd-watsonx-ai-ifm-operator 
    oc
    exec -it <ibm-cpd-watsonx-ai-ifm-operator-pod-from-last-cmd> /bin/bash -n ${PROJECT_CPD_INST_OPERATORS}
    cat /opt/ansible/8.4.0/digests.yaml | grep -A3 fmaas_runtime_wisdom_ansible_image:
    • For example:
      fmaas_runtime_wisdom_ansible_image:
      name: fmaas-runtime-wisdom-ansible@sha256
      tag: 2cc673a6066cab686eed7b8e6998bc453a49be4ccb993674ab0ff81f099f807f
      tag_metadata: 0.39.0_ubi9_py311_cpd 

  1. Create a job that downloads the model from Hugging Face, and then (if needed) converts it to safetensor and fast-tokenizer formats:

    First, create the job.yaml file:

    apiVersion: batch/v1
    kind: Job
    metadata:
      name: <job-name>
      namespace: ${PROJECT_CPD_INST_OPERANDS}
    spec:
      template:
       spec:
          containers:
          - name: models-convertor
            image: cp.icr.io/cp/cpd/fmaas-runtime-wisdom-ansible@sha256:<tag>
            env:
            - name: ${MODEL_PATH}
              value: <model-path from step 1>
            - name: ${MODEL_NAME}
              value: <model-name from step 1>
            - name: ${HF_TOKEN}
              valueFrom:
                secretKeyRef:
                  name: <secret-name>
                  key: TOKEN
            command: ["/bin/sh", "-c"]
            args:
            - |
              huggingface-cli login --token ${HF_TOKEN}
              huggingface-cli download ${MODEL_NAME} --local-dir ${MODEL_PATH} --cache-dir ${MODEL_PATH}
              text-generation-server convert-to-safetensors ${MODEL_PATH}
              text-generation-server convert-to-fast-tokenizer ${MODEL_PATH}
            volumeMounts:
            - mountPath: /model
              name: byom-model
          restartPolicy: Never
          volumes:
          - name: byom-model
            persistentVolumeClaim:
              claimName: <pvc-name>
  • For example: 

    apiVersion: batch/v1
    kind: Job
    metadata:
      name: model-copy-job
      namespace: cpd-instance
    spec:
      template:
        spec:
          containers:
          - name: models-convertor
            image: cp.icr.io/cp/cpd/fmaas-runtime-wisdom-ansible@sha256:2cc673a6066cab686eed7b8e6998bc453a49be4ccb993674ab0ff81f099f807f
            env:
            - name: ${MODEL_PATH}
              value: "/model"
            - name: ${MODEL_NAME}
              value: "google/mt5-base"
            - name: TOKEN
              valueFrom:
                secretKeyRef:
                  name: huggingface-tokken
                  key: ${HF_TOKEN}
            command: ["/bin/sh", "-c"]
            args:
            - |
              huggingface-cli login --token ${HF_TOKEN}
              huggingface-cli download ${MODEL_NAME} --local-dir ${MODEL_PATH} --cache-dir ${MODEL_PATH}
              text-generation-server convert-to-safetensors ${MODEL_PATH}
              text-generation-server convert-to-fast-tokenizer ${MODEL_PATH}
            volumeMounts:
            - mountPath: /model
              name: byom-model
          restartPolicy: Never
          volumes:
          - name: byom-model
          persistentVolumeClaim:
            claimName: model-copy-pvc 
  1. Create and run the job in your cluster: 

    oc apply -f job.yaml
  • Verify that the job was created: 
    oc get job <job-name> -n <pvc-namespace>

 

  • Expected output: 
    NAME                   COMPLETIONS   DURATION   AGE
    2<job-name>   1/1           xx         xx
  • Check the job status: 
    oc get job <job-name> -n <namespace> -o jsonpath='{.status.conditions[?(@.type=="Complete")].status}'
  • Expected output:
    True
     

Step 2: Register the model with watsonx.ai 

Once you have uploaded your custom foundation model, you must register the model to make it available with watsonx.ai. 

To register the model with watsonx.ai, follow these steps: 

  1. Log in to OpenShift and then edit the CR: 

    oc edit Watsonxaiifm
  1. Add a model entry under spec.custom_foundation_models and enter the following details: 

  • For example:

        apiVersion: watsonxaiifm.cpd.ibm.com/v1beta1
        kind: Watsonxaiifm
        metadata:
        name: watsonxaiifm-cr
        ......
        spec:
        ignoreForMaintenance: false
        .......
        custom_foundation_models:
        - model_id: example_model_70b
            location:
              pvc_name: example_model_pvc
            tags:
            - example_model
            - 70b
            parameters:
            - name: dtype
                default: float16
                options:
                - float16
                - bfloat16
            - name: max_batch_size
                default: 256
                min: 16
                max: 512
            - name: max_concurrent_requests
                default: 64
                min: 0
                max: 128
            - name: max_sequence_length
                default: 2048
                min: 256
                max: 8192
            - name: max_new_tokens
                default: 2048
                min: 512
                max: 4096
        - model_id: example_model_13b
            location:
              pvc_name: example_model_pvc_13b

Step 3: Create the deployment for your custom model 

 

After the model is stored and registered, you can add the model asset to a space, and then create an online deployment.  

 

Note: The steps in this section describe how to create the deployment using API commands. To create the deployment from a deployment space, see Creating a deployment from space.  

 

To create a deployment programmatically, you must first get the model asset ID and then create the deployment.  

View available custom foundation models 

Begin by running a command to list the deployable foundation models you have in watsonx.ai. To view the list of available custom foundation models by using the watsonx API, run this code: 

 

curl --location 'https://<cluster_url>/ml/v4/custom_foundation_models' \
--header 'Authorization: Bearer $TOKEN'

 

Example output: 

{
    "first": {
        "href": "/ml/v4/custom_foundation_models?limit=100"
    },
    "limit": 100,
    "resources": [
        {
            "model_id": "example_model_13b",
            "parameters": [
                {
                    "default": "float16",
                    "display_name": "Data Type",
                    "name": "dtype",
                    "options": [
                        "float16",
                        "bfloat16"
                    ],
                    "type": "string"
                },
                {
                    "default": 256,
                    "display_name": "Max Batch Size",
                    "name": "max_batch_size",
                    "type": "number"
                },
                {
                    "default": 1024,
                    "display_name": "Max Concurrent Requests",
                    "name": "max_concurrent_requests",
                    "type": "number"
                },
                {
                    "default": 2048,
                    "display_name": "Max New Tokens",
                    "name": "max_new_tokens",
                    "type": "number"
                },
                {
                    "default": 2048,
                    "display_name": "Max Sequence Length",
                    "name": "max_sequence_length",
                    "type": "number"
                }
            ]
        },
        {
            "model_id": "example_model_70b",
            "parameters": [
                {
                    "default": "float16",
                    "display_name": "Data Type",
                    "name": "dtype",
                    "options": [
                        "float16",
                        "bfloat16"
                    ],
                    "type": "string"
                },
                {
                    "default": 256,
                    "display_name": "Max Batch Size",
                    "max": 512,
                    "min": 16,
                    "name": "max_batch_size",
                    "type": "number"
                },
                {
                    "default": 64,
                    "display_name": "Max Concurrent Requests",
                    "max": 128,
                    "min": 0,
                    "name": "max_concurrent_requests",
                    "type": "number"
                },
                {
                    "default": 2048,
                    "display_name": "Max New Tokens",
                    "max": 4096,
                    "min": 512,
                    "name": "max_new_tokens",
                    "type": "number"
                },
                {
                    "default": 2048,
                    "display_name": "Max Sequence Length",
                    "max": 8192,
                    "min": 256,
                    "name": "max_sequence_length",
                    "type": "number"
                }
            ],
            "tags": [
                "example_model",
                "70b"
            ]
        }
    ],
    "total_count": 2
}

 

Note: If you access the model list programmatically, you can access all the parameters that you can set for the selected model. For models deployed through the UI, the parameters are available at the online deployment creation phase. See the description of parameters for custom foundation models. 

The next step is to create the custom foundation model asset. You can create a model asset in two contexts: project context and space context.  

  • If you create a project asset in project context, you can then import the model to your project and then promote it to space. 

  • If you create a project asset in space context, you can import the model and then deploy it online. A model deployed from a space is also accessible from Prompt Lab in the project scope. 

 

To create a model asset for your custom foundation model in space context, run this code: 

curl -X POST "https://<cluster_url>/ml/v4/models?version=2024-01-29" \
-H "Authorization: Bearer " \
-H "content-type: application/json" \
--data '{
            "name": "<a meaningful name>",
            "space_id": "<your space id>",
            "foundation_model": {
            "model_id": "<your model id>"
            },
            "type": "custom foundation model 1.0",
            "software_spec": {
            "name": "watsonx-cfm-caikit-1.0"
            }
        }'

 

Note: The model type must be custom_foundation_model_1.0. The software specification name must be watsonx-cfm-caikit-1.0. You cannot customize the software specification. 

Create online deployment

When the custom foundation model asset has been created, you are ready to create the online deployment.  

For instance, here is an example code that shows a sample deployment with some of the parameters overridden. 

curl -X POST "https://<cluster_url>/ml/v4/deployments?version=2024-01-29" \
-H "Authorization: Bearer " \
-H "content-type: application/json" \
--data '{
  "asset":{
    "id":<your custom foundation model id>  // WML custom foundation model asset
  },
  "online":{
    "parameters":{
      "serving_name":"test_custom_fm",
      "foundation_model": {
           "max_sequence_length": 4096
      }
    }
  },
  "hardware_spec": {                        // Only one, of "id" or "name" must be set.
    "id": "<your custom hardware spec id>",
    "num_nodes": 1
  },
  "description": "Testing deployment using custom foundation model",
  "name":"custom_fm_deployment",
  "project_id":<your project id>  // Either "project_id" (or) "space_id". Only one is allowed
}'


View the status for a deployment 

You can view the status for an existing deployment by running this command: 

curl -X GET "https://<cluster_url>/ml/v4/deployments/<your deployment ID>?version=2024-01-29&project_id=<your project ID>" \
-H "Authorization: Bearer "

Note: The deployed_asset_type is returned as custom_foundation_model. 

Step 4: Run a prompt using your custom model  

Now that you have stored and deployed your custom foundation model, you can start using it. You can use the Prompt Lab to prompt the model and generate responses or create a prompt programmatically. 

To run a prompt with the custom model using the API, run this code:  

curl -X POST "https://<cluster_url>/ml/v1/deployments/<your deployment ID>/text/generation?version=2024-01-29" \
-H "Authorization: Bearer " \
-H "content-type: application/json" \
--data '{
 "input": "Hello, what is your name",
 "parameters": {
    "max_new_tokens": 200,
    "min_new_tokens": 20
 }
}'

 Congratulations, your model is now fully deployed and ready to use! 

Summary 

By deploying a custom foundation model to watsonx.ai, you are able to work with a model that best fits your project and business needs. A custom model can be any model that is built with an architecture supported by watsonx.ai, which greatly expands your options and flexibility in terms of the models that best fit your specific use case.   

In this article, we covered the steps necessary to install and deploy a custom foundation model, using the example of a foundation model from Hugging Face. To learn about other deployment options, see the Deploying foundation model documentation. 


#watsonx.ai
#MachineLearning
#PromptLab
#GenerativeAI

0 comments
106 views

Permalink