watsonx.data

Put your data to work, wherever it resides, with the hybrid, open data lakehouse for AI and analytics

View Only

Back to Blog List

Get started with using IBM embedding models with watsonx.data

By Katherine Ciaravalli posted Mon July 01, 2024 03:34 PM

IBM now has several supported embedding models available with watsonx.ai. Embedding models are encoder-only foundation models that create text embeddings. A text embedding encodes the meaning of a sentence or passage in an array of numbers known as a vector.

The following embedding models are available in watsonx.ai:

slate-30m-english-rtvr
slate-125m-english-rtvt
all-minilm-l12-v2
multilingual-e5-large

IBM watsonx.data now has the capability to store and search vectors using the embedded open source Milvus service. Users will need to use an embedding model like the ones mentioned above in order to create embeddings on content before loading into the watsonx.data vector database. Here is a small tutorial with Python code snippets that can get you started:

Establish your watsonx.ai Embedding object:
- Note: you will need a watsonx.ai API key in order to access these

from ibm_watsonx_ai import Credentials
from ibm_watsonx_ai.foundation_models import Embeddings
from ibm_watsonx_ai.metanames import EmbedTextParamsMetaNames as EmbedParams
from ibm_watsonx_ai.foundation_models.utils.enums import EmbeddingTypes

embed_params = {
     EmbedParams.TRUNCATE_INPUT_TOKENS: 3,
     EmbedParams.RETURN_OPTIONS: {
     'input_text': True
     }
 }

embedding = Embeddings(
     model_id=EmbeddingTypes.IBM_SLATE_30M_ENG,
     params=embed_params,
     credentials=Credentials(
         api_key = "<YOUR watsonx.ai CLOUD API KEY>", 
         url = "https://us-south.ml.cloud.ibm.com"),
     project_id=project.project_context.projectID  # From within a watsonx.ai project, insert project token to reveal 
     )

Establish a list of chunked text you want to create embeddings on:
- Note: this text can come from sources within other connected data stores on watsonx.data, files, or websites. The format for the text field for the embed_documents function used below is a list of strings.

chunked_text = ["sample chunked text",
                "these chunks can be short or long",
                "the recommended chunking size can depend ",
                "on the type of embedding model",
                "typically chunks are between ",
                "100 and 500 characters long "]

Create and assign the embeddings for the chunks of text using the Embedding function “embed_documents”:

embedding_vectors = embedding.embed_documents(texts=chunked_text)

Establish a connection to the watsonx.data Milvus service:

from pymilvus import(Milvus,IndexType,Status,connections,
    FieldSchema,DataType,Collection,CollectionSchema,utility)


url = milvus_service['host']
port = milvus_service['port']
apikey = milvus_service['password']
apiuser = 'ibmlhapikey'


connections.connect(alias="default", 
                    host=url, 
                    port=port, 
                    user=apiuser, 
                    password=apikey, 
                    secure=True)

Create a new collection for you to store your content and embeddings:

fields = [
    FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=True), # Primary key
    FieldSchema(name="content", dtype=DataType.VARCHAR, max_length=6000,),
    FieldSchema(name="source_title", dtype=DataType.VARCHAR, max_length=200,),
    FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=384),
]

schema = CollectionSchema(fields, "collection_description")

my_collection = Collection("collection_name", schema)

# Create index
index_params = {
        'metric_type':'L2',
        'index_type':"IVF_FLAT",
        'params':{"nlist":2048}
}

my_collection.create_index(field_name="embedding", index_params=index_params)

Load the data and embeddings into watsonx.data Vector Database

basic_collection = Collection("collection_name") 
data = [
    chunked_text,
    chunks_title,  # list of short titles describing the associated source of the chunk, same length as chunked_text
    embedding_vectors
]
               
out = my_collection.insert(data)    #loading data into Milvus service
basic_collection.flush()  # Ensures data persistence

Check to ensure your data has been property loaded:

basic_collection = Collection("collection_name") 

basic_collection.num_entities

Congratulations! You have now used the watsonx.ai embedding model slate-30m-english-rtvr to create embeddings and loaded them into the watsonx.data vector database using python.

Your content is now properly chunked and vectorized and ready to be searched on before being used for prompting a large language model.

For more information on IBM embedding models and slate-30m-english-rtvr, check out these sources:

- https://dataplatform.cloud.ibm.com/docs/content/wsj/analyze-data/fm-models-embed.html?context=wx#ibm-provided

- https://ibm.github.io/watsonx-ai-python-sdk/fm_embeddings.html

- https://dataplatform.cloud.ibm.com/docs/content/wsj/analyze-data/fm-slate-30m-english-rtrvr-model-card.html?context=wx&audience=wdp

#watsonx.data

0 comments

22 views