Embeddable AI

 View Only

A method of capacity planning for Watson Discovery

By TAKESHI INAGAKI posted Sat June 19, 2021 08:52 AM

  

Determining required system resources, such as number of CPU cores, from requirements on performance of a system, such as data throughput is not a trivial task. To know number of CPU cores to be allocated for a system to gain required data throughput, or by inverting the problem, to predict data throughput with CPU cores allocated, performance measurement test is performed. However, there are variety of system hardware configurations, and observed performance is affected by many unknown factors. We cannot execute performance measurements for all possible cases. Purpose of this post is explaining a method to know performance characteristic of software applied on your data with minimum number of measurements by example of Watson Discovery.

 

Watson Discovery processes large number of documents ingested and generates search indexes from them. Data throughput for indexing documents varies by contents of documents, for example, size of text data included in document files, distribution of words in text. By that reason, it is recommended to examine performance by measurement with your own data to determine precious size of a system to meet requirement of data throughput.

 

Now, we focus on relation between number of CPU cores and data throughput of document indexing on Watson Discovery. Observed data throughput is affected by many factors of a system used for performance measurement. How can we extract this relation from a single observation? To answer this question, let’s consider another question. If a job takes two hours by a single CPU core, can we expect that job can be completed within one hour with two CPU cores? Watson Discovery ingestion process is designed to scale out to process multiple documents in parallel with using multiple CPU cores. So, the answer is yes. If it is, we can use a number W = [average number of cores utilized] × [duration of document ingestion to complete] as a measure for workload of the job. Expected duration of document ingestion can be calculated by [duration] = W ÷ [number of cores].

 

Duration of document ingestion is linearly increased by increase of number of documents ingested or, in other words, total size of data ingested. We gain another number C = W ÷ [total data size of ingested documents] which is a constant for document ingestion tasks on Watson Discovery with fixed configuration of collection such as enabled enrichment for user’s document data. This constant makes it possible to calculate number of CPU cores required to process given size of ingested data within fixed duration as [number of cores] = C × [data size] ÷ [duration]. Or you can calculate duration of data ingestion with fixed number of CPU cores and given size of ingested data as [duration] = C × [data size] ÷ [number of cores].

 


#BuildwithWatsonApps
#EmbeddableAI
0 comments
6 views

Permalink