Watson Discovery

 View Only

Part III : Pattern Induction - Best Practices for Highlighting Examples and Providing Expert Feedback

By Bikalpa Neupane posted Wed November 10, 2021 02:52 PM

  

If you are here, you must have already read our previous blogs (Part I and Part II) on how to perform pattern induction using Watson Discovery. These are the general guidelines for you to follow for better experience:

1. Highlight examples with relatively few (at most 6) tokens. The first thing to understand is that the tool doesn’t handle patterns of arbitrary length. The runtime of the system to learn highly depends on the length of the pattern with longer patterns slowing down the process. As a result, the tool currently only supports patterns up to 6 tokens in length. For this beta version, we recommend working with patterns that have at most 6 tokens to get the best results.

Additionally, it is important to understand Pattern Induction’s tokenization behavior, or how it defines the boundaries of a token or a word. Like how other text-based AI tools have their own definition of token, Pattern Induction has its own tokenization behavior. In this tool, token boundaries depend on the language of the input documents. For this release, the tool focuses on English, so we provide a few examples for English documents. In general, the token boundaries are determined by white spaces. For instance, the phrase

revenue: 10.5 million dollars

is composed of five tokens: |revenue| |:| |10.5| |million| |dollars|. The words that are used in daily English are separated by whitespaces. The numeric amount is its own token. Symbols are also considered as their own tokens. Consider the following phrase:

AMD32: Performance Review

There are five tokens in total since “AMD32:” will be split into three tokens by the system: |AMD| |32| |:|. Pattern Induction breaks tokens up according to consecutive numerical characters, consecutive alphabetical characters, and individual symbol characters. Note that the number “32” is not part of a numeric amount with a decimal point or commas (for clarity as in large numbers e.g. “4,927,535”).

2. Highlight examples that belong to the same concept. For instance, when you highlight the following examples:

revenue: 10 million dollars 

income: 3.2 thousand dollars

We do not recommend you highlight the following at the same time.

5 December 2025 

If you want to capture a different concept (e.g., dates vs revenue in this case), you can start a new Pattern Induction session and create a separate extractor for the new concept by providing appropriate examples.

3. Highlight examples with similar patterns. To illustrate pattern similarity, suppose you highlight the following two examples:

  • revenue: 10 million dollars
  • revenue: $15.5 thousand

The pattern can be described at a high-level: Extract tokens “revenue” followed by a colon and the currency amount. Pattern Induction will learn a rule that will capture such texts in the rest of the documents. However, you must avoid highlighting texts that do not match the patterns of the already-highlighted examples, such as:

revenue and income in 2010: an estimate of 10.5 million dollars

In this example, the underlying pattern is a lot more complex and longer than the patterns derived from the shorter examples. The longer pattern requires a set of tokens, including both “revenue” and “income”, followed by the year and a colon. Afterwards, there is a set of tokens indicating “an estimate of” and finally, the example ends with the currency amount.

Providing such examples makes it challenging for Pattern Induction to create a rule that generalizes both shorter examples and longer examples. Moreover, the number of tokens is significantly larger than in the other two shorter examples. It is important to note that the system works best when the highlighted examples have roughly the same length (in terms of tokens). Although the system employs heuristics to learn rules capturing examples of slightly different lengths, it is generally difficult to infer rules if there is a big difference in the length of examples.

While it is advised to highlight examples with similar patterns, we would also like to emphasize that Pattern Induction is capable in learning variations of a pattern, which are small tweaks to a pattern. Variations to a pattern are mostly at a token-level, as illustrated in the following as a variation of the above pattern but with the token “income”:

income: 3.2 thousand dollars

4. In cases where examples vary in pattern similarity, you must highlight an example for each variation. Pattern Induction is capable of only learning patterns of text according to the examples you provide to it. So, if you wish to extract texts containing either “revenue” or “income”, then you must highlight one for each variation:

  • revenue: 10 million dollars
  • income: 3.2 thousand dollars

Note that these set of examples are of a similar pattern, where the currency amount appears after the tokens, “revenue” or “income”.

5. Highlight examples that are missing from the extracted examples by Pattern Induction. After Pattern Induction extracts texts, you should inspect the extracted examples in the review pane. If you notice that an intended extraction is missing, you must highlight them in your document for Pattern Induction to learn not to miss it in the following iterations. For instance, if Pattern Induction did not extract any texts with the token “revenue” but all those with the token “income”, then you must double check whether you have highlighted an example containing “revenue”. Pattern Induction lists all the examples it learns from in the review pane, where you can confirm whether an example variation is missing.

6. Reject an extracted example from Pattern Induction, even if the extracted example contains the desired substring. While it may be tempting to accept extracted examples that are partially correct, the user must reject these extractions. Rejecting partially correct extracted examples guides Pattern Induction to generate much more specific rules according to the desired extractions. In the following extracted examples, Pattern Induction incorrectly includes the currency amount from the sentence following the desired extraction, as shown in the red text.



Although it contains a substring that is the desired extraction, the entire extraction is incorrect. In this case, Pattern Induction generated a somewhat general rule that may have caused it to also extract the currency amounts in the following sentence. To guide the learner to a more specific rule, you must reject the extracted example, which you can do directly in the review pane:

income: $4 billion. 6 billion dollars →Reject through the review pane

We understand that patterns that require understanding word meanings or semantics can be challenging for our current version of Pattern Induction to learn. Currently, pattern induction creates dictionaries based on similar patterns in the tokens. In the future, we plan on enabling users to upload their own dictionaries that Pattern Induction can utilize when generating patterns. Stay tuned !

Authors: Dr Maeda Hanafi, Dr Yannis Katsis, Dr Yunyao Li, Dr Bikalpa Neupane


#WatsonDiscovery
0 comments
22 views

Permalink