NLU - Entities & Relations - Distribution of Training Data

View Only

Expand all | Collapse all

NLU - Entities & Relations - Distribution of Training Data

1. NLU - Entities & Relations - Distribution of Training Data

0 Like
Alessandro Vignazia
Posted Mon April 04, 2022 09:57 AM

Reply
Hi there,

Should we train our natural language understanding custom model (for the extraction of entities & relations) in Watson Knowledge Service either:

a) As per how the data will be at inference (production run) time in the real world in that some document formats / types and data attributes will appear far more than others e.g. a particular competitor's insurance schedule formats vs a smaller competitor's quotations; OR

b) Should we "balance" the training data so that the number and type are effectively equal across the competitors, document types (quotation, schedule, endorsement, etc) and ultimately mentions (annotations).

If "b" above is the correct approach (according to our research)...must be use real world training data or can we create synthetic data and if we do the latter must we ensure 1) the data attributes themselves are varied in terms of data variation (patterns etc) and 2) the document types and formats and competitors is equally balanced again? i.e. we cannot create 300 dummy documents all for one competitor document type....

Thank you
Alessandro

------------------------------
Alessandro Vignazia
------------------------------

#BuildwithWatsonApps
#EmbeddableAI
2. RE: NLU - Entities & Relations - Distribution of Training Data

0 Like
Daniel Toczala
Posted Mon April 04, 2022 10:06 AM

Reply
Alessandro,

Some very good questions. In terms of the mix of data to use for training, in general, you need to have something that reflects reality. In terms of distribution of broad types of data, it is best to approximate what a real distribution of data might be. In terms of the data to use, you should always look to use real-world data. Synthetic data always leads to poorer results. You need some good examples from each broad category, and you need to make sure that you make provisions to retrain your model, and get user feedback on the model performance from your end-users.

------------------------------
Daniel Toczala
Community Leader and Customer Success Manager - Watson
dtoczala@us.ibm.com
------------------------------

Original Message
3. RE: NLU - Entities & Relations - Distribution of Training Data

0 Like
Alessandro Vignazia
Posted Mon April 04, 2022 10:23 AM

Reply
Hi Daniel,

Many thanks for your reply - much appreciated.

Just to echo my understanding of what you are saying:

1) Use Real Data for better results;
2) Approximate inference (run) time data distribution;
3) Ensure sufficient good examples for each key "type";
4) Be prepared to retrain the model based on testing / feedback from end users to shore up "issues"...

Thanks again
Alessandro

------------------------------
Alessandro Vignazia
------------------------------

Original Message
4. RE: NLU - Entities & Relations - Distribution of Training Data

0 Like
Alessandro Vignazia
Posted Sat May 21, 2022 10:14 AM

Reply
Hi Daniel,

Hope you are well.

I have a question I am hoping you can provides some insights into...

We are trying to use relations to extract multiple rows from an ingested document.

We are using a "class" entity and then multiple attributes / entities which relate to the "class" entity.

Our class entity is "PolicyHolder" and it has multiple attributes / entities related to it e.g. First_Name, Last_Name, Mobile_Number, Address etc.

Currently we use a single relation for relating all the entities to the PolicyHolder class e.g. "PolicyHolder_HasA" and our WKS model is at 49% F1 score for relations with around 100 documents trained of less than 100 words each.

Would it improve our relation F1 score if we named each relation uniquely wrt the related entity / attribute e.g. for First_Name the relation could be called "PolicyHolder_First_Name" and for Last_Name perhaps "PolicyHolder_Last_Name" etc. i.e. would this allow the wks nlu model to differentiate the relations better?

Thank you
Alessandro

Original Message
5. RE: NLU - Entities & Relations - Distribution of Training Data

0 Like
Daniel Toczala
Posted Mon May 23, 2022 06:06 PM

Reply
Alessandro,

Some more good questions from you. I have found that the more "general" you can be, the better your results tend to be. So it's usually easier for a model to just recognize a person's name, rather than try to have it recognize a specific person's name (like buyer name, seller name, etc.). With that being said, I have also seen instances where customers have had models become very good at picking out specific names or instances of things, based on the location and context of the thing in question. A good example is determining legal parties in a legal document, since they are always presented in the same way, a model can easily tell who is who based on position in a document.

This is all very interesting but doesn't help you and your situation much. My advice is to try it and see if your results improve. Like so many things in machine learning and AI models, your specific situation could have some specific things that cause it to be more suitable to one type of model, or one base ontology, over another. You need to experiment and see what the measurable impacts are of the different approaches. You mention the F1 score here, so I assume that you are doing some blind testing of these models. I also suggest to customers to keep a few of the "this approach wasn't as good" approaches, because as you move further along in your implementation, you may find that a particular approach improves over time. Automate the testing, that way you always have a handful of "candidate models" that you are assessing as you move forward with your implementation.

------------------------------
Daniel Toczala
Community Leader and Customer Success Manager - Watson
dtoczala@us.ibm.com
------------------------------

Original Message
6. RE: NLU - Entities & Relations - Distribution of Training Data

0 Like
Alessandro Vignazia
Posted Tue May 24, 2022 05:16 AM

Reply
Hi David,

Thanks so much – I appreciate your advice !

Sorry but you answered another question we are also grappling with �� probably because I ramble – apologies.

My original question in this thread is specifically related to the RELATIONS between the entities i.e. would you advise we use uniquely named RELATIONS between an Entity Class and Entity Attributes to achieve higher F1 scores on RELATION extraction because currently we use the same RELATION name for all attributes of a single class i.e. POLCY_HOLDER class "HAS_A" relation to first_name and "HAS_A" relation Last_name" or should we rather do POLICY_HOLDER class "HAS_A_First_Name" relation and "HAS_A_Last_Name" relation ?

Many thanks again for your inputs sir

Regards
Alessandro

Original Message
7. RE: NLU - Entities & Relations - Distribution of Training Data

0 Like
Joel Gregory
Posted Mon September 26, 2022 09:45 AM

Reply
Just to echo my understanding of what you are saying:

1) Use Real Data for better results;
2) Approximate inference (run) time data distribution;
3) Ensure sufficient good examples for each key "type";
4) Be prepared to retrain the model based on testing / feedback from end users to shore up "issues"...

------------------------------
Joel Gregory
------------------------------

Original Message
8. RE: NLU - Entities & Relations - Distribution of Training Data

0 Like
micle loginsd
Posted Fri April 07, 2023 09:04 AM

Reply
You can also get more informaation

------------------------------
micle loginsd
------------------------------

Original Message

AI and Data Science

Master the art of AI and Data Science.

Embeddable AI

NLU - Entities & Relations - Distribution of Training Data

Alessandro VignaziaMon April 04, 2022 09:57 AM

Daniel ToczalaMon April 04, 2022 10:06 AM

Alessandro VignaziaMon April 04, 2022 10:23 AM

Alessandro VignaziaSat May 21, 2022 10:14 AM

Daniel ToczalaMon May 23, 2022 06:06 PM

Alessandro VignaziaTue May 24, 2022 05:16 AM

Joel GregoryMon September 26, 2022 09:45 AM

micle loginsdFri April 07, 2023 09:04 AM

1. NLU - Entities & Relations - Distribution of Training Data

2. RE: NLU - Entities & Relations - Distribution of Training Data

3. RE: NLU - Entities & Relations - Distribution of Training Data

4. RE: NLU - Entities & Relations - Distribution of Training Data

5. RE: NLU - Entities & Relations - Distribution of Training Data

6. RE: NLU - Entities & Relations - Distribution of Training Data

7. RE: NLU - Entities & Relations - Distribution of Training Data

8. RE: NLU - Entities & Relations - Distribution of Training Data