Embeddable AI

 View Only
  • 1.  NLU - Entities & Relations - Distribution of Training Data

    Posted Mon April 04, 2022 09:57 AM
    Hi there,

    Should we train our natural language understanding custom model (for the extraction of entities & relations) in Watson Knowledge Service either:

    a) As per how the data will be at inference (production run) time in the real world in that some document formats / types and data attributes will appear far more than others e.g. a particular competitor's insurance schedule formats vs a  smaller competitor's quotations; OR

    b) Should we "balance" the training data so that the number and type are effectively equal across the competitors, document types (quotation, schedule, endorsement, etc) and ultimately mentions (annotations).

    If "b" above is the correct approach (according to our research)...must be use real world training data or can we create synthetic data and if we do the latter must we ensure 1) the data attributes themselves are varied in terms of data variation (patterns etc) and 2) the document types and formats and competitors is equally balanced again? i.e. we cannot create 300 dummy documents all for one competitor document type....

    Thank you
    Alessandro

    ------------------------------
    Alessandro Vignazia
    ------------------------------

    #BuildwithWatsonApps
    #EmbeddableAI


  • 2.  RE: NLU - Entities & Relations - Distribution of Training Data

    Posted Mon April 04, 2022 10:06 AM
    Alessandro,

      Some very good questions.  In terms of the mix of data to use for training, in general, you need to have something that reflects reality.  In terms of distribution of broad types of data, it is best to approximate what a real distribution of data might be.  In terms of the data to use, you should always look to use real-world data.  Synthetic data always leads to poorer results.  You need some good examples from each broad category, and you need to make sure that you make provisions to retrain your model, and get user feedback on the model performance from your end-users.

    ------------------------------
    Daniel Toczala
    Community Leader and Customer Success Manager - Watson
    dtoczala@us.ibm.com
    ------------------------------



  • 3.  RE: NLU - Entities & Relations - Distribution of Training Data

    Posted Mon April 04, 2022 10:23 AM
    Hi Daniel,

    Many thanks for your reply - much appreciated.

    Just to echo my understanding of what you are saying: 

    1) Use Real Data for better results; 
    2) Approximate inference (run) time data distribution;
    3) Ensure sufficient good examples for each key "type"; 
    4) Be prepared to retrain the model based on testing / feedback from end users to shore up "issues"...

    Thanks again
    Alessandro

    ------------------------------
    Alessandro Vignazia
    ------------------------------



  • 4.  RE: NLU - Entities & Relations - Distribution of Training Data

    Posted Sat May 21, 2022 10:14 AM

    Hi Daniel,

     

    Hope you are well.

     

    I have a question I am hoping you can provides some insights into...

     

    We are trying to use relations to extract multiple rows from an ingested document.

    We are using a "class" entity and then multiple attributes / entities which relate to the "class" entity.

    Our class entity is "PolicyHolder" and it has multiple attributes / entities related to it e.g. First_Name, Last_Name, Mobile_Number, Address etc.

    Currently we use a single relation for relating all the entities to the PolicyHolder class e.g. "PolicyHolder_HasA" and our WKS model is at 49% F1 score for relations with around 100 documents trained of less than 100 words each.

    Would it improve our relation F1 score if we named each relation uniquely wrt the related entity / attribute e.g. for First_Name the relation could be called "PolicyHolder_First_Name" and for Last_Name perhaps "PolicyHolder_Last_Name" etc. i.e. would this allow the wks nlu model to differentiate the relations better?

    Thank you
    Alessandro

     






  • 5.  RE: NLU - Entities & Relations - Distribution of Training Data

    Posted Mon May 23, 2022 06:06 PM
    Alessandro,

      Some more good questions from you.  I have found that the more "general" you can be, the better your results tend to be.  So it's usually easier for a model to just recognize a person's name, rather than try to have it recognize a specific person's name (like buyer name, seller name, etc.).  With that being said, I have also seen instances where customers have had models become very good at picking out specific names or instances of things, based on the location and context of the thing in question.  A good example is determining legal parties in a legal document, since they are always presented in the same way, a model can easily tell who is who based on position in a document.

      This is all very interesting but doesn't help you and your situation much.  My advice is to try it and see if your results improve.  Like so many things in machine learning and AI models, your specific situation could have some specific things that cause it to be more suitable to one type of model, or one base ontology, over another.  You need to experiment and see what the measurable impacts are of the different approaches.  You mention the F1 score here, so I assume that you are doing some blind testing of these models.  I also suggest to customers to keep a few of the "this approach wasn't as good" approaches, because as you move further along in your implementation, you may find that a particular approach improves over time.  Automate the testing, that way you always have a handful of "candidate models" that you are assessing as you move forward with your implementation.


    ------------------------------
    Daniel Toczala
    Community Leader and Customer Success Manager - Watson
    dtoczala@us.ibm.com
    ------------------------------



  • 6.  RE: NLU - Entities & Relations - Distribution of Training Data

    Posted Tue May 24, 2022 05:16 AM

    Hi David,

     

    Thanks so much – I appreciate your advice !

     

    Sorry but you answered another question we are also grappling with ��  probably because I ramble – apologies.

     

    My original question in this thread is specifically related to the RELATIONS between the entities i.e. would you advise we use uniquely named RELATIONS between an Entity Class and Entity Attributes to achieve higher F1 scores on RELATION extraction because currently we use the same RELATION name for all attributes of a single class i.e. POLCY_HOLDER class "HAS_A" relation to first_name and "HAS_A" relation Last_name" or should we rather do POLICY_HOLDER class "HAS_A_First_Name" relation and "HAS_A_Last_Name" relation ?

     

    Many thanks again for your inputs sir

     

    Regards

    Alessandro

     






  • 7.  RE: NLU - Entities & Relations - Distribution of Training Data

    Posted Mon September 26, 2022 09:45 AM
    Just to echo my understanding of what you are saying: 

    1) Use Real Data for better results; 
    2) Approximate inference (run) time data distribution;
    3) Ensure sufficient good examples for each key "type"; 
    4) Be prepared to retrain the model based on testing / feedback from end users to shore up "issues"...

    ------------------------------
    Joel Gregory
    ------------------------------



  • 8.  RE: NLU - Entities & Relations - Distribution of Training Data

    Posted Fri April 07, 2023 09:04 AM