Hi there,
Should we train our natural language understanding custom model (for the extraction of entities & relations) in Watson Knowledge Service either:
a) As per how the data will be at inference (production run) time in the real world in that some document formats / types and data attributes will appear far more than others e.g. a particular competitor's insurance schedule formats vs a smaller competitor's quotations; OR
b) Should we "balance" the training data so that the number and type are effectively equal across the competitors, document types (quotation, schedule, endorsement, etc) and ultimately mentions (annotations).
If "b" above is the correct approach (according to our research)...must be use real world training data or can we create synthetic data and if we do the latter must we ensure 1) the data attributes themselves are varied in terms of data variation (patterns etc) and 2) the document types and formats and competitors is equally balanced again? i.e. we cannot create 300 dummy documents all for one competitor document type....
Thank you
Alessandro
------------------------------
Alessandro Vignazia
------------------------------
#BuildwithWatsonApps#EmbeddableAI