Setting up an ADP classification model correctly is critical, as the chosen document types directly impact the extraction model which ultimately determines what data is extracted during document processing. Adhering to best practices during the classification setup ensures accurate and consistent data extraction.
Additionally, the steps outlined below can be applied in cases of misclassification to enhance both classification accuracy and the quality of the extracted results.
Break Down Large Projects
When working with large set of document types or many sample documents, project size can quickly exceed the supported limits. As mentioned in the documentation, ADP only supports project export up to 1GB and having a larger project can lead to operational bottlenecks.
To avoid such issues:
- Split large projects into smaller, more focused projects.
- Group similar document types or documents together logically.
Train Your Model Incrementally
Instead of training your classification model with all document types at once, adopt an incremental training strategy:
- Start with a smaller set of document types.
- Train the model and evaluate the results.
- Gradually add more types over time, fine-tuning as you go.
This helps in identifying issues early and adjusting ontology or data quality before the project scales.
Ensure Ontology Integrity
Project ontology also plays a central role in classification accuracy. Ensure the following:
- Make sure each document type has unique and clearly defined key-value pairs(KVPs).
- Avoid overlapping or ambiguous class definitions.
Consistent and clean ontology design leads to better model learning and easier debugging.
Maintain a Balanced Train-Test Split
A good practice is to use a 70:30 train-test split ratio:
- 70% of your sample documents goes into training.
- 30% is reserved for testing.
This balance helps ensure the model has enough data to learn effectively while still leaving enough unseen data to evaluate performance. If certain document types perform poorly, consider:
- Adding more sample documents to those document types.
- Ensuring the document type is not underrepresented in the training set.
If performance is still unsatisfactory, you can try increasing the training portion to 80:20, especially when you have limited samples.
Address Type Imbalance
Classification models can be biased toward types with more examples. To combat this:
- Identify document types with fewer samples.
- Add more training documents for those document types whenever possible.
- Consider techniques like oversampling or data augmentation if additional real data is not available.
Conclusion
In conclusion, by carefully managing project size, incrementally training, maintaining clean ontologies, balancing train-test split, and addressing class imbalances, you can significantly boost ADP classification model performance and long-term scalability.
Acknowledgments