I have a few current NLP projects which depend on non-trivial parsing of PDFs. Working with PDFs as a data source can be complicated, since it's a proprietary format, and reading their documentation to troubleshoot problems can be mind-bendingly difficult!
There are so many different open source projects available for parsing PDFs, that it becomes difficult just to understand which to try first.
I'm super curious:
which PDF parsing libraries has your team used in production and found most useful?Many thanks for any and all suggestions.
Some that I've worked with include:
Each has their pro's and con's, most definitely. For example, some are good at extracting the URLs linked to text, others extract more (or less) of the text on a page, while others tend to preserve the structure of a document.
The latter is particularly interesting, especially when you're working with research publications such as open access journal articles. The general idea is that parts of a paper contain different "qualities" of text: a phrase in a title or methods section has different significance than the same phrase used in a bibliography. Some of these libraries incorporate machine learning models to help classify the text and preserve more of a document's "structure" -- which in turn is valuable for preparing features to train ML models, based on the parsed text.
------------------------------
Paco Nathan
------------------------------
#GlobalAIandDataScience#GlobalDataScience