Global AI and Data Science

 View Only
  • 1.  PDF parsing

    Posted Wed February 26, 2020 06:27 PM
    I have a few current NLP projects which depend on non-trivial parsing of PDFs.  Working with PDFs as a data source can be complicated, since it's a proprietary format, and reading their documentation to troubleshoot problems can be mind-bendingly difficult!

    There are so many different open source projects available for parsing PDFs, that it becomes difficult just to understand which to try first.

    I'm super curious: which PDF parsing libraries has your team used in production and found most useful?
    Many thanks for any and all suggestions.


    Some that I've worked with include:
    Each has their pro's and con's, most definitely.  For example, some are good at extracting the URLs linked to text, others extract more (or less) of the text on a page, while others tend to preserve the structure of a document.

    The latter is particularly interesting, especially when you're working with research publications such as open access journal articles. The general idea is that parts of a paper contain different "qualities" of text: a phrase in a title or methods section has different significance than the same phrase used in a bibliography.  Some of these libraries incorporate machine learning models to help classify the text and preserve more of a document's "structure" -- which in turn is valuable for preparing features to train ML models, based on the parsed text.

    ------------------------------
    Paco Nathan
    ------------------------------

    #GlobalAIandDataScience
    #GlobalDataScience


  • 2.  RE: PDF parsing

    Posted Mon March 09, 2020 03:18 AM
    I use Corpus Conversion Services https://www.ibm.com/blogs/research/2018/08/corpus-conversion-service/  or sometimes MS Word with a powershell script that opens directly the pdf  (yes it works) and saves as HTML.

    ------------------------------
    Marc Fiammante
    DE
    IBM
    Nice
    ------------------------------



  • 3.  RE: PDF parsing

    Posted Wed March 11, 2020 12:49 PM
    Many thanks Marc!  That's so great to learn about, we'll check about the IBM service for our project.

    I hadn't understood that about MS Word supporting PDF => HTML conversion. That's also really good to know!

    Paco


    ------------------------------
    Paco Nathan
    ------------------------------