Global AI and Data Science

 View Only
  • 1.  PDF Processing with Python

    Posted Wed July 10, 2019 10:03 AM
    As a Data Scientist , You may not stick to data format . PDFs is good source of data . Most of the organization release their data in PDFs only . As AI is growing , We need more data for prediction and classification . Hence ignoring PDFs as data source could be a blunder . Actually PDF processing is little difficult but we can leverage the below API for making it easier.
    Check out my article for more information..
    a quick demonstration with python is there.
    https://medium.com/@ahmedkhemiri24/pdf-preprocessing-with-python-19829752af9f


    ------------------------------
    Ahmed Khemiri
    ------------------------------

    #GlobalAIandDataScience
    #GlobalDataScience


  • 2.  RE: PDF Processing with Python

    Posted Mon July 22, 2019 11:52 AM
    Thanks for the info Ahmed, very informative article.   We've found a large number of the documents we use in our sports history research have never even been digitized.   We've been working on digitizing many of the historic literature from the 1800s and plan on using AI to analyze the documents once completed.

    ------------------------------
    Ben Noelle
    Founder
    RetroSeasons.com
    ------------------------------



  • 3.  RE: PDF Processing with Python

    Posted Mon November 27, 2023 08:48 AM

    There are many libraries in Python that can be used to process PDF files, including operations such as reading, editing, merging, splitting, and converting. Here are some commonly used Python PDF processing libraries:

    PyPDF2: PyPDF2 is a pure Python PDF library that can split, merge, crop, and convert pages of PDF files. It can also add custom data, password protection, and digital signatures, as well as extract text and metadata from PDFs.
    pdfplumber: pdfplumber is based on the pdfminer library and can easily extract text, graphics, and metadata from PDFs. It also supports table extraction and visual debugging.
    PDFMiner: PDFMiner is a toolkit for extracting and processing text, graphics, and metadata from PDF documents. It supports multiple languages and encodings, and can customize the parsing and processing of various elements of PDF files.
    PyMuPDF: PyMuPDF (also known as fitz) is a powerful PDF processing library that supports multiple file formats, including PDF, XPS, OpenXPS, CBZ, EPUB, and HTML. It can read, edit, annotate, convert, and print PDF files, and also supports OCR text recognition and image processing



    ------------------------------
    钟 张
    ------------------------------