PDF Processing with Python

View Only

Expand all | Collapse all

1. PDF Processing with Python

0 Like
Ahmed Khemiri
Posted Wed July 10, 2019 10:03 AM

Reply
As a Data Scientist , You may not stick to data format . PDFs is good source of data . Most of the organization release their data in PDFs only . As AI is growing , We need more data for prediction and classification . Hence ignoring PDFs as data source could be a blunder . Actually PDF processing is little difficult but we can leverage the below API for making it easier.

Check out my article for more information..

a quick demonstration with python is there.

https://medium.com/@ahmedkhemiri24/pdf-preprocessing-with-python-19829752af9f

------------------------------
Ahmed Khemiri
------------------------------

#GlobalAIandDataScience
#GlobalDataScience
2. RE: PDF Processing with Python

0 Like
Ben Noelle
Posted Mon July 22, 2019 11:52 AM

Reply
Thanks for the info Ahmed, very informative article. We've found a large number of the documents we use in our sports history research have never even been digitized. We've been working on digitizing many of the historic literature from the 1800s and plan on using AI to analyze the documents once completed.

------------------------------
Ben Noelle
Founder
RetroSeasons.com
------------------------------

Original Message
3. RE: PDF Processing with Python

1 Like
钟张
Posted Mon November 27, 2023 08:48 AM

Reply
There are many libraries in Python that can be used to process PDF files, including operations such as reading, editing, merging, splitting, and converting. Here are some commonly used Python PDF processing libraries:

PyPDF2: PyPDF2 is a pure Python PDF library that can split, merge, crop, and convert pages of PDF files. It can also add custom data, password protection, and digital signatures, as well as extract text and metadata from PDFs.
pdfplumber: pdfplumber is based on the pdfminer library and can easily extract text, graphics, and metadata from PDFs. It also supports table extraction and visual debugging.
PDFMiner: PDFMiner is a toolkit for extracting and processing text, graphics, and metadata from PDF documents. It supports multiple languages and encodings, and can customize the parsing and processing of various elements of PDF files.
PyMuPDF: PyMuPDF (also known as fitz) is a powerful PDF processing library that supports multiple file formats, including PDF, XPS, OpenXPS, CBZ, EPUB, and HTML. It can read, edit, annotate, convert, and print PDF files, and also supports OCR text recognition and image processing

------------------------------
钟张
------------------------------

Original Message

Global AI and Data Science

PDF Processing with Python

Ahmed KhemiriWed July 10, 2019 10:03 AM

Ben NoelleMon July 22, 2019 11:52 AM

钟 张Mon November 27, 2023 08:48 AM

1. PDF Processing with Python

2. RE: PDF Processing with Python

3. RE: PDF Processing with Python

钟张Mon November 27, 2023 08:48 AM