Global AI and Data Science

Global AI & Data Science

Train, tune and distribute models with generative AI and machine learning capabilities

View Only

Back to discussions

Expand all | Collapse all

PDF Processing with Python

1. PDF Processing with Python

Like
Ahmed Khemiri
Posted Wed July 10, 2019 10:03 AM

Reply
As a Data Scientist , You may not stick to data format . PDFs is good source of data . Most of the organization release their data in PDFs only . As AI is growing , We need more data for prediction and classification . Hence ignoring PDFs as data source could be a blunder . Actually PDF processing is little difficult but we can leverage the below API for making it easier.

Check out my article for more information..

a quick demonstration with python is there.

https://medium.com/@ahmedkhemiri24/pdf-preprocessing-with-python-19829752af9f

------------------------------
Ahmed Khemiri
------------------------------

#GlobalAIandDataScience
#GlobalDataScience
2. RE: PDF Processing with Python

Like
Ben Noelle
Posted Mon July 22, 2019 11:52 AM

Reply
Thanks for the info Ahmed, very informative article. We've found a large number of the documents we use in our sports history research have never even been digitized. We've been working on digitizing many of the historic literature from the 1800s and plan on using AI to analyze the documents once completed.

------------------------------
Ben Noelle
Founder
RetroSeasons.com
------------------------------

Original Message
3. RE: PDF Processing with Python

Like
钟张
Posted Mon November 27, 2023 08:48 AM

Reply
There are many libraries in Python that can be used to process PDF files, including operations such as reading, editing, merging, splitting, and converting. Here are some commonly used Python PDF processing libraries:

PyPDF2: PyPDF2 is a pure Python PDF library that can split, merge, crop, and convert pages of PDF files. It can also add custom data, password protection, and digital signatures, as well as extract text and metadata from PDFs.
pdfplumber: pdfplumber is based on the pdfminer library and can easily extract text, graphics, and metadata from PDFs. It also supports table extraction and visual debugging.
PDFMiner: PDFMiner is a toolkit for extracting and processing text, graphics, and metadata from PDF documents. It supports multiple languages and encodings, and can customize the parsing and processing of various elements of PDF files.
PyMuPDF: PyMuPDF (also known as fitz) is a powerful PDF processing library that supports multiple file formats, including PDF, XPS, OpenXPS, CBZ, EPUB, and HTML. It can read, edit, annotate, convert, and print PDF files, and also supports OCR text recognition and image processing

------------------------------
钟张
------------------------------

Original Message

Global AI and Data Science

Global AI & Data Science

PDF Processing with Python

Ahmed KhemiriWed July 10, 2019 10:03 AM

Ben NoelleMon July 22, 2019 11:52 AM

钟张Mon November 27, 2023 08:48 AM

1. PDF Processing with Python

2. RE: PDF Processing with Python

3. RE: PDF Processing with Python

Additional
Resources

Office

Quick Links

Global AI and Data Science

Global AI & Data Science

PDF Processing with Python

Ahmed KhemiriWed July 10, 2019 10:03 AM

Ben NoelleMon July 22, 2019 11:52 AM

钟 张Mon November 27, 2023 08:48 AM

1. PDF Processing with Python

2. RE: PDF Processing with Python

3. RE: PDF Processing with Python

Related Content

PDF parsing

Data Science Networking Opportunities @ Think

Think 2019: Data Science Sessions

Recap July 9 Virtual Meetup: Explainable Workflows Using Python

Global Data Science

Additional Resources

Office

Quick Links

钟张Mon November 27, 2023 08:48 AM

Additional
Resources