Scope of pypdf

What features should pypdf have and which features will it never have?

pypdf aims at simplifying interactions with PDF documents. Core tasks that pypdf can perform are:

Document manipulation: Splitting, merging, cropping, and transforming the pages of PDF files
Data Extraction: Extract text and metadata from PDF documents
Security: Decrypt / encrypt PDF documents

Typical indicators that pypdf should do something:

The moonshot extensions are features we would like to have, but are currently not able to add (PRs are welcome 😉)

Belongs in user code

Here are a few indicators that a feature belongs into users’ code (and not into pypdf):

The use-case is very specific. Most people will not encounter the same need.
It can be done without knowledge of the PDF specification
It cannot be done without (non-pdf) domain knowledge. Anything that is specific to your industry.

While this list is infinitely long, there are a few topics that are asked multiple times.

Those topics are out of scope for pypdf. They will never be part of pypdf:

Optical Character Recognition (OCR): OCR is about extracting text from images. That is very different from the kind of text extraction pypdf is doing. Please note that images can be within PDF documents. In the case of scanned documents, the whole page is an image. Some scanners automatically execute OCR and add a text-layer behind the scanned page. That is something pypdf can use if it’s present. As a rule-of-thumb: If you cannot mark/copy the text, it’s likely an image. A noteworthy open source OCR project is tesseract.
Format Conversion: Converting docx / HTML to PDF or PDF to those formats. You might want to have a look at pdfkit and similar projects.

Out of scope for the moment, but might be added if there are enough contributors:

Digital Signature Support (reference ticket): Cryptography is complicated. It’s important to get it right. pypdf currently doesn’t have enough active contributors to properly add digital signature support. For the moment, pyhanko seems to be the best choice.
PDF Generation from Scratch: pypdf can manipulate existing PDF documents, add annotations, combine / split / crop / transform. It can add blank pages. But if you want to generate invoices, you might want to have a look at reportlab / fpdf2 or document conversion tools like pdfkit.
Replacing words within a PDF: Extracting text from PDF is hard. Replacing text in a reliable way is even harder. For example, one word might be split into multiple tokens. Hence, it’s not a simple “search and replace” in some cases.
(Not) Extracting headers/footers/page numbers: While you can apply heuristics, there is no way to always make it work. PDF documents simply don’t contain the information what a header/footer/page number is.

It’s also worth pointing out that pypdf is designed to be a library. It is not an application. That has several implications:

Execution: pypdf cannot be executed directly, but only be called from within a program written by a pypdf user. In contrast, an application is executed on its own.
Dependencies: pypdf should have a minimal set of dependencies and only restrict them where it is strictly necessary. In contrast, applications should be installed in environments which are isolated from other applications. They can pin their dependencies.

If you’re looking for a way to interact with PDF files via Shell, you should either write a script using pypdf or use pdfly.