Developer Intro

pypdf is a library and hence its users are developers. This document is not for the users, but for people who want to work on pypdf itself.

Installing Requirements

pip install -r requirements/dev.txt

Running Tests

See testing pypdf with pytest

The sample-files git submodule

The reason for having the submodule sample-files is that we want to keep the size of the pypdf repository small while we also want to have an extensive test suite. Those two goals contradict each other.

The resources folder should contain a select set of core examples that cover most cases we typically want to test for. The sample-files might cover a lot more edge cases, the behavior we get when file sizes get bigger, different PDF producers.

In order to get the sample-files folder, you need to execute:

git submodule update --init

Tools: git and pre-commit

Git is a command line application for version control. If you don’t know it, you can play ohmygit to learn it.

GitHub is the service where the pypdf project is hosted. While git is free and open source, GitHub is a paid service by Microsoft, but free in a lot of cases.

pre-commit is a command line application that uses git hooks to automatically execute code. This allows you to avoid style issues and other code quality issues. After you entered pre-commit install once in your local copy of pypdf, it will automatically be executed when you git commit.

Commit Messages

Having a clean commit message helps people to quickly understand what the commit is about, without actually looking at the changes. The first line of the commit message is used to auto-generate the CHANGELOG. For this reason, the format should be:

PREFIX: DESCRIPTION

BODY

The PREFIX can be:

  • SEC: Security improvements. Typically an infinite loop that was possible.

  • BUG: A bug was fixed. Likely there is one or multiple issues. Then write in the BODY: Closes #123 where 123 is the issue number on GitHub. It would be absolutely amazing if you could write a regression test in those cases. That is a test that would fail without the fix. A bug is always an issue for pypdf users - test code or CI that was fixed is not considered a bug here.

  • ENH: A new feature! Describe in the body what it can be used for.

  • DEP: A deprecation. Either marking something as “this is going to be removed” or actually removing it.

  • PI: A performance improvement. This could also be a reduction in the file size of PDF files generated by pypdf.

  • ROB: A robustness change. Dealing better with broken PDF files.

  • DOC: A documentation change.

  • TST: Adding or adjusting tests.

  • DEV: Developer experience improvements, e.g. pre-commit or setting up CI.

  • MAINT: Quite a lot of different stuff. Performance improvements are for sure the most interesting changes in here. Refactorings as well.

  • STY: A style change. Something that makes pypdf code more consistent. Typically a small change. It could also be better error messages for end users.

The prefix is used to generate the CHANGELOG. Every PR must have exactly one - if you feel like several match, take the top one from this list that matches for your PR.

Pull Request Size

Smaller Pull Requests (PRs) are preferred as it’s typically easier to merge them. For example, if you have some typos, a few code-style changes, a new feature, and a bug-fix, that could be 3 or 4 PRs.

A PR must be complete. That means if you introduce a new feature it must be finished within the PR and have a test for that feature.

Benchmarks

We need to keep an eye on performance and thus we have a few benchmarks.

See py-pdf.github.io/pypdf/dev/bench