How pypdf writes PDF files

pypdf uses PdfWriter to write PDF files. pypdf has PdfObject and several subclasses with the write_to_stream method. The PdfWriter.write method uses the write_to_stream methods of the referenced objects.

The PdfWriter.write_stream method has the following core steps:

  1. _sweep_indirect_references: This step ensures that any circular references to objects are correctly handled. It adds the object reference numbers of any circularly referenced objects to an external reference map, so that self-page-referencing trees can reference the correct new object location, rather than copying in a new copy of the page object.

  2. Write the File Header and Body with _write_pdf_structure: In this step, the PDF header and objects are written to the output stream. This includes the PDF version (e.g., %PDF-1.7) and the objects that make up the content of the PDF, such as pages, annotations, and form fields. The locations (byte offsets) of these objects are stored for later use in generating the xref table.

  3. Write the Cross-Reference Table with _write_xref_table: Using the stored object locations, this step generates and writes the cross-reference table (xref table) to the output stream. The cross-reference table contains the byte offsets for each object in the PDF file, allowing for quick random access to objects when reading the PDF.

  4. Write the File Trailer with _write_trailer: The trailer is written to the output stream in this step. The trailer contains essential information, such as the number of objects in the PDF, the location of the root object (Catalog), and the Info object containing metadata. The trailer also specifies the location of the xref table.

How others do it

Looking at altrnative software designs and implementations can help to improve our choices.

fpdf2

fpdf2 has a PDFObject class with a serialize method which roughly maps to pypdf.PdfObject.write_to_stream. Some other similarities include:

pdfrw

pdfrw, in contrast, seems to work more with the standard Python objects (bool, float, string) and not wrap them in custom objects, if possible. It still has:

The core classes of pdfrw are PdfReader and PdfWriter