How pypdf writes PDF files
pypdf uses PdfWriter
to write PDF files. pypdf has
PdfObject
and several subclasses with the
write_to_stream
method.
The PdfWriter.write
method uses the
write_to_stream
methods of the referenced objects.
The PdfWriter.write_stream
method
has the following core steps:
_sweep_indirect_references
: This step ensures that any circular references to objects are correctly handled. It adds the object reference numbers of any circularly referenced objects to an external reference map, so that self-page-referencing trees can reference the correct new object location, rather than copying in a new copy of the page object.Write the File Header and Body with
_write_pdf_structure
: In this step, the PDF header and objects are written to the output stream. This includes the PDF version (e.g., %PDF-1.7) and the objects that make up the content of the PDF, such as pages, annotations, and form fields. The locations (byte offsets) of these objects are stored for later use in generating the xref table.Write the Cross-Reference Table with
_write_xref_table
: Using the stored object locations, this step generates and writes the cross-reference table (xref table) to the output stream. The cross-reference table contains the byte offsets for each object in the PDF file, allowing for quick random access to objects when reading the PDF.Write the File Trailer with
_write_trailer
: The trailer is written to the output stream in this step. The trailer contains essential information, such as the number of objects in the PDF, the location of the root object (Catalog), and the Info object containing metadata. The trailer also specifies the location of the xref table.
How others do it
Looking at alternative software designs and implementations can help to improve our choices.
fpdf2
fpdf2 has a PDFObject
class
with a serialize method which roughly maps to pypdf.PdfObject.write_to_stream
.
Some other similarities include:
pdfrw
pdfrw, in contrast, seems to work more with the standard Python objects (bool, float, string) and not wrap them in custom objects, if possible. It still has: