How pypdf parses PDF files
Finding and reading the cross-reference tables / trailer: The cross-reference table (xref table) is a table of byte offsets that indicate the locations of objects within the file. The trailer provides additional information such as the root object (Catalog) and the Info object containing metadata.
Parsing the objects: After locating the xref table and the trailer, pypdf proceeds to parse the objects in the PDF. Objects in a PDF can be of various types such as dictionaries, arrays, streams, and simple data types (e.g., integers, strings). pypdf parses these objects and stores them in
Decoding content streams: The content of a PDF is typically stored in content streams, which are sequences of PDF operators and operands. pypdf decodes these content streams by applying filters (e.g.,
LZWDecode) specified in the stream’s dictionary. This is only done when the object is requested via
7.5 File Structure
7.5.4 Cross-Reference Table
7.8 Content Streams and Resources