The PdfReader Class

class pypdf.PdfReader(stream: Union[str, IO, Path], strict: bool = False, password: Union[None, str, bytes] = None)[source]

Bases: object

Initialize a PdfReader object.

This operation can take some time, as the PDF stream’s cross-reference tables are read into memory.

Parameters

stream – A File object or an object that supports the standard read and seek methods similar to a File object. Could also be a string representing a path to a PDF file.
strict – Determines whether user should be warned of all problems and also causes some correctable problems to be fatal. Defaults to False.
password – Decrypt PDF file at initialization. If the password is None, the file will not be decrypted. Defaults to None

property pdf_header: str

The first 8 bytes of the file.

This is typically something like '%PDF-1.6' and can be used to detect if the file is actually a PDF file and which version it is.

property metadata: Optional[DocumentInformation]: Retrieve the PDF file’s document information dictionary, if it exists. Note that some PDF files use metadata streams instead of docinfo dictionaries, and these metadata streams will not be accessed by this function.

getDocumentInfo() → Optional[DocumentInformation][source]: Deprecated since version 1.28.0: Use the attribute metadata instead.

property documentInfo: Optional[DocumentInformation]: Deprecated since version 1.28.0.

Use the attribute metadata instead.

property xmp_metadata: Optional[XmpInformation]: XMP (Extensible Metadata Platform) data.

getXmpMetadata() → Optional[XmpInformation][source]: Deprecated since version 1.28.0: Use the attribute xmp_metadata instead.

property xmpMetadata: Optional[XmpInformation]: Deprecated since version 1.28.0.

Use the attribute xmp_metadata instead.

getNumPages() → int[source]: Deprecated since version 1.28.0: Use len(reader.pages) instead.

property numPages: int: Deprecated since version 1.28.0.

Use len(reader.pages) instead.

getPage(pageNumber: int) → PageObject[source]: Deprecated since version 1.28.0: Use reader.pages[page_number] instead.

property namedDestinations: Dict[str, Any]: Deprecated since version 1.28.0.

Use named_destinations instead.

property named_destinations: Dict[str, Any]: A read-only dictionary which maps names to Destinations

get_fields(tree: Optional[TreeObject] = None, retval: Optional[Dict[Any, Any]] = None, fileobj: Optional[Any] = None) → Optional[Dict[str, Any]][source]

Extract field data if this PDF contains interactive form fields.

The tree and retval parameters are for recursive use.

Parameters

tree –
retval –
fileobj – A file object (usually a text file) to write a report to on all interactive form fields found.

Returns

A dictionary where each key is a field name, and each value is a Field object. By default, the mapping name is used for keys. None if form data could not be located.

getFields(tree: Optional[TreeObject] = None, retval: Optional[Dict[Any, Any]] = None, fileobj: Optional[Any] = None) → Optional[Dict[str, Any]][source]: Deprecated since version 1.28.0: Use get_fields() instead.

get_form_text_fields(full_qualified_name: bool = False) → Dict[str, Any][source]

Retrieve form fields from the document with textual data.

The key is the name of the form field, the value is the content of the field.

If the document contains multiple form fields with the same name, the second and following will get the suffix .2, .3, …

full_qualified_name should be used to get full name

getFormTextFields() → Dict[str, Any][source]: Deprecated since version 1.28.0: Use get_form_text_fields() instead.

getNamedDestinations(tree: Optional[TreeObject] = None, retval: Optional[Any] = None) → Dict[str, Any][source]: Deprecated since version 1.28.0: Use named_destinations instead.

property outline: List[Union[Destination, List[Union[Destination, List[Destination]]]]]: Read-only property for the outline (i.e., a collection of ‘outline items’ which are also known as ‘bookmarks’) present in the document.

property outlines: List[Union[Destination, List[Union[Destination, List[Destination]]]]]: Deprecated since version 2.9.0.

Use outline instead.

getOutlines(node: Optional[DictionaryObject] = None, outline: Optional[Any] = None) → List[Union[Destination, List[Union[Destination, List[Destination]]]]][source]: Deprecated since version 1.28.0: Use outline instead.

property threads: Optional[pypdf.generic._data_structures.ArrayObject]: Read-only property for the list of threads see §8.3.2 from PDF 1.7 spec. It’s an array of dictionaries with “/F” and “/I” properties or None if there are no articles.

get_page_number(page: PageObject) → int[source]

Retrieve page number of a given PageObject

Parameters: page – The page to get page number. Should be an instance of PageObject
Returns: The page number or -1 if page is not found

getPageNumber(page: PageObject) → int[source]: Deprecated since version 1.28.0: Use get_page_number() instead.

get_destination_page_number(destination: Destination) → int[source]

Retrieve page number of a given Destination object.

Parameters: destination – The destination to get page number.
Returns: The page number or -1 if page is not found

getDestinationPageNumber(destination: Destination) → int[source]: Deprecated since version 1.28.0: Use get_destination_page_number() instead.

property pages: List[PageObject]: Read-only property that emulates a list of Page objects.

property page_labels: List[str]

A list of labels for the pages in this document.

This property is read-only. The labels are in the order that the pages appear in the document.

property page_layout: Optional[str]

Get the page layout currently being used.

Valid `layout` values
/NoLayout	Layout explicitly not specified
/SinglePage	Show one page at a time
/OneColumn	Show one column at a time
/TwoColumnLeft	Show pages in two columns, odd-numbered pages on the left
/TwoColumnRight	Show pages in two columns, odd-numbered pages on the right
/TwoPageLeft	Show two pages at a time, odd-numbered pages on the left
/TwoPageRight	Show two pages at a time, odd-numbered pages on the right

getPageLayout() → Optional[str][source]: Deprecated since version 1.28.0: Use page_layout instead.

property pageLayout: Optional[str]: Deprecated since version 1.28.0.

Use page_layout instead.

property page_mode: Optional[typing_extensions.Literal[/UseNone, /UseOutlines, /UseThumbs, /FullScreen, /UseOC, /UseAttachments]]

Get the page mode currently being used.

Valid `mode` values
/UseNone	Do not show outline or thumbnails panels
/UseOutlines	Show outline (aka bookmarks) panel
/UseThumbs	Show page thumbnails panel
/FullScreen	Fullscreen view
/UseOC	Show Optional Content Group (OCG) panel
/UseAttachments	Show attachments panel

getPageMode() → Literal[/UseNone, /UseOutlines, /UseThumbs, /FullScreen, /UseOC, /UseAttachments]][source]: Deprecated since version 1.28.0: Use page_mode instead.

property pageMode: Optional[typing_extensions.Literal[/UseNone, /UseOutlines, /UseThumbs, /FullScreen, /UseOC, /UseAttachments]]: Deprecated since version 1.28.0.

Use page_mode instead.

get_object(indirect_reference: Union[int, IndirectObject]) → Optional[PdfObject][source]

getObject(indirectReference: IndirectObject) → Optional[PdfObject][source]: Deprecated since version 1.28.0: Use get_object() instead.

read_object_header(stream: IO) → Tuple[int, int][source]

readObjectHeader(stream: IO) → Tuple[int, int][source]: Deprecated since version 1.28.0: Use read_object_header() instead.

cache_get_indirect_object(generation: int, idnum: int) → Optional[PdfObject][source]

cacheGetIndirectObject(generation: int, idnum: int) → Optional[PdfObject][source]: Deprecated since version 1.28.0: Use cache_get_indirect_object() instead.

cache_indirect_object(generation: int, idnum: int, obj: Optional[PdfObject]) → Optional[PdfObject][source]

cacheIndirectObject(generation: int, idnum: int, obj: Optional[PdfObject]) → Optional[PdfObject][source]: Deprecated since version 1.28.0: Use cache_indirect_object() instead.

read(stream: IO) → None[source]

read_next_end_line(stream: IO, limit_offset: int = 0) → bytes[source]: Deprecated since version 2.1.0.

readNextEndLine(stream: IO, limit_offset: int = 0) → bytes[source]: Deprecated since version 1.28.0.

decrypt(password: Union[str, bytes]) → PasswordType[source]

When using an encrypted / secured PDF file with the PDF Standard encryption handler, this function will allow the file to be decrypted. It checks the given password against the document’s user password and owner password, and then stores the resulting decryption key if either password is correct.

It does not matter which password was matched. Both passwords provide the correct decryption key that will allow the document to be used with this library.

Parameters: password – The password to match.
Returns: A PasswordType.

decode_permissions(permissions_code: int) → Dict[str, bool][source]

property is_encrypted: bool: Read-only boolean property showing whether this PDF file is encrypted. Note that this property, if true, will remain true even after the decrypt() method is called.

getIsEncrypted() → bool[source]: Deprecated since version 1.28.0: Use is_encrypted instead.

property isEncrypted: bool: Deprecated since version 1.28.0.

Use is_encrypted instead.

property xfa: Optional[Dict[str, Any]]