The PdfReader Class

Bases: PdfDocCommon

Initialize a PdfReader object.

This operation can take some time, as the PDF stream’s cross-reference tables are read into memory.

Parameters:

stream – A File object or an object that supports the standard read and seek methods similar to a File object. Could also be a string representing a path to a PDF file.
strict – Determines whether user should be warned of all problems and also causes some correctable problems to be fatal. Defaults to False.
password – Decrypt PDF file at initialization. If the password is None, the file will not be decrypted. Defaults to None.
root_object_recovery_limit – The maximum number of objects to query for recovering the Root object in non-strict mode. To disable this security measure, pass None.

strict: bool = False

flattened_pages: list[PageObject] | None = None

resolved_objects: dict[tuple[Any, Any], PdfObject | None]: Storage of parsed PDF objects.

close() → None[source]: Close the stream if opened in __init__ and clear memory.

property root_object: DictionaryObject: Provide access to “/Root”. Standardized with PdfWriter.

property pdf_header: str

The first 8 bytes of the file.

This is typically something like '%PDF-1.6' and can be used to detect if the file is actually a PDF file and which version it is.

property xmp_metadata: XmpInformation | None: XMP (Extensible Metadata Platform) data.

get_object(indirect_reference: int | IndirectObject) → PdfObject | None[source]

read_object_header(stream: IO[Any]) → tuple[int, int][source]

cache_get_indirect_object(generation: int, idnum: int) → PdfObject | None[source]

cache_indirect_object(generation: int, idnum: int, obj: PdfObject | None) → PdfObject | None[source]

read(stream: IO[Any]) → None[source]

Read and process the PDF stream, extracting necessary data.

Parameters:: stream – The PDF file stream.

decrypt(password: str | bytes) → PasswordType[source]

When using an encrypted / secured PDF file with the PDF Standard encryption handler, this function will allow the file to be decrypted. It checks the given password against the document’s user password and owner password, and then stores the resulting decryption key if either password is correct.

It does not matter which password was matched. Both passwords provide the correct decryption key that will allow the document to be used with this library.

Parameters:: password – The password to match.
Returns:: An indicator if the document was decrypted and whether it was the owner password or the user password.

property is_encrypted: bool

Read-only boolean property showing whether this PDF file is encrypted.

Note that this property, if true, will remain true even after the decrypt() method is called.

add_form_topname(name: str) → DictionaryObject | None[source]

Add a top level form that groups all form fields below it.

Parameters:: name – text string of the “/T” Attribute of the created object
Returns:: The created object. None means no object was created.

rename_form_topname(name: str) → DictionaryObject | None[source]

Rename top level form field that all form fields below it.

Parameters:: name – text string of the “/T” field of the created object
Returns:: The modified object. None means no object was modified.

property are_permissions_valid: bool | None

Whether the /Perms integrity check passed for this document.

For AES-256 encrypted documents (R=5/R=6), the /Perms field is an encrypted copy of the permissions that can be verified independently. Returns False if this check fails (the /P permissions may have been tampered with).

Returns None if the document is not encrypted or has not yet been decrypted via decrypt(). Returns True for non-AES-256 encryption (no /Perms to check).

property attachment_list: Generator[EmbeddedFile, None, None]: Iterable of attachment objects.

property attachments: Mapping[str, list[bytes]]: Mapping of attachment filenames to their content.

decode_permissions(permissions_code: int) → NoReturn: Take the permissions as an integer, return the allowed access.

get_destination_page_number(destination: Destination) → int | None

Retrieve page number of a given Destination object.

Parameters:: destination – The destination to get page number.
Returns:: The page number or None if page is not found

Extract field data if this PDF contains interactive form fields.

The tree, retval, stack parameters are for recursive use.

Parameters:

tree – Current object to parse.
retval – In-progress list of fields.
fileobj – A file object (usually a text file) to write a report to on all interactive form fields found.
stack – List of already parsed objects.

Returns:

A dictionary where each key is a field name, and each value is a Field object. By default, the mapping name is used for keys. None if form data could not be located.

get_form_text_fields(full_qualified_name: bool = False) → dict[str, Any]

Retrieve form fields from the document with textual data.

Parameters:

full_qualified_name – to get full name

Returns:

A dictionary. The key is the name of the form field, the value is the content of the field.

If the document contains multiple form fields with the same name, the second and following will get the suffix .2, .3, …

get_named_dest_root() → ArrayObject

get_num_pages() → int

Calculate the number of pages in this PDF file.

Returns:: The number of pages of the parsed PDF file.
Raises:: PdfReadError – If restrictions prevent this action.

get_page(page_number: int) → PageObject

Retrieve a page by number from this PDF file. Most of the time .pages[page_number] is preferred.

Parameters:: page_number – The page number to retrieve (pages begin at zero)
Returns:: A PageObject instance.

get_page_number(page: PageObject) → int | None

Retrieve page number of a given PageObject.

Parameters:: page – The page to get page number. Should be an instance of PageObject
Returns:: The page number or None if page is not found

get_pages_showing_field(field: Field | PdfObject | IndirectObject) → list[PageObject]

Provides list of pages where the field is called.

Parameters:

field – Field Object, PdfObject or IndirectObject referencing a Field

Returns:

List of pages –

Empty list:
The field has no widgets attached (either hidden field or ancestor field).
Single page list:
Page where the widget is present (most common).
Multi-page list:
Field with multiple kids widgets (example: radio buttons, field repeated on multiple pages).

property metadata: DocumentInformation | None

Retrieve the PDF file’s document information dictionary, if it exists.

Note that some PDF files use metadata streams instead of document information dictionaries, and these metadata streams will not be accessed by this function.

property named_destinations: dict[str, Destination]: A read-only dictionary which maps names to destinations.

property open_destination: None | Destination | TextStringObject | ByteStringObject

Property to access the opening destination (/OpenAction entry in the PDF catalog). It returns None if the entry does not exist or is not set.

Raises:: Exception – If a destination is invalid.

property outline: list[Destination | list[Destination | list[Destination]]]: Read-only property for the outline present in the document (i.e., a collection of ‘outline items’ which are also known as ‘bookmarks’).

property page_labels: list[str]

A list of labels for the pages in this document.

This property is read-only. The labels are in the order that the pages appear in the document.

property page_layout: str | None

Get the page layout currently being used.

Valid `layout` values
/NoLayout	Layout explicitly not specified
/SinglePage	Show one page at a time
/OneColumn	Show one column at a time
/TwoColumnLeft	Show pages in two columns, odd-numbered pages on the left
/TwoColumnRight	Show pages in two columns, odd-numbered pages on the right
/TwoPageLeft	Show two pages at a time, odd-numbered pages on the left
/TwoPageRight	Show two pages at a time, odd-numbered pages on the right

property page_mode: Literal['/UseNone', '/UseOutlines', '/UseThumbs', '/FullScreen', '/UseOC', '/UseAttachments'] | None

Get the page mode currently being used.

Valid `mode` values
/UseNone	Do not show outline or thumbnails panels
/UseOutlines	Show outline (aka bookmarks) panel
/UseThumbs	Show page thumbnails panel
/FullScreen	Fullscreen view
/UseOC	Show Optional Content Group (OCG) panel
/UseAttachments	Show attachments panel

property pages: list[PageObject]: Property that emulates a list of PageObject. This property allows to get a page or a range of pages.

Note

For PdfWriter only: Provides the capability to remove a page/range of page from the list (using the del operator). Remember: Only the page entry is removed, as the objects beneath can be used elsewhere. A solution to completely remove them - if they are not used anywhere - is to write to a buffer/temporary file and then load it into a new PdfWriter.

remove_page(page: int | PageObject | IndirectObject, clean: bool = False) → None

Remove page from pages list.

Parameters:

page –
- int: Page number to be removed.
- PageObject: page to be removed. If the page appears many times only the first one will be removed.
- IndirectObject: Reference to page to be removed.
clean – replace PageObject with NullObject to prevent annotations or destinations to reference a detached page.

property threads: ArrayObject | None

Read-only property for the list of threads.

See §12.4.3 from the PDF 1.7 or 2.0 specification.

It is an array of dictionaries with “/F” (the first bead in the thread) and “/I” (a thread information dictionary containing information about the thread, such as its title, author, and creation date) properties or None if there are no articles.

Since PDF 2.0 it can also contain an indirect reference to a metadata stream containing information about the thread, such as its title, author, and creation date.

property user_access_permissions: UserAccessPermissions | None: Get the user access permissions for encrypted documents. Returns None if not encrypted.

Warning

For AES-256 encrypted documents (R=5/R=6), the returned permissions are derived from the /P field, which is only trustworthy if the /Perms integrity check passed. Check are_permissions_valid to verify.

property viewer_preferences: ViewerPreferences | None: Returns the existing ViewerPreferences as an overloaded dictionary.

property xfa: dict[str, Any] | None

class pypdf.PasswordType(*values)[source]

Bases: IntEnum

NOT_DECRYPTED = 0

USER_PASSWORD = 1

OWNER_PASSWORD = 2