Interactions with PDF Forms
Reading form fields
from pypdf import PdfReader
reader = PdfReader("form.pdf")
fields = reader.get_form_text_fields()
fields == {"key": "value", "key2": "value2"}
# You can also get all fields:
fields = reader.get_fields()
Filling out forms
from pypdf import PdfReader, PdfWriter
reader = PdfReader("form.pdf")
writer = PdfWriter()
page = reader.pages[0]
fields = reader.get_fields()
writer.append(reader)
writer.update_page_form_field_values(
writer.pages[0],
{"fieldname": "some filled in text"},
auto_regenerate=False,
)
with open("filled-out.pdf", "wb") as output_stream:
writer.write(output_stream)
Generally speaking, you will always want to use auto_regenerate=False
. The
parameter is True
by default for legacy compatibility, but this flags the PDF
processor to recompute the field’s rendering, and may trigger a “save changes”
dialog for users who open the generated PDF.
Some notes about form fields and annotations
PDF forms have a dual-nature approach about the fields:
Within the root object, an
/AcroForm
structure exists. Inside it you could find (optional):some global elements (Fonts, Resources,…)
some global flags (like
/NeedAppearances
(set/cleared withauto_regenerate
parameter inupdate_page_form_field_values()
) that indicates if the reading program should re-render the visual fields upon document launch)/XFA
that houses a form in XDP format (very specific XML that describes the form rendered by some viewers); the/XFA
form overrides the page content/Fields
that houses an array of indirect references that reference the upper Field Objects (roots)
Within the page
/Annots
, you will spot/Widget
annotations that define the visual rendering.
To flesh out this overview:
The core specific properties of a field are:
/FT
: The field type (Button, Text, Choice, or Signature)./T
: The partial field name./V
: The field’s value, whose format varies depending on the field type./DV
: The default value to which the field reverts when a reset-form action is executed.
In order to streamline readability, Field Objects and Widget Objects can be fused housing all properties.
Fields can be organised hierarchically, id est one field can be placed under another. In such instances, the
/Parent
will have an IndirectObject providing Bottom-Up links and/Kids
is an array carrying IndirectObjects for Top-Down navigation; Widget Objects are still required for visual rendering. To call upon them, use the fully qualified field name (where all the individual names of the parent objects are separated by.
)For instance take two (visual) fields both called city, but attached below sender and receiver; the corresponding full names will be sender.city and receiver.city.
When a field is repeated on multiple pages, the Field Object will have many Widget Objects in
/Kids
. These objects are pure widgets, containing no field specific data.If Fields stores only hidden values, no Widgets are required.
In pypdf fields are extracted from the /Fields
array:
from pypdf import PdfReader
reader = PdfReader("form.pdf")
fields = reader.get_fields()
from pypdf import PdfReader
from pypdf.constants import AnnotationDictionaryAttributes
reader = PdfReader("form.pdf")
fields = []
for page in reader.pages:
for annot in page.annotations:
annot = annot.get_object()
if annot[AnnotationDictionaryAttributes.Subtype] == "/Widget":
fields.append(annot)
However, while similar, there are some very important differences between the two above blocks of code. Most importantly, the first block will return a list of Field objects, whereas the second will return more generic dictionary-like objects. The objects lists will mostly reference the same object in the underlying PDF, meaning you’ll find that obj_taken_fom_first_list.indirect_reference == obj_taken_from _second_list.indirect_reference
. Field objects are generally more ergonomic, as the exposed data can be accessed via clearly named properties. However, the more generic dictionary-like objects will contain data that the Field object does not expose, such as the Rect (the widget’s position on the page). Therefore the correct approach depends on your use case.
However, it is also important to note that the two lists do not always refer to the same underlying PDF object. For example, if the form contains radio buttons, you will find that reader.get_fields()
will get the parent object (the group of radio buttons) whereas page.annotations
will return all the child objects (the individual radio buttons).
Note
Remember that fields are not stored in pages; if you use add_page()
the field structure is not copied. It is recommended to use .append()
with the proper parameters instead.
In case of missing field objects in /Fields
, writer.reattach_fields()
will parse page(s) annotations and will reattach them. This fix cannot guess intermediate fields and will not report fields using the same name.
Identify pages where fields are used
In order to ease locating page fields you can use get_pages_showing_field
of PdfReader or PdfWriter. This method accepts a field object, a PdfObject that represents a field (as extracted from _root_object["/AcroForm"]["/Fields"]
). The method returns a list of pages, because a field can have multiple widgets as mentioned previously (e.g. radio buttons or text displayed on multiple pages).
The page numbers can then be retrieved as usual by using page.page_number
.