Extract Images
Note
In order to use the following code you need to install optional dependencies, see installation guide.
Every page of a PDF document can contain an arbitrary number of images. The names of the files may not be unique.
from pypdf import PdfReader
reader = PdfReader("example.pdf")
page = reader.pages[0]
for i, image_file_object in enumerate(page.images):
file_name = "out-image-" + str(i) + "-" + image_file_object.name
image_file_object.image.save(file_name)
Other images
Some other objects can contain images, such as stamp annotations.
You can extract the image from the annotation with the following code:
from pypdf import PdfReader
reader = PdfReader("example.pdf")
im = (
reader.pages[0]["/Annots"][4]["/Parent"]
.get_object()["/AP"]["/N"]["/Resources"]["/XObject"]["/Im4"]
.decode_as_image()
)
im.save("out-annotation-image.png")
Error handling
Iterating over page.images directly will raise an exception on the first issue.
If you expect some more or less broken PDF files, but still want to retrieve as many images as possible,
consider making this a multistep process:
from pypdf import PdfReader
reader = PdfReader("example.pdf")
for page in reader.pages:
for name in page.images.keys():
try:
# Try to retrieve actual image.
image = page.images[name]
except Exception as exception:
# Handle exceptions.
pass