PdfFileReader Python example

In this Python tutorial, we will discuss what is PyPDF2 in python and various methods of PdfFileReader and also PdfFileReader Python example.

We will learn about the PdfFileReader class and methods. It is the class from the PyPDF2 module that is widely used to access & manipulate PDF files in Python.

PyPDF2 Python Library

  • Python is used for a wide variety of purposes & is adorned with libraries & classes for all kinds of activities. Out of these purposes, one is to read text from PDF in Python.
  • PyPDF2 offers classes that help us to Read, Merge, Write a pdf file.
    • PdfFileReader used to perform all the operations related to reading a file.
    • PdfFileMerger is used to merge multiple pdf files together.
    • PdfFileWriter is used to perform write operations on pdf.
  • All of the classes have various functions that facilitate a programmer to control & perform any operation on pdf.
  • PyPDF2 has stopped receiving any updates after Python3.5 but it is still used to control PDFs. In this tutorial, we will be covering everything about PdfFileReader class & we will tell you what all functions are depreciated or broken.

Read: PdfFileMerger Python examples

Install pypdf2 in python

To use the PyPDF2 library in Python, we need to first install PyPDF2. Follow the below code to install the PyPDF2 module in your system.

pip install PyPDF2

After reading this tutorial, you will have complete knowledge of each function in PdfFileReader class. Also, we will be demonstrating the examples for each function in PdfFileReader class.

PdfFileReader in Python

  • PdfFileReader in Python offers functions that help in reading & viewing the pdf file. It offers various functions using which you can filter the pdf on the basis of the page number, content, page mode, etc.
  • The first step is to import the PyPDF2 module, type import PyPDF2
import PyPDF2
  • The next step is to create an object that holds the path of the pdf file. We have provided one more argument i.e rb which means read binary. We have used the pdf file with the name ‘sample’ & it is stored in the same directory where the main program is.
pdfObj = open('sample', 'rb')
  • , PdfFileReader function is used to read the object that holds the path of a pdf file. Also, it offers few more arguments that can be passed.
PyPDF2.PdfFileReader(
    stream, 
    strict=True, 
    warndest=None, 
    overwriteWarnings=True
    )
  • Here is the explanation of all four arguments:
    • stream: Pass the name of the object that holds the pdf file. In our case it is pdfObj.
    • strict: Do you want to inform the user about the fatal error that appeared while reading the pdf file. If yes then set it to True. if no, then set it to False. By default it is True.
    • warndest: Destination for logging warning ( default is sys.stderr).
    • overwriteWarnings: Determines whether to override python’s warning.py module with a custom implementation (default is True).
  • Here is the implementation of all the code mentioned above.
python tkinter PyPDF2 installation
PdfFileReader in Python
  • This picture shows three things:
    1. You can notice the files on the left side. The ‘sample’ is the pdf file that we have used in this program.
    2. All the above code is in the centre.
    3. terminal shows an error when we tried to run this program so we have installed the PyPDF2 module. Now we run the program nothing appears that is because we have just read the file so far.
READ:  Django get all data from POST request

Read: PdfFileWriter Python Examples

PdfFileReader python example

In this section, we will cover all the functions of PdfFileReader class. Our approach would be to explain the function in the simplest way & to demonstrate an example for each. So let us see a few PdfFileReader python examples.

Get PDF information using PdfFileReader in Python

PdfFileReader provides a method as documentInfo() which gives us the information about a PDF file in Python.

  • retrieves pdf document information in a dictionary format if exist.
  • TypeError: 'DocumentInformation' object is not callable
  • If you are seeing the above error, simply remove the () from the documentInfo.

Example:

Here is the example of implementation of documentinfo function.

Code Snippet:

In this code, we have displayed the information of sample.pdf in Python.

import PyPDF2

pdfObj = open('sample', 'rb')

reader = PyPDF2.PdfFileReader(
    pdfObj,
    strict=True, 
    warndest=None, 
    overwriteWarnings=True
    )
print(reader.documentInfo)

Output:

In this output, you can notice that the information of sample.pdf is displayed in a dictionary format.

python tkinter PyPDF2 documentInfo.png
pdffilereader python example

Get PDF information of a specific page using PdfFileReader in Python

PdfFileReader provides a method as getDestinationPageNumber() which gives us the information about a PDF file in Python on a specific page.

  • Retrieves information available on the provided page number.
  • If you want to see the content of a particular page then you can simply pass the page number as an argument to this function.
  • It is helpful only if you know the page number or you have the index of content.
  • PyPDF2 library is not updated after python3.5 so there are few bugs & broken functions. This works perfectly only when used with python3.5 or below.

Get field data from PDF using PdfFileReader in Python

PdfFileReader provides a method getFields(tree=None, retval=None, FileObj=None) which extracts field data from interactive PDF in Python.

  • tree & retval parameters are for recursive use.
  • This function extracts field data if the PDF contains interactive form fields.
  • Interactive forms are those in which users can fill in the information. Click here to see the demonstration of interactive forms.
  • These interactive pdf won’t work if downloaded directly so we have mentioned the python code below that can download interactive pdfs in a working state.
READ:  NumPy Sum of Squares in Python [6 Methods]

Code Snippet:

You can find the interactive forms over the internet & these can be downloaded by using the given code. Simply provide the path of the interactive pdf file. In our case, we are downloading it from https://royalegroupnyc.com

import urllib.request

pdf_path = "https://royalegroupnyc.com/wp-content/uploads/seating_areas/sample_pdf.pdf"

def download_file(download_url, filename):
    response = urllib.request.urlopen(download_url)    
    file = open(filename + ".pdf", 'wb')
    file.write(response.read())
    file.close()
 
download_file(pdf_path, "Test")

Here is the code snippet to read the interactive PDF in python.

import PyPDF2

pdfObj = open('interactivepdf.pdf', 'rb')

reader = PyPDF2.PdfFileReader(
    pdfObj,
    strict=True, 
    warndest=None, 
    overwriteWarnings=True
    )

print(reader.getFields())

Output:

In this output, you can notice that all the information is fetched in a dictionary format. If the PDF won’t contain interactive fields in that case None is returned.

python tkinter PDF2 getFields
read the interactive PDF in python

Get text data from fields in PDF using PdfFileReader in Python

PdfFileReader provides a method getFormTextFields() to extract text data from the interactive PDF in Python.

  • This function is used to retrieve the text data that is provided by the user in the interactive PDF in Python.
  • The data is displayed in a dictionary format
  • In case you are seeing an error : TypeError: 'NoneType' object is not iterable This means the pdf does not contain interactive text fields.
  • The major difference between getFields() and getFormTextFields() is getFileds displays all the Filed information whereas getFormTextFields displays the information entered in the interactive pdf.

Code Snippet:

In this code, we have used this function in the last line where it is displaying the output.

import PyPDF2

pdfObj = open('interactivepdf.pdf', 'rb')

reader = PyPDF2.PdfFileReader(
    pdfObj,
    strict=True, 
    warndest=None, 
    overwriteWarnings=True
    )

print(reader.getFormTextFields())

Output:

In this output, you can notice in the terminal section that Name has value None. This means that no value is passed in the PDF.

PdfFileReader example
PdfFileReader example

Get to the named Destinations in PDF using PdfFileReader in Python

PdfFileReader provide a method getNamedDestinations(tree=None, retval=None) to easily get named destination of PDF in Python.

  • This function is used to retrieve the named destination present in the doc.
  • It returns empty dictionary if named destination is not found.

code Snippet:

In this code, this function is used in the last line. It is displaying the named destination present in the Smallpdf.pdf.

import PyPDF2

pdfObj = open('Smallpdf.pdf', 'rb')

reader = PyPDF2.PdfFileReader(
    pdfObj,
    strict=True, 
    warndest=None, 
    overwriteWarnings=True
    )

print(reader.getNamedDestinations())

Output:

In this output, you can notice on the terminal that empty curly braces are returned. It means that named destination is not present in Smallpdf.pdf.

python tkinter PyPDF2 getNamed Destinations
Python tkinter PyPDF2

Get the total page count of PDF using PdfFileReader in Python

PdfFileReader provide a method getNumPages() which returns the total pages in the PDF file in Python.

  • This function returns the total number of pages in the PDF file in Python.
  • It retrieves page information by page number

Code Snippet:

In this code, this function is used in the last line where it is displaying page number of ‘sample.pdf’

import PyPDF2

pdfObj = open('sample', 'rb')

reader = PyPDF2.PdfFileReader(
    pdfObj,
    strict=True, 
    warndest=None, 
    overwriteWarnings=True
    )

print(reader.getNumPages())

Output:

In this output, you can notice result on the terminal. The sample.pdf has total 8 pages.

python tkinter PyPDF2 getNumPages
Get total number of pages from PDF in Python

Get outlines in the PDF using PdfFileReader in Python

PdfFileReader provide a method getOutlines(node=None, outlines=None) which allows to retrieves outlines in the PDF file in Python

  • This function retrieves the outline present in the PDF file.
  • In other words, it retrieves the nested list of destinations.
  • When a group of people start looking at the pdf then they usually add some markings also called annotations. Using this function you can fetch all the marking or outlines.
  • PyPDF2 is not being updated after python3.5 so there are things that are broken. Outlines function is one of the broken this which stopped working fine after python3.5
  • We tried but it didn’t work for us. The output shows an empty string even after adding outlines to the pdf.
  • We will update the blog once we found the solution.
READ:  Python Clear Turtle with examples

Jump to a specific page of PDF using PdfFileReader in Python

PdfFileReader provide a method getPage(pageNumber) which allows to see content of specific page.

  • This function returns the content on the provided page number.
  • to extract the content in a readable format we have to use a function with the name extractText().
  • extractText() is a function from PageObject Class. Using this function we can read the content of the pdf.

Code Snippet:

In this code, we are using single page pdf file with the name Smallpdf.pdf. In the last line of the code we have passed pagenumber 0 as an argument & we have applied extractText() function to display the content.

import PyPDF2

pdfObj = open('Smallpdf.pdf', 'rb')

reader = PyPDF2.PdfFileReader(
    pdfObj,
    strict=True, 
    warndest=None, 
    overwriteWarnings=True
    )

print(reader.getPage(0).extractText())

Output:

In this output, you can see that data from page 0 that means first is displayed on the screen. The content is in human readable format.

python tkinter PyPDF2 getPage
python tkinter PyPDF2 getPage

Get Page Mode of PDF using PdfFileReader in Python

PdfFileReader provide a method getPageMode() which allows to get the page mode of PDF in Python.

  • This function is used to get the page mode.
  • There are various valid page modes
ModesUsage
/useNoneDo not show outlines or thumbnail panel
/useOutlinesShow outline panel
/useThumbsShow page thumbnails
/FullScreenGo fullscreen
/useOCShow optional content group (OCG)
/useAttachmentsShow attachment panel

Get Page Layout of the PDF usingPdfFileReader in Python

PdfFileReader provide a method getPageLayout() which returns page layout of PDF in Python

  • get the layout of the page of PDF in Python
  • There are various valid layouts.
LayoutsUsage
/NoLayoutLayout explicitly not specified
/SinglePageShow one page at a time
/OneColumnShow one column at a time
/TwoColumnLeftShow pages in two columns,
add number page on the left.
/TwoColumnRightShow pages in two columns,
add number page on the right.
/TwoPageLeftShow two pages at a time,
add number page on left.
/TwoPageRightShow two pages at a time,
add number page on the right

Get Encryption information of the PDF using PdfFileReader in Python

PdfFileReader provides method isEncrypted() which allows to check if the PDF file is encrypted in Python

  • Shows whether the PDF is encrypted or not using Python.
  • The return type is boolean (True/False) and the function is not callable.
  • If the PDF file returns True then it will remain true even if it is decrypted.
  • In the below picture you can see that Smallpdf.pdf is not encrypted that is why the output is false.
python tkinter PyPDF2 isEncrypted
pdffilereader python example

You may also like the following Python tutorials:

With this, we have completed the Python PdfFileReader class and its functions. There were two functions that were depreciated getOutlines() and getDestinationPageNumber().

Here are a few PdfFileReader python example.

  • Get PDF information using PdfFileReader in Python
  • How to get PDF information of a specific page using PdfFileReader in Python
  • Get field data from PDF using PdfFileReader in Python
  • How to get text data from fields in PDF using PdfFileReader in Python
  • Get to the named Destinations in PDF using PdfFileReader in Python
  • How to get the total page count of PDF using PdfFileReader in Python
  • How to get outlines in the PDF using PdfFileReader in Python
  • Jump to a specific page of PDF using PdfFileReader in Python
  • How to get Page Mode of PDF using PdfFileReader in Python
  • Get Page Layout of the PDF usingPdfFileReader in Python
  • How to get Encryption information of the PDF using PdfFileReader in Python