The PdfFileReader Class: PyPDF2 Documentation

While working with PDF files in Python, I realized how the PyPDF2 library is useful, especially its PdfFileReader class. Whether you want to extract text, read metadata, or work with pages, PdfFileReader makes it simple. Over my 10+ years as a Python developer, this tool has been invaluable whenever I needed to automate PDF processing.

In this tutorial, I’ll walk you through how to use PdfFileReader with practical examples. If you’re dealing with PDFs in your projects, such as reading reports, invoices, or any document, this guide will help you get started quickly and efficiently.

Let’s get in!

Table of Contents

What is PdfFileReader?

PdfFileReader is a class from the PyPDF2 library that allows you to read and extract information from PDF files. It supports operations like:

Accessing the number of pages
Extracting text from pages
Reading document metadata
Accessing page dimensions and more

It’s a pure Python library, so no extra dependencies are required, and it works well on all platforms.

How to Install PyPDF2

Before we dive into the code, you need to install the PyPDF2 package if you haven’t already.

pip install PyPDF2

This command installs the latest version from PyPI.

Check out Access Modifiers in Python

Basic PdfFileReader Example: Reading Text from a PDF

Let me show you a simple example where we open a PDF file and extract text from its first page. For this example, imagine you have a PDF named USA_Economic_Report.pdf containing economic data.

from PyPDF2 import PdfReader

# Path to your PDF file
file_path = "USA_Economic_Report.pdf"

# Open and read the PDF
with open(file_path, 'rb') as file:
    reader = PdfReader(file)
    num_pages = len(reader.pages)
    print(f"Number of pages: {num_pages}")

    # Print text of the first page
    first_page = reader.pages[0]
    text = first_page.extract_text()
    print("Text from first page:\n", text)

You can see the output in the screenshot below.

What’s happening here?

We open the PDF file in binary read mode ('rb').
We create a PdfFileReader object to interact with the file.
We check how many pages the PDF contains.
We extract text from the first page using extract_text().

This method works well for most text-based PDFs.

Extract Text from All Pages

Often, you want to process the entire document. Here’s how I loop through all pages to extract text:

from PyPDF2 import PdfFileReader

with open('USA_Economic_Report.pdf', 'rb') as file:
    reader = PdfFileReader(file)
    total_pages = reader.numPages

    for page_num in range(total_pages):
        page = reader.getPage(page_num)
        text = page.extract_text()
        print(f'--- Page {page_num + 1} ---')
        print(text)

You can see the output in the screenshot below.

This loop goes through each page and prints its text content. It’s handy when you want to analyze or store the entire PDF content.

Read PDF Metadata

PDF files often contain metadata like author, creation date, and title. You can access this information easily:

from PyPDF2 import PdfFileReader

with open('USA_Economic_Report.pdf', 'rb') as file:
    reader = PdfFileReader(file)
    info = reader.getDocumentInfo()

    print('PDF Metadata:')
    for key, value in info.items():
        print(f'{key}: {value}')

Metadata can be useful for cataloging documents or verifying their origin.

Check out Fastest Sorting Algorithm in Python

Check if a PDF is Encrypted

Sometimes PDFs are encrypted and require a password to open. Here’s how you can check and handle that:

from PyPDF2 import PdfFileReader

with open('Confidential_USA_Report.pdf', 'rb') as file:
    reader = PdfFileReader(file)

    if reader.isEncrypted:
        print('PDF is encrypted. Trying to decrypt...')
        # If you know the password, provide it here
        if reader.decrypt('your_password_here'):
            print('Decryption successful!')
            # Now you can read pages as usual
            page = reader.getPage(0)
            print(page.extract_text())
        else:
            print('Failed to decrypt PDF.')
    else:
        print('PDF is not encrypted.')

This is essential when working with protected documents.

Combine PdfFileReader with PdfFileWriter

In many cases, you might want to read a PDF, modify it, or extract certain pages. PdfFileReader works together with PdfFileWriter for such tasks. Here’s a quick example to extract pages 2 to 4 into a new PDF:

from PyPDF2 import PdfFileReader, PdfFileWriter

with open('USA_Economic_Report.pdf', 'rb') as infile:
    reader = PdfFileReader(infile)
    writer = PdfFileWriter()

    # Extract pages 2 to 4 (page indices 1 to 3)
    for page_num in range(1, 4):
        page = reader.getPage(page_num)
        writer.addPage(page)

    # Write the extracted pages to a new PDF
    with open('Extracted_Pages.pdf', 'wb') as outfile:
        writer.write(outfile)

print('Pages 2 to 4 extracted successfully.')

This method is useful for creating summaries or splitting large reports.

Read raw_input Function in Python for User Input

Tips from My Experience

Text extraction quality depends on the PDF: Some PDFs are scanned images and require OCR tools instead.
Always open files in binary mode ('rb'): This avoids issues with file reading.
Use context managers (with statements): They ensure files close properly.
Check for encryption: Many official documents are password-protected.
Keep PyPDF2 updated: The library is actively maintained with improvements.

By using PdfFileReader, you can automate many PDF-related tasks, saving time and effort. Whether you’re processing financial reports, government documents, or any PDF files, these techniques will get you started.

If you want to dive deeper, explore the official PyPDF2 documentation for advanced features like merging PDFs, rotating pages, and adding annotations.

I hope you found this tutorial helpful. Feel free to try out the examples with your own PDFs and see how easy it is to work with PDFs in Python!

You may also read:

Bijay Kumar

Bijay Kumar is an experienced Python and AI professional who enjoys helping developers learn modern technologies through practical tutorials and examples. His expertise includes Python development, Machine Learning, Artificial Intelligence, automation, and data analysis using libraries like Pandas, NumPy, TensorFlow, Matplotlib, SciPy, and Scikit-Learn. At PythonGuides.com, he shares in-depth guides designed for both beginners and experienced developers. More about us.

enjoysharepoint.com/

The PdfFileReader Class: PyPDF2 documentation