Convert PDF to Word (Docx) in Python

I was working on a client project where I needed to extract data from a set of PDF reports and share them in Word format. The challenge was that the reports were only available in PDF, and manually copying the content into Word was taking hours.

That’s when I decided to automate the process using Python. In this tutorial, I’ll show you how I converted PDF files into Word (Docx) documents with just a few lines of code.

I’ll walk you through multiple methods, explain the pros and cons of each, and share some practical tips I learned from my experience. By the end of this guide, you’ll be able to quickly convert PDFs into editable Word files with Python.

Table of Contents

Methods to Convert PDF to Word in Python?

I’ve often found myself in situations where clients send contracts, invoices, or reports only in PDF format. While PDFs are great for sharing, editing them isn’t always easy.

Word (Docx) files, on the other hand, are much easier to edit and collaborate on. That’s why automating the conversion process saves a lot of time, especially when dealing with multiple files.

Here are some real-world use cases where this comes in handy:

Converting government forms or tax documents into editable Word files.
Extracting data from business reports for further editing.
Creating editable templates from PDF brochures.

1 – Convert PDF to Docx using pdf2docx

The first method I recommend is using the pdf2docx library. It’s a simple and reliable tool that converts PDFs into Word documents while keeping most of the formatting intact.

Installation: Open your terminal or command prompt and install the library:

pip install pdf2docx

This installs the package we’ll use to handle the conversion.

Example Code

from pdf2docx import Converter

# Define the file paths 
# Convert all pages to Word
cv.convert(docx_file, start=0, end=None)

# Close the converter
cv.close()

print("PDF successfully converted to Word!")

I executed the above example code and added the screenshot below.

This script converts the entire PDF into a Word document. The best part is that it preserves text, tables, and even some images.

Convert Only Specific Pages: Sometimes, I only need a few pages from a large PDF. Here’s how you can do that:

from pdf2docx import Converter

pdf_file = "annual_report.pdf"
docx_file = "summary_pages.docx"

cv = Converter(pdf_file)

# Convert only pages 2 to 5
cv.convert(docx_file, start=1, end=5)
cv.close()

print("Selected pages converted to Word!")

This is useful when dealing with long reports where only a portion is needed. It keeps the file size smaller and saves processing time.

2 – Convert PDF to Word using PyPDF2 and python-docx

Another way I’ve handled PDF to Word conversion is by combining PyPDF2 with python-docx.
This method extracts text from the PDF and writes it into a Word document.

It doesn’t preserve formatting as well as pdf2docx, but it’s useful when you only need the raw text.

Installation

pip install PyPDF2 python-docx

These two libraries give us the ability to read PDFs and create Word files. Now let’s see them in action.

Example Code

import PyPDF2
from docx import Document

# Open the PDF file
with open("meeting_notes.pdf", "rb") as pdf_file:
    reader = PyPDF2.PdfReader(pdf_file)

    # Create a new Word document
    doc = Document()

    # Extract text page by page
    for page_num, page in enumerate(reader.pages, start=1):
        text = page.extract_text()
        doc.add_paragraph(f"Page {page_num}")
        doc.add_paragraph(text if text else "[No text found]")
        doc.add_page_break()

    # Save the Word document
    doc.save("meeting_notes.docx")

print("PDF text extracted and saved to Word!")

I executed the above example code and added the screenshot below.

This code goes through each page of the PDF and extracts the text. It then writes the text into a Word document with page breaks.

When to Use This Method

When you only care about text, not formatting.
For scanned PDFs where formatting doesn’t matter.
For quick extraction of meeting notes, invoices, or contracts.

3 – Convert Scanned PDFs to Word using OCR (Tesseract)

Sometimes, PDFs are just scanned images. In such cases, the above methods won’t work because there’s no actual text to extract.

That’s when I use OCR (Optical Character Recognition) with pytesseract and Pillow.

Installation

pip install pytesseract pillow pdf2image

You’ll also need to install Tesseract OCR on your system:

Windows: Tesseract Installer
macOS: brew install tesseract
Linux: sudo apt-get install tesseract-ocr

Example Code

import pytesseract
from pdf2image import convert_from_path
from docx import Document

# Convert PDF pages to images
pages = convert_from_path("scanned_invoice.pdf")

# Create a Word document
doc = Document()

# Process each page with OCR
for i, page in enumerate(pages, start=1):
    text = pytesseract.image_to_string(page)
    doc.add_paragraph(f"Page {i}")
    doc.add_paragraph(text if text else "[No text detected]")
    doc.add_page_break()

# Save the Word document
doc.save("scanned_invoice.docx")

print("Scanned PDF converted to editable Word file!")

This method takes longer but is extremely powerful. It lets you turn scanned PDFs into fully editable Word documents.

Choose the Right Method

Here’s a quick comparison based on my experience:

Method	Best For	Pros	Cons
pdf2docx	General PDF to Word conversion	Preserves formatting, easy to use	Sometimes struggles with complex layouts
PyPDF2 + python-docx	Extracting plain text	Lightweight, simple	No formatting, images ignored
OCR (pytesseract)	Scanned PDFs	Works on images, extracts text	Slower, requires Tesseract installation

Practical Tips I’ve Learned

Always check the converted Word file manually, especially for contracts or financial documents.
For batch processing, wrap the code in a loop to handle multiple PDFs.
Keep backups of original PDFs in case formatting doesn’t transfer perfectly.
Use OCR only when necessary, since it’s slower than text-based extraction.

Conclusion

Converting PDF to Word in Python has saved me countless hours of manual work. In this tutorial, I showed you three different methods:

Using pdf2docx for reliable formatting.
Using PyPDF2 and python-docx for plain text extraction.
Using OCR for scanned PDFs.

Each method has its own strengths, and the one you choose depends on your specific use case.
I hope you found this guide helpful; now you can confidently convert PDFs into editable Word documents using Python.

You may also read:

Bijay Kumar

Bijay Kumar is an experienced Python and AI professional who enjoys helping developers learn modern technologies through practical tutorials and examples. His expertise includes Python development, Machine Learning, Artificial Intelligence, automation, and data analysis using libraries like Pandas, NumPy, TensorFlow, Matplotlib, SciPy, and Scikit-Learn. At PythonGuides.com, he shares in-depth guides designed for both beginners and experienced developers. More about us.

enjoysharepoint.com/

Convert PDF to Word (Docx) in Python

Methods to Convert PDF to Word in Python?

1 – Convert PDF to Docx using pdf2docx

2 – Convert PDF to Word using PyPDF2 and python-docx

3 – Convert Scanned PDFs to Word using OCR (Tesseract)

Choose the Right Method

Practical Tips I’ve Learned

Conclusion

51 PYTHON PROGRAMS PDF FREE

Aspiring to be a Python developer?

Let’s be friends