I was working on a client project where I needed to extract data from a set of PDF reports and share them in Word format. The challenge was that the reports were only available in PDF, and manually copying the content into Word was taking hours.
That’s when I decided to automate the process using Python. In this tutorial, I’ll show you how I converted PDF files into Word (Docx) documents with just a few lines of code.
I’ll walk you through multiple methods, explain the pros and cons of each, and share some practical tips I learned from my experience. By the end of this guide, you’ll be able to quickly convert PDFs into editable Word files with Python.
Methods to Convert PDF to Word in Python?
I’ve often found myself in situations where clients send contracts, invoices, or reports only in PDF format. While PDFs are great for sharing, editing them isn’t always easy.
Word (Docx) files, on the other hand, are much easier to edit and collaborate on. That’s why automating the conversion process saves a lot of time, especially when dealing with multiple files.
Here are some real-world use cases where this comes in handy:
- Converting government forms or tax documents into editable Word files.
- Extracting data from business reports for further editing.
- Creating editable templates from PDF brochures.
1 – Convert PDF to Docx using pdf2docx
The first method I recommend is using the pdf2docx library. It’s a simple and reliable tool that converts PDFs into Word documents while keeping most of the formatting intact.
Installation: Open your terminal or command prompt and install the library:
pip install pdf2docxThis installs the package we’ll use to handle the conversion.
Example Code
from pdf2docx import Converter
# Define the file paths
# Convert all pages to Word
cv.convert(docx_file, start=0, end=None)
# Close the converter
cv.close()
print("PDF successfully converted to Word!")I executed the above example code and added the screenshot below.

This script converts the entire PDF into a Word document. The best part is that it preserves text, tables, and even some images.
Convert Only Specific Pages: Sometimes, I only need a few pages from a large PDF. Here’s how you can do that:
from pdf2docx import Converter
pdf_file = "annual_report.pdf"
docx_file = "summary_pages.docx"
cv = Converter(pdf_file)
# Convert only pages 2 to 5
cv.convert(docx_file, start=1, end=5)
cv.close()
print("Selected pages converted to Word!")This is useful when dealing with long reports where only a portion is needed. It keeps the file size smaller and saves processing time.
2 – Convert PDF to Word using PyPDF2 and python-docx
Another way I’ve handled PDF to Word conversion is by combining PyPDF2 with python-docx.
This method extracts text from the PDF and writes it into a Word document.
It doesn’t preserve formatting as well as pdf2docx, but it’s useful when you only need the raw text.
Installation
pip install PyPDF2 python-docxThese two libraries give us the ability to read PDFs and create Word files. Now let’s see them in action.
Example Code
import PyPDF2
from docx import Document
# Open the PDF file
with open("meeting_notes.pdf", "rb") as pdf_file:
reader = PyPDF2.PdfReader(pdf_file)
# Create a new Word document
doc = Document()
# Extract text page by page
for page_num, page in enumerate(reader.pages, start=1):
text = page.extract_text()
doc.add_paragraph(f"Page {page_num}")
doc.add_paragraph(text if text else "[No text found]")
doc.add_page_break()
# Save the Word document
doc.save("meeting_notes.docx")
print("PDF text extracted and saved to Word!")I executed the above example code and added the screenshot below.

This code goes through each page of the PDF and extracts the text. It then writes the text into a Word document with page breaks.
When to Use This Method
- When you only care about text, not formatting.
- For scanned PDFs where formatting doesn’t matter.
- For quick extraction of meeting notes, invoices, or contracts.
3 – Convert Scanned PDFs to Word using OCR (Tesseract)
Sometimes, PDFs are just scanned images. In such cases, the above methods won’t work because there’s no actual text to extract.
That’s when I use OCR (Optical Character Recognition) with pytesseract and Pillow.
Installation
pip install pytesseract pillow pdf2imageYou’ll also need to install Tesseract OCR on your system:
- Windows: Tesseract Installer
- macOS: brew install tesseract
- Linux: sudo apt-get install tesseract-ocr
Example Code
import pytesseract
from pdf2image import convert_from_path
from docx import Document
# Convert PDF pages to images
pages = convert_from_path("scanned_invoice.pdf")
# Create a Word document
doc = Document()
# Process each page with OCR
for i, page in enumerate(pages, start=1):
text = pytesseract.image_to_string(page)
doc.add_paragraph(f"Page {i}")
doc.add_paragraph(text if text else "[No text detected]")
doc.add_page_break()
# Save the Word document
doc.save("scanned_invoice.docx")
print("Scanned PDF converted to editable Word file!")This method takes longer but is extremely powerful. It lets you turn scanned PDFs into fully editable Word documents.
Choose the Right Method
Here’s a quick comparison based on my experience:
| Method | Best For | Pros | Cons |
|---|---|---|---|
| pdf2docx | General PDF to Word conversion | Preserves formatting, easy to use | Sometimes struggles with complex layouts |
| PyPDF2 + python-docx | Extracting plain text | Lightweight, simple | No formatting, images ignored |
| OCR (pytesseract) | Scanned PDFs | Works on images, extracts text | Slower, requires Tesseract installation |
Practical Tips I’ve Learned
- Always check the converted Word file manually, especially for contracts or financial documents.
- For batch processing, wrap the code in a loop to handle multiple PDFs.
- Keep backups of original PDFs in case formatting doesn’t transfer perfectly.
- Use OCR only when necessary, since it’s slower than text-based extraction.
Conclusion
Converting PDF to Word in Python has saved me countless hours of manual work. In this tutorial, I showed you three different methods:
- Using pdf2docx for reliable formatting.
- Using PyPDF2 and python-docx for plain text extraction.
- Using OCR for scanned PDFs.
Each method has its own strengths, and the one you choose depends on your specific use case.
I hope you found this guide helpful; now you can confidently convert PDFs into editable Word documents using Python.
You may also read:
- Sum All Values in a Python Dictionary
- Slice a Dictionary in Python
- Save a Python Dictionary as a JSON File
- Write a Dictionary to a File in Python

I am Bijay Kumar, a Microsoft MVP in SharePoint. Apart from SharePoint, I started working on Python, Machine learning, and artificial intelligence for the last 5 years. During this time I got expertise in various Python libraries also like Tkinter, Pandas, NumPy, Turtle, Django, Matplotlib, Tensorflow, Scipy, Scikit-Learn, etc… for various clients in the United States, Canada, the United Kingdom, Australia, New Zealand, etc. Check out my profile.