Convert PDF file to Docx in Python

Recently, we got a requirement where we need to build a Python application that converts PDF files to word documents (Docx file). Now, for this task in Python, we utilize the pdf2docx package in Python. But in the implementation, we got an error stating “AttributeError: ‘Page’ object has no attribute ‘rotationMatrix’. Did you mean: ‘rotation_matrix’?“.

So, in this Python tutorial, we will understand how to solve this error and convert PDF file to Docx in Python. Here is the list of topics that we are going to discuss.

  • Prerequisite for “Convert PDF file to Docx in Python”
  • Convert PDF file to Docx in Python Error
  • Convert PDF file to Docx in Python using Converter()
  • Convert PDF file to Docx in Python using parse()

Prerequisite for “Convert PDF file to Docx in Python”

Checking Python Version

Before we start with the implementation of the converting PDF to Docx, we need to make sure that python is properly installed in our system. Now, we can check the version of python by using the following command in our command prompt or terminal.

python --version

However, for the current instance, we are using Windows operating system, so we will be using Windows Command Prompt. And here is the result of the above command.

Check Python version in CMD
Check the Python version in CMD

So, from the output, you can observe that we are using Python version 3.10.2.

Also, check: What is a Python Dictionary

Installing pdf2docx package

After this, the next prerequisite is the pdf2docx package. This Python library utilizes PyMuPDF which is Python binding to extract data from PDF files and interpret its layout. And then it uses the python-docx library to create word document files.

Here python-docx is another useful library that is generally utilized in generating and editing Microsoft Word (.docx) files.

Now, this pdf2docx package is a 3rd party package so before using it we need to install it in our system or virtual environment.

Here are the basic steps where we will create a virtual environment and then use the pip command to install the pdf2docx package in it.

Command to create a virtual environment in Python.

python -m venv myapp

In the above command, myapp is the name of the virtual environment. However, you can also specify any other environment name as well.

The next step is to activate the virtual environment and we will use the following command for this task.

myapp\Scripts\activate

Once the virtual environment is activated, the name of the virtual environment will appear at the starting of the terminal.

Convert PDF file to Docx in Python Example
Convert PDF file to Docx in Python Example

Now, we are ready to install the pdf2docx package in our myapp virtual environment. For this task, we will use the following pip command.

pip install pdf2docx

Once we run the above command, it will install all the required packages related to this pdf2docx package.

Important:

In our case, it has install the pdf2docx version 0.5.3. And the error wheich we are goining to resolve will also come in the same version. If you have installed some other version there is a possibility that you don’t recieve any or same error.

Read: Python naming conventions

Convert PDF file to Docx in Python 

Once we have installed the pdf2docx package, we are ready to use this package in Python to convert a PDF file to a word document having a .docx extension. For this task, we have 2 different methods in pdf2docx. The first method includes the use Converter() class from the package and the second method includes the use of the parse() function.

Let us discuss each method with an example in Python.

Using Converter() class

  • The Converter() class utilizes PyMuPDF to read the specified PDF file and fetch page-by-page raw layout data, which includes text, images, and their associated properties.
  • After this, it examines the document’s layout at the page header, footer, and margin level.
  • Next, it will parse page layout to docx structure. In the last, it uses “python-docx” to generate a docx file as a result.

Let us understand how to use this Converter() class to convert a PDF to a word document in Python.

# Importing the Converter() class
from pdf2docx import Converter

# Specifying the pdf & docx files
pdf_file = 'muscles.pdf'
docx_file = 'muscles.docx'


try:
    # Converting PDF to Docx
    cv_obj = Converter(pdf_file)
    cv_obj.convert(docx_file)
    cv_obj.close()

except:
    print('Conversion Failed')
    
else:
    print('File Converted Successfully')

In the above code, first, we have imported the Converter() class from the pdf2docx module. After this, we defined 2 variables to specify the file and path for both the pdf file we want to convert and also the resultant word file. For the current instance, we have kept the muscles.pdf file in the same working directory where we kept the python file.

Next, we created an object of Converter() class named cv_obj where we passed the pdf_file variable as an argument. And then the object utilized the convert() method to convert the file. Moreover, within the convert() method, we passed the docx_file variable as an argument.

In the last, we utilized the close() method to close the file. Next, run the python program, it will create a new docx file named muscles.docx which will also consist of all the data from the pdf file.

Convert PDF file to Docx in Python Error

Now, here if you have also installed pdf2docx version 0.5.3 then there is a high possibility that the conversion at your end also fails and it returns Conversion Failed“. This happens due to the try-except block where while execution an exception has been raised.

Here, if we remove the try-except block and then execute the program, it will return the following error.

AttributeError: ‘Page’ object has no attribute ‘rotationMatrix’. Did you mean: ‘rotation_matrix’?

Now, to resolve the above error in Python, we need to follow the following steps.

  • First, go inside the virtual environment directory and go to the following folder. Here is the path at our end: “D:\project\myapp\Lib\site-packages”. Other than this, if you have not used the virtual environment, you need to go to the following path: C:\users\UserName\appdata\roaming\python\python310\site-packages

In the second path, please insert your username and python directory properly. 

  • Next, open the pdf2docx directory and go to the page directory and open the RawPage.py file. 
  • In the file, go to the line number 279 where it shows Element.set_rotation_matrix(self.fitz_page.rotationMatrix).  
  • Now, we need to replace rotationMatrix with rotation_matrix and then save and close the file. 
Convert PDF file to Docx in Python Error
Convert PDF file to Docx in Python Error

After implementing the above steps, we can utilize the previous example to convert the muscles.pdf file into the muscles.docx file. Here is the sample result of the command prompt, when we implement the python program.

Example of Convert PDF file to Docx in Python
Example of Convert PDF file to Docx in Python

Read: How to create a list in Python

Using parse() function

Unlike the Converter() class, we can also utilize the parse() function from the pdf2docx module. And we can directly use this function to convert a pdf file into a word document.  

For implementation, we may need to use the following syntax of the parse() function. 

parse(pdf_file_path, docx_file_path, start=page_no, end=page_no) 

The parse() method accepts 4 argument values and the explanation related to each parameter is given below. 

  • The pdf_file_path argument is utilized to define the file name and path of the PDF file that we want to convert. 
  • The docx_file_path argument is utilized to define the file name and path of the word file that we want in the result. 
  • The start parameter will be utilized to specify the starting page number of the pdf file from where we want to start the conversion.
  • In the last, there is an end argument that can be utilized to specify the ending page number of a pdf file, and the method will convert the page in the specified range. 

Next, to understand the above syntax, let us execute a sample example in Python. And the code for the example is given below.

# Importing the parse() function
from pdf2docx import parse

# Specifying the pdf & docx files
pdf_file = 'muscles.pdf'
docx_file = 'new_muscles.docx'

try:
    # Converting PDF to Docx
    parse(pdf_file, docx_file)
    
except:
    print('Conversion Failed')
    
else:
    print('File Converted Successfully')

In the above example, first, we imported the parse() function from the pdf2docx package. After this, we defined two variables, just like the previous example to specify the file name and path for both pdf and docx files.

Next, we utilized the parse() function where the first argument is the pdf_file variable representing the pdf file and the second argument is docx_file representing the docx file.

Moreover, we have kept this code within the try-except-else block to handle exceptions whenever raised. However, in the end, it will convert the muscles.pdf file and generate the new_muscles.pdf file.

You may also like to read the following Python tutorial.

So, in this Python tutorial, we understood how to convert PDF file to docx in Python. Here is the list of topics that we have covered in this tutorial.

  • Prerequisite for “Convert PDF file to Docx in Python”
  • Convert PDF file to Docx in Python Error
  • Convert PDF file to Docx in Python using Converter()
  • Convert PDF file to Docx in Python using parse()