Extract text from PDF Python + Useful Examples

This Python tutorial explains, extract text from PDF Python. We will see how to extract text from PDF files in Python using Python Tkinter. I will also show a pdf to word converter that we developed using Python.

Also, we will check:

  • Copy text from pdf file in Python
  • How to extract text from pdf file using Python Tkinter
  • Delete text from pdf file in Python
  • How to copy text from pdf file images in Python
  • Can’t copy text from pdf
  • How to copy text from pdf to word in Python
  • Copy text from pdf online
  • Remove text from pdf online
  • How to select text from pdf in Python

Before doing the below examples, check out the below three articles:

Python copy text from pdf file

  • In this section, we will learn how to copy text from PDF files using Python. Also, we will be demonstrating everything using Python Tkinter. We assume that you have already installed PyPDF2 and Tkinter module in your respective system.
  • The process of copying text in Python Tkinter is divided into two parts:
    • In the first part, we will be extracting text from the pdf using the PyPDF2 module in Python.
    • In the second step, we will be copying the text using clipboard() function available in Python Tkinter.

Here is the code to read and extract data from the PDF using the PyPDF2 module in Python.

reader = PdfFileReader(filename)
pageObj = reader.getNumPages()
for page_count in range(pageObj):
    page = reader.getPage(page_count)
    page_data = page.extractText()
    
  • In the first line, we have created a ‘reader’ variable that holds the PDF file path. Here filename refers to the name of the file with the path.
  • In the second line, we have fetched the total number of pages present in the PDF file.
  • In the third line, the loop is started and it will iterate over the total number of pages in a PDF file.
  • Every time the loop runs it displays the text information present on the PDF file.
  • So in this way, we can extract the text out of the PDF using the PyPDF2 module in Python.

Here is the code to copy text using Python Tkinter.

 ws.withdraw()
 ws.clipboard_clear()
 ws.clipboard_append(content)
 ws.update()
 ws.destroy()
  • Here, ws is the master window.
  • The first line of code is used to remove the window from the screen without destroying it.
  • In the second line of code, we have removed any text if already copied.
  • the third line of code is the action of copying the content. Here content can be replaced with the text you want to copy.
  • It is important that the text remained copied even after the window is closed. To do that we are using the update function.
  • In the last line of code, we have simply destroyed the window. This is an option you can remove this code if don’t want the window to be closed.

Code Snippet:

Here is the code of a small project that include everything that we have learned so far. This project is GUI based program created using Python Tkinter to implement copying of text from PDF.

from PyPDF2 import PdfFileReader
from tkinter import *
from tkinter import filedialog

ws = Tk()
ws.title('PythonGuides')
ws.geometry('400x300')
ws.config(bg='#D9653B')

def choose_pdf():
      filename = filedialog.askopenfilename(
            initialdir = "/",   # for Linux and Mac users
          # initialdir = "C:/",   for windows users
            title = "Select a File",
            filetypes = (("PDF files","*.pdf*"),("all files","*.*")))
      if filename:
          return filename


def read_pdf():
    filename = choose_pdf()
    reader = PdfFileReader(filename)
    pageObj = reader.getNumPages()
    for page_count in range(pageObj):
        page = reader.getPage(page_count)
        page_data = page.extractText()
        textbox.insert(END, page_data)

def copy_pdf_text():
    content = textbox.get(1.0, "end-1c")
    ws.withdraw()
    ws.clipboard_clear()
    ws.clipboard_append(content)
    ws.update()
    ws.destroy()


textbox = Text(
    ws,
    height=13,
    width=40,
    wrap='word',
    bg='#D9BDAD'
)
textbox.pack(expand=True)

Button(
    ws,
    text='Choose Pdf File',
    padx=20,
    pady=10,
    bg='#262626',
    fg='white',
    command=read_pdf
).pack(expand=True, side=LEFT, pady=10)

Button(
    ws,
    text="Copy Text",
    padx=20,
    pady=10,
    bg='#262626',
    fg='white',
    command=copy_pdf_text
).pack(expand=True, side=LEFT, pady=10)


ws.mainloop()

Output:

In this output, we have used the Python Tkinter Text box to show the text of the PDF file. The user will click on the Choose PDF file button. Using the file dialogue box in Python Tkinter he/she can navigate and select the PDF file from the computer.

The text will be displayed in the Text box immediately now from here user can copy the text simply by clicking on the Copy Text button. The text will be copied and can be pasted anywhere like we normally do.

Python tkinter copy Pdf
Python copy text from pdf file

This is how to copy text from PDF file in Python.

Extract text from pdf Python

  • In this section, we will learn how to extract text from PDF using Python Tkinter. PyPDF2 module in Python offers a method extractText() using which we can extract the text from PDF in Python.
  • In the previous section, where we have demonstrated how to copy the text in Python Tkinter. There we have used the extractText() method to display the text on the screen.
  • Here is the code from the previous section to extract text from PDF using the PyPDF module in Python Tkinter.
 reader = PdfFileReader(filename)
 pageObj = reader.getNumPages()
 for page_count in range(pageObj):
     page = reader.getPage(page_count)
     page_data = page.extractText()
     
  • In this first line of code, we have created an object of PdfFileReader. Here filename is the name of the PDF file with the complete path.
  • In the second line of code, we have collected the total number of pages available in the PDF file. This information will be further used in the loop.
  • In the third line of code, we have started a for loop that will overall the pages present in the PDF file. for example, if the PDF has 10 pages then the loop will run 10 times.
  • Each time the loop runs it adds the information of each page in a ‘page’ variable. That means the variable Page has information on each page present in the PDF.
  • Now, by applying extractText() method on variable ‘page’ we are able to extract and display all the text of the PDF in a human-readable format.
  • All the text displayed here is using extractText() method of PyPDF2 module in Python. For source code Please refer to the previous section.
extract text from pdf python
extract text from pdf python

This is how to extract text from pdf python.

Read: Upload a File in Python Tkinter

Delete text from pdf file in Python

The below is the complete code to delete text from PDF file in Python.

from PyPDF2 import PdfFileReader
from tkinter import *
from tkinter import filedialog

ws = Tk()
ws.title('PythonGuides')
ws.geometry('400x300')
ws.config(bg='#D9653B')

path = []

def save_pdf():
    content = textbox.get(1.0, "end-1c")
    content.output(path[0])


def saveas_pdf():
    pass    

def choose_pdf():
      global path
      filename = filedialog.askopenfilename(
            initialdir = "/",   # for Linux and Mac users
          # initialdir = "C:/",   for windows users
            title = "Select a File",
            filetypes = (("PDF files","*.pdf*"),("all files","*.*")))
      if filename:
          path.append(filename)
          return filename


def read_pdf():
    filename = choose_pdf()
    reader = PdfFileReader(filename)
    pageObj = reader.getNumPages()
    for page_count in range(pageObj):
        page = reader.getPage(page_count)
        page_data = page.extractText()
        textbox.insert(END, page_data)


def copy_pdf_text():
    content = textbox.get(1.0, "end-1c")
    ws.withdraw()
    ws.clipboard_clear()
    ws.clipboard_append(content)
    ws.update()
    ws.destroy()

fmenu = Menu(
    master=ws,
    bg='#D9653B',
    
    relief=GROOVE
  
    )
ws.config(menu=fmenu)

file_menu = Menu(
    fmenu,
    tearoff=False
)
fmenu.add_cascade(
    label="File", menu=file_menu
)
file_menu.add_command(
    label="Open", 
    command=read_pdf
)
file_menu.add_command(
    label="Save",
    command=save_pdf    
)

file_menu.add_command(
    label="Save as",
    command=None    # ToDo
)

file_menu.add_separator()

file_menu.add_command(
    label="Exit",
    command=ws.destroy
)

textbox = Text(
    ws,
    height=13,
    width=40,
    wrap='word',
    bg='#D9BDAD'
)
textbox.pack(expand=True)

Button(
    ws,
    text='Choose Pdf File',
    padx=20,
    pady=10,
    bg='#262626',
    fg='white',
    command=read_pdf
).pack(expand=True, side=LEFT, pady=10)

Button(
    ws,
    text="Copy Text",
    padx=20,
    pady=10,
    bg='#262626',
    fg='white',
    command=copy_pdf_text
).pack(expand=True, side=LEFT, pady=10)


ws.mainloop()

Python extract text from image

  • Reading or copying text from an image is an advanced process and requires a Machine Learning algorithm.
  • Each language has a different pattern of writing alphabets. So it requires a dataset of alphabets & words with different calligraphy in specific language that is written on the image.
  • When this dataset is passed in the Machine Learning algorithm then it starts identifying the text on the image by matching the pattern of alphabets.
  • OCR (Optical Character Recognition) is the Python library that runs a machine-learning algorithm to identify characters from images.
  • Python extract text from image topic will be covered in our Machine Learning section.

Can’t copy text from pdf in Python

In this section, we will be sharing common problems that occur while reading PDF using Python Tkinter. So, if you can’t copy text from pdf in Python, then check the below points.

  • If the PDF is being used by another process then you can’t copy text from PDF.
  • Double-check the PDF file if you are seeing a message that can’t copy text from PDF

These are the common observations wherein users can’t copy text from PDF. If you face any other issue please leave it in the comment below.

Read Python Tkinter drag and drop

How to copy text from pdf to word in Python

  • To copy text from PDF to Word file using Python we use a module pdf2docs in Python.
  • pdf2docx allows converting any PDF document to a Word file using Python. This word file can be further open with third-party applications like Microsoft Word, Libre Office, and WPS.
  • The first step in this process is to install pdf2docs module. Using pip you can install the module on your device in any operating system.
pip install pdf2docx

Code Snippet:

This code shows how PDF can be converted to Word file using Python Tkinter.

from tkinter import *
from tkinter import filedialog
import pdf2docx


path = []

def convert_toword():
    global path
    data = []
    file = filedialog.asksaveasfile( 
        defaultextension = data,
        filetypes = (("WORD files","*.docx*"),("all files","*.*")),
        )
    pdf2docx.parse(
        pdf_file=path[0],
        docx_file= file.name,
        start=0,
        end=None,
    )

def choose_file():
    global path
    path.clear()
    filename = filedialog.askopenfilename(
            initialdir = "/",   # for Linux and Mac users
          # initialdir = "C:/",    for windows users
            title = "Select a File",
            filetypes = (("PDF files","*.pdf*"),("all files","*.*")))
    path.append(filename)

ws = Tk()
ws.title('PythonGuides')
ws.geometry('400x300')
ws.config(bg='#F2E750')         

choose_btn = Button(
    ws,
    text='Choose File',
    padx=20,
    pady=10,
    bg='#344973',  
    fg='white',
    command=choose_file
)
choose_btn.pack(expand=True, side=LEFT)

convert_btn = Button(
    ws,
    text='Convert to Word',
    padx=20,
    pady=10,
    bg='#344973',
    fg='white',
    command=convert_toword
)
convert_btn.pack(expand=True, side=LEFT)

ws.mainloop()

Output:

This is the output of the main screen of the application. User can choose the PDF file by clicking on the Choose File button. And once selected then he can click on the convert to word PDF. The word file will be created in the same directory from where the PDF file was chosen.

copy text from pdf to word in Python
fig 1: main screen of the application

fig 2 shows the appearance of the file dialogue window when the user clicked on the Choose file button. So the user has selected Grades.pdf

copy text from pdf to word in Python
fig 2: selecting PDF

Fig 3 shows appearance of save file dialogue box. User is saving the file with .docx extension.

How to copy text from pdf to word in Python
Fig 3: converting to word

Fig 4 shows the conversion of PDF to Word file. In this case, you can see the word document is created with the name updatedGrades.docx. This name is provided by the user in Fig 3.

copy text from pdf to word in Python
fig 4: file converted to word document

This is how to copy text from pdf to word in Python.

Read: Create a game using Python Pygame

How to Select text from pdf file in Python

  • In this section, we will learn how to select text from PDF using Python. Also, we will be demonstrating everything using Python Tkinter. We assume that you have already installed PyPDF2 and Tkinter module in your respective system.
  • The process of selecting text in Python Tkinter is divided into two parts:
    • In the first part, we will be extracting text from the pdf using the PyPDF2 module in Python.
    • In the second step, we will be selecting text from the extracted text.

Here is the code to read and extract data from the PDF using the PyPDF2 module in Python

reader = PdfFileReader(filename)
pageObj = reader.getNumPages()
for page_count in range(pageObj):
    page = reader.getPage(page_count)
    page_data = page.extractText()
    
  • In the first line, we have created a ‘reader’ variable that holds the PDF file path. Here filename refers to the name of the file with the path.
  • In the second line, we have fetched the total number of pages present in the PDF file.
  • In the third line, the loop is started and it will iterate over the total number of pages in a PDF file.
  • every time the loop runs it displays the text information present on the PDF file.
  • So in this way, we can extract the text out of the PDF using the PyPDF2 module in Python.
  • Once you have extracted text, now you can simply select the text by right click and dragging the mouse

Code Snippet:

Here is the code of a small project that shows the extraction of text from PDF.. This project is a GUI based program created using Python Tkinter to implement selectiong of text from PDF.

from PyPDF2 import PdfFileReader
from tkinter import *
from tkinter import filedialog

ws = Tk()
ws.title('PythonGuides')
ws.geometry('400x300')
ws.config(bg='#D9653B')

def choose_pdf():
      filename = filedialog.askopenfilename(
            initialdir = "/",   # for Linux and Mac users
          # initialdir = "C:/",   for windows users
            title = "Select a File",
            filetypes = (("PDF files","*.pdf*"),("all files","*.*")))
      if filename:
          return filename


def read_pdf():
    filename = choose_pdf()
    reader = PdfFileReader(filename)
    pageObj = reader.getNumPages()
    for page_count in range(pageObj):
        page = reader.getPage(page_count)
        page_data = page.extractText()
        textbox.insert(END, page_data)

def copy_pdf_text():
    content = textbox.get(1.0, "end-1c")
    ws.withdraw()
    ws.clipboard_clear()
    ws.clipboard_append(content)
    ws.update()
    ws.destroy()


textbox = Text(
    ws,
    height=13,
    width=40,
    wrap='word',
    bg='#D9BDAD'
)
textbox.pack(expand=True)

Button(
    ws,
    text='Choose Pdf File',
    padx=20,
    pady=10,
    bg='#262626',
    fg='white',
    command=read_pdf
).pack(expand=True, side=LEFT, pady=10)

Button(
    ws,
    text="Copy Text",
    padx=20,
    pady=10,
    bg='#262626',
    fg='white',
    command=copy_pdf_text
).pack(expand=True, side=LEFT, pady=10)


ws.mainloop()

Output:

In this output, we have used the Python Tkinter Text box to show the text of the PDF file. The user will click on the Choose PDF file button. Using the file dialogue box in Python Tkinter he/she can navigate and select the PDF file from the computer.

The text will be displayed in the Text box immediately now from here user can copy the text simply by clicking on the Copy Text button. The text will be copied and can be pasted anywhere like we normally do. Now user can select any portion of text and use it to solve his/her purpose.

Select text from pdf file in Python
Select text from pdf file in Python

How to Convert Pdf to word Python pypdf2

Now, it is time to develop a tool, pdf to word converter using Python.

In this section, we have created software to convert pdf to word python pypdf2. This is complete software and can be used as a minor project using Python Tkinter.

from PyPDF2 import PdfFileReader
from tkinter import *
from tkinter import filedialog
import pdf2docx

f = ("Times", "15", "bold")

def export_toword():
    pdf2docx.convert = browseFiles.filename
 

def browseFiles():
    filename = filedialog.askopenfilename(
        initialdir = "/",
        title = "Select a File",
        filetypes = (("PDF files","*.pdf*"),("all files","*.*")))
    fname = filename.split('/')
    upload_confirmation_lbl.configure(text=fname[-1])
    process(filename)
    return filename
    

def process(filename): 
    with open(filename, 'rb') as f:
        pdf = PdfFileReader(f)
        information = pdf.getDocumentInfo()
        number_of_pages = pdf.getNumPages()
    fname = filename.split('/')
    right2.config(text=f'{information.author}')
    right3.config(text=f'{information.producer}')
    right1.config(text=f'{fname[-1]}:')
    right4.config(text=f'{information.subject}')
    right5.config(text=f'{information.title}')
    right6.config(text=f'{number_of_pages}')

      

ws = Tk()
ws.title('PythonGuides')
ws.geometry('800x800')


upload_frame = Frame(
    ws,
    padx=5,
    pady=5
    )
upload_frame.pack(pady=10)

upload_btn = Button(
    upload_frame,
    text='UPLOAD PDF FILE',
    padx=20,
    pady=20,
    bg='#f74231',
    fg='white',
    command=browseFiles
)
upload_btn.pack(expand=True)
upload_confirmation_lbl = Label(
    upload_frame,
    pady=10,
    fg='green'
)
upload_confirmation_lbl.pack()

description_frame = Frame(
    ws,
    padx=10,
    pady=10
)
description_frame.pack()

right1 = Label(
    description_frame,
)
right2 = Label(
    description_frame,
)
right3 = Label(
    description_frame,
)
right4 = Label(
    description_frame,
)
right5 = Label(
    description_frame,
)
right6 = Label(
    description_frame
)

left1 = Label(
    description_frame,
    text='Author: ',
    padx=5,
    pady=5,
    font=f
    
)
left2 = Label(
    description_frame,
    text='Producer: ',
    padx=5,
    pady=5,
    font=f
)

left3 = Label(
    description_frame,
    text='Information about: ',
    padx=5,
    pady=5,
    font=f
)

left4 = Label(
    description_frame,
    text='Subject: ',
    padx=5,
    pady=5,
    font=f
)

left5 = Label(
    description_frame,
    text='Title: ',
    padx=5,
    pady=5,
    font=f
)

left6 = Label(
    description_frame,
    text='Number of pages: ',
    padx=5,
    pady=5,
    font=f
)

left1.grid(row=1, column=0, sticky=W)
left2.grid(row=2, column=0, sticky=W)
left3.grid(row=3, column=0, sticky=W)
left4.grid(row=4, column=0, sticky=W)
left5.grid(row=5, column=0, sticky=W)
left6.grid(row=6, column=0, sticky=W)

right1.grid(row=1, column=1)
right2.grid(row=2, column=1)
right3.grid(row=3, column=1)
right4.grid(row=4, column=1)
right5.grid(row=5, column=1)
right6.grid(row=6, column=1)

export_frame = LabelFrame(
    ws,
    text="Export File As",
    padx=10,
    pady=10,
    font=f

)
export_frame.pack(expand=True, fill=X)
to_text_btn = Button(
    export_frame,
    text="TEXT FILE",
    command=None,
    pady=20,
    font=f,
    bg='#00ad8b',
    fg='white'
)
to_text_btn.pack(side=LEFT, expand=True, fill=BOTH)

to_word_btn = Button(
    export_frame,
    text="WORD FILE",
    command=export_toword,
    pady=20,
    font=f,
    bg='#00609f',
    fg='white'
)
to_word_btn.pack(side=LEFT, expand=True, fill=BOTH)


ws.mainloop()

Output:

This is a multiple-purpose software created. It can convert the PDF file to a text file and Word file. Also, it displays brief information about the selected PDF.

convert pdf to word python pypdf2
convert pdf to word python pypdf2

This is pdf to word converter developed using Python.

The above code will help to solve the below problems:

  • pypdf2 convert pdf to word
  • pypdf2 convert pdf to docx
  • Convert pdf to docx in python
  • Convert pdf to docx using python
  • Convert pdf to text file using python
  • How to convert pdf to word in python
  • How to convert pdf to word file in python

This is how to convert pdf to word in Python using pypdf2.

You may like the following Python articles:

In this tutorial, we have learned how to extract text from PDF in Python. Also, we have covered these topics:

  • Python copy text from pdf file
  • Extract text from pdf Python
  • Delete text from pdf file in Python
  • Python extract text from image
  • Can’t copy text from pdf in Python
  • How to copy text from pdf to word in Python
  • How to Select text from pdf file in Python
  • How to Convert Pdf to word Python pypdf2