Extracting text from common file formats in Python

This week I was working with a digital archive of historical documents. As part of that work, I've been experimenting with different ways of extracting text from common file formats using Python, trying to find methods that are reliable and fast enough for the things I need to do.

This post is a note on the best methods I have found so far. It may be helpful to others doing similar things. If you know of better ways of doing any of the things shown here, please do let me know, especially if they are faster or more reliable. I will update this post as I find more and better methods.

Bear in mind these are minimal examples. You'll want to add error handling and other checks appropriate to your dataset, especially if you're processing a large number of documents.

1. Setup

These are the packages you need to install to run all the examples shown below. But you probably don't need them all. The import statements in the examples make it clear which packages you need in each case. Just be aware that PyMuPDF is imported as fitz.


pip install PyMuPDF pdfplumber Pillow pytesseract textract extract-msg

2. Text PDFs

For modern PDFs that contain encoded text data, you can use PyMuPDF to extract the text.


import fitz

def extract_text_pdf(pdf_path):
    pdf = fitz.open(pdf_path)
    pages = []
    for page in pdf:
        page_text = page.get_text()
        pages.append(page_text)
    pdf.close()
    text = '\n\n'.join(pages)
    return text

pages = extract_text_pdf('example.pdf')

This is the fastest method I have found so far, but note the following point from the PyMuPDF documentation.

The output will be plain text as it is coded in the document. No effort is made to prettify in any way. Specifically for PDF, this may mean output not in usual reading order, unexpected line breaks and so forth.

The package contains tools that allow for more granular control of how the text is extracted. You can get the text as a collection of blocks with positional information and use that to make inferences about the page layout. But that is beyond the scope of this post.

An alternative is to use the pdfplumber package, which works in a similar way. In my testing, this did a better job of preserving the positional order of page elements like headers and footers, but it also took longer than PyMuPDF to process the same document. If you care more about positional fidelity than speed, this may be a better option. You'll just have to try both and see which works best for you.


import pdfplumber

def extract_text_pdf(pdf_path):
    pdf = pdfplumber.open(pdf_path)
    pages = []
    for page in pdf.pages:
        page_text = page.extract_text()
        pages.append(page_text)
    pdf.close()
    text = '\n\n'.join(pages)
    return text

pages = extract_text_pdf('example.pdf')

3. Scanned PDFs

Some PDFs don't contain text data. Instead they are a collection of images from scans of a physical document. The text shown in these images can be extracted with optical character recognition. Use PyMuPDF to extract the images from a scanned PDF, then use Pillow and Tesseract to extract the text from the images. This is inevitably a slower process than extracting encoded text.


import fitz
import io

from PIL import Image
from pytesseract import image_to_string

def extract_text_scanned_pdf(pdf_path, verbose=False):
    
    pdf = fitz.open(pdf_path)
    pages = []

    # Process each page in turn
    for page_num, page in enumerate(pdf, start=1):
        
        # Get a list of images in this page
        pdf_image_list = page.get_images()
        
        if verbose:
            num_images = len(pdf_image_list)
            image_noun = 'image' if num_images == 1 else 'images'
            print('Extracting {0} {1} from page {2}'.format(
                num_images, image_noun, page_num))
        
        # Process each image in turn
        for pdf_image in pdf_image_list:
            
            # Get external reference and extract bytes
            pdf_image_xref = pdf_image[0]
            base_image = pdf.extract_image(pdf_image_xref)
            base_image_bytes = base_image['image']
            
            # Read image and extract text
            image = Image.open(io.BytesIO(base_image_bytes))
            page_text = image_to_string(image, lang='eng')
            pages.append(page_text)

    pdf.close()
    text = '\n\n'.join(pages)
    return text
                    
pages = extract_text_scanned_pdf('example_scanned.pdf', verbose=True)

4. Word documents

You can use textract to extract the text of a Word document with a single line of code. The package can handle both .doc and .docx files, and a number of other file formats too. By default it will infer which parser to use based on the file extension at the end of the file path.

However, I would recommend setting the file extension explicitly in the function call, because asking textract to infer the file format directly from the file path can sometimes fail. In my case, some of the file paths in the Azure container I had mounted in DBFS had trailing spaces, which prevented textract from properly inferring the file extension.

The following code snippet shows how to extract text from an old-fashioned Word document with a .doc file extension.


import textract

doc_text = textract.process('example.doc', extension='doc').decode('utf-8')

5. Emails

You can use textract to extract the contents of email files in both .eml and .msg format. However, I ran into some problems because it tries to figure out the input encoding of the text for you, and if this fails there is no way to handle the encoding errors, which prevents you reading the file.

5.1. eml files

The following function takes the same approach to extracting the text from .eml files that textract uses behind the scenes, but it lets you specify an input encoding, and provides an argument that is passed to Python's open function which controls how any errors should be handled. This lets you handle errors differently if necessary.


from email.parser import Parser as EmailParser

def extract_text_eml(eml_path, encoding='utf-8', errors='strict'):

    with open(eml_path, encoding=encoding, errors=errors) as stream:    
        parser = EmailParser()
        message = parser.parse(stream)

    text_content = []
    for part in message.walk():
        if part.get_content_type().startswith('text/plain'):
          text_content.append(part.get_payload())

    text = '\n\n'.join(text_content)
    return text

eml_text = extract_text_eml('example.eml', errors='ignore')

5.2. msg files

In theory, there is in no need to guess the encoding of .msg files because each file should specify its own encoding. However, textract could not find an encoding in some of the .msg files I was working with, and that prevented it from extracting the text. This function uses the msg-extractor package directly, which lets you explicitly set the encoding.


from extract_msg import Message

def extract_text_msg(msg_path, encoding='utf-8'):
    msg = Message(msg_path, overrideEncoding=encoding)
    text = msg.body
    return text

msg_text = extract_text_msg('example.msg', encoding='cp1252')

6. Changelog

This article was originally posted on 18 December 2021, but it is a living document. It is intended to serve as a reference for useful text extraction methods and a record of some of the issues to bear in mind when doing this kind of work. Because this article may be updated from time to time, changes are recorded here.

  • 2021-12-18 – Initial post containing text extraction methods for text PDFs, scanned PDFs, and Word documents in .doc and .docx formats.
  • 2022-01-07 – Added text extraction methods for emails in .eml and .msg format.
  • 2022-01-11 – Updated text extraction methods for pdfs so that they return a single string of text rather than a list of strings for each page.

Let me know if you have any suggestions for additional methods or improvements.