Extracting text from common file formats in Python
This week I was working with a digital archive of historical documents. As part of that work, I've been experimenting with different ways of extracting text from common file formats using Python, trying to find methods that are reliable and fast enough for the things I need to do.
This post is a note on the best methods I have found so far. It may be helpful to others doing similar things. If you know of better ways of doing any of the things shown here, please do let me know, especially if they are faster or more reliable. I will update this post as I find more and better methods.
Bear in mind these are minimal examples. You'll want to add error handling and other checks appropriate to your dataset, especially if you're processing a large number of documents.
1. Setup
These are the packages you need to install to run all the examples shown below. But you probably don't need them all. The import statements in the examples make it clear which packages you need in each case. Just be aware that PyMuPDF is imported as fitz.
pip install PyMuPDF pdfplumber Pillow pytesseract textract extract-msg
2. Text PDFs
For modern PDFs that contain encoded text data, you can use PyMuPDF to extract the text.
import fitz
def extract_text_pdf(pdf_path):
pdf = fitz.open(pdf_path)
pages = []
for page in pdf:
page_text = page.get_text()
pages.append(page_text)
pdf.close()
text = '\n\n'.join(pages)
return text
pages = extract_text_pdf('example.pdf')
This is the fastest method I have found so far, but note the following point from the PyMuPDF documentation.
The output will be plain text as it is coded in the document. No effort is made to prettify in any way. Specifically for PDF, this may mean output not in usual reading order, unexpected line breaks and so forth.
The package contains tools that allow for more granular control of how the text is extracted. You can get the text as a collection of blocks with positional information and use that to make inferences about the page layout. But that is beyond the scope of this post.
An alternative is to use the pdfplumber package, which works in a similar way. In my testing, this did a better job of preserving the positional order of page elements like headers and footers, but it also took longer than PyMuPDF to process the same document. If you care more about positional fidelity than speed, this may be a better option. You'll just have to try both and see which works best for you.
import pdfplumber
def extract_text_pdf(pdf_path):
pdf = pdfplumber.open(pdf_path)
pages = []
for page in pdf.pages:
page_text = page.extract_text()
pages.append(page_text)
pdf.close()
text = '\n\n'.join(pages)
return text
pages = extract_text_pdf('example.pdf')
3. Scanned PDFs
Some PDFs don't contain text data. Instead they are a collection of images from scans of a physical document. The text shown in these images can be extracted with optical character recognition. Use PyMuPDF to extract the images from a scanned PDF, then use Pillow and Tesseract to extract the text from the images. This is inevitably a slower process than extracting encoded text.
import fitz
import io
from PIL import Image
from pytesseract import image_to_string
def extract_text_scanned_pdf(pdf_path, verbose=False):
pdf = fitz.open(pdf_path)
pages = []
# Process each page in turn
for page_num, page in enumerate(pdf, start=1):
# Get a list of images in this page
pdf_image_list = page.get_images()
if verbose:
num_images = len(pdf_image_list)
image_noun = 'image' if num_images == 1 else 'images'
print('Extracting {0} {1} from page {2}'.format(
num_images, image_noun, page_num))
# Process each image in turn
for pdf_image in pdf_image_list:
# Get external reference and extract bytes
pdf_image_xref = pdf_image[0]
base_image = pdf.extract_image(pdf_image_xref)
base_image_bytes = base_image['image']
# Read image and extract text
image = Image.open(io.BytesIO(base_image_bytes))
page_text = image_to_string(image, lang='eng')
pages.append(page_text)
pdf.close()
text = '\n\n'.join(pages)
return text
pages = extract_text_scanned_pdf('example_scanned.pdf', verbose=True)
4. Word documents
You can use textract to extract the text of a Word document with a single line of code. The package can handle both .doc and .docx files, and a number of other file formats too. By default it will infer which parser to use based on the file extension at the end of the file path.
However, I would recommend setting the file extension explicitly in the function call, because asking textract to infer the file format directly from the file path can sometimes fail. In my case, some of the file paths in the Azure container I had mounted in DBFS had trailing spaces, which prevented textract from properly inferring the file extension.
The following code snippet shows how to extract text from an old-fashioned Word document with a .doc file extension.
import textract
doc_text = textract.process('example.doc', extension='doc').decode('utf-8')
5. Emails
You can use textract to extract the contents of email files in both .eml and .msg format. However, I ran into some problems because it tries to figure out the input encoding of the text for you, and if this fails there is no way to handle the encoding errors, which prevents you reading the file.
5.1. eml files
The following function takes the same approach to extracting the text from .eml files that textract uses behind the scenes, but it lets you specify an input encoding, and provides an argument that is passed to Python's open function which controls how any errors should be handled. This lets you handle errors differently if necessary.
from email.parser import Parser as EmailParser
def extract_text_eml(eml_path, encoding='utf-8', errors='strict'):
with open(eml_path, encoding=encoding, errors=errors) as stream:
parser = EmailParser()
message = parser.parse(stream)
text_content = []
for part in message.walk():
if part.get_content_type().startswith('text/plain'):
text_content.append(part.get_payload())
text = '\n\n'.join(text_content)
return text
eml_text = extract_text_eml('example.eml', errors='ignore')
5.2. msg files
In theory, there is in no need to guess the encoding of .msg files because each file should specify its own encoding. However, textract could not find an encoding in some of the .msg files I was working with, and that prevented it from extracting the text. This function uses the msg-extractor package directly, which lets you explicitly set the encoding.
from extract_msg import Message
def extract_text_msg(msg_path, encoding='utf-8'):
msg = Message(msg_path, overrideEncoding=encoding)
text = msg.body
return text
msg_text = extract_text_msg('example.msg', encoding='cp1252')
6. Changelog
This article was originally posted on 18 December 2021, but it is a living document. It is intended to serve as a reference for useful text extraction methods and a record of some of the issues to bear in mind when doing this kind of work. Because this article may be updated from time to time, changes are recorded here.
- 2021-12-18 – Initial post containing text extraction methods for text PDFs, scanned PDFs, and Word documents in .doc and .docx formats.
- 2022-01-07 – Added text extraction methods for emails in .eml and .msg format.
- 2022-01-11 – Updated text extraction methods for pdfs so that they return a single string of text rather than a list of strings for each page.
Let me know if you have any suggestions for additional methods or improvements.