PyCon X


2nd - 5th May 2019

Text Extraction from PDFs made Easy

Overview: PDFs are one of the widely used digital media formats and are used to present and exchange information reliably, independent of the software, hardware and operating system. Extracting data from PDF files can be a tricky task. This is because of the complicated formats, fonts, and layouts that are stored in PDF files.

Text Extraction from PDFs is one of the most common problems that is prevalent in every industry. I shall be showing you how to extract data directly from n number of PDF files using just one Python Library - Regex. I encountered a similar problem while working with my company and utilized Regular Expressions to automate an otherwise strenuous task.

1. Who is this talk for?

  • Anyone who is dealing with the problem of extracting data from unstructured data at their workplace/ university.

  • Anyone who wants to learn Regular Expression in a fun and easy manner.

2. What background knowledge or experience should the audience have?

  • This will be a beginner’s talk. The basis of Regex will be built from the ground up during the talk.

3. What will the audience learn after attending the talk?

  • The audience will get a crisp overview of Regular Expressions, that can assist them in a wide spectrum of tasks, from wrangling data to qualifying and categorizing it.

  • I’ll be showing them a use-case/demo of a problem statement that I encountered at my workplace, giving the attendees more clarity about the extent of usage of Regex

Feedback form:

in on Sunday 5 May at 09:45 See schedule


  1. Gravatar
    Hi Aakriti, well written proposal!
    — Roberto Polli,

New comment