Command line utilities such as grep and ack-grep are great for searching plain-text files for patterns matching a specified regular expression. But have you ever tried using these utilities to search for patterns in a PDF file? Well, don’t! You will not get any result as these tools cannot read PDF files; they only read plain-text files.
pdfgrep, as the name suggests, is a small command line utility that makes it possible to search for text in a PDF file without opening the file. It is insanely fast – faster than the search provided by virtually all PDF document viewers. A great distinction between grep and pdfgrep is that pdfgrep operates on pages, whereas grep operates on lines. It also prints a single line multiple times if more than one match is found on that line. Let’s look at how exactly to use the tool.
For Ubuntu and other Linux distros based on Ubuntu, it is pretty simple:
sudo apt install pdfgrep
For other distros, just provide pdfgrep as input for the package manager, and that should get it installed. You can also check out the project’s GitLab page, in case you want to play around with the code.
Now that you have the tool installed, let’s go for a test run. pdfgrep command takes this format:
pdfgrep [OPTION. ] PATTERN [FILE. ]
OPTION is a list of extra attributes to give the command such as -i or --ignore-case , which both ignore the case distinction between the regular pattern specified and the once matching it from the file.
PATTERN is just an extended regular expression.
FILE is just the name of the file, if it is in the same working directory, or the path to the file.
I ran the command on Python 3.6 official documentation. The following image is the result.
The red highlights indicate all the places the word “queue” was encountered. Passing -i as option to the command included matches of the word “Queue.” Remember, the case does not matter when -i is passed as an option.
pdfgrep has quite a number of interesting options to use. However, I’ll cover only a few here.
The full list of supported options can be found in the man pages or in the pdfgrep online documenation. Don’t forget pdfgrep can search multiple files at the same time, in case you’re working with some bulk files. The default match highlight color can be changed by altering the GREP_COLORS environment variable.
The next time you think of opening up a PDF file to search for anything. think of using pdfgrep. The tool comes in handy and will save you time.
Subscribe to our newsletter!Our latest tutorials delivered straight to your inbox