Posted on

Indexing and searching PDFs in WordPress

WordPress PDF

Relevanssi Premium users have asked for PDF indexing since day one, and version 2.0 finally introduces this feature. Coming up with a method that is fast and reliable hasn’t been easy, but we’re pretty proud of what we have now. Our PDF indexer doesn’t tax your server as it runs as a service on a separate server.

Which PDF files can you index?

Since Relevanssi is a WordPress search, Relevanssi operates on WordPress posts (including all the different post types). So, in order to have Relevanssi index your PDFs, they need to be WordPress posts. That’s fortunately really simple: just upload your PDF files to the Media library, and they become posts with the post type of attachment.

Relevanssi can only parse and read PDF files that contain text. If the PDF file is all images, it cannot be read. An easy way to check is to try to select the text in a PDF reader. If you can select the text, Relevanssi can read it, but if you can’t, the text is stored as an image (for example a scanned document that hasn’t been OCR processed), and Relevanssi can’t read it.

There may be some restrictions on the PDF size. We are not sure where the limits are, but it’s fairly safe to expect trouble when indexing 100-megabyte PDF files with complex structures. 100 pages of plain text is easy.

How does the PDF indexing work?

Relevanssi PDF indexing is a two-step process. First, the PDF content is read and stored in a custom field (_relevanssi_pdf_content). This alone does not index the PDF content – it just makes it available for future indexing and ensures you don’t have to read the PDF contents many times.

The second step is the actual indexing. Here Relevanssi offers two different methods. You can choose to index the attachment post type, in which case the search results will include the PDF attachment posts. The other method is to index the PDF content for the parent post of the attachment, in which case the search results will include the post the PDF is attached to.

The PDF content reading is not done on your own server, which ensures that even sites on shared hosting can reliably read even larger files. The files are sent to Relevanssiservices.com, which is a Digital Ocean Droplet hosted in the USA. There the files are processed with Spatie’s pdf-to-text, which uses the open source pdftotext.

While we really don’t care what’s inside the PDF files you index on our server, the server needs to make working copies of the files. The copies are removed after use. It is possible that someone with an access to the server (outside our staff, I assume Digital Ocean staff may be able to access the contents of the server) could see your files. If your files are really sensitive and confidential, it is best not to index them with our service.

How to index a single PDF? How does Relevanssi see a PDF file? What about errors?

Go to the attachment edit page. You can get to the edit page from the Media library: click an attachment, then click “Edit more details”.

Attachment details
Click “Edit more details” to see the Relevanssi PDF controls for individual attachments.

That will take you to the attachment edit page. There you will find the Relevanssi PDF controls. To read the PDF contents, click the “Index PDF contents” button. If everything goes well, the page will reload and a “PDF Content” text box will appear, showing you the PDF content as seen by the Relevanssi PDF extractor. If there’s a problem, a “PDF Error” box will appear with the error message.

Relevanssi PDF controls
Relevanssi PDF controls for a PDF file that has been successfully read.
Relevanssi PDF error
A typical error message: a PDF file that doesn’t have any text, just images, will result in this error.

Indexing PDF files in bulk

Attachments tab
Relevanssi attachments tab for reading the PDF contents in bulk.

To read in all the PDF files on your site, you can go to the Attachments tab in the Relevanssi settings page. There you will find the tools for reading all the PDF files at once. The process will take a while, if you have lots of PDF files, but requires just a single click of a button and some patience.

For more information about the Attachments tab, see Installing Relevanssi and adjusting the settings in Relevanssi User Manual.

Searching the PDF content

Once the PDF content is read in and you’ve indexed the content – either by indexing the attachment post type or for the parent post – searches will automatically target the PDF content.

You can use phrases for PDF content as well: wrap the search query in quotes, like this: "search phrase", and only posts and PDF files containing that exact phrase will be returned.

Custom field excerpts
Enable this option to create excerpts from custom field content, including the PDF file contents which are stored in a custom field.

To see PDF contents in Relevanssi-generated excerpts, you can check the “Use custom fields for excerpts” option (because PDF content is stored in custom fields). Do note that generating excerpts is the slowest part of searching to start with, and if your PDFs have lots of text, enabling this option may make the search slower. You can try and see how it works; especially with word-based excerpt lenghts the process may not be too slow.

Highlighted PDF content
And here we go, with highlighted PDF content found in searches by Relevanssi!

Leave a Reply

Your email address will not be published. Required fields are marked *