Posted on

Indexing and searching PDFs in WordPress

WordPress PDF

Relevanssi Premium users have asked for PDF indexing since day one, and version 2.0 finally introduces this feature. Coming up with a method that is fast and reliable hasn’t been easy, but we’re pretty proud of what we have now. Our PDF indexer doesn’t tax your server as it runs as a service on a separate server.

Which PDF files can you index?

Since Relevanssi is a WordPress search, Relevanssi operates on WordPress posts (including all the different post types). So, in order to have Relevanssi index your PDFs, they need to be WordPress posts. That’s fortunately really simple: just upload your PDF files to the Media library, and they become posts with the post type of attachment.

Relevanssi can only parse and read PDF files that contain text. If the PDF file is all images, it cannot be read. An easy way to check is to try to select the text in a PDF reader. If you can select the text, Relevanssi can read it, but if you can’t, the text is stored as an image (for example a scanned document that hasn’t been OCR processed), and Relevanssi can’t read it.

There may be some restrictions on the PDF size. We are not sure where the limits are, but it’s fairly safe to expect trouble when indexing 100-megabyte PDF files with complex structures. 100 pages of plain text is easy.

How does the PDF indexing work?

Relevanssi PDF indexing is a two-step process. First, the PDF content is read and stored in a custom field (_relevanssi_pdf_content). This alone does not index the PDF content – it just makes it available for future indexing and ensures you don’t have to read the PDF contents many times.

The second step is the actual indexing. Here Relevanssi offers two different methods. You can choose to index the attachment post type, in which case the search results will include the PDF attachment posts. The other method is to index the PDF content for the parent post of the attachment, in which case the search results will include the post the PDF is attached to.

The PDF content reading is not done on your own server, which ensures that even sites on shared hosting can reliably read even larger files. The files are sent to Relevanssiservices.com, which is a Digital Ocean Droplet hosted in the USA. There the files are processed with Spatie’s pdf-to-text, which uses the open source pdftotext.

While we really don’t care what’s inside the PDF files you index on our server, the server needs to make working copies of the files. The copies are removed after use. It is possible that someone with an access to the server (outside our staff, I assume Digital Ocean staff may be able to access the contents of the server) could see your files. If your files are really sensitive and confidential, it is best not to index them with our service.

How to index a single PDF? How does Relevanssi see a PDF file? What about errors?

Go to the attachment edit page. You can get to the edit page from the Media library: click an attachment, then click “Edit more details”.

Attachment details
Click “Edit more details” to see the Relevanssi PDF controls for individual attachments.

That will take you to the attachment edit page. There you will find the Relevanssi PDF controls. To read the PDF contents, click the “Index PDF contents” button. If everything goes well, the page will reload and a “PDF Content” text box will appear, showing you the PDF content as seen by the Relevanssi PDF extractor. If there’s a problem, a “PDF Error” box will appear with the error message.

Relevanssi PDF controls
Relevanssi PDF controls for a PDF file that has been successfully read.
Relevanssi PDF error
A typical error message: a PDF file that doesn’t have any text, just images, will result in this error.

Indexing PDF files in bulk

Attachments tab
Relevanssi attachments tab for reading the PDF contents in bulk.

To read in all the PDF files on your site, you can go to the Attachments tab in the Relevanssi settings page. There you will find the tools for reading all the PDF files at once. The process will take a while, if you have lots of PDF files, but requires just a single click of a button and some patience.

For more information about the Attachments tab, see Installing Relevanssi and adjusting the settings in Relevanssi User Manual.

Searching the PDF content

Once the PDF content is read in and you’ve indexed the content – either by indexing the attachment post type or for the parent post – searches will automatically target the PDF content.

You can use phrases for PDF content as well: wrap the search query in quotes, like this: "search phrase", and only posts and PDF files containing that exact phrase will be returned.

Custom field excerpts
Enable this option to create excerpts from custom field content, including the PDF file contents which are stored in a custom field.

To see PDF contents in Relevanssi-generated excerpts, you can check the “Use custom fields for excerpts” option (because PDF content is stored in custom fields). Do note that generating excerpts is the slowest part of searching to start with, and if your PDFs have lots of text, enabling this option may make the search slower. You can try and see how it works; especially with word-based excerpt lenghts the process may not be too slow.

Highlighted PDF content
And here we go, with highlighted PDF content found in searches by Relevanssi!

I’m still not seeing PDFs in search results

Make sure your search results are not restricted by post type. The PDF attachment posts have the post type attachment. If your search is being restricted to the post post type, you won’t see any PDFs in the results. The restriction usually comes from your theme. It may be a theme setting, or a hidden input field in the search form.

If you want to make sure attachments are included, you can add this function to theme functions.php:

add_filter('relevanssi_modify_wp_query', 'rlv_force_post_product');
function rlv_force_post_product($query) {
    $query->query_vars['post_types'] = "post,page,attachment";
    return $query;
}

Adjust the list of post types to suit your needs.

21 comments Indexing and searching PDFs in WordPress

  1. We’re implementing Relevanssi on our development site which is password protected. When we index our PDFs (relative links) we get the following error message: “PDF Processor error: Not a valid URL.”

    We can’t figure out why we’re getting this error message because the paths are indeed valid. Is it because our dev site is password protected? If so, how can we work around that?

    1. Gwinn, that error message is caused when PHP can’t validate your URL as a valid URL. Generally a protected PDF gives a “Empty PDF file. Is the file publicly available?” error. However, since the site is password protected, that’s still your problem right there.

      On the Attachment tab, check the “Upload PDF files” option, and that should fix it.

  2. Hi!
    Pre sale question.
    I have a bunch of pdf files which I need searchable, however I don’t wan’t want the results come up in a normal wp search, only on the page containg the pdfs, is this possible? Maybe some kind backend filtering > search only pdf on this page, or so?

    1. John, Relevanssi doesn’t care where the actual files are hosted, but they must appear as attachment posts in the Relevanssi Media library, otherwise Relevanssi can’t access them.

        1. John, I have no idea. It’d have to go the other way around, I think: you’d have to upload the PDF to Media library, then have it hosted somewhere else. It works like that with Amazon S3, but I’m afraid it might not be as easy with Google Drive.

  3. I am attempting to index PDFs on my site (using Relevanssi premium) and every time it attempts to process the contents of a PDF file it instead returns…

    PDF Processor error: Key XXXXXXX is not valid.

    If I attempt to mass index them I get the same error for each PDF except it also returns the attachment ID.

    1. Jason, first of all, you should never post your API key in public: it’ll allow anyone to use your license.

      Second, you’re seeing the message because, well, that’s how it is: your key is not valid. Your license has expired in January, and you haven’t renewed it. In order to have access to the PDF content reading, you need to have a valid license. You can renew your license here.

      1. I had no idea that was the API key! It was just a cryptic error message as far as I knew.
        I also didn’t know that the license was up. I will have our department renew the license. Will that solve the PDF parsing issue though or is it a problem with the site configuration?

  4. Hello Mikko,
    a question before buying.

    I have many PDF spare parts catalogs. If the catalog is indexed, the search term e.g. is a manufacturer number and this was found in the document – is it possible to make this manufacturer number then “clickable”. (Add to Cart)

    Best regards
    Mario

    1. Mario, that is something that’s up to your theme. Relevanssi doesn’t control what happens on your search results page, that’s the responsibility of your theme.

    1. Scott, you can have it both ways, depending on how Relevanssi is set up. The default is to link to the attachment page, but it’s a single checkbox to change it.

  5. Hi Mikko,
    I am indexing the pdfs in bulk, but after 41/2 hours, the progress bar area has been stuck on “Time remaining: less than a minute” for almost an hour. This is the second time I have tried, last time it stayed like that for a few hours and I refreshed the screen only to see the state of the index with none of the pdfs read in. I know you said the bulk feature requires patience but is this common? There are about 500 pdfs.
    Thanks in advance!

    1. No, it’s not common. The indexing processes should always respond in couple of minutes. If nothing happens in five minutes, something is definitely wrong. If you can use the EU server, I’d recommend trying that if you’re having problems, it seems more reliable than the US server. I’ve rebooted the US server now and it seems to work again, but for some reason it’s more often down than the EU server, even though the two should be identical in all regards except the location.

      1. Thanks for getting back to me Mikko. I connected through VPN in Europe and it worked like a charm. Thanks man.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.