Posted on

Indexing and searching PDFs in WordPress

WordPress PDF

Relevanssi Premium users have asked for PDF indexing since day one, and version 2.0 finally introduces this feature. Coming up with a method that is fast and reliable hasn’t been easy, but we’re pretty proud of what we have now. Our PDF indexer doesn’t tax your server as it runs as a service on a separate server.

Which PDF files can you index?

Since Relevanssi is a WordPress search, Relevanssi operates on WordPress posts (including all the different post types). So, in order to have Relevanssi index your PDFs, they need to be WordPress posts. That’s fortunately really simple: just upload your PDF files to the Media library, and they become posts with the post type of attachment.

Relevanssi can only parse and read PDF files that contain text. If the PDF file is all images, it cannot be read. An easy way to check is to try to select the text in a PDF reader. If you can select the text, Relevanssi can read it, but if you can’t, the text is stored as an image (for example a scanned document that hasn’t been OCR processed), and Relevanssi can’t read it.

There may be some restrictions on the PDF size. We are not sure where the limits are, but it’s fairly safe to expect trouble when indexing 100-megabyte PDF files with complex structures. 100 pages of plain text is easy.

What about other attachments?

Yes! Relevanssi can handle lots of different formats. Our server uses Apache Tika to handle the files, so that gives us a huge variety of supported formats. Most important document formats are covered: Word documents, Open Office documents, RTFs and so on.

How does the attachment indexing work?

Relevanssi attachment indexing is a two-step process. First, the attachment content is read and stored in a custom field (_relevanssi_pdf_content). This alone does not index the attachment content – it just makes it available for future indexing and ensures you don’t have to read the attachment contents many times.

The second step is the actual indexing. Here Relevanssi offers two different methods. You can choose to index the attachment post type, in which case the search results will include the attachment posts. The other method is to index the attachment content for the parent post of the attachment, in which case the search results will include the post the file is attached to.

The attachment content reading is not done on your own server, which ensures that even sites on shared hosting can reliably read even larger files. The files are sent to Relevanssiservices.com, which is a Digital Ocean Droplet hosted either in the USA or EU (you can choose from the settings which server you want to use). There the files are processed with Tika.

While we really don’t care what’s inside the files you index on our server, the server needs to make working copies of the files. The copies are removed after use. It is possible that someone with an access to the server could see your files. If your files are really sensitive and confidential, it is best not to index them with our service.

How to index a single file? How does Relevanssi see a PDF file? What about errors?

Go to the attachment edit page. You can get to the edit page from the Media library: click an attachment, then click “Edit more details”.

Attachment details
Click “Edit more details” to see the Relevanssi PDF controls for individual attachments.

That will take you to the attachment edit page. There you will find the Relevanssi attachment controls. To read the file contents, click the “Index attachment contents” button. If everything goes well, the page will reload and a “Attachment Content” text box will appear, showing you the file content as seen by the Relevanssi extractor. If there’s a problem, a “Attachment Error” box will appear with the error message.

Relevanssi PDF controls
Relevanssi PDF controls for a PDF file that has been successfully read.
Relevanssi PDF error
A typical error message: a PDF file that doesn’t have any text, just images, will result in this error.

Indexing files in bulk

Attachments tab
Relevanssi attachments tab for reading the PDF contents in bulk.

To read in all the attachment files on your site, you can go to the Attachments tab in the Relevanssi settings page. There you will find the tools for reading all the files at once. The process will take a while, if you have lots of files, but requires just a single click of a button and some patience.

For more information about the Attachments tab, see Installing Relevanssi and adjusting the settings in Relevanssi User Manual.

Searching the attachment content

Once the attachment content is read in and you’ve indexed the content – either by indexing the attachment post type or for the parent post – searches will automatically target the attachment content.

You can use phrases for PDF content as well: wrap the search query in quotes, like this: "search phrase", and only posts and attachment files containing that exact phrase will be returned.

Custom field excerpts
Enable this option to create excerpts from custom field content, including the PDF file contents which are stored in a custom field.

To see attachment contents in Relevanssi-generated excerpts, you can check the “Use custom fields for excerpts” option (because attachment content is stored in custom fields). Do note that generating excerpts is the slowest part of searching to start with, and if your PDFs have lots of text, enabling this option may make the search slower. You can try and see how it works; especially with word-based excerpt lenghts the process may not be too slow.

Highlighted PDF content
And here we go, with highlighted PDF content found in searches by Relevanssi!

I’m still not seeing attachment files in search results

Make sure your search results are not restricted by post type. The attachment posts have the post type attachment. If your search is being restricted to the post post type, you won’t see any files in the results. The restriction usually comes from your theme. It may be a theme setting, or a hidden input field in the search form.

If you want to make sure attachments are included, you can add this function to theme functions.php:

add_filter('relevanssi_modify_wp_query', 'rlv_force_post_product');
function rlv_force_post_product($query) {
    $query->query_vars['post_types'] = "post,page,attachment";
    return $query;
}

Adjust the list of post types to suit your needs.

43 comments Indexing and searching PDFs in WordPress

  1. We’re implementing Relevanssi on our development site which is password protected. When we index our PDFs (relative links) we get the following error message: “PDF Processor error: Not a valid URL.”

    We can’t figure out why we’re getting this error message because the paths are indeed valid. Is it because our dev site is password protected? If so, how can we work around that?

    1. Gwinn, that error message is caused when PHP can’t validate your URL as a valid URL. Generally a protected PDF gives a “Empty PDF file. Is the file publicly available?” error. However, since the site is password protected, that’s still your problem right there.

      On the Attachment tab, check the “Upload PDF files” option, and that should fix it.

  2. Hi!
    Pre sale question.
    I have a bunch of pdf files which I need searchable, however I don’t wan’t want the results come up in a normal wp search, only on the page containg the pdfs, is this possible? Maybe some kind backend filtering > search only pdf on this page, or so?

    1. John, Relevanssi doesn’t care where the actual files are hosted, but they must appear as attachment posts in the Relevanssi Media library, otherwise Relevanssi can’t access them.

        1. John, I have no idea. It’d have to go the other way around, I think: you’d have to upload the PDF to Media library, then have it hosted somewhere else. It works like that with Amazon S3, but I’m afraid it might not be as easy with Google Drive.

  3. I am attempting to index PDFs on my site (using Relevanssi premium) and every time it attempts to process the contents of a PDF file it instead returns…

    PDF Processor error: Key XXXXXXX is not valid.

    If I attempt to mass index them I get the same error for each PDF except it also returns the attachment ID.

    1. Jason, first of all, you should never post your API key in public: it’ll allow anyone to use your license.

      Second, you’re seeing the message because, well, that’s how it is: your key is not valid. Your license has expired in January, and you haven’t renewed it. In order to have access to the PDF content reading, you need to have a valid license. You can renew your license here.

      1. I had no idea that was the API key! It was just a cryptic error message as far as I knew.
        I also didn’t know that the license was up. I will have our department renew the license. Will that solve the PDF parsing issue though or is it a problem with the site configuration?

  4. Hello Mikko,
    a question before buying.

    I have many PDF spare parts catalogs. If the catalog is indexed, the search term e.g. is a manufacturer number and this was found in the document – is it possible to make this manufacturer number then “clickable”. (Add to Cart)

    Best regards
    Mario

    1. Mario, that is something that’s up to your theme. Relevanssi doesn’t control what happens on your search results page, that’s the responsibility of your theme.

    1. Scott, you can have it both ways, depending on how Relevanssi is set up. The default is to link to the attachment page, but it’s a single checkbox to change it.

  5. Hi Mikko,
    I am indexing the pdfs in bulk, but after 41/2 hours, the progress bar area has been stuck on “Time remaining: less than a minute” for almost an hour. This is the second time I have tried, last time it stayed like that for a few hours and I refreshed the screen only to see the state of the index with none of the pdfs read in. I know you said the bulk feature requires patience but is this common? There are about 500 pdfs.
    Thanks in advance!

    1. No, it’s not common. The indexing processes should always respond in couple of minutes. If nothing happens in five minutes, something is definitely wrong. If you can use the EU server, I’d recommend trying that if you’re having problems, it seems more reliable than the US server. I’ve rebooted the US server now and it seems to work again, but for some reason it’s more often down than the EU server, even though the two should be identical in all regards except the location.

      1. Thanks for getting back to me Mikko. I connected through VPN in Europe and it worked like a charm. Thanks man.

  6. I’m running Relevanssi premium on a multisite and getting the following error when I try to index pdf files:
    PDF Processor error: Key 0 is not valid.
    Assuming from the above thread that I don’t have the api key set.
    When I go to the Overview tab in Settings, there is no box to enter the api key. Where can I enter the key?
    Please help

    1. L Dixon, are you on multisite? If so, the API key is entered in the Network settings (and don’t worry about it: it looks like it’s not saved because of a bug, but it is).

  7. Is it possible to index PDF in a subfolder of the uploads folder? i.e wp-content/uploads/delightful-downloads?

    I’ve tried amending the post types to add a new type, but anyting in the above folder is not indexed.

    Many thanks

    1. Steven, that would depend on how they are structured in the WP database. If they are attachment posts that appear in the wp_posts database, Relevanssi can index the PDFs, no matter where the files actually are. But if there’s no matching attachment post in the database, Relevanssi doesn’t even know the file exists.

  8. I have added a number of pdf files into my Media Library, and added a title and description for each. They are now accessible via URLs like http://www.amphibianark.org/wp-content/uploads/2018/07/A-process-for-assessing-and-prioritizing-species-conservation-needs-going-beyond-the-red-list.pdf I have included attachments in the indexes and have built the indexes.

    However when I search for anything contained within the pdfs, it is not wound. Only strings within the title, filename or description are appearing in my results, but not actual content from the pdfs.

    The file types are showing as “application/pdf”, and the attachment error message is “PDF Processor error: Empty attachment file. Is the file publicly available?”.

    Can you offer any suggestions on how I can resolve this please?

  9. Hello

    I’m using the Relevanssi FREE to include the information filled in from the meta fields in the custom search results (custom type posts).

    The problem is that I would like to include the name of the attachments as results to be found in this search, but from what I could see the attachment field for attachments stores only the attachment ID, so that I can not find it by name in the search results.

    Can you give me some way to include the name of the attachment in the results of this custom search, without having to fill separately / manually in a text field the part. (the end user would not fill this)

    The premium version extensions could help?

    1. Ok, in that case you need a relevanssi_content_to_index filter that for each parent post goes through the attachments attached to that post and adds the names of the attachments to the content of the parent post.

      1. Okay, it’s great to know that it’s possible.
        But I am not aware to customize this function.
        If you can develop this role for me, email me the costs and how we could get it right.

        thank you so much

          1. Hello Mikko

            Updating on the function: I made an adaptation in the code to work with the meta (File Advanced) attachments of the famous metabox.io plugin. It has been running perfectly, and much lighter and faster than indexing the attachments in standard wordpress mode.

            I sent the function to the official plugin library and it was already approved and praised by some users in the group: https://github.com/wpmetabox/library/pull/3

            Thank you again.

  10. Hi Mikko,

    First off, love the plugin.

    I am indexing a bunch of attachment files and for some reason, some of them (about 10%) are giving me this error: “Attachment ID XX: Empty attachment file. Is the file publicly available?”

    The files, as far as I can tell are available publicly (they show up on the website) and they have text content (ie not images as a pdf). When I go to the individual file and click “Index Attachment Content” it works, but I get the error when reading / indexing the attachments in bulk.

    Any ideas?

    Thanks!

  11. Hi Mikko,

    I am attempting to index PDFs on my site (using Relevanssi premium) and every time it attempts to process the contents of a PDF file it instead returns with the same attachement error message:
    cURL error 28: Operation timed out after 45001 milliseconds with 0 bytes received

    What to do?!

    And: Do I have to index one simple PDF after the other from my mediathek or is it possible to do it in a way all together?!

    Thank you, Thomas

    1. Thomas, if possible, use the EU server. For some reason it works more reliably than the US server, even though both run exactly the same code. This problem is caused when the indexing server is down. I’ve rebooted the US server now and it should work, but the EU server is more reliable.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.