PDF indexing

Relevanssi has a PDF content indexing feature in the works. It is currently in beta testing. Right now the PDF indexing is being offered as a separate plugin. Once the beta testing is over, the feature will be merged to Relevanssi Premium core and will be available for all Relevanssi Premium users.

Download the plugin file here.

How does it work

The PDF indexing uses an external server to extract the text from PDFs. When a PDF is indexed, the URL to the file is sent to the Relevanssi PDF indexer at https://www.relevanssiservices.com/, which will index the text and return it to WordPress, where it will be placed in the attachment meta data in the field _relevanssi_pdf_content.

Relevanssi can then read the text from there when indexing. There are two ways to index attachments: either by themselves or included in the posts they are attached to. In the first case, searching for the PDF content will return the attachment page in the results, in the second case the parent post will be returned.

There’s an alternative method where the PDF files are uploaded to the processing server. That method can be enabled from the Settings > Relevanssi PDF page.

Usage

Individual attachments can be indexed from the attachment edit page. There you can also see the PDF content as Relevanssi sees it and also the possible error message from the PDF indexer (the error message is stored in _relevanssi_pdf_error custom field).

There’s a bulk indexer in Settings > Relevanssi PDF. There you can also set the option of indexing with parent page or not (in either case the attachment meta data will contain the attachment content).

With the current version of Relevanssi Premium, searching for PDF content won’t work if the “Custom fields to index” setting is empty. This will be fixed in the next version. Meanwhile you can add something in the field, for example “_relevanssi_pdf_content”. (This is also added automatically, but the automatic addition doesn’t work if the field is empty.)

You can also use WP CLI to index all the PDFs. The command is wp relevanssi_pdf index_pdfs.

Terms & Conditions

The PDFs you index will be read and handled by a server hosted by Digital Ocean in the USA. The server has to store a working copy of the file, but won’t keep a copy of the file after use.

This feature is only available to users with an active, valid Relevanssi Premium license. Once the license expires, the service becomes unavailable. The indexed PDF content will remain in your site even if the license expires.

Leave a Reply

Your email address will not be published. Required fields are marked *