Posted on

Indexing embedded PDFs for the parent post

Relevanssi can automatically index PDF content for the parent post if the PDF (or other attachment) is attached to the parent post in WordPress. However, that’s not always the case. Sometimes the PDF is attached to the page using an embed, which doesn’t create a connection between the posts in WordPress. Thus, Relevanssi won’t know the PDF is embedded in the post and cannot index the PDF contents for the parent post.

Most of these plugins use shortcodes to embed the PDF viewer on a page. To get Relevanssi to index the embedded PDF contents for the parent post, you need to establish a connection between the PDF and the post, based on the URL in the shortcode.

The same code works with different PDF embedders; you only have to adjust the regex to match the shortcode used by the plugin.

PDF.js Viewer Shortcode

PDF.js Viewer Shortcode uses a shortcode, with the file name in the url parameter.

add_filter( 'relevanssi_content_to_index', 'rlv_pdfjs_content', 10, 2 );
function rlv_pdfjs_content( $content, $post ) {
    $m = preg_match_all( '/\[pdfjs-viewer url=["\'](.*?)["\']/', $post->post_content, $matches );
    if ( $m ) {
        global $wpdb;
        $upload_dir = wp_upload_dir();
        foreach ( $matches[1] as $pdf ) {
            $pdf_url     = ltrim( str_replace( $upload_dir['baseurl'], '', urldecode( $pdf ) ), '/' );
            $pdf_content = $wpdb->get_var( $wpdb->prepare( "SELECT meta_value FROM $wpdb->postmeta WHERE meta_key = '_relevanssi_pdf_content' AND post_id IN ( SELECT post_id FROM $wpdb->postmeta WHERE meta_key = '_wp_attached_file' AND meta_value = %s )", $pdf_url ) );
            $content    .= $pdf_content;
        }
    }
    return $content;
}

PDF Embedder

PDF Embedder uses the same method, so the only change is the name of the shortcode:

add_filter( 'relevanssi_content_to_index', 'rlv_pdfembedder_content', 10, 2 );
function rlv_pdfembedder_content( $content, $post ) {
    $m = preg_match_all( '/\[pdf-embedder url=["\'](.*?)["\']/', $post->post_content, $matches );
    if ( $m ) {
        global $wpdb;
        $upload_dir = wp_upload_dir();
        foreach ( $matches[1] as $pdf ) {
            $pdf_url     = ltrim( str_replace( $upload_dir['baseurl'], '', urldecode( $pdf ) ), '/' );
            $pdf_content = $wpdb->get_var( $wpdb->prepare( "SELECT meta_value FROM $wpdb->postmeta WHERE meta_key = '_relevanssi_pdf_content' AND post_id IN ( SELECT post_id FROM $wpdb->postmeta WHERE meta_key = '_wp_attached_file' AND meta_value = %s )", $pdf_url ) );
            $content    .= $pdf_content;
        }
    }
    return $content;
}

Wonderplugin PDF Embed

Wonderplugin PDF Embed uses a similar method; the URL of the attachment is in the attribute src.

add_filter( 'relevanssi_content_to_index', 'rlv_wonderpdf_content', 10, 2 );
function rlv_wonderpdf_content( $content, $post ) {
    $m = preg_match_all( '/\[wonderplugin_pdf src=["\'](.*?)["\']/', $post->post_content, $matches );
    if ( $m ) {
        global $wpdb;
        $upload_dir = wp_upload_dir();
        foreach ( $matches[1] as $pdf ) {
            $pdf_url     = ltrim( str_replace( $upload_dir['baseurl'], '', urldecode( $pdf ) ), '/' );
            $pdf_content = $wpdb->get_var( $wpdb->prepare( "SELECT meta_value FROM $wpdb->postmeta WHERE meta_key = '_relevanssi_pdf_content' AND post_id IN ( SELECT post_id FROM $wpdb->postmeta WHERE meta_key = '_wp_attached_file' AND meta_value = %s )", $pdf_url ) );
            $content    .= $pdf_content;
        }
    }
    return $content;
}

3D Flipbook

3D Flipbook has the flipbook post ID as the shortcode parameter, and you can find the attachment post ID in the post meta for the flipbook post:

add_filter( 'relevanssi_content_to_index', 'rlv_3dflipbook_content', 10, 2 );
function rlv_3dflipbook_content( $content, $post ) {
    $m = preg_match_all( '/\[3d-flip-book.*?id=["\'](.*?)["\']/', $post->post_content, $matches );
    if ( $m ) {
        global $wpdb;
        foreach ( $matches[1] as $flipbook_id ) {
            $data        = get_post_meta( $flipbook_id, '3dfb_data', true );
            $pdf_content = $wpdb->get_var(
                $wpdb->prepare(
                    "SELECT meta_value FROM $wpdb->postmeta WHERE meta_key = '_relevanssi_pdf_content' AND post_id = %d",
                    $data['post_ID']
                )
            );
            $content    .= $pdf_content;
        }
    }
    return $content;
}

Excerpts

To get excerpts from the PDF content, you can use the same function with the relevanssi_excerpt_content filter hook, like this:

add_filter( 'relevanssi_excerpt_content', 'rlv_pdfjs_content' );

This function will include the PDF content for excerpt-building. There’s a performance cost, so you have to try and see whether including the content slows down the search too much.

One option is to read the PDF content to a custom field in the relevanssi_content_to_index hook and then use the data in the custom field in excerpt-building, which may be faster.

5 comments Indexing embedded PDFs for the parent post

  1. Good morning, I’m trying to index embebed PDF’s as in your tutorial: “indexing embedded PDFs for the parent post” but I’m having some problems. I’m using Relevanssi permanent access subscription and my PDF embebber is 3D FlipBook.

  2. Hi there – great plugin. I’m trying to get this to work with PDF Embedder Premium, any chance you could provide the code for this as well?

    I have the site PW protected during development. Thanks in advance!

    1. Dave, at least the PDF Embedder free version is straightforward. I’ve added that to the post. If the Premium version does something different, I’d need to know what, I don’t have any access to the Premium version.

Leave a Reply

Are you a Relevanssi Premium customer looking for support? Please use the Premium support form.

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.