Posted on

Indexing embedded PDFs for the parent post

Relevanssi can automatically index PDF content for the parent post, if the PDF (or other attachment) is attached to the parent post in WordPress. However, that’s not always the case. Sometimes the PDF is attached to the page using an embed, and that doesn’t create a connection between the posts in WordPress. Thus, Relevanssi won’t know the PDF is embedded in the post and cannot index the PDF contents for the parent post.

One such case is the PDF.js Viewer Shortcode plugin. It uses a shortcode to embed a PDF viewer on the page, but creates no connection between the posts.

It’s still possible to index the PDF contents for the parent post, it just takes some hacking. This function can be added to your theme functions.php and when a post is indexed, it will find the pdfjs-viewer shortcodes from the post, will grab the PDF URLs from the shortcodes and then find the attachment posts from based on the URL.

The same code works with different PDF embeders, you only have to adjust the regex to match the shortcode used by the plugin.

PDF.js Viewer Shortcode

add_filter( 'relevanssi_content_to_index', 'rlv_pdfjs_content', 10, 2 );
function rlv_pdfjs_content( $content, $post ) {
    $m = preg_match_all( '/\[pdfjs-viewer url="(.*?)"/', $post->post_content, $matches );
    if ( $m ) {
        global $wpdb;
        $upload_dir = wp_upload_dir();
        foreach ( $matches[1] as $pdf ) {
            $pdf_url     = ltrim( str_replace( $upload_dir['baseurl'], '', urldecode( $pdf ) ), '/' );
            $pdf_content = $wpdb->get_var( $wpdb->prepare( "SELECT meta_value FROM $wpdb->postmeta WHERE meta_key = '_relevanssi_pdf_content' AND post_id IN ( SELECT post_id FROM $wpdb->postmeta WHERE meta_key = '_wp_attached_file' AND meta_value = %s )", $pdf_url ) );
            $content    .= $pdf_content;
        }
    }
    return $content;
}

Wonderplugin PDF Embed

Wonderplugin PDF Embed uses a similar method, the URL of the attachment is in attribute src.

add_filter( 'relevanssi_content_to_index', 'rlv_wonderpdf_content', 10, 2 );
function rlv_wonderpdf_content( $content, $post ) {
    $m = preg_match_all( '/\[wonderplugin_pdf src="(.*?)"/', $post->post_content, $matches );
    if ( $m ) {
        global $wpdb;
        $upload_dir = wp_upload_dir();
        foreach ( $matches[1] as $pdf ) {
            $pdf_url     = ltrim( str_replace( $upload_dir['baseurl'], '', urldecode( $pdf ) ), '/' );
            $pdf_content = $wpdb->get_var( $wpdb->prepare( "SELECT meta_value FROM $wpdb->postmeta WHERE meta_key = '_relevanssi_pdf_content' AND post_id IN ( SELECT post_id FROM $wpdb->postmeta WHERE meta_key = '_wp_attached_file' AND meta_value = %s )", $pdf_url ) );
            $content    .= $pdf_content;
        }
    }
    return $content;
}

3D Flipbook

3D Flipbook has the flipbook post ID as the shortcode parameter, and the attachment post ID can be found in the post meta for the flipbook post:

add_filter( 'relevanssi_content_to_index', 'rlv_3dflipbook_content', 10, 2 );
function rlv_3dflipbook_content( $content, $post ) {
    $m = preg_match_all( '/\[3d-flip-book.*?id="(.*?)"/', $post->post_content, $matches );
    if ( $m ) {
        global $wpdb;
        foreach ( $matches[1] as $flipbook_id ) {
            $data        = get_post_meta( $flipbook_id, '3dfb_data', true );
            $pdf_content = $wpdb->get_var(
                $wpdb->prepare(
                    "SELECT meta_value FROM $wpdb->postmeta WHERE meta_key = '_relevanssi_pdf_content' AND post_id = %d",
                    $data['post_ID']
                )
            );
            $content    .= $pdf_content;
        }
    }
    return $content;
}

Excerpts

In order to get excerpts from the PDF content, the same function can be used with the relevanssi_excerpt_content filter hook, like this:

add_filter( 'relevanssi_excerpt_content', 'rlv_pdfjs_content' );

This will include the PDF content for excerpt-building. This comes with a performance cost, so you have to give this a go and see if this slows down the search too much or not.

One option is to read the PDF content to a custom field in the relevanssi_content_to_index hook and then use the data in the custom field in excerpt-building, that may be faster.

3 comments Indexing embedded PDFs for the parent post

  1. Good morning, I’m trying to index embebed PDF’s as in your tutorial: “indexing embedded PDFs for the parent post” but I’m having some problems. I’m using Relevanssi permanent access subscription and my PDF embebber is 3D FlipBook.

Leave a Reply

Are you a Relevanssi Premium customer looking for support? Please use the Premium support form.

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.