Posted on

Boosting shorter posts with higher keyword density

By default, Relevanssi tends to prefer longer posts. The default TF × IDF weights Relevanssi uses simply count the term frequency, ie. how many times a word appears in the post. That prefers longer posts as they usually have the search term appear more often. However, a 500-word post with 15 search term appearances might well be a better match for the search than a 2000-word post with 20 search term appearances, as the density is much higher in the shorter post.

One way to make Relevanssi give a boost to shorter posts is the add a consideration for the document length in the calculations. Adding this function hooked to the relevanssi_match hook to your site will include inverse document length in the weights:

add_filter( 'relevanssi_match', 'rlv_inverse_document_length', 10, 2 );
function rlv_inverse_document_length( $match, $idf ) {
    global $relevanssi_post_idl;
    if ( isset( $relevanssi_post_idl[ $match->doc ] ) ) {
        $idl = $relevanssi_post_idl[ $match->doc ];
    } else {
        $current_post_object = relevanssi_get_post( $match->doc );
        $minimum_doc_length  = 5000; // in characters
        if ( ! $current_post_object ) {
            $idl = 1;
        } else {
            $post_length = max( $minimum_doc_length, strlen( $current_post_object->post_content ) );
            $idl         = $minimum_doc_length / $post_length;
        }
        $relevanssi_post_idl[ $match->doc ] = $idl;
    }
    $match_multiplier = $match->weight / ( $match->tf * $idf );
    $match->weight    = $match_multiplier * $match->tf * $idf * $idl;
    return $match;
}

What this does is to determine the post length (in characters, not in words, because counting words is slower) and then come up with a ratio between the current post length and the minimum post length chosen in the function. Here it’s set to 5000 characters, which means that all posts are considered at least 5000 characters long and posts longer than that will get a multiplier that goes down from 1 towards 0 as the post gets longer.

This will give a boost to shorter posts that have a higher weight and will punish very long posts that rank high just for being long.

The version above only considers post content. It does not count the attachment content. This version includes that:

add_filter( 'relevanssi_match', 'rlv_inverse_document_length', 10, 2 );
function rlv_inverse_document_length( $match, $idf ) {
    global $relevanssi_post_idl;
    if ( isset( $relevanssi_post_idl[ $match->doc ] ) ) {
        $idl = $relevanssi_post_idl[ $match->doc ];
    } else {
        $current_post_object = relevanssi_get_post( $match->doc );
        $minimum_doc_length  = 5000; // in characters
        if ( ! $current_post_object ) {
            $idl = 1;
        } else {
            $post_content = $current_post_object->post_content;
            $pdf_content  = get_post_meta( $current_post_object->ID, '_relevanssi_pdf_content', true );
            $content      = $post_content . ' ' . $pdf_content;
            $post_length  = max( $minimum_doc_length, strlen( $content ) );
            $idl          = $minimum_doc_length / $post_length;
        }
        $relevanssi_post_idl[ $match->doc ] = $idl;
    }
    $match_multiplier = $match->weight / ( $match->tf * $idf );
    $match->weight    = $match_multiplier * $match->tf * $idf * $idl;
    return $match;
}

Leave a Reply

Are you a Relevanssi Premium customer looking for support? Please use the Premium support form.

Your email address will not be published. Required fields are marked *