Posted on

Relevanssi and languages

Relevanssi is language-agnostic in itself. It does not know any language and doesn’t care about which language the site uses.

However, there are a few things that you need to consider when using Relevanssi in languages other than English.

Characters: use UTF8

As long as your site uses UTF8 characters, Relevanssi can handle just about anything you throw at it – you can even search for emojis. UTF8 is the standard in WordPress, and you generally don’t have to worry about it.

Words: bad news for Chinese and Japanese

While Relevanssi can read Chinese, Japanese and many other characters without problems, the lack of distinct words in these languages is a problem for Relevanssi.

Relevanssi works by splitting the posts into words at spaces and then counting how many times those words appear. Since Chinese and Japanese texts don’t have spaces separating words, Relevanssi can’t do this.

As a result, Relevanssi can search for Chinese and Japanese characters or character sequences, especially if you enable one-character words and inside-word matching in Relevanssi settings. Still, since the weights for the posts are essentially random, the results won’t be of high quality.

Unfortunately, making the search work well in Chinese, Japanese and other languages with similar characteristics requires advanced linguistics and is far beyond our capabilities.

Update 25.11.2020: Matthew Wang has suggested using a Chinese language segmentation tool like phpjieba. If you have the jieba() function installed on your site, you can use it for tokenizing Chinese text like this:

add_filter( 'relevanssi_remove_punctuation', 'rlv_use_jieba' );
function rlv_use_jieba( $string ) {
    $string = jieba( $string, 1, 1500 );
    $string = @implode( ' ', $string );
    return $string;
}

For Japanese, there’s Limelight.

Did you mean suggestions: limited to Latin characters

While Relevanssi can search Arabic, Russian or other non-Latin character sets, the “Did you mean” suggestions in Relevanssi Premium only support Latin characters.

The way these suggestions work is that when Relevanssi searches, Relevanssi then modifies the search term in different ways by adding or removing letters in it. Relevanssi does these modifications with the Latin alphabet (mainly the English alphabet, with a few extra umlauts thrown in). This alphabet use restricts the Premium “Did you mean” feature to text in the Latin alphabet.

The simpler “Did you mean” feature in the free version of Relevanssi should work with most character sets, as it uses the user searches, but it’s less reliable in other ways.

Relevanssi has a filter hook relevanssi_didyoumean_alphabet for replacing the alphabet used. Here are some replacement alphabets:

Russian

add_filter( 'relevanssi_didyoumean_alphabet', function() { return 'абвгдеёжзийкмнопрстуфхцчшщъыьэюя'; } );

Arabic

add_filter( 'relevanssi_didyoumean_alphabet', function() { return 'ابتثجحخدذرزسشصضطظعغفقكفمنههيآإأؤئى'; } );

Polish

add_filter( 'relevanssi_didyoumean_alphabet', function() { return 'aąbbcćdeęfghijklłmnńoóprsśtuwyzźż'; } );	

Vietnamese

add_filter( 'relevanssi_didyoumean_alphabet', function() { return 'aáàâậăằảbcdđeẹêệềghiịklmnoóọôộơớợpqrstuụủưựứvxyỹ'; } );

Hebrew

add_filter( 'relevanssi_didyoumean_alphabet', function() { return 'אבגדהוזחטיכלמנעפצקרשתםןףךץ'; } );

Stemming and suffix stripping

Relevanssi Premium includes a simple English-language stemmer that changes word forms to more basic forms to make the searching less dependent on exact word form.

To enable the English stemmer, add this to your site and rebuild the index:

add_filter( 'relevanssi_stemmer', 'relevanssi_simple_english_stemmer' );

Other languages:

These simple stemmers are not very good, though, so I recommend using a proper Snowball stemmer. It’s available as an add-on plugin and is slightly harder to set up, but the results are better, and the plugin supports over dozen languages.

Get the Snowball Stemmer add-on plugin here.

Arabic diacritics

You can improve the Relevanssi Arabic support by removing diacritics with this function. Add this to your site:

add_filter( 'relevanssi_remove_punctuation', 'rlv_arabic_remap', 9 );

/**
 * Remove Arabic diacritics.
 *
 * @param string $a The text to remove punctuation from.
 *
 * @return string The same text with punctuation and diacritics removed.
 */
function rlv_arabic_remap( $a ) {
    $remap = array(
        'إ' => 'ا',
        'آ' => 'ا',
        'أ' => 'ا',
        'ئ' => 'ى',
        'ة' => 'ه',
        'ؤ' => 'و',
        'ـ' => '',
        'آ' => 'ا',
    );

    $diacritics = array(
        '~[\x{0600}-\x{061F}]~u',   
        '~[\x{063B}-\x{063F}]~u',   
        '~[\x{064B}-\x{065E}]~u',   
        '~[\x{066A}-\x{06FF}]~u',   
    );

    $a = preg_replace( $diacritics, '', $a );
    $a = str_replace( array_keys( $remap ), array_values( $remap ), $a );

    return $a;
}

After adding the code, make sure you rebuild the index. This function will remove the diacritics and map some characters to their simpler forms in the index and user searches, enabling the search to find more results.

Leave a Reply

Are you a Relevanssi Premium customer looking for support? Please use the Premium support form.

Your email address will not be published.