Posted on

Relevanssi and languages

Relevanssi is language agnostic in itself. It does not know any language and doesn’t really care about which language the site uses.

However, there are few things that need to be considered when using Relevanssi in languages other than English.

Characters: use UTF8

As long as your site uses UTF8 characters, Relevanssi can handle just about anything you throw at it – you can even search for emoji. UTF8 is the standard in WordPress and something you generally don’t have worry about.

Words: bad news for Chinese and Japanese

While Relevanssi can read Chinese, Japanese and many other characters without problems, the lack of distinct words in these languages is a problem for Relevanssi.

Relevanssi works by splitting the posts into words at spaces and then counting how many times those words appear. Since Chinese and Japanese texts don’t have spaces separating words, Relevanssi can’t do this.

As a result Relevanssi can search for Chinese and Japanese characters or character sequences, especially if you enable one-character words and inside-word matching in Relevanssi settings, but since the weights for the posts are essentially random, the results won’t be of high quality.

Unfortunately making the search work well in Chinese, Japanese and other languages with similar characteristics requires advanced linguistics, and is far beyond our capabilities.

Update 25.11.2020: Matthew Wang has suggested using a Chinese language segmentation tool like phpjieba. If you have the jieba() function installed on your site, you can use it for tokenizing Chinese text like this:

add_filter( 'relevanssi_remove_punctuation', 'rlv_use_jieba' );
function rlv_use_jieba( $string ) {
    $string = jieba( $string, 1, 1500 );
    $string = @implode( ' ', $string );
    return $string;
}

Did you mean suggestions: limited to Latin characters

While Relevanssi can do searches with Arabic, Russian or other non-Latin character sets, the “Did you mean” suggestions in Relevanssi Premium only support Latin characters.

The way these suggestions work is that when a search is made, the search term is then modified in different ways by adding or removing letters in it. These modifications are made with the Latin alphabet (mostly the English alphabet, with few extra umlauts thrown in). This restricts the Premium “Did you mean” feature to text in Latin alphabet.

The simpler “Did you mean” feature in the free version of Relevanssi should work with most character sets, as it’s based on the searches made by users, but it’s less reliable in other ways.

This is something we’re currently looking into improving in a future version of Relevanssi, to allow better support for non-Latin alphabets. The next version of Relevanssi will include a filter that will allow users to replace the default alphabet with something else. If you wish to try this before the release, please contact us.

Here are some replacement alphabets for future use:

Russian

add_filter( 'relevanssi_didyoumean_alphabet', function() { return 'абвгдеёжзийкмнопрстуфхцчшщъыьэюя'; } );

Arabic

add_filter( 'relevanssi_didyoumean_alphabet', function() { return 'ابتثجحخدذرزسشصضطظعغفقكفمنههيآإأؤئى'; } );

Polish

add_filter( 'relevanssi_didyoumean_alphabet', function() { return 'aąbbcćdeęfghijklłmnńoóprsśtuwyzźż'; } );	

Vietnamese

add_filter( 'relevanssi_didyoumean_alphabet', function() { return 'aáàâậăằảbcdđeẹêệềghiịklmnoóọôộơớợpqrstuụủưựứvxyỹ'; } );

Stemming and suffix stripping

Relevanssi Premium includes a simple English-language stemmer that changes word forms to more basic forms, in order to make the searching less dependent on exact word form.

To enable the English stemmer, add this to your theme functions.php and rebuild the index:

add_filter( 'relevanssi_stemmer', 'relevanssi_simple_english_stemmer' );

Other languages:

These simple stemmers are not very good, though, so I would recommend using a proper Snowball stemmer. It’s available as an add-on plugin and is slightly harder to set up, but the results are better and the plugin supports over dozen languages.

Get the Snowball Stemmer add-on plugin here.

Arabic diacritics

The Relevanssi Arabic support can be improved by removing diacritics with this function. Add this to your theme functions.php:

add_filter( 'relevanssi_remove_punctuation', 'rlv_arabic_remap', 9 );

/**
 * Remove Arabic diacritics.
 *
 * @param string $a The text to remove punctuation from.
 *
 * @return string The same text with punctuation and diacritics removed.
 */
function rlv_arabic_remap( $a ) {
    $remap = array(
        'إ' => 'ا',
        'آ' => 'ا',
        'أ' => 'ا',
        'ئ' => 'ى',
        'ة' => 'ه',
        'ؤ' => 'و',
        'ـ' => '',
        'آ' => 'ا',
    );

    $diacritics = array(
        '~[\x{0600}-\x{061F}]~u',   
        '~[\x{063B}-\x{063F}]~u',   
        '~[\x{064B}-\x{065E}]~u',   
        '~[\x{066A}-\x{06FF}]~u',   
    );

    $a = preg_replace( $diacritics, '', $a );
    $a = str_replace( array_keys( $remap ), array_values( $remap ), $a );

    return $a;
}

After adding the code, make sure you rebuild the index. This will remove the diacritics and map some characters to their simpler forms in the index and in user searches, enabling the search to find more results.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.