utf8 && locale

Today Perl has support UNICODE and various charsets but writting code to deal with our national language is still difficult. It's never that easy and there are always questions emerging about this hot topic. Jan send his question to our mailing list some time go. Lately I've read that a similar question was discussed on the CZ/SK Debian mailing list. So let's see what's going on with an example:

#!/usr/bin/perl 

my @pole=qw(šiška marek ucho čaj žička);

@pole=sort(@pole);           # sort array

foreach my $a (@pole) {
    $a =~ s/\W//g;           # remove all non word characters
    print "$a \n";           # print the array field
}

open my $handle, ">file.txt" or die "Can't write to file.txt: $!";
print $handle "\Uěščřžabcd\E\n";     # write upper cased string
close $handle;

Will produce the following output:

marek 
ucho 
aj 
ika 
ika 

ěščřžABCD

Many things went wrong with the previous script and the results are not really what we expected. čaj, šiška and žička are at the end and wheren't sorted properly, ěščřž was left intact and the letters with punctuation characters are not considered as word characters.

Now let's take understand what went wrong and how to fix the script.

Teaching Perl the alphabet

At first, Perl isn't handling the strings properly as the sorting and pattern matching didn't use all characters as letters. This is because not all written languages use the same alphabet. While some languages might have the same letters in their alphabets they might not have the same sorting order. Perl doesn't know in which language the strings are being written thus it can't handle them properly. To fix this we need to tell Perl to use a specific locale (think of it as an alphabet), then it will be able to tell which characters are letters and their right order in order to process strings properly. This is done by use locale;, once this is added to the script the output looks a bit better:

aj 
marek 
ika 
ucho 
ika 
ěščřžABCD

Parsing source code in different languages

The previous section showed how to teach Perl the alphabet. The problem is that Perl still doesn't know how to read our text. It can read strings properly from files and STDIN but not from it's own source code. This is because Perl assumes that the script is written in some default encoding that is NOT UTF-8 (maybe be ASCII or Latin-1, but I'm not sure). Thus Perl can't parse properly the strings in the source code. To fix this we need to tell Perl in which encoding the source code is. This is done in three ways:

By adding use utf8; Perl will read the source file as being an UTF-8 file, thus accepting accents and other characters defined in the UNICODE charset.

By adding use encoding 'utf8';. This pragma has the advantage to accept other encodings. Thus, the source file could be written in some other encoding.

Another alternative is to write the the characters in UNICODE but by using Perl escape sequences for all non ASCII characters. For instance, to write Á use the escape sequence "\x{C1}" which corresponds to the UNICODE character for LATIN CAPITAL LETTER A WITH ACUTE. This is quite portable as the source file will be in ASCII but it's tedious to read an maintain.

Once the source file is read with the proper encoding the output strings are properly displayed.

čaj 
marek 
šiška 
ucho 
žička 

Wide character in print at ./utf8-test.pl line 18.
ĚŠČŘŽABCD

I/O in UTF-8

Now the strings are printed as we expected them. But we still have some issues with printing UTF-8 strings, this is because STDOUT is not expecting UTF-8. This can be fixed in multiple ways:

The first solution is to add use encoding 'utf8';, this will tell Perl to set the script encoding (the source file) to UTF-8, but most important (in this example) it will set the PerlIO layers of STDIN and STDOUT to UTF-8. In fact, use encoding 'utf8'; and use utf8; are almost the same except that the former sets the encoding of both STDIN and STDOUT (STDERR is left unchanged). The POD for utf8 states: In case you are wondering: yes, "use encoding ’utf8’;" works much the same as "use utf8;" .

The second alternative is to explicitly change the encoding of STDOUT manually through binmode(STDOUT => ':encoding(utf8)'). This can be done for all file handles and is not limited to STDOUT. This might be the best solution after all. This is because all other PerlIO handles are left untouched even with use encoding. If some UTF-8 strings would be printed to a file, Perl will still complain about Wide character in print. The only fix for this is to change the encoding of the PerlIO layer manually.

Finally, the encoding of a PerlIO layer can be set when opening the layer through open. For instance the following code open my $handle, "<:utf8", "file" or die "Can't read file: $!"; will open the file in UTF-8.

Once we tell Perl to handle all I/O in UTF-8 all is fine:

čaj 
marek 
šiška 
ucho 
žička 

ĚŠČŘŽABCD

Well and as a summary, the final test script looks like this:

#!/usr/bin/perl 

use warnings;
use strict;

# Sort strings properly and match all letters
use locale;

# Read the source file as UTF-8 and set STDOUT and STDIN to UTF-8
use encoding 'utf8';


# We can write letters in the source code in UTF-8 thanks to "use encoding"
my @pole=qw(šiška marek ucho čaj žička);

# Sort knows our alphabet thanks to "use locale"
@pole=sort(@pole);           # sort array

foreach my $a (@pole) {
    # The pattern matching works fine thanks to "use locale"
    $a =~ s/\W//g;           # remove all non word characters
    print "$a \n";           # print the array field
}

# Write in UTF-8 into the file
open my $handle, ">:utf8", "file.txt" or die "Can't write to file.txt: $!";
# We can convert to upper case thanks to "use locale"
print $handle "\Uěščřžabcd\E\n";     # write upper cased string
close $handle;

Notes

If you want to write Pod in non Latin-1 characters use following directive to set encoding:

=encoding utf-8

Links: utf8 - Perl pragma to enable/disable UTF-8 (or UTF-EBCDIC) in source code, locale - Perl pragma to use and avoid POSIX locales for built-in operations, perlunitut - Perl Unicode Tutorial, perluniintro - Perl Unicode introduction, perlunicode - Unicode support in Perl

29. Jun 2008
Jozef

03. Aug 2008
Emmanuel