Localizing GNU/Linux Platform for Sinhala Language and Sri Lanka

by Anuradha Ratnaweera

Virtusa Corporation and Lanka Linux User Group

1. Introduction

2. Issues

Sinhala script is sometimes classified as a `complex script', along with other `Indic' languages. Some of the modifiers coming before a base character makes matters even complicated.

Present standards (SLS 1134 and Unicode) assigns characters for each vowel, consonent and modifier. This makes certain things such as collation simple, but as modifiers sometimes have more than one shapes (e.g.: glyphs) associated with them and vice versa, the rendering engine has to do some extra work.

Examples: ko and kra

3 GNU C Library

3.1 Locale

The GNU C Library implements locales as defined by POSIX and other related standards. Each locale is associated with the language, a country and optionally a character code. Sinhala languge (si) locale for Sri Lanka (LK) with UTF-8 encoding is signified by si_LK.UTF-8.

Each locale contains attributes related to language and a country, including currency formats and symbols, number formats, date and month names and formats, paper sizes, ways of writing names and addresses, phone codes and formats and measurement systems.

Properly internationalized programs query the locale database for even the most trivial matters such as date or month names. Therefore, changing the locale makes all the programs behave according to the relevent country, language and encoding.

Here is an example - short names for days of week (from /usr/share/i18n/locales/si_LK):

% Abbreviated weekday names (%a)
abday       "<U0D89>";"<U0DC3>";/
            "<U0D85>";"<U0DB6>";/
            "<U0DB6><U0DCA><U200D><U0DBB>";"<U0DC3><U0DD2>";/
            "<U0DC3><U0DD9>"

Once a locale is created, it has to be `compiled' by running localedef or a higher level tool such as locale-gen on Debian.

Here is an example which shows the effect of setting the locale on the date program.

% date
Thu Mar 10 18:05:04 LKT 2005
% export LC_ALL=si_LK.UTF-8
% date
2005 මාර්තු 10 වැනි බ්‍රහස්පතින්දා 18:04:52 +0600
%

Aliases for locales can be defined in /etc/locale.aliases for convenience as follows:

si       si_LK.UTF-8
si_LK    si_LK.UTF-8
sinhala  si_LK.UTF-8

Sinhala locale for Sri Lanka has been submitted to the Bugzilla of GNU C Library.

3.2 Message Translation

Internationalized programs don't just print messages. Instead, they lookup for the proper translation in the message databae.

The GNU C Library provides two ways of message translation. Out of them, the Uniforum approach, or the gettext family of functions, is more popular in GNU/Linux systems, and we provide only gettext catelogues so far.

The effect of changing environment variables such as LC_ALL has a similar effect on internationalized applications; all the strings that have equivalents in the gettext catelogue will be displayed translated.

A message catelogue is made by first creating a PO file and compiling it to an MO file using a tool msgfmt. A typical PO file looks like this:

msgid "Close"
msgstr "වසන්න"

msgid "Copy"
msgstr "පිටපත් කරන්න"

msgid "Contents"
msgstr "පටුන"

Compiled MO files for Sinhala are placed in /usr/share/locale/si/LC_MESSAGES/. Generally, each program of library has an MO file associated with it. Here is an example of gedit with the locales C and si_LK.UTF-8 respectively.

4. X Window System

The X Window System also has a locale system almost independent from the C Library. However, programmes based on GTK and QT libraries have their own rendering engines and use the locales and translation catelogues in the C library. Therefore, it's not necessary to add si_LK locale exclusively to X.

However, X Window System will not switch to the relevant locale when it doesn't know about it. The common workaround is to `bind' those locales to en_US.UTF-8. We have submitted patches to both X.Org and, XFree86 to do this for si_LK.UTF-8.

Relevent files are compose.dir, locale.alias and locale.dir in /usr/X11R6/lib/X11/locale/. Here is an extract from locale.dir:

en_US.UTF-8/XLC_LOCALE:        sh_YU.UTF-8
en_US.UTF-8/XLC_LOCALE:        si_LK.UTF-8
en_US.UTF-8/XLC_LOCALE:        sk_SK.UTF-8

5. Fonts

Almost all the `complex' scripts are using OpenType fonts, an extension to Apple's Truetype font format.

For Sinhala GNU/Linux, we created an OpenType font using outlines from the Sinhala LaTeX project. These outlines were originally developed by Yannis Haralambous.

Pattern substitution was added to the font using FontForge (formerly known as PFAEdit). Basic glyphs were left at the Sinhala unicode code page, but ligatures (combinations) were added to the end of fonts without assigning code points (-1). Here is an example (`nu'):

6. GNOME, GTK and Pango

GNU Network Object Model Environment (GNOME) is an application development framework. which is commonly known for it's desktop environment.

GNOME is built on top of the Gimp Toolkit (GTK), originally developed for the GNU Image Manipulation Program (GIMP).

6.1. Pango Rendering

GTK / GNOME uses a library called Pango to `shape' strings, i.e., to converts strings encoded in different languages into sequences of glyphs (shapes) from fonts.

Original patch submitted to Pango simply added Sinhala into the indic OpenType rendering module. However, it tried to create conjuncts implicitly, even when ZWJ is not present. A fix was submitted and is available in Pango 1.8.1 onwards.

6.2. Keyboard Input

GTK supports `input method' modules, that can be selected by right clicking on any widget that does text input.

6.3. Mozilla Firefox

7. QT and KDE

8. OpenOffice and ICU

9. Spell Checking

10. Conclusions