Using the Scorpio Translation system

Posted by Dave Redfern (Writer), Tutorials on 07 Sep 2009 @ 22:54

Introduction

You've made the choice (or it was made for you), that your current site or the next project, has to support several languages. You've done your research and found a lot out there, but not all of it as easy as it appears. Some questions that may have crossed your mind could include:

In this tutorial we will look at answering most of these questions and more and how the Scorpio Framework can assist you.

First a little background on the translation system.

Like most developers, we behind Scorpio had to research multiple language systems in PHP that implement translation. The main requirements being flexibility and supporting PHP5. There are numerous types including PEAR extensions as well as the gettext() PHP extension. In fact, there are so many options it can be difficult to know where to begin. Ultimately quite a lot of time was spent looking into several solutions with the Zend Framework proving to be the most useful implementation. The Scorpio translation system is thus based on Zend_Translate with a few additional patches, updates and tweaks to accommodate Scorpio.

The primary benefit of the Zend system is the presence of multiple adaptors for a variety of translation formats including gettext, simply array / ini files as well as XLIFF, XML and other XML formats. Some of these formats are quite complex, others relatively simple.

With that behind us, the first decision to make is which format your language files will be stored in.

Choosing a storage format

The Scorpio translation system supports several storage formats as has already been mentioned, but what are the pros and cons of these? Is there any one that is "better" than the others? After having reviewed them all from a point of view of writing test cases, these are the main pros and cons that we've found for each adaptor.

Array

Pros
  • Probably the simplest format
  • Stored in a native PHP array
  • Easy for PHP developers to work with
  • Easily cached by APC or other opcode cache
Cons
  • Not friendly for a non-developer to work with
  • Files must always be stored in UTF-8, easy to forget
  • Only allows for a 1:1 mapping e.g. English to Spanish
  • No dev tools to aid translation

CSV

Pros
  • Second simplest format
  • Minimal processing overhead (PHP natively supports CSV files)
Cons
  • Not friendly for a non-developer to work with
  • Same UTF-8 concerns as with Array
  • No dev tools to aid translation

Gettext

Pros
  • Probably the widest used format
  • Part of the normal Unix language tools
  • CLI commands for text gathering
  • gettext is a native PHP function
Cons
  • Not friendly for a non-developer to work with
  • gettext is not always enabled in PHP
  • Language files must be 'compiled' into a binary format before being used
  • No dev tools to aid translation

INI

Pros
  • Relatively simple format (some caveats)
  • Minimal processing overhead (PHP natively supports INI files)
Cons
  • Not friendly for a non-developer to work with
  • Same UTF-8 concerns as with Array
  • Only allows for a 1:1 mapping
  • Constraints on source key (e.g. keys cannot contain spaces)
  • No dev tools to aid translation

QT (also known as TS format)

Pros
Cons
  • Not friendly for a non-developer to work with if the language tools cannot be used
  • Only allows for a 1:1 mapping
  • XML processing overhead

TBX

Pros
Cons
  • Extremely complex XML format for all those relationships
  • Documentation is VERY verbose
  • Not to be used by a non-developer
  • Status of GUI tools is unknown
  • Very verbose XML format requires a hefty amount of processing

TMX

Pros
  • Open standard managed by the Localisation Industry Standards Association
  • Relatively simple XML format
  • Allows multiple translations in a single file
  • Well documented
Cons
  • Not friendly for a non-developer to work with
  • Status of GUI tools is unknown
  • XML based format has processing performance penalties
  • Generally used in the CAT industry (according to LISA:TMX page)

Xliff

Pros
  • Widely used in other PHP frameworks (e.g. Symfony)
  • Relatively simple XML format
  • Several GUIs for editing and creating translations
  • Managed by an open group (OASIS)
Cons
  • Only allows for a 1:1 mapping
  • Must have a stated source and target language
  • XML based format has processing performance penalties

XMLTM

Pros
  • Open standard managed by LISA
  • Designed to work with other formats including transformation into them
  • Supports many advanced concepts to aid translation
Cons
  • Relatively new (only standardised in 2007)
  • Complex XML format
  • Not friendly for a non-developer to work with
  • Status of GUI tools unknown

That covers the available formats. Which one to use is up to the development team and who will be performing the translations. If your company is already using one of the above, it is a good idea to continue to.

If after reading and reviewing the documentation you are still not sure, then the Scorpio recommendation is something like:

QT / Xliff > Gettext > Array

Both QT and Xliff have decent GUIs that aid translating text. This means that if you have to translate in house, it is easy to train other users and they do not need to look at raw file data. Gettext is the next preferred option as it is widely used - but be aware that the PO files do need to be compiled before they can be used. Gettext supports some pretty advanced features, but they do have to be coded into the PO files before compiling. The last option is the straight PHP array - the simplest format, minimal processing but not ideal for a non-coder to be editing as values need to be correctly escaped etc (that is still a consideration of the others but less of an issue with a GUI).

It is not recommended to use INI files as there are many constraints on the key value - you cannot use spaces only A-Z, 0-9 and a handful of punctuation marks (,.-_). Scorpio does have an exporter for INI that will build new keys which must be used in your templates in place of actual text strings.

Building templates for translation

With the storage format chosen, the next step is to build the templates with translation in mind. For the sake of this tutorial, we will be looking at a purely MVC driven website using the Scorpio framework and the built-in Smarty layer.

The default view renderer in Scorpio is a customised extension of Smarty, a PHP template engine. Scorpio enhances the functionality of Smarty by adding several custom functions and wrappers. In particular, there is a pre-filter that is registered automatically during initialisation. This pre-filter allows text that is marked up correctly to be run through the translation system - if enabled on the site.

The default markup for Smarty templates is to use a pair of 't' tags: {t}{/t}. This can be configured in the site config file to some other letter or phrase.

Using the tag is straight forward; whenever you write text that is shown to the end-user, place the tags around the whole block of text - note this should not include display HTML (strong and em is permissible but is strongly discouraged). This block can include other Smarty functions or Smarty formatting modifiers and variables. For example, you may say hello to the currently logged in user:

Before:
Hello {$oUser->getName()}

After:
{t}Hello {$oUser->getName()}{/t}

What about images?

Images are problematic as what looks great in English or the native script, may (when translated) be unreadable. At any rate, the image text will need translating and the custom image stored in a language specific location. You can then access the path using the Locale that is registered in the request object.

Your image path may then be something like:
<img src="{$themeimages}/{$oLocale->getLocale()}/welcome.png" />

As a suggestion: keep text in images to a minimum. It becomes difficult to manage and each new language will need new images and remember that changes to the text will need to be made to ALL images - better idea to use CSS sprites and have real text instead.

Once all your templates are in order and all the text is marked up, the next step is building the language resource file.

Building a language resource file

Depending on your chosen language format, this next step will be either easy or pretty difficult. If you are using: Array, INI, CSV, Gettext, Xliff, QT or TMX then you are in luck, and can use the built-in extract tool in the main Scorpio CLI tool.

If you are using TBX or XMLTM, you will need to build one of the other formats first and then transform that into TBX or XMLTM - something beyond the scope of the Scorpio framework and this article.

Presuming that you have one of the supported formats, the next step is to run the extraction utility over the website. This is part of the core scorpio.php CLI tool that is located in the /tools folder of the main distribution. To run it you will need a correctly configured PHP CLI binary (Windows and Linux are both supported - as is Cygwin on Windows). You should be able to run: php -i and see the PHP version information before proceeding.

The extraction tool has a lot of options, so before running it you should review the help information. This can be accessed by running from within /tools:

php scorpio.php help extract i18n

As we are using Smarty and a website this means we will need to specify the following parameters:

Then if you did not use {t}{/t} as the markup, you will also need to specify

With all those specified hit enter and the website language data will be extracted into the specified format and saved into a file under the website folder in /libraries/data/[LOCALE]/. From there you can check the contents, edit the strings for translation or send the files to be translated.

Note: once this step has been done you should NOT change the text in the templates. If you do, the language data will need to be re-extracted as the look-up keys will have changed. Alternatively: you can manually edit the language files.

During development it is likely a good idea to provide at least one translated file - even if it is just a partial translation for testing.

Testing the site

Now on to the (exciting?) part of testing that the translation is working. We've marked up our site, extracted the language data and (perhaps) made some partial translations.

First step - visit the site again in development and... chances are you are seeing all the {t}{/t} text (at worst) or you get an error that there is no such tag.

This is because the internationalisation is disabled by default and we need to configure it in the site config.xml file. Open this and then add a new section: i18n. The following parameters will need to be set:

Your config may look something like the following example:

<section name="il8n" override="1">
    <option name="active" value="true" />
    <option name="identifier" value="t" />
    <option name="defaultLanguage" value="en" />
    <option name="adaptor" value="xliff" />
    <option name="adaptorOptions" value="disableNotices=true|scan=directory" />
</section>

adaptorOptions are documented in the API docs for translateAdaptor, however they are listed below:

clearclears already loaded data when adding new files
scansearches for translation files using the SEARCH_LOCALE constants
localethe actual set locale to use
ignoreignore files with this character, default .
disableNoticesdisable trigger notices if no translation found

With the config updated, try the site again. You should now see your page without any markup tags and with the translated text; better yet - all those Smarty variables, modifiers and function calls have still been made!

Why is that?

The language filtering is handled BEFORE the template is executed meaning that anything within the {t}{/t} tags will still be executed as Smarty code. This allows date formatting, currency formatting and other things to be done in the translation data without needing any additional transformations in the view layer.

Performance considerations

The final topic of this tutorial is performance. Obviously adding translation is going to impact on the site performance, and the more text to be translated the higher this impact will be. Certain formats are inherently 'better' for performance - at least at the actual translation level - than others, but thankfully in Scorpio, and so long as Smarty is being used, the impact is largely removed.

As previously mentioned the translation in Smarty is handled before the templates are compiled i.e. the translated text in the requested language is actually compiled into the Smarty template cache files. The upshot of this, is that once the page has been translated it does not need to happen again, unless the cache is cleared.

What is happening behind the scenes?

When a template translation is requested, Scorpio automatically passes in the locale to the render and compile functions. This is automatically added to the compile and cache ids during the compile and render phases.

The benefit of this is a compiled template cache for each language, already prepared in whatever language was requested. Your visitors need never know - and Smarty is saved from having to prefilter all the text. Better still: those compiled pages can be cached completely in the language even eliminating the need to compile.

Of course, pre-filling the cache is important - especially if there is a lot of text or you have very largely (or numerous) translation resources.

Finishing up

That is the translation system covered in a very broad manner, through the choices to implementation, extraction and testing. The extraction tool can also find and extract strings from PHP libraries or any source that has been appropriately marked up so you could make your entire application stack multi-lingual.

Of course there is one aspect so far avoided; namely the question of where do you allow the end-user to select their language and how to pass this information around?

Well there are various methods. Scorpio does have a distributor plugin for resolving the locale during distributor start-up. The supported methods are:

Which you chose is up to you and is definitely something you should consider as you build your application. While Scorpio does allow for site.com/LANG/... it is not properly supported (at the time of writing) and a better option would be to use sub-domains for each language and set the default language to the required language.

The ideal preference (if not sub-domains) is to offer language preference either via a cookie (long term storage) or as a sign-up preference to a user record so it can be loaded into the session.

A few other notes:

In the translations you should always supply one that is the same as that included in the template - this ensures that there is always a default and - more importantly - allows for the default to be updated without having to change the templates.

Using the Smarty compile and cache ids per language generates a LOT of cache data. Be sure that there is sufficient disk space for each language cache.

Remember that translation should not be seen as a "bolt-on" to the site - you should make every effort to design it in from the start. It can take a long time to get text translated (e.g. from a bureau) and you will need the staff to support queries in those languages you add.

Finally: please feel free to leave comments and suggestions in the comments or contact the project via the methods on the contact page.

< Return to article