Tuesday, December 27, 2011

Cross-platform multilingual support in PHP for the lazy programmer inside you

If you are like me, you dread adding multilingual support to applications.  It isn't that I do not like the various, rich cultures of our world, but rather that it is such a pain in the neck to implement multilingual support into an existing application from a programming perspective.

One of the problems stems from the fact that nearly all programming languages are written in English and limited to the basic ASCII character set and targeted at English-speaking people.  Sure, I know of one programming language in Polish and a few others in other spoken languages but non-English programming languages are few and far between.  It really has nothing to do with America vs. whoever but more to do with settling on something we can use to get work done on a computer and English happens to fit nicely into a single byte (or less) and seems to be one of a handful of what I call "common trade languages" - that is, if you want to conduct business across international borders, it helps a lot to know some English.  Regardless of the reason, programming languages are written in English and likely will continue to be for years to come.

Of course, for native English-speaking programmers, multilingual support is perceived as a difficulty - suddenly we have to think very differently in ways we aren't accustomed to thinking.  And, if this is your first time writing a multilingual app, you might search around to see what the "standard" is.  There isn't one and people are all over the place on what the "best" method is.  If you write a lot of PHP like me, 'gettext' crops up BUT it has too many issues including not being thread-safe and isn't necessarily available on the target web host.

What most programmers are looking for is a good strategy.  What I'm going to share with you is my "lazy man" approach that works on all PHP installations and with how I prefer to develop software.  If you need working PHP code, you'll be able to find it in the next release of Admin Pack.

First, I implement the application how I normally would:  In English with all strings inline.  This doesn't work so well for some languages, such as C/C++ (not impossible though), but it works great for PHP.  Also, be sure the application uses some form of Unicode.  UTF-8 is pretty much universally supported.

Next, when the situation for multilingual support arises, I implement a set of functions that introduce a "language stack" and load in languages in a specific order.  This is something of my own creation that allows progressive fallback to my built-in English strings.  English is at the very bottom of the stack at the "system" layer, a "default" language is at the next layer, and then a "user-specific" language is at the top.  Now, let's say a function called TranslateStr() exists.  TranslateStr() first checks to see if a string matches in the "user-specific" language mapping.  If it doesn't exist, it checks the "default" language mapping.  If it still doesn't exist, it falls back to the English string.  Of course, this approach can create some "interesting" results where Chinese, French, and English all are displayed in the same user interface, but that's a translation issue, not a programming issue.  As a lazy programmer, you shouldn't care.  The key here is that something gets displayed even if it is in the wrong language.  A lot of multilingual systems don't have progressive fallback despite being really easy to implement.  The TranslateStr() function is then applied liberally throughout the application wherever English strings appear.  The function can be equipped with a "this string doesn't have a translation at all" notification method so that translators can be made aware of strings that need translation work to be done.

Full static strings are easy one-to-one translations.  Dynamically-constructed strings are much harder to translate.  In PHP, functions exist like sprintf() which allow a string to be built differently based on the format specifiers in the first argument (e.g. '%1$s') - the lazy English strings only need '%s' though.  This sort of functionality makes some string translations easier. Multilingual support gets messy when you discover that currency, numbers, and dates/times are displayed differently in different countries and that some languages are read in different directions.  If you want to be lazy, just do your best to allow a translator to create a "mostly correct" translation.  The human brain will adapt to and implicitly cover over most mistakes with only mild irritation.  It helps a lot if the entire application is designed for "possible future multilingual support" so that it only takes a day or two to deploy a complete multilingual solution.

At this point, you are probably wondering how to create the actual translations themselves.  If you are building a PHP-based web application, multilingual support is done lazily with IANA language codes (e.g. 'en-us') - which can be used with HTTP_ACCEPT_LANGUAGE parsing - and storing the translations into PHP files.  But you don't want translators messing around with raw PHP files - trust me, they don't understand software development and you'll spend more time fixing the problems they create than anything else.  A decent approach is to capture the output of var_export() using ob_start(), ob_get_contents(), and ob_end_clean() and create a web-based editor where they can manage the translation files that way.  Be sure to protect the translation editor - the last thing you need is the German translation of your web application spouting stuff about certain "enlargement medicine" to your users because some script kiddie found your insecure translation editor.  I'll probably figure out how to make one of these for general-purpose use, so you can sit back and wait for it (remember - when it comes to multilingual stuff, you want to be as lazy as possible).

Speaking of security, multilingual support has been the source of countless security headaches and several major security exploits have been the result of not thinking through possibilities.  Make doubly-sure you don't introduce exploitable code as you deploy multilingual support into your application.  The lazy approach allows you to carefully inspect the application as you go along, so you might end up fixing other vulnerabilities too.

Some programmers may observe a slight decrease in performance and increased memory usage when the multilingual support routines are added.  These are unfortunate side-effects of multilingual support.  But keep in mind that multilingual support is sometimes a necessary evil to keep users happy.  You'll be able to get away with single-language-only applications more often than not.

And that is pretty much it.  Enjoy being a lazy programmer!

1 comment:

  1. Hey, cool blog, loving it.

    I did this kind thing (i called it language inheritance) a couple of years ago in a python webapp I made. I really loved the way it works, if there is not translation, fall back...

    anyway, I just thought I'd mention var_export()'s second argument allows it to return a string so you don't have to use the output buffer. I used the output buffer for a few years before I noticed the second argument.

    ReplyDelete