t-fischer.net

Blog

ASCII Transliteration without ICU or iconv

2019-08-09, 20:18, by Thomas Fischer

So far, most of my blog postings that appeared on Planet KDE were release announcements for KBibTeX. Still, I had always planned to write more about what happens on the development side of KBibTeX. Well, here comes my first try to shed light on KBibTeX’s internal working …

Active development of KBibTeX happens in its master branch. There are other branches created from time to time, mostly for bug fixing, i. e. allowing bug reporters to compile and test a bug fix before before the change is merged into master or a release branch. Speaking of release branches, those get forked from master every one to three years. At the time of writing, the most recent release branch is kbibtex/0.9. Actual releases, including alpha or beta releases, are tagged on those release branches.

KBibTeX is developed on Linux; personally I use the master branch on Gentoo Linux and Arch Linux. KBibTeX compiles and runs on Windows with the help of Craft (master better than kbibtex/0.9). It is on my mental TODO list to configure a free Windows-based continuous integration service to build binary packages and installers for Windows; suggestions and support are welcome. Craft supports macOS, too, to some extend as well, so I gave KBibTeX a shot on this operating system (I happen to have access to an old Mac from time to time). Running Craft and installing packages caused some trouble, as macOS is the least tested platform for Craft. Also, it seems to be more difficult to find documentation on how to solve compilation or linking problems on macOS than it is for Windows (let alone Linux). However, with the help of the residents in #kde-craft and related IRC channels, I was eventually able to start compiling KBibTeX on macOS (big thanks!).

The main issue that came up when crafting KBibTeX on macOS was the problem of linking against ICU (International Components for Unicode). This library is shipped on macOS as it is used in many other projects, but seemingly even if you install Xcode, you don’t get any headers or other development files. Installing a different ICU version via Craft doesn’t seem to work either. However, I am no macOS expert, so I may have gotten the details wrong …

Discussing in Craft’s IRC channel how to get KBibTeX installed on macOS despite its dependency on ICU, I got asked why KBibTeX needs to use ICU in the first place, given that Qt ships QTextCodec which covers most text encoding needs. My particular need is to transliterate a given Unicode text like ‘äåツ’ into a 7-bit ASCII representation. This is used among others to rewrite identifiers for BibTeX entries from whatever the user wrote or an imported BibTeX file contained to an as close as possible 7-bit ASCII representation (which is usually the lowest common denominator supported on all systems) in order to reduce issues if the file is fed into an ancient bibtex or shared with people using a different encoding or keyboard layout.

Such a transliteration is also useful in other scenarios such as if filenames are supposed to be based on a person’s name but still must be transcribed into ASCII to be accessible on any filesystem and for any user irrespective of keyboard layout. For example, if a filename needs to have some resemblance the Scandinavian name ‘Ångström’, the name’s transliteration could be ‘Angstrom’, thus a file could be named Angstrom.txt.

So, if ICU is not available, what are the alternatives? Before I adopted ICU for the transliteration task, I had used iconv. Now, my first plan to avoid hard-depending on ICU was to test for both ICU and iconv during the configuration phase (i. e. when cmake runs) and use ICU if available and fall back to iconv if no ICU was available. Depending on the chosen alternative, paths and defines (to enable or disable specific code via #ifdefs) were set. See commit 2726f14ee9afd525c4b4998c2497ca34d30d4d9f for the implementation.

However, using iconv has some disadvantages which motivated my original move to ICU:

There are different iconv implementations out there and not all support transliteration.
The result of a transliteration may depend on the current locale. For example, ‘ä’ may get transliterated to either ‘a’ or ‘ae’.
Typical iconv implementations know less Unicode symbols than ICU. Results are acceptable for European or Latin-based scripts, but for everything else you far too often get ‘?’ back.

Is there a third option? Actually, yes. Qt’s Unicode code supports only the first 2¹⁶ symbols anyway, so it is technically feasible to maintain a mapping from Unicode character (essentially a number between 0 and 65535) to a short ASCII string like AE for ‘Æ’ (0x00C6). This mapping can be built offline with the help of a small program that does link against ICU, queries this library for a transliteration for every Unicode code point from 0 to 65535, and prints out a C/C++ source code fragment containing the mapping (almost like in the good old days with X PixMaps). This source code fragment can be included into KBibTeX to enable transliteration without requiring/depending on either ICU or iconv on the machines where KBibTeX is compiled or run. Disadvantages include the need to drag along this mapping as well as to updated it from time to time in order to keep up with updates in ICU’s own transliteration mappings. See commit 82e15e3e2856317bde0471836143e6971ef260a9 where the mapping got introduced as the third option.

The solution I eventually settled with is to still test for ICU during the configuration phase and make use of it in KBibTeX as I did before. However, in case no ICU is available, the offline-generated mapping will be used to offer essentially the same functionality. Switching between both alternatives is a compile-time thing, both code paths are separated by #ifdefs.

Support for iconv has been dropped as it became the least complete solution (see commit 47485312293de32595146637c96784f83f01111e).

Now, how does this generated mapping look like? In order to minimize the data structure’s size I came up with the following approach: First, there is a string called const char *unidecode_text that contains any occurring plain ASCII representation once, for example only one single a that can be used for ‘a’, ‘ä’, ‘å’, etc. This string is about 28800 characters long for 65536 Unicode code points where a code point’s ASCII representation may be several characters long. So, quite efficient.

Second, there is an array const unsigned int unidecode_pos[] that holds a number for every of the 65536 Unicode code points. Each number contains both a position and a length telling which substring to extract from unidecode_text to get the ASCII representation. As the observed ASCII representations’ lengths never exceed 31, the array’s unsigned ints contain the representations’ lengths in their lower (least significant) five bits, the remaining more significant bits contain the positions. For example, to get the ASCII representation for ‘Ä’, use the following approach:

const char16_t unicode = 0x00C4; ///< 'A' with two dots above (diaeresis)
const int pos = unidecode_pos[unicode] >> 5;
const int len = unidecode_pos[unicode] & 31;
const char *ascii = strndup(unidecode_text + pos, len);

If you want to create a QString object, use this instead of the last line above:

const QString ascii = QString::fromLatin1(unidecode_text + pos, len);

If you would go through this code step-by-step with a debugger, you would see that unidecode_pos[unicode] has value 876481 (this value may change if the generated source code changes). Thus, pos becomes 27390 and len becomes 1. Indeed and not surprisingly, in unidecode_text at this position is the character A. BTW, value 876481 is not just used for ‘Ä’, but also for ‘À’ or ‘Â’, for example.

Above solution can be easily adjusted to work with plain C99 or modern C++. It is in no way specific to Qt or KDE, so it should be possible to use it as a potential solution to musl (a libc implementation) to implement a //TRANSLIT feature in their iconv implementation (I have not checked their code if that is possible at all).

This posting is available via Gemini at gemini://gemini.t-fischer.net/post/ascii-transliteration-without-icu-or-iconv.gmi.

Commenting blog postings is currently not possible. Instead, share it on Mastodon icon Mastodon. Select your instance: