t-fischer.net
Blog
ASCII Transliteration without ICU or iconv
2019-08-09, 20:18, by Thomas Fischer
So far, most of my blog postings that appeared on Planet KDE were release announcements for KBibTeX. Still, I had always planned to write more about what happens on the development side of KBibTeX. Well, here comes my first try to shed light on KBibTeX’s internal working …
Active development of KBibTeX happens in its master
branch. There are other branches created from time to time, mostly
for bug fixing, i. e. allowing bug reporters to compile and test a bug
fix before before the change is merged into master
or a
release branch. Speaking of release branches, those get forked from
master
every one to three years. At the time of writing,
the most
recent release branch is kbibtex/0.9
. Actual releases,
including alpha or beta releases, are tagged on those
release branches.
KBibTeX is developed on Linux; personally I use the
master
branch on Gentoo
Linux and Arch Linux. KBibTeX
compiles and runs on Windows with the help of Craft
(master
better than kbibtex/0.9
). It is on my
mental TODO list to configure a free Windows-based continuous
integration service to build binary packages and installers for Windows;
suggestions and support are welcome. Craft supports macOS, too, to some
extend as well, so I gave KBibTeX a shot on this operating system (I
happen to have access to an old Mac from time to time). Running Craft
and installing packages caused some trouble, as macOS is the least
tested platform for Craft. Also, it seems to be more difficult to find
documentation on how to solve compilation or linking problems on macOS
than it is for Windows (let alone Linux). However, with the help of the
residents in #kde-craft
and related IRC channels, I was eventually able to start compiling
KBibTeX on macOS (big thanks!).
The main issue that came up when crafting KBibTeX on macOS was the problem of linking against ICU (International Components for Unicode). This library is shipped on macOS as it is used in many other projects, but seemingly even if you install Xcode, you don’t get any headers or other development files. Installing a different ICU version via Craft doesn’t seem to work either. However, I am no macOS expert, so I may have gotten the details wrong …
Discussing in Craft’s IRC channel how to get KBibTeX installed on macOS despite its dependency on ICU, I got asked why KBibTeX needs to use ICU in the first place, given that Qt ships QTextCodec which covers most text encoding needs. My particular need is to transliterate a given Unicode text like ‘äåツ’ into a 7-bit ASCII representation. This is used among others to rewrite identifiers for BibTeX entries from whatever the user wrote or an imported BibTeX file contained to an as close as possible 7-bit ASCII representation (which is usually the lowest common denominator supported on all systems) in order to reduce issues if the file is fed into an ancient bibtex or shared with people using a different encoding or keyboard layout.
Such a transliteration is also useful in other scenarios such as if
filenames are supposed to be based on a person’s name but still must be
transcribed into ASCII to be accessible on any filesystem and for any
user irrespective of keyboard layout. For example, if a filename needs
to have some resemblance the Scandinavian name ‘Ångström’, the name’s
transliteration could be ‘Angstrom’, thus a file could be named
Angstrom.txt
.
So, if ICU is not available, what are the alternatives? Before I adopted ICU for the transliteration task, I had used iconv. Now, my first plan to avoid hard-depending on ICU was to test for both ICU and iconv during the configuration phase (i. e. when cmake runs) and use ICU if available and fall back to iconv if no ICU was available. Depending on the chosen alternative, paths and defines (to enable or disable specific code via #ifdefs) were set. See commit 2726f14ee9afd525c4b4998c2497ca34d30d4d9f for the implementation.
However, using iconv has some disadvantages which motivated my original move to ICU:
- There are different iconv implementations out there and not all support transliteration.
- The result of a transliteration may depend on the current locale. For example, ‘ä’ may get transliterated to either ‘a’ or ‘ae’.
- Typical iconv implementations know less Unicode symbols than ICU. Results are acceptable for European or Latin-based scripts, but for everything else you far too often get ‘?’ back.
Is there a third option? Actually, yes. Qt’s Unicode code supports only the first 216 symbols anyway, so it is technically feasible to maintain a mapping from Unicode character (essentially a number between 0 and 65535) to a short ASCII string like AE for ‘Æ’ (0x00C6). This mapping can be built offline with the help of a small program that does link against ICU, queries this library for a transliteration for every Unicode code point from 0 to 65535, and prints out a C/C++ source code fragment containing the mapping (almost like in the good old days with X PixMaps). This source code fragment can be included into KBibTeX to enable transliteration without requiring/depending on either ICU or iconv on the machines where KBibTeX is compiled or run. Disadvantages include the need to drag along this mapping as well as to updated it from time to time in order to keep up with updates in ICU’s own transliteration mappings. See commit 82e15e3e2856317bde0471836143e6971ef260a9 where the mapping got introduced as the third option.
The solution I eventually settled with is to still test for ICU
during the configuration phase and make use of it in KBibTeX as I did
before. However, in case no ICU is available, the offline-generated
mapping will be used to offer essentially the same functionality.
Switching between both alternatives is a compile-time thing, both code
paths are separated by #ifdef
s.
Support for iconv has been dropped as it became the least complete solution (see commit 47485312293de32595146637c96784f83f01111e).
Now, how does this generated mapping look like? In order to minimize
the data structure’s size I came up with the following approach: First,
there is a string called const char *unidecode_text
that contains any occurring plain ASCII representation once, for example
only one single a that can be used for ‘a’, ‘ä’, ‘å’, etc. This string
is about 28800 characters long for 65536 Unicode code points where a
code point’s ASCII representation may be several characters long. So,
quite efficient.
Second, there is an array const unsigned int unidecode_pos[]
that holds a number for every of the 65536 Unicode code points. Each
number contains both a position and a length telling which substring to
extract from unidecode_text to get the ASCII representation. As the
observed ASCII representations’ lengths never exceed 31, the array’s
unsigned ints contain the representations’ lengths in their lower (least
significant) five bits, the remaining more significant bits contain the
positions. For example, to get the ASCII representation for ‘Ä’, use the
following approach:
const char16_t unicode = 0x00C4; ///< 'A' with two dots above (diaeresis)
const int pos = unidecode_pos[unicode] >> 5;
const int len = unidecode_pos[unicode] & 31;
const char *ascii = strndup(unidecode_text + pos, len);
If you want to create a QString
object, use this instead
of the last line above:
const QString ascii = QString::fromLatin1(unidecode_text + pos, len);
If you would go through this code step-by-step with a debugger, you
would see that unidecode_pos[unicode]
has value 876481
(this value may change if the generated source code changes). Thus, pos
becomes 27390 and len becomes 1. Indeed and not surprisingly, in
unidecode_text at this position is the character A. BTW, value 876481 is
not just used for ‘Ä’, but also for ‘À’ or ‘Â’, for example.
Above solution can be easily adjusted to work with plain C99 or modern C++. It is in no way specific to Qt or KDE, so it should be possible to use it as a potential solution to musl (a libc implementation) to implement a //TRANSLIT feature in their iconv implementation (I have not checked their code if that is possible at all).
This posting is available via Gemini at gemini://gemini.t-fischer.net/post/ascii-transliteration-without-icu-or-iconv.gmi.
Commenting blog postings is currently not possible. Instead, share it on Mastodon.