Pleasures of Tibetan input and typesetting with TeX

Many years ago I decided to learn Tibetan (and the necessary Sanskrit), and enrolled in the university studies of Tibetology in Vienna. Since then I have mostly forgotten Tibetan due to absolute absence of any practice, save for regular consultations from a friend on how to typeset Tibetan. In the future I will write a lengthy article for TUGboat on this topic, but here I want to concentrate only on a single case, the character ཨྰོཾ.Since you might have a hard time to get it rendered correctly in your browser, here is in an image of it.

In former times we used ctib to typeset Tibetan, but the only font that was usable with it is a bit clumsy, and there are several much better fonts now available. Furthermore, always using transliterated input instead of the original Tibetan text might be a problem for people outside academics of Tibetan. If we want to use one of these (obviously) ttf/otf fonts, a switch to either XeTeX or LuaTeX is required.

It turned out that the font my friends currently uses, Jomolhari, although it is beautiful, is practically unusable. My guess is that the necessary tables in the truetype font are not correctly initialized, but this is detective work for later. I tried with several other fonts, in particular DDC Uchen, Noto Serif Tibetan, Yagpo Uni, and Qomolangma, and obtained much better results. Complicated compounds like རྣྲི་ did properly render in both LuaTeX (in standard version as well as the HarfBuzz based version LuaHBTeX) and XeTeX. Others, like ཨོྂ་ only properly with Noto Serif Tibetan.

But there was one combination that escaped correct typesetting: ཨྰོཾ་ which is the famous OM with a subjoint a-chung. It turned out that with LuaLaTeX it worked, but with LuaHBLaTeX (with actually enabled harfbuzz rendering) and XeTeX the output was a a separate OM followed by a dotted circle with subjoint a-chung.

The puzzling thing about it was, that it was rendered correctly in my terminal, and even in my browser – the reason was it uses the font Yagpo Uni. Unfortunately other combinations did render badly with Yagpo Uni, which is surprising because it seems that Yagpo Uni is one of the most complete fonts with practically all stacks (not only standard ones) precomposed as glyphs in the font.

With help of the luaotfload package maintainers we could dissect the problem:

  • The unicode sequence was U+0F00 U+0FB0, which is the pre-composed OM with the subjoined a-chung
  • This sequence was input via ETWS (M17N) by entering oM+' which somehow feels natural to use
  • HarfBuzz decomposes U+0F00 into U+0F68 U+0F7C U+0F7E according to the ccmp table of the font
  • HarfBuzz either does not recognize U+0F00 as a glyph that can get further consonants subjoined, or the decomposed sequence followed by the subjoined a-chung U+0F68 U+0F7C U+0F7E U+0FB0 cannot be rendered correctly, thus HarfBuzz renders the subjoined a-chung separately under the famous dotted circle

Having this in mind, it was easy to come up with a correct input sequence to achieve the correct output: a+'oM which generates the unicode sequence

  U+0F68 TIBETAN LETTER A
  U+0F7C TIBETAN VOWEL SIGN O
  U+0F7E TIBETAN SIGN RJES SU NGA RO
  U+0FB0 TIBETAN SUBJOINED LETTER -A

which in turn is correctly rendered by HarfBuzz, and thus by both XeTeX and LuaHBTeX with HarfBuzz rendering enabled.

Here some further comments concerning HarfBuzz based rendering (luahbtex-dev, xetex, libreoffice) versus luatex:

  • Yagpo Uni has lots of pre-composed compounds, but is rather picky about how they are input. The above letter renders correctly when input as oM+' (that is U+0F00 U+0FB0), but incorrectly when input as a+'oM.
  • Noto Serif Tibetan accepts most input from EWTS correctly, but oM+' does not work, while a+'oM does (opposite of Yagpo Uni!)
  • Concerning ཨོྂ་ (input o~M`) only Noto Serif Tibetan and Qomolangma works, unfortunately Yagpo is broken here. I guess a different input format is necessary for this.
  • Qomolangma does not use the correct size of a-chung in ligatures that are more complex than OM
  • Jomolhari seems to be completely broken with respect to normal input

All this tells me that there is really a huge difference in what fonts expect, how to input, how to render, and with such endangered minority languages like Tibetan the situations seems to be much worse.

Concerning the rendering of ཨྰོཾ་ I am still not sure where to send a bug report: One could easily catch this order of characters in the EWTS M17N input definition, but that would fix it only for EWTS and other areas would be left out, in particular LibreOffice rendering as well as any other HarfBuzz based application. So I would rather see it fixed in HarfBuzz (and I am currently testing code). But all the other problems that appear are still unsolved.

6 Responses

  1. behdad says:

    If you want to see something fixed in HarfBuzz, you need to report in HarfBuzz issue tracker on github.

  2. Chris Fynn says:

    I suggest you try entering Tibetan characters in the order which they would normally be written: (headline l consonant then down – finally any above base vowel marks and signs going up )
    1. Tibetan letter A U+0F68; 2, Tibetan subjoined letter -A – U+0F71; 3. Tibetan vowel sign O – U+0F7C ; 4. Tibetan mark rjes su nga ro U+0F7E. This is the order in which the glyphs for these characters would normally be written (or input) by a Tibetan

    • Hi Chris,
      thanks for Jomolhari, this is very much appreciated.
      And indeed, the failure was with using U+0FB0 instead of U+0F71. I blindly took the input from my colleague and didn’t check back whether U+0FB0 is correct, which it isn’t.
      Using your sequence correctly displays the ligature. Thanks a lot!

      But since you are here, I had also problems with rnri where the input I would expect is U+0F62 U+0FA3 U+0FB2 U+0F72. This input renders correctly in Noto Tibetan but broken in Jomolhari. Is there again something wrong with the input order?

  3. Chris Fynn says:

    Hi Norbert.

    Again, the in input expected is the order in which Tibetan would normally be written. – In other words the sequence U+0F63 U+0FA3 U+0FB2 would come (or would normally be written) first Then the sequence U+0F72 U+0FA3.

    Tibetans and Bhutanese are usually taught to write and to spell out in this order – which is why the OT look-ups in Jomolhari were ordered like this.

    In a ‘stack’ the headline consonant character is written first then other (sub-joined) consonants going down next, any subjoined vowel marks, then any vowel marks above the consonant stack and finally any marks like U+0F7E, U+0F82 or U+0F83.

    The Dzongkha keyboard layout found in Windows , Linux (X86) was designed for input like this.

    BTW Jomolhari was designed at a time when there was little support for OpenType layout in operating systems and applications. While it has OpenType look-up tables which map from the basic Unicode characters in the Tibetan block (U+0F00->U+0FFF), it also contains pre-composed stacks which are mapped to the Private Use Area characters following the Chinese National Standards; GB/T 20542-2006 “Tibetan coded character set Extension -A ” (which uses the range U+F300-U+F8FF); and GB/T 22238-2008 “Tibetan coded character set Extension -B ” (which uses the range U+F0000-U+163F); (so far, Jomolhari only partially supports extension B).

    So the OT lookups in Jomolhari map from normal Tibetan Unicode to these pre-composed character stacks (the glyphs for which are also mapped to these PUA characters) – following these Chinese standards. This also allowed Tibetan Unicode text to be converted from ‘straight’ Unicode to this Chinese encoding and displayed in applications which had (at the time no support for OpenType). Andrew West’s Babel Pad is one utility that can be used to make the conversion.
    Also see: https://sites.google.com/view/chrisfynn/home/tibetanscriptfonts/standardization/precomposedtibetan-parta and https://sites.google.com/view/chrisfynn/home/tibetanscriptfonts/standardization/precomposed-tibetan-part-b.

    Some other Tibetan fonts (starting with the Microsoft Himalaya font ) use more complex OpenType features and lookups to build Tibetan stacks ‘on the fly and these fonts allow almost any sequence of Tibetan characters (including nonsense combinations). There are advantages and disadvantages of each method. For one example: While building stacks from parts ‘on the fly’ supports almost any sequence of Tibetan characters, the results are not always aesthetically pleasing.

    I do need to extend the Jomolhari font – and have finished an awful lot of work towards this – but never enough time or resources to complete all that I want to do.

Leave a Reply

Your email address will not be published.

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>