Pleasures of Tibetan input and typesetting with TeX
Many years ago I decided to learn Tibetan (and the necessary Sanskrit), and enrolled in the university studies of Tibetology in Vienna. Since then I have mostly forgotten Tibetan due to absolute absence of any practice, save for regular consultations from a friend on how to typeset Tibetan. In the future I will write a lengthy article for TUGboat on this topic, but here I want to concentrate only on a single case, the character ཨྰོཾ．Since you might have a hard time to get it rendered correctly in your browser, here is in an image of it.
In former times we used ctib to typeset Tibetan, but the only font that was usable with it is a bit clumsy, and there are several much better fonts now available. Furthermore, always using transliterated input instead of the original Tibetan text might be a problem for people outside academics of Tibetan. If we want to use one of these (obviously) ttf/otf fonts, a switch to either XeTeX or LuaTeX is required.
It turned out that the font my friends currently uses, Jomolhari, although it is beautiful, is practically unusable. My guess is that the necessary tables in the truetype font are not correctly initialized, but this is detective work for later. I tried with several other fonts, in particular DDC Uchen, Noto Serif Tibetan, Yagpo Uni, and Qomolangma, and obtained much better results. Complicated compounds like རྣྲི་ did properly render in both LuaTeX (in standard version as well as the HarfBuzz based version LuaHBTeX) and XeTeX. Others, like ཨོྂ་ only properly with Noto Serif Tibetan.
But there was one combination that escaped correct typesetting: ཨྰོཾ་ which is the famous OM with a subjoint a-chung. It turned out that with LuaLaTeX it worked, but with LuaHBLaTeX (with actually enabled harfbuzz rendering) and XeTeX the output was a a separate OM followed by a dotted circle with subjoint a-chung.
The puzzling thing about it was, that it was rendered correctly in my terminal, and even in my browser – the reason was it uses the font Yagpo Uni. Unfortunately other combinations did render badly with Yagpo Uni, which is surprising because it seems that Yagpo Uni is one of the most complete fonts with practically all stacks (not only standard ones) precomposed as glyphs in the font.
With help of the luaotfload package maintainers we could dissect the problem:
- The unicode sequence was
U+0F00 U+0FB0, which is the pre-composed OM with the subjoined a-chung
- This sequence was input via ETWS (M17N) by entering
oM+'which somehow feels natural to use
- HarfBuzz decomposes
U+0F68 U+0F7C U+0F7Eaccording to the
ccmptable of the font
- HarfBuzz either does not recognize
U+0F00as a glyph that can get further consonants subjoined, or the decomposed sequence followed by the subjoined a-chung
U+0F68 U+0F7C U+0F7E U+0FB0cannot be rendered correctly, thus HarfBuzz renders the subjoined a-chung separately under the famous dotted circle
Having this in mind, it was easy to come up with a correct input sequence to achieve the correct output:
a+'oM which generates the unicode sequence
U+0F68 TIBETAN LETTER A
U+0F7C TIBETAN VOWEL SIGN O
U+0F7E TIBETAN SIGN RJES SU NGA RO
U+0FB0 TIBETAN SUBJOINED LETTER -A
which in turn is correctly rendered by HarfBuzz, and thus by both XeTeX and LuaHBTeX with HarfBuzz rendering enabled.
Here some further comments concerning HarfBuzz based rendering (luahbtex-dev, xetex, libreoffice) versus luatex:
- Yagpo Uni has lots of pre-composed compounds, but is rather picky about how they are input. The above letter renders correctly when input as
U+0F00 U+0FB0), but incorrectly when input as
- Noto Serif Tibetan accepts most input from EWTS correctly, but
oM+'does not work, while
a+'oMdoes (opposite of Yagpo Uni!)
- Concerning ཨོྂ་ (input
o~M`) only Noto Serif Tibetan and Qomolangma works, unfortunately Yagpo is broken here. I guess a different input format is necessary for this.
- Qomolangma does not use the correct size of a-chung in ligatures that are more complex than OM
- Jomolhari seems to be completely broken with respect to normal input
All this tells me that there is really a huge difference in what fonts expect, how to input, how to render, and with such endangered minority languages like Tibetan the situations seems to be much worse.
Concerning the rendering of ཨྰོཾ་ I am still not sure where to send a bug report: One could easily catch this order of characters in the EWTS M17N input definition, but that would fix it only for EWTS and other areas would be left out, in particular LibreOffice rendering as well as any other HarfBuzz based application. So I would rather see it fixed in HarfBuzz (and I am currently testing code). But all the other problems that appear are still unsolved.