Writing Japanese in LaTeX : Part 2 – Characters and Encodings
After the very brief introduction to Japanese typesetting and the available options in the previous blog post, let us now dive into some technical explanations concerning character sets and encodings.
In this blog post we will discuss character sets and encodings, something one has to think about already when using non-ASCII characters in e.g. European languages, but even more when writing Japanese (or any other ideographic writing system with thousands of glyphs). We also lift the eternal mystery about the ¥ appearing in many places instead of the \ in Japanese documents.
First of all, be reminded that I am not an expert in this area, so many of my explanations here are simplified from the real world, often incomplete, and maybe sometimes plain wrong (hint: leave a comment!).
Character Sets and Encodings
In the ASCII worlds, everything was simple: Representing a letter like “A” in a computer, a certain code was decided (65), and that was at the same time the byte value. So in the ASCII world, character set and encoding coincide.
Already when non-ASCII languages asked for inclusion, the problem started. Here it was always about character-set-definition-and-encoding in one. We defined Latin1 as containing these characters, like “Ö”, and assigned code points at the same time (214 for Ö). This worked because by having 256 possible values for character codes, we could satisfy all the alphabets we tried.
But with thousands and tens of thousands of kanji, this is not possible anymore. What is needed is a separation. On the one hand a standardization of the glyphs, assigning each kanji a unique id (number). And then a way to encode these numbers in a good way.
Character set standards
In practice there are only two different sources of standardization of character classes in Japan:
- JIS – in practice several Japanese Industrial Standards. For example JIS X 0201, the Japanese version of ASCII adding 64 half-width Katakana characters, or JIS X 0208, the most common kanji character set containing 6,879 characters, including 6355 kanji and 524 other characters
- UCS – the Universal Character Set, contains nearly one hundred thousand abstract characters, trying to include all writing systems in history. There is only a slight hickup when in comes to Japanese, and that is the Han unification, where the similar ideographs in different languages were mapped to the same code point.
The other question is how to represent these characters in a computer system, i.e., in bits and bytes. Encoding standards are often tied closely with character set standards, which creates some confusion.
When it comes to TeX (and probably all computerization nowadays), the following encoding mechanisms are in common use. The first three (SJIS, JIS X 0202, EUC-JP) deal with character set defined by JIS standards (in various levels), while the last one (UTF-8) is one encoding for the Unicode character set standard.
- SJIS or Shift JIS – an encoding for the character sets in JIS X 0201 and JIS X 0208, that is a backward compatible extension to the former. Thus, the encoding plays nicely with ASCII and the old industrial standard JIS X 0201 that was used in many computers, but has some peculiarities that makes it not play nicely with general purpose parsers (think: An “a” might be a real “a”, or part of a multi-byte encoded glyph!)
- JIS X 0202 – often referred to shortly as JIS for data in JIS X 0208. This encoding is good for transmitting over (only) 7bit capable channels, like email. Old versions (and maybe even now?) of Emacs used this encoding internally.
- EUC(-JP) Extended Unix Code – actually a family of encodings for ISO-2022 compliant character sets. This encoding was strong in Unix-affine computer systems in Japan for many years.
- UTF-8 the standard for encoding UCS, part of the Unicode Standard, currently used on practically all modern operating systems
Finally, let us look at an example, namely the following file:
Language Ä á 日本語 ひらがな
and look at the bytes appearing in different encodings. The following listings provide the hex values of the above text
In UTF-8 we see that the first line is the one-byte encoding, i.e., ASCII, of “Language”. The second line shows that accented characters are encoded in two bytes (separated by space 0x20). Finally both the kanjis on the third line and hiragana on the fourth line are encoded into three bytes.
4c 61 6e 67 75 61 67 65 c3 84 20 c3 a1 e6 97 a5 e6 9c ac e8 aa 9e e3 81 b2 e3 82 89 e3 81 8c e3 81 aa
In SJIS we see some interesting effects: First, as expected, the ASCII part remains the same, the first line. But the second line exhibits problems – the Umlaut-A Ä is not encoded in the underlying character set, so SJIS cannot represent it. What my recode program here did is replacing Ä with “A (0x22 0x41). On the other hand, á is in the character set of SJIS, and is encoded in three bytes: 0x81 0x4c 0x61. Here you see the problem with parsers mentioned above. The 0x4c = L and 0x61 = a, the same “La” occurring at the beginning of the first line. So a simple search for “La” will show up two occurrences, which is wrong.
The kanjis in the third line and hiragana in the forth line take up each only two bytes.
4c 61 6e 67 75 61 67 65 22 41 20 81 4c 61 93 fa 96 7b 8c ea 82 d0 82 e7 82 aa 82 c8
A very similar problem as in SJIS arises in ISO2022-JP: As the Ä is not encoded, it is replaced with “A by my recode command. The á on the contrary takes up 9 bytes! And both the kanji and hiragana are encoded in a non-linear non-fixed way.
4c 61 6e 67 75 61 67 65 22 41 20 1b 24 42 21 2d 1b 28 42 61 1b 24 42 46 7c 4b 5c 38 6c 1b 28 42 1b 24 42 24 52 24 69 24 2c 24 4a 1b 28 42
Since EUC-JP encodes the much bigger JIS X 0212 character set, the Ä is not lost. Both diacritical latin characters in line two are encoded in 3 bytes each. The kanjis on line three and the hiragana on the last line are all endoded in two bytes each.
4c 61 6e 67 75 61 67 65 8f aa a3 20 8f ab a1 c6 fc cb dc b8 ec a4 d2 a4 e9 a4 ac a4 ca
From the above, you can easily see why UTF-8 wasn’t initially well accepted in Japan, as all documents would increase in size, due to longer encoding per character. Other encodings provide shorter effective size of documents. Or maybe it was just the usual attitude of any sys-admin: “Never change a running system” that made the switch to UTF8 a long running project in Japan.
And what am I using?
This is a good question, and depends mostly on your operating system and how old it is, here some guidance:
- Linux: in most cases you will have UTF-8 as encoding, check the output of locale command, in particular the value of LC_CTYPE, which in my case is en_US.utf8, indicating the language and the encoding.
- MacOS: older version (before Mac OSX) used Apples version of ShiftJIS, called MacJapanese. From MacOSX on (as far as I know) Mac uses Unicode/UTF-8
- Windows: Again, older versions (before XP??) used Microsoft’s version of ShiftJIS, called Code Page 932. Newer versions hopefully use again Unicode/UTF-8.
- BSD: as far as I know, UTF-8
If you start writing Japanese text, you should be aware of what character encoding you are using, because it somehow defines what engine you can use. The most widely accepted character set is UTF-8, as newer versions of pTeX accept also utf-8, and uptex, xetex, luatex expect UTF-8 natively, and for BXcjkjatype, too, you need UTF-8 input.
But you have to be aware that you will find many old files shared with you to be in one or another ShiftJIS encoding, which can as is only compiled with platex, or after conversion to UTF-8 with any engine.
But wait, you didn’t tell us about the ¥ versus \ thingy
From the ShiftJIS WikiPedia page:
The single-byte characters 0x00 to 0x7F match the ASCII encoding, except for a yen sign (U+00A5) at 0x5C and an overline (U+203E) at 0x7E in place of the ASCII character set’s backslash and tilde respectively.
That means, in Shift JIS, the ¥ sign is encoded with the same code as the backslash in ASCII. Now TeX expects the 0x5C as the excape character by default, and that is the reason why on computers working with the ShiftJIS encoding as basis (older Windows, older Mac, any Unix with locale set to ShiftJIS), the commands of TeX will show up as ¥hfill etc.
After this a bit hard-core entry, we will return to hands-on tutorial in the next blog post, by writing Hello-World like documents for all the engines.
Enjoy, and please leave remarks and suggestions for improvements, and corrections for my misunderstandings, here!