NetHack and Unicode

Tags: nethack internals portability terminal libuncursed | Written by Alex Smith, 2015-01-09

Recently, the NetHack 3 series devteam have asked about how to bring Unicode to NetHack. Likewise, NetHack 4's Unicode support is somewhat lacking; it handles Unicode on output but not on input.

This blog post looks at the current situation, and at the various possibilities for resolving it.

Unicode on the map

One of the most common places to see Unicode output is when rendering the map. Most NetHack variants now do this, because it's one of the more portable ways of displaying the line-drawing characters which are commonly used for walls.

The simplest approach is to treat Unicode as an alternative to IBMgraphics or DECgraphics. This is what two of the most popular variants (UnNetHack and nethack.alt.org) do. The basic idea is to store everything in an 8-bit character set such as IBMgraphics (code page 437) internally, then convert to Unicode just before output.

The big advantage of this method is that it's only a very minimal change to the game core. (In fact, there is probably no need to change the game core at all when doing this. When using a rendering library such as libuncursed, the library will do the Unicode translation if necessary for the terminal, and the game core can think entirely in terms of code page 437 or the like.) The big disadvantage is that it's not very customizable; there are more than 256 possible renderings that might need to be drawn on the map ("glyphs" in the terminology of NetHack 3.4.3, or "tilekeys" in the terminology of NetHack 4), so the game core needs to either thing in more than 8 bits (in which case it may as well use Unicode directly), or else artificially restrict the set of possible renderings.

NetHack 4 uses a somewhat different method. The game core mostly doesn't deal with what map elements look like at all (actually, it is aware of a default ASCII rendering for each tilekey, but the only intended use of this is to draw the map in the dumplog when a character dies). The game core communicates with the interface in terms of "API keys" (which have a 1 to 1 correspondence with tilekeys, but are spelled differently for historical NitroHack-related reasons). The interface translates the API keys to tilekeys, and then looks up the appropriate renderings in the current tileset. When playing tiles, the tileset will contain tile images; when playing on a terminal (or fake terminal), the tileset will contain Unicode characters together with information on how to draw them (color, background, underlining, etc.).

Here's an excerpt from one of the NetHack 4 tileset files:

iron bars: cyan '≡'
fountain: blue '⌠'
the floor of a room: bgblack gray regular '·'
sub unlit the floor of a room: bgblack darkgray regular '·'

The information for the floor of a room specifies all the information tha might be necessary, because it's drawn on the bottommost map layer: it has a black background, no underlining, is gray (or darkgray if unlit), and is drawn with a · character. Iron bars are not on the bottommost map layer, so they preserve background and underlining, but override the color to cyan and the character used to ≡.

I'm generally pretty pleased with the NetHack 4 approach here; it's enabled things like per-branch colors for walls implemented entirely in the tileset, without needing to get the game core involved at all (beyond telling the interface which branch is being rendered). The drawback is that it's quite complex (using separate binaries basecchar and tilecompile to generate the tilesets).

There's one other problem, too, and this is the ; (farlook) and / (whatis) commands. Understanding the problem is easier with a quick history lesson.

NetHack's predecessor Hack had a map screen that looks very similar to NetHack's. However, there were much fewer objects that existed in the game: in fact, it was possible to assign a different ASCII character to each of what NetHack 4 now calls a tilekey. So, for example, if you wanted to know what the d or $ in the starting room represented, you'd press / ("tell what this symbol represents" according to Hack's documentation), and then d or $ respectively. The output looks like this:

d       a dog
$       a pile, pot or chest of gold

There is no ; command in Hack; none was necessary, because the same letter always represented the same thing. This came with many gameplay limitations, though. For example, it is impossible to determine whether a dog in Hack is tame or not, and all potions look the same unless they're in your inventory or you're standing on them.

NetHack added many more features to the game: in particular, many more monsters than exist in Hack. It typically distinguishes between the monsters using colour, something which is most obvious where dragons are concerned (a red D is a baby or adult red dragon, a green D is a baby or adult green dragon, and so on).

The / command in NetHack still supports the Hack method of operation:

Specify unknown object by cursor? [ynq] (q) n
Specify what? (type the word) D
D       a dragon--More--
More info? [yn] (n) n

However, as we can see, it's got a lot more complicated. The main reason for this is that telling the / command that we see a D is insufficient to get full information about it. Telling the game that the D is red would be one way to get more information, but even then, this would run into the problem with tame monsters that Hack had. NetHack thus allows the object to be specified via pointing it out with the cursor:

Specify unknown object by cursor? [ynq] (q) y
Please move the cursor to an unknown object.--More--
(For instructions type a ?)
D       a dragon (tame red dragon called example)--More--
More info? [yn] (n) n

We now have all the information about the red D, rather than just information on what a D represents generally. For compatibility with Hack, though, we're told that D is a dragon before the game tells us about this specific dragon, something that is IMO just confusing and should probably be removed.

There are a lot of extraneous prompts here, so the ; command was introduced to do the same thing as the / command but making the most common choice for each question:

Pick an object.
D       a dragon (tame red dragon called example)

Farlooking with ; is very common in normal NetHack play, but the / command is almost unused nowadays. (Its most common use is to chain multiple farlooks via accepting with ,, something that does not work with the ; command due to arbitrary restrictions. I've removed these restrictions for NetHack 4.)

NetHack 4 also adds two other methods of farlooking: moving the cursor over an object during a location prompt (either that of ;, or in any other command); and moving the mouse pointer over an object, when using a high terminal (so that there's space beneath the map to say what the object is).

Anyway, the big offender here is the very first character on the farlook output. Here's the output from vanilla NetHack with DECgraphics:

Pick an object.
└       a wall

Oops. Our non-ASCII characters have leaked into pline, which is part of the game core.

This problem can be seen as a pretty minor one, because IMO the output format from ; and / is dubious anyway. Here's what NetHack 4 does in the same situations:

Pick an object.  A red dragon on the floor of a room.
Pick an object.  A wall.

NetHack 4 tiles can render the floor beneath the dragon, so farlook gives the same information (so that ASCII players do not have a disadvantage compared to tiles players as a result of layered memory). The major change, though, is that the "this is a D, which means dragon" bit of the output has been removed entirely, because the game core doesn't know what a red dragon looks like; the tileset might be rendering it as a red underlined D (the default), but the tileset could also render it as any other character, or as a tiles image, or the like.

So my conclusion here is: the game core shouldn't be using Unicode for the map because it shouldn't be using any character set for the map. Let the interface sort that out. This means you have to change the / and ; commands, but they were in need of a change as it is, and there's no way to make them work with tiles anyway. (Besides, hardly anyone knows how to type a └ to give it as an argument to /.)

Unicode in strings

Apart from the map, the other place where Unicode might potentially be iuseful is in strings inside the game: character names, fruit names, monster names, and perhaps messages printed by the game (currently NetHack is only officially in English, but the occasional non-ASCII character crops up even in English, e.g. "naïve"; interestingly, these spellings are dying out in favour of non-accented ones, perhaps due to the use of computers).

This is not currently a problem that most variants handle; for example, NetHack 4 currently uses the same text input routines as NitroHack, which disallow non-ASCII characters.

In order to implement this, there are three separate problems. One is reading input from the user, but this is not really the concern of the game core; reading Unicode needs to be done differently for each windowport anyway. Another is storing the strings in memory; this is the problem that nhmall's rgrn post is talking about. The remaining problem is processing such strings in situations such as makeplural and the wish parser.

When storing the strings in memory, there are three real options, which also affect how easy it is to rewrite string handling functions:

long / int32_t / char32_t. These are all 32-bit-wide types. This gives us the "UTF-32" encoding of Unicode, which stores each codepoint in one 32-bit unit. (Unicode codepoints go from 0 to 1114111 inclusive, meaning that 16 bits is not enough to store the whole of Unicode; 32-bit types are the next-largest that are widely supported.)

The main advantage of this is that it maximises the amount of code that we'd expect to continue to work, given that one int32_t in Unicode acts quite similarly to one char in ASCII. However, there are various caveats:
- There are multiple different ways to express a 32-bit type in C. long has been around forever, but is often more than 32 bits (which might or might not be a problem depending on the circumstances). int32_t is C99, and might not be supported by some compilers that are particularly slow at updating to modern standards (after all, C99 was only released 15 years ago). char32_t is C11, and has the advantage that it's possible to write a char32_t * string literal:
```
const char32_t *string = U"→ this is Unicode ←";
```
  It's unclear which of these types would be the best representaition. (It would be possible for compile-time configuration to choose between long and int32_t; however, you lose most of the benefit of char32_t if you make it configurable, because then you can't use string literals.)
- When using ASCII, you can normally just disallow control characters; placing a newline or escape or the like in an object name isn't something that it's reasonable to support. Thus, you can safely assume that all characters in ASCII are one em wide on the screen (at least when using a fixed-width font; most NetHack windowports do). Unicode has legitimate uses for zero-width characters (combining characters, direction overrides, and the like). As such, either you'd sacrifice the ability to render these characters correctly, or else you'd need a more complex function for counting string width (losing most of the benefits of UTF-32 in the first place).
- Lookup tables would stop working (a 256-entry lookup table is sensible; a 1114111-entry lookup table less so). In particular, this means that strstri would need a complete rewrite. (That said, the only use of strstri, as far as I know, is for counting Elbereths, and Elbereth is pure ASCII.)
wchar_t. This is a system-dependent type designed for holding Unicode characters. In practice, it is 32 bits wide on Linux and 16 bits wide on Windows; I haven't tested other operating systems.

The huge advantage of wchar_t is that it's been in the C standard longer than any other form of Unicode support, and is very widely supported by now. For example, almost every compiler accepts the following form of string literal, not just C11 compilers:
```
const wchar_t *string = L"→ this is Unicode ←";
```
Additionally, wchar_t has the best library support of any of the options being considered here: there are wchar_t concatenation functions, length functions, substring search functions, and the like. (wchar_t is the option I chose in libuncursed, incidentally, mostly for compatibility with curses, but it's not an awful choice in its own right.) This means that if you only wanted to get a program working on Linux, a wchar_t would be outright superior to a char32_t.

The Windows API also requires that wchar_t is used for all Unicode input and output. It handles characters outside the 16-bit range by representing them as UTF-16, for backwards compatibility.

The drawbacks:
- Being a system-specific type, a wchar_t cannot be placed into a save file directly if you want that save file to be portable between platforms. This is not a huge problem: struct padding also differs between platforms, so as you have to pack and unpack the structures manually anyway, you can convert your wchar_ts to something else upon save.
- On Windows, a wchar_t is not large enough to hold all Unicode characters: it misses out on the "astral plane" characters above codepoint 65535. In the case of libuncursed, I didn't worry about this too much because the purpose of libuncursed is to produce lowest-common-denominator terminal output, and astral plane characters don't render correctly on many terminals anyway. The NetHack 3 series has more of a tradition of being able to configure the game to take advantage of unusual features that your terminal has, so it's more of a problem there.
char or unsigned char, encoded as UTF-8. This is a multibyte encoding which represents ASCII as ASCII, and other Unicode characters as sequences of non-ASCII characters.

One obvious advantage here is that NetHack uses ASCII char * string literals anyway, meaning that this would reduce the amount of code that needed to be touched as far as possible: a UTF-8 string literal in C11 is written u8"→ this is Unicode ←", but if there are no non-ASCII characters (and there usually aren't), you can omit the u8 and get portability to old compilers too.

The drawbacks:
- UTF-8 is a multibyte encoding in which different characters have different widths, so a specialized string width function is absolutely required in cases like engraving (which cares a lot about the width of a string). It doesn't make sense for it to take twice as long to engrave éééé as it does to engrave eeee.
- Because UTF-8 is equivalent to ASCII in most simple cases, but not once non-ASCII characters start being used, the Unicode code would get a lot less testing than in the other cases here: the ASCII case (common) is different from the non-ASCII case (rare). This means that using the wrong string width function, or the like, might not be spotted for several months.
- The handling of UTF-8 in the standard library is mostly based on the locale conversion functions, which are user-configurable. This works fine if the user has configured them to use UTF-8, but not if they're using some legacy encoding. Trying to get functions like wcstombs to behave is harder than just hand-rolling them yourself, often, meaning that UTF-8 basically has no viable library support.

As usual, there does not seem to be any obvious choice here. int32_t, char32_t, wchar_t, and char in UTF-8 all seem like somewhat viable options.

There are also some things to watch out for regardless of the encoding used. For example, the character name is often used as part of the filename, and Unicode in filenames is a pretty nonportable topic in its own right.

wchar_t is what the leaked NetHack code uses, incidentally. This looks like it's the best option unless the problem with the astral planes on Windows is a dealbreaker, but it may well be. (Having to use UTF-16 with wchar_t is something of a disaster; it gives you pretty much all the drawbacks listed above at once.) Before writing this, I was in favour of UTF-8, but now I'm more dubious about it; I think I'd prefer a widespread change in which if something breaks, it breaks obviously, than I would a change in which everything appears to work and then breaks much later.

I'm not planning to implement Unicode input in NetHack 4 within the next couple of months, because there are other more urgent things to do first. However, I'll likely implement it eventually, and most likely I'll use wchar_t when I do so (it's possible that I'll use a 32-bit type, though that would mean backwards-incompatible changes to libuncursed). For the NetHack 3 series, a compile-time choice between long, int32_t, char32_t would fit in most with the typical philosophy of that codebase (perhaps alongside a function that either expands to a U prefix on a string literal, or a function call to a ASCII-to-UTF-32 conversion function), but the other choices don't seem that bad either.

If you have any comments or suggestions, let me know, either directly via email, or by posting comments on news aggregators that link to this blog post; I'll be looking and responding there, and summarizing the bulk of the sentiment I hear about for the DevTeam. Perhaps there's some important point that everyone's missing that will make the choice obvious.