This blog post is also published at nethack4.org.

NetHack and Unicode

Tags: nethack internals portability terminal libuncursed | Written by Alex Smith, 2015-01-09

Recently, the NetHack 3 series devteam have asked about how to bring Unicode to NetHack. Likewise, NetHack 4's Unicode support is somewhat lacking; it handles Unicode on output but not on input.

This blog post looks at the current situation, and at the various possibilities for resolving it.

Unicode on the map

One of the most common places to see Unicode output is when rendering the map. Most NetHack variants now do this, because it's one of the more portable ways of displaying the line-drawing characters which are commonly used for walls.

The simplest approach is to treat Unicode as an alternative to IBMgraphics or DECgraphics. This is what two of the most popular variants (UnNetHack and nethack.alt.org) do. The basic idea is to store everything in an 8-bit character set such as IBMgraphics (code page 437) internally, then convert to Unicode just before output.

The big advantage of this method is that it's only a very minimal change to the game core. (In fact, there is probably no need to change the game core at all when doing this. When using a rendering library such as libuncursed, the library will do the Unicode translation if necessary for the terminal, and the game core can think entirely in terms of code page 437 or the like.) The big disadvantage is that it's not very customizable; there are more than 256 possible renderings that might need to be drawn on the map ("glyphs" in the terminology of NetHack 3.4.3, or "tilekeys" in the terminology of NetHack 4), so the game core needs to either thing in more than 8 bits (in which case it may as well use Unicode directly), or else artificially restrict the set of possible renderings.

NetHack 4 uses a somewhat different method. The game core mostly doesn't deal with what map elements look like at all (actually, it is aware of a default ASCII rendering for each tilekey, but the only intended use of this is to draw the map in the dumplog when a character dies). The game core communicates with the interface in terms of "API keys" (which have a 1 to 1 correspondence with tilekeys, but are spelled differently for historical NitroHack-related reasons). The interface translates the API keys to tilekeys, and then looks up the appropriate renderings in the current tileset. When playing tiles, the tileset will contain tile images; when playing on a terminal (or fake terminal), the tileset will contain Unicode characters together with information on how to draw them (color, background, underlining, etc.).

Here's an excerpt from one of the NetHack 4 tileset files:

iron bars: cyan '≡'
fountain: blue '⌠'
the floor of a room: bgblack gray regular '·'
sub unlit the floor of a room: bgblack darkgray regular '·'

The information for the floor of a room specifies all the information tha might be necessary, because it's drawn on the bottommost map layer: it has a black background, no underlining, is gray (or darkgray if unlit), and is drawn with a · character. Iron bars are not on the bottommost map layer, so they preserve background and underlining, but override the color to cyan and the character used to .

I'm generally pretty pleased with the NetHack 4 approach here; it's enabled things like per-branch colors for walls implemented entirely in the tileset, without needing to get the game core involved at all (beyond telling the interface which branch is being rendered). The drawback is that it's quite complex (using separate binaries basecchar and tilecompile to generate the tilesets).

There's one other problem, too, and this is the ; (farlook) and / (whatis) commands. Understanding the problem is easier with a quick history lesson.

NetHack's predecessor Hack had a map screen that looks very similar to NetHack's. However, there were much fewer objects that existed in the game: in fact, it was possible to assign a different ASCII character to each of what NetHack 4 now calls a tilekey. So, for example, if you wanted to know what the d or $ in the starting room represented, you'd press / ("tell what this symbol represents" according to Hack's documentation), and then d or $ respectively. The output looks like this:

d       a dog
$       a pile, pot or chest of gold

There is no ; command in Hack; none was necessary, because the same letter always represented the same thing. This came with many gameplay limitations, though. For example, it is impossible to determine whether a dog in Hack is tame or not, and all potions look the same unless they're in your inventory or you're standing on them.

NetHack added many more features to the game: in particular, many more monsters than exist in Hack. It typically distinguishes between the monsters using colour, something which is most obvious where dragons are concerned (a red D is a baby or adult red dragon, a green D is a baby or adult green dragon, and so on).

The / command in NetHack still supports the Hack method of operation:

Specify unknown object by cursor? [ynq] (q) n
Specify what? (type the word) D
D       a dragon--More--
More info? [yn] (n) n

However, as we can see, it's got a lot more complicated. The main reason for this is that telling the / command that we see a D is insufficient to get full information about it. Telling the game that the D is red would be one way to get more information, but even then, this would run into the problem with tame monsters that Hack had. NetHack thus allows the object to be specified via pointing it out with the cursor:

Specify unknown object by cursor? [ynq] (q) y
Please move the cursor to an unknown object.--More--
(For instructions type a ?)
D       a dragon (tame red dragon called example)--More--
More info? [yn] (n) n

We now have all the information about the red D, rather than just information on what a D represents generally. For compatibility with Hack, though, we're told that D is a dragon before the game tells us about this specific dragon, something that is IMO just confusing and should probably be removed.

There are a lot of extraneous prompts here, so the ; command was introduced to do the same thing as the / command but making the most common choice for each question:

Pick an object.
D       a dragon (tame red dragon called example)

Farlooking with ; is very common in normal NetHack play, but the / command is almost unused nowadays. (Its most common use is to chain multiple farlooks via accepting with ,, something that does not work with the ; command due to arbitrary restrictions. I've removed these restrictions for NetHack 4.)

NetHack 4 also adds two other methods of farlooking: moving the cursor over an object during a location prompt (either that of ;, or in any other command); and moving the mouse pointer over an object, when using a high terminal (so that there's space beneath the map to say what the object is).

Anyway, the big offender here is the very first character on the farlook output. Here's the output from vanilla NetHack with DECgraphics:

Pick an object.
└       a wall

Oops. Our non-ASCII characters have leaked into pline, which is part of the game core.

This problem can be seen as a pretty minor one, because IMO the output format from ; and / is dubious anyway. Here's what NetHack 4 does in the same situations:

Pick an object.  A red dragon on the floor of a room.
Pick an object.  A wall.

NetHack 4 tiles can render the floor beneath the dragon, so farlook gives the same information (so that ASCII players do not have a disadvantage compared to tiles players as a result of layered memory). The major change, though, is that the "this is a D, which means dragon" bit of the output has been removed entirely, because the game core doesn't know what a red dragon looks like; the tileset might be rendering it as a red underlined D (the default), but the tileset could also render it as any other character, or as a tiles image, or the like.

So my conclusion here is: the game core shouldn't be using Unicode for the map because it shouldn't be using any character set for the map. Let the interface sort that out. This means you have to change the / and ; commands, but they were in need of a change as it is, and there's no way to make them work with tiles anyway. (Besides, hardly anyone knows how to type a to give it as an argument to /.)

Unicode in strings

Apart from the map, the other place where Unicode might potentially be iuseful is in strings inside the game: character names, fruit names, monster names, and perhaps messages printed by the game (currently NetHack is only officially in English, but the occasional non-ASCII character crops up even in English, e.g. "naïve"; interestingly, these spellings are dying out in favour of non-accented ones, perhaps due to the use of computers).

This is not currently a problem that most variants handle; for example, NetHack 4 currently uses the same text input routines as NitroHack, which disallow non-ASCII characters.

In order to implement this, there are three separate problems. One is reading input from the user, but this is not really the concern of the game core; reading Unicode needs to be done differently for each windowport anyway. Another is storing the strings in memory; this is the problem that nhmall's rgrn post is talking about. The remaining problem is processing such strings in situations such as makeplural and the wish parser.

When storing the strings in memory, there are three real options, which also affect how easy it is to rewrite string handling functions:

As usual, there does not seem to be any obvious choice here. int32_t, char32_t, wchar_t, and char in UTF-8 all seem like somewhat viable options.

There are also some things to watch out for regardless of the encoding used. For example, the character name is often used as part of the filename, and Unicode in filenames is a pretty nonportable topic in its own right.

wchar_t is what the leaked NetHack code uses, incidentally. This looks like it's the best option unless the problem with the astral planes on Windows is a dealbreaker, but it may well be. (Having to use UTF-16 with wchar_t is something of a disaster; it gives you pretty much all the drawbacks listed above at once.) Before writing this, I was in favour of UTF-8, but now I'm more dubious about it; I think I'd prefer a widespread change in which if something breaks, it breaks obviously, than I would a change in which everything appears to work and then breaks much later.

I'm not planning to implement Unicode input in NetHack 4 within the next couple of months, because there are other more urgent things to do first. However, I'll likely implement it eventually, and most likely I'll use wchar_t when I do so (it's possible that I'll use a 32-bit type, though that would mean backwards-incompatible changes to libuncursed). For the NetHack 3 series, a compile-time choice between long, int32_t, char32_t would fit in most with the typical philosophy of that codebase (perhaps alongside a function that either expands to a U prefix on a string literal, or a function call to a ASCII-to-UTF-32 conversion function), but the other choices don't seem that bad either.

If you have any comments or suggestions, let me know, either directly via email, or by posting comments on news aggregators that link to this blog post; I'll be looking and responding there, and summarizing the bulk of the sentiment I hear about for the DevTeam. Perhaps there's some important point that everyone's missing that will make the choice obvious.