NetHack and Unicode
Tags: nethack internals portability terminal libuncursed | Written by Alex Smith, 2015-01-09
Recently, the NetHack 3 series devteam have asked about how to bring Unicode to NetHack. Likewise, NetHack 4's Unicode support is somewhat lacking; it handles Unicode on output but not on input.
This blog post looks at the current situation, and at the various possibilities for resolving it.
Unicode on the map
One of the most common places to see Unicode output is when rendering the map. Most NetHack variants now do this, because it's one of the more portable ways of displaying the line-drawing characters which are commonly used for walls.
The simplest approach is to treat Unicode as an alternative to IBMgraphics or DECgraphics. This is what two of the most popular variants (UnNetHack and nethack.alt.org) do. The basic idea is to store everything in an 8-bit character set such as IBMgraphics (code page 437) internally, then convert to Unicode just before output.
The big advantage of this method is that it's only a very minimal change to the game core. (In fact, there is probably no need to change the game core at all when doing this. When using a rendering library such as libuncursed, the library will do the Unicode translation if necessary for the terminal, and the game core can think entirely in terms of code page 437 or the like.) The big disadvantage is that it's not very customizable; there are more than 256 possible renderings that might need to be drawn on the map ("glyphs" in the terminology of NetHack 3.4.3, or "tilekeys" in the terminology of NetHack 4), so the game core needs to either thing in more than 8 bits (in which case it may as well use Unicode directly), or else artificially restrict the set of possible renderings.
NetHack 4 uses a somewhat different method. The game core mostly doesn't deal with what map elements look like at all (actually, it is aware of a default ASCII rendering for each tilekey, but the only intended use of this is to draw the map in the dumplog when a character dies). The game core communicates with the interface in terms of "API keys" (which have a 1 to 1 correspondence with tilekeys, but are spelled differently for historical NitroHack-related reasons). The interface translates the API keys to tilekeys, and then looks up the appropriate renderings in the current tileset. When playing tiles, the tileset will contain tile images; when playing on a terminal (or fake terminal), the tileset will contain Unicode characters together with information on how to draw them (color, background, underlining, etc.).
Here's an excerpt from one of the NetHack 4 tileset files:
iron bars: cyan '≡' fountain: blue '⌠' the floor of a room: bgblack gray regular '·' sub unlit the floor of a room: bgblack darkgray regular '·'
The information for the floor of a room specifies all the information
tha might be necessary, because it's drawn on the bottommost map
layer: it has a black background, no underlining, is gray (or darkgray
if unlit), and is drawn with a ·
character. Iron bars are not on
the bottommost map layer, so they preserve background and underlining,
but override the color to cyan and the character used to ≡
.
I'm generally pretty pleased with the NetHack 4 approach here; it's
enabled things like per-branch colors for walls implemented entirely
in the tileset, without needing to get the game core involved at all
(beyond telling the interface which branch is being rendered). The
drawback is that it's quite complex (using separate binaries
basecchar
and tilecompile
to generate the tilesets).
There's one other problem, too, and this is the ;
(farlook) and /
(whatis) commands. Understanding the problem is easier with a quick
history lesson.
NetHack's predecessor Hack had a map screen that looks very similar to
NetHack's. However, there were much fewer objects that existed in the
game: in fact, it was possible to assign a different ASCII character
to each of what NetHack 4 now calls a tilekey. So, for example, if
you wanted to know what the d
or $
in the starting room
represented, you'd press /
("tell what this symbol represents"
according to Hack's documentation), and then d
or $
respectively.
The output looks like this:
d a dog $ a pile, pot or chest of gold
There is no ;
command in Hack; none was necessary, because the same
letter always represented the same thing. This came with many
gameplay limitations, though. For example, it is impossible to
determine whether a dog in Hack is tame or not, and all potions look
the same unless they're in your inventory or you're standing on them.
NetHack added many more features to the game: in particular, many more
monsters than exist in Hack. It typically distinguishes between the
monsters using colour, something which is most obvious where dragons
are concerned (a red D
is a baby or adult red dragon, a green D
is
a baby or adult green dragon, and so on).
The /
command in NetHack still supports the Hack method of
operation:
Specify unknown object by cursor? [ynq] (q) n Specify what? (type the word) D D a dragon--More-- More info? [yn] (n) n
However, as we can see, it's got a lot more complicated. The main
reason for this is that telling the /
command that we see a D
is
insufficient to get full information about it. Telling the game that
the D
is red would be one way to get more information, but even
then, this would run into the problem with tame monsters that Hack
had. NetHack thus allows the object to be specified via pointing it
out with the cursor:
Specify unknown object by cursor? [ynq] (q) y Please move the cursor to an unknown object.--More-- (For instructions type a ?) D a dragon (tame red dragon called example)--More-- More info? [yn] (n) n
We now have all the information about the red D
, rather than just
information on what a D
represents generally. For compatibility with
Hack, though, we're told that D
is a dragon before the game tells us
about this specific dragon, something that is IMO just confusing and
should probably be removed.
There are a lot of extraneous prompts here, so the ;
command was
introduced to do the same thing as the /
command but making the most
common choice for each question:
Pick an object. D a dragon (tame red dragon called example)
Farlooking with ;
is very common in normal NetHack play, but the /
command is almost unused nowadays. (Its most common use is to chain
multiple farlooks via accepting with ,
, something that does not
work with the ;
command due to arbitrary restrictions. I've removed
these restrictions for NetHack 4.)
NetHack 4 also adds two other methods of farlooking: moving the cursor
over an object during a location prompt (either that of ;
, or in any
other command); and moving the mouse pointer over an object, when using
a high terminal (so that there's space beneath the map to say what the
object is).
Anyway, the big offender here is the very first character on the farlook output. Here's the output from vanilla NetHack with DECgraphics:
Pick an object. └ a wall
Oops. Our non-ASCII characters have leaked into pline
, which is
part of the game core.
This problem can be seen as a pretty minor one, because IMO the output
format from ;
and /
is dubious anyway. Here's what NetHack 4 does
in the same situations:
Pick an object. A red dragon on the floor of a room. Pick an object. A wall.
NetHack 4 tiles can render the floor beneath the dragon, so farlook
gives the same information (so that ASCII players do not have a
disadvantage compared to tiles players as a result of layered memory).
The major change, though, is that the "this is a D
, which means
dragon" bit of the output has been removed entirely, because the game
core doesn't know what a red dragon looks like; the tileset might be
rendering it as a red underlined D
(the default), but the tileset
could also render it as any other character, or as a tiles image, or
the like.
So my conclusion here is: the game core shouldn't be using Unicode for
the map because it shouldn't be using any character set for the map.
Let the interface sort that out. This means you have to change the
/
and ;
commands, but they were in need of a change as it is, and
there's no way to make them work with tiles anyway. (Besides, hardly
anyone knows how to type a └
to give it as an argument to /
.)
Unicode in strings
Apart from the map, the other place where Unicode might potentially be iuseful is in strings inside the game: character names, fruit names, monster names, and perhaps messages printed by the game (currently NetHack is only officially in English, but the occasional non-ASCII character crops up even in English, e.g. "naïve"; interestingly, these spellings are dying out in favour of non-accented ones, perhaps due to the use of computers).
This is not currently a problem that most variants handle; for example, NetHack 4 currently uses the same text input routines as NitroHack, which disallow non-ASCII characters.
In order to implement this, there are three separate problems. One is
reading input from the user, but this is not really the concern of the
game core; reading Unicode needs to be done differently for each
windowport anyway. Another is storing the strings in memory; this is
the problem that nhmall's rgrn post is talking about. The remaining
problem is processing such strings in situations such as makeplural
and the wish parser.
When storing the strings in memory, there are three real options, which also affect how easy it is to rewrite string handling functions:
long
/int32_t
/char32_t
. These are all 32-bit-wide types. This gives us the "UTF-32" encoding of Unicode, which stores each codepoint in one 32-bit unit. (Unicode codepoints go from 0 to 1114111 inclusive, meaning that 16 bits is not enough to store the whole of Unicode; 32-bit types are the next-largest that are widely supported.)The main advantage of this is that it maximises the amount of code that we'd expect to continue to work, given that one
int32_t
in Unicode acts quite similarly to onechar
in ASCII. However, there are various caveats:There are multiple different ways to express a 32-bit type in C.
long
has been around forever, but is often more than 32 bits (which might or might not be a problem depending on the circumstances).int32_t
is C99, and might not be supported by some compilers that are particularly slow at updating to modern standards (after all, C99 was only released 15 years ago).char32_t
is C11, and has the advantage that it's possible to write achar32_t *
string literal:const char32_t *string = U"→ this is Unicode ←";
It's unclear which of these types would be the best representaition. (It would be possible for compile-time configuration to choose between
long
andint32_t
; however, you lose most of the benefit ofchar32_t
if you make it configurable, because then you can't use string literals.)When using ASCII, you can normally just disallow control characters; placing a newline or escape or the like in an object name isn't something that it's reasonable to support. Thus, you can safely assume that all characters in ASCII are one em wide on the screen (at least when using a fixed-width font; most NetHack windowports do). Unicode has legitimate uses for zero-width characters (combining characters, direction overrides, and the like). As such, either you'd sacrifice the ability to render these characters correctly, or else you'd need a more complex function for counting string width (losing most of the benefits of UTF-32 in the first place).
Lookup tables would stop working (a 256-entry lookup table is sensible; a 1114111-entry lookup table less so). In particular, this means that
strstri
would need a complete rewrite. (That said, the only use ofstrstri
, as far as I know, is for counting Elbereths, andElbereth
is pure ASCII.)
wchar_t
. This is a system-dependent type designed for holding Unicode characters. In practice, it is 32 bits wide on Linux and 16 bits wide on Windows; I haven't tested other operating systems.The huge advantage of
wchar_t
is that it's been in the C standard longer than any other form of Unicode support, and is very widely supported by now. For example, almost every compiler accepts the following form of string literal, not just C11 compilers:const wchar_t *string = L"→ this is Unicode ←";
Additionally,
wchar_t
has the best library support of any of the options being considered here: there arewchar_t
concatenation functions, length functions, substring search functions, and the like. (wchar_t
is the option I chose in libuncursed, incidentally, mostly for compatibility with curses, but it's not an awful choice in its own right.) This means that if you only wanted to get a program working on Linux, awchar_t
would be outright superior to achar32_t
.The Windows API also requires that
wchar_t
is used for all Unicode input and output. It handles characters outside the 16-bit range by representing them as UTF-16, for backwards compatibility.The drawbacks:
Being a system-specific type, a
wchar_t
cannot be placed into a save file directly if you want that save file to be portable between platforms. This is not a huge problem: struct padding also differs between platforms, so as you have to pack and unpack the structures manually anyway, you can convert yourwchar_t
s to something else upon save.On Windows, a
wchar_t
is not large enough to hold all Unicode characters: it misses out on the "astral plane" characters above codepoint 65535. In the case of libuncursed, I didn't worry about this too much because the purpose of libuncursed is to produce lowest-common-denominator terminal output, and astral plane characters don't render correctly on many terminals anyway. The NetHack 3 series has more of a tradition of being able to configure the game to take advantage of unusual features that your terminal has, so it's more of a problem there.
char
orunsigned char
, encoded as UTF-8. This is a multibyte encoding which represents ASCII as ASCII, and other Unicode characters as sequences of non-ASCII characters.One obvious advantage here is that NetHack uses ASCII
char *
string literals anyway, meaning that this would reduce the amount of code that needed to be touched as far as possible: a UTF-8 string literal in C11 is writtenu8"→ this is Unicode ←"
, but if there are no non-ASCII characters (and there usually aren't), you can omit theu8
and get portability to old compilers too.The drawbacks:
UTF-8 is a multibyte encoding in which different characters have different widths, so a specialized string width function is absolutely required in cases like engraving (which cares a lot about the width of a string). It doesn't make sense for it to take twice as long to engrave
éééé
as it does to engraveeeee
.Because UTF-8 is equivalent to ASCII in most simple cases, but not once non-ASCII characters start being used, the Unicode code would get a lot less testing than in the other cases here: the ASCII case (common) is different from the non-ASCII case (rare). This means that using the wrong string width function, or the like, might not be spotted for several months.
The handling of UTF-8 in the standard library is mostly based on the locale conversion functions, which are user-configurable. This works fine if the user has configured them to use UTF-8, but not if they're using some legacy encoding. Trying to get functions like
wcstombs
to behave is harder than just hand-rolling them yourself, often, meaning that UTF-8 basically has no viable library support.
As usual, there does not seem to be any obvious choice here.
int32_t
, char32_t
, wchar_t
, and char
in UTF-8 all seem like
somewhat viable options.
There are also some things to watch out for regardless of the encoding used. For example, the character name is often used as part of the filename, and Unicode in filenames is a pretty nonportable topic in its own right.
wchar_t
is what the leaked NetHack code uses, incidentally. This
looks like it's the best option unless the problem with the astral
planes on Windows is a dealbreaker, but it may well be. (Having to
use UTF-16 with wchar_t
is something of a disaster; it gives you
pretty much all the drawbacks listed above at once.) Before writing
this, I was in favour of UTF-8, but now I'm more dubious about it; I
think I'd prefer a widespread change in which if something breaks, it
breaks obviously, than I would a change in which everything appears to
work and then breaks much later.
I'm not planning to implement Unicode input in NetHack 4 within the
next couple of months, because there are other more urgent things to
do first. However, I'll likely implement it eventually, and most
likely I'll use wchar_t
when I do so (it's possible that I'll use a
32-bit type, though that would mean backwards-incompatible changes to
libuncursed). For the NetHack 3 series, a compile-time choice between
long
, int32_t
, char32_t
would fit in most with the typical
philosophy of that codebase (perhaps alongside a function that either
expands to a U
prefix on a string literal, or a function call to a
ASCII-to-UTF-32 conversion function), but the other choices don't seem
that bad either.
If you have any comments or suggestions, let me know, either directly via email, or by posting comments on news aggregators that link to this blog post; I'll be looking and responding there, and summarizing the bulk of the sentiment I hear about for the DevTeam. Perhaps there's some important point that everyone's missing that will make the choice obvious.