Generating Random Numbers without Modulo Bias

TLDR: If (RAND_MAX + 1) % n != 0, you have modulo bias. Fix it!

Let’s say you need a random number from 0 to 2 (0, 1 or 2). If you do this:

/* LISTING 1 */
int x;
int n = 3;
x = rand() % n;
return x;

your code is broken, thanks to modulo bias. You will always get modulo bias whenever RAND_MAX + 1 of your rand() function is not evenly divisible by your modulus n.

For example, let’s say rand()‘s RAND_MAX is 4, so there are 5 different possible values (0, 1, 2, 3, 4). Now rand() will return each value 20% of the time, assuming that rand() has very good distribution from 0 to RAND_MAX (all modern rand() functions should have very good distribution). But since we take the modulo by 3, we get 0 40% of the time (0 and 3), 1 40% of the time (1 and 4), and 2 20% of the time (2 by itself). Visually, it is as follows:

The following code is correct:

/* LISTING 2 */
int x;
int n = 3;
int rand_limit;
int rand_excess;
rand_excess = (RAND_MAX + 1) % n;
rand_limit = RAND_MAX - rand_excess;
while (x = rand() > rand_limit) {};
return x % n;

So, if RAND_MAX is 4 as in the strongvious example, rand_excess is (4 + 1) % 3, or 5 % 3, or 2. Then rand_limit becomes 4 – 2, or 2. The while statement then throws out the values 3 and 4 (i.e., x is only allowed to be 0, 1, or 2). Then, we return the modulo expression x % n. The expression x % n is redundant here only because RAND_MAX is very low for the sake of our example.

The only problem with the above code is if RAND_MAX is already the maximum value allowed (e.g., if rand() returns an unsigned 64-bit integer within the range 0 through 2^64-1). Then, (RAND_MAX + 1) would wrap back around to 0, and rand_excess would be 0. To avoid this, you can use the alternate expression

/* LISTING 3 */
...
rand_excess = (RAND_MAX % n) + 1;
...

Alas there still remains a remote problem with this workaround: there is a remote chance that rand_excess will be equal to n, needlessly reducing the size of rand_limit. For example, say RAND_MAX is 8 and n is 3. Then (8 % 3) + 1 is (2 + 1), or 3. Now rand_limit is 8 – 3, or 5. But RAND_MAX of 8 is already valid because there are 9 possible values 0 through 8, and there is no modulo bias to begin with (since 9 % 3 = 0)! Luckily, this scenario should be pretty rare given that rand() is a very large number.

But remember: whenever you generate a random number, make sure to remove modulo bias! Use LISTING 2, and maybe LISTING 3 if you want. Lastly, if you know that n will not change, and you also know the size of RAND_MAX, it might be wise to simplify LISTING 2 mathematically (using constants, not expressions) and save some CPU cycles.

Problems with the Portable Game Notation (PGN) Standard

I was trying to write a toy program to parse chess PGN (Portable Game Notation) files, to automate chess analysis using the free chess engine Stockfish. It turns out that I really, really hate PGN.

Parsing PGN Requires a Legal Chess Move Generator

This is the biggest problem with PGN. It uses SAN (Standard Algebraic Notation, or just “algebraic notation”) to encode chess moves in a game. However, for almost inexplicable reasons, it decided to NEVER use disambiguation files/ranks whenever possible. For example, if there are two knights that can move to e1, but one of them is pinned by an enemy piece (moving this knight would expose the king to a check and is thus an illegal move), PGN states that the move must be encoded as “Ne1”.

Thanks to this, it is impossible to know what “Ne1” means any time you see it in PGN, unless you have a legal chess move generator to disambiguate it. The irony is that the disambiguation characters, though mentioned in the PGN standard (“8.2.3.4 Disambiguation”), do not help you disambiguate at all! If the standard just said “use Smith Notation“, the requirement of any chess knowledge would not be necessary; each recorded move would be the only possible legal move anyway, and hence, would not require the use of a full-fledged move generator, and you’d only need to keep track of where the pieces are as you move along.

Other Problems

I’ve said 99% of what I wanted to say… But! There are still other problems with PGN that annoy me and probably every single other person out there who have tried to write a program to parse PGN data:

  • Two different formats: There are two PGN formats, with different stylistic requirements: “PGN Export” and “PGN Import” formats. I cannot find any reason why there should be two different formats. Just have 1 format, and call it “The PGN Standard, 2.0”!
  • Stateful parsing: This is the result of the choice of SAN and disambiguation characters I mentioned above. You have to keep track of game state (where all the pieces are) in order to parse a sequence of moves. Technically speaking, even the use of Smith Notation requires that you keep track of where the pieces are, and hence, is stateful, so there is room for improvement here.
  • It’s set in stone: The standard was last revised in 1994, nearly 20 years ago. Some parts of the standard remain undefined to this day.
  • ASCII encoding: The world uses Unicode, and there ARE players whose names use non-English characters, such as European players, Russian players, Chinese players, etc.
  • Confusion of human-friendliness and computer-friendliness: The use of SAN for the Movetext section was probably designed for easy legibility for humans. But there are other designs that make it extremely human-unfriendly, notably NAGs (Numeric Annotation Glyphs), which require that you use symbols like “$3” for the traditional “!!” move comment.
  • 80-character limit: The widespread “PGN Export” format requires you to limit every line to 80 characters. This is again due to confusion between human-friendliness and computer-friendliness. Computers would much rather have lines of text that are broken up semantically (e.g., pretty-printed XML or YAML). The case for human-friendliness is weak, because any better-than-Notepad text editor has sensible soft line-breaking. And no one encodes PGN by hand with a text editor anyway (everyone uses one of the many free PGN editing programs out there).
  • 255-character limit: The “PGN Import” format requires that a line must be less than 255 characters long.
  • Messy use of newlines: This character is used to break up each line to 80 or 255 characters, and is hence can be ignored when parsing. Right? Wrong. It is also used to separate two PGN games from each other. Ugh!
  • Confusion between content and style: Section “3.2.1: Byte equivalence” states, “For a given PGN data file, export format representations generated by different PGN programs on the same computing system should be exactly equivalent, byte for byte.” So, this makes even whitespace significant (an extra space character will violate the standard)… even though it doesn’t play a huge role (most of the time; see “Messy use of newlines” above).
  • Seven Tag Roster: There are 7 tag pairs, or key-value pairs, that are mandatory for a game for “archival storage”. Here they are:
    1. Event (the name of the tournament or match event)
    2. Site (the location of the event)
    3. Date (the starting date of the game)
    4. Round (the playing round ordinal of the game)
    5. White (the player of the white pieces)
    6. Black (the player of the black pieces)
    7. Result (the result of the game)

    And, these tags must appear in the above order (again, confusion between content and style). Technically, these tags can have empty values, so from a parsing viewpoint, having these tags with empty values is the same thing as not having these tags at all. And some of these tags are very awkward for some situations (e.g., the “Site” tag, which is geographic in nature, doesn’t really apply for the case of correspondence chess.)

  • Conflict between tag pairs and Movetext: The “PlyCount” tag, for example, is redundant and does nothing but introduce errors in a PGN file. (What would you, as programmer, trust — the PlyCount tag, or the actual Movetext section? So why have the “PlyCount” tag in the first place?) The “Result” tag is also redundant, because it is supposed to be present as a “Game Termination Marker” after the Movetext section. Yet another tag that you have to check for validity…
  • Clumsy date tag: The “Date” tag uses a “YYYY.MM.DD” pattern to record the date. There is no way to disambiguate the order of games with the same opponent for those 5-minute blitz games that everyone plays on the internet.

A New, Sane Standard…

If it were me, I’d do away with PGN entirely… there are just too many problems. Here are some ideas for a better chess game recording standard:

  • Only require a list of moves. Do not require certain “tag pairs” such as the “Seven Tag Roster”. Make all additional data in addition to the list of actual game moves optional. Because, really, a game of chess is, essentially, a sequence of moves.
  • Use of “Full Notation” for recording moves. I hereby declare my support for a new notation, called “Full Notation”: each move records what piece is moving, what piece is captured (if any), what piece a pawn was promoted to (if any), whether it is a normal move, a castling move, or an en passant capture, and the starting and destination squares of the moving piece. For castling moves, it records the starting and destination squares for both the King and Rook (with maybe aliases “O-O” and “O-O-O” for traditional orthodox castling moves), for dead-simple clarity. This removes the need for a legal move generator when parsing the moves, and also gets rid of the need for disambiguation characters entirely. It also removes any need to keep track of where the pieces are. Simplicity always wins.
  • State which variant of chess we are playing, since there are many popular ones now, such as Chess960.
  • Design every feature to be computer-friendly, not human-friendly. Nobody writes raw PGN by hand from scratch, so there’s no concern for “alienating” any existing PGN users.
  • Maybe separate the actual game moves from the variations/commentary, to allow for easier basic parsing. The recursive structure used by PGN for defining move variations has been wildly successful (perhaps PGN’s only redeeming feature), and it makes sense to adopt this property. But I’m not sure what is the simplest way to represent the actual moves played vs. the variations/commentary.
  • Be very conservative against stylistic features (e.g., PGN’s Numeric Annotation Glyphs are an excellent example of what NOT to do).
  • Use XML or YAML (probably YAML). This would make it 10x easier to parse the game info, regardless of your programming-language-of-choice. This also automatically makes the standard computer-friendly.
  • Use Unicode.

The above ideas would certainly make the format more “verbose” and require more disk space. But in the age of hard drives in the hundreds-of-gigabytes range, I think it makes a lot of sense to sacrifice the extra kilobytes per game to achieve simplicity. Besides, few people keep more than, say, 10,000 files in a plaintext format such as PGN. I would be delighted to see Scid, PyChess or any other free program adopt these ideas to create a new chess game recording standard… The most important feature I’d like to see would be the lack of the need for a legal move generator when parsing the moves. This alone would make the parsing 100x easier.

UPDATES:

  • January 27, 2012: Don’t suggest the use of long algebraic notation. Instead, support a new notation called “Full Notation”. Also clarify some points about keeping track of game state.