Compression for chess game data... what to use?

The place for codemasters or beginners to talk about programming any language for the Spectrum.
Post Reply
Nomad
Manic Miner
Posts: 600
Joined: Thu Dec 28, 2017 12:38 pm

Compression for chess game data... what to use?

Post by Nomad »

Was going to pick the brains of you fellas. I looked at some compression code - but its got to the point of analysis paralysis :lol: I could do with some outside opinions on this.

I was thinking of going with zx7.

The crap chess game database - I want to compress the data on disk, then just use the ram disk to load the data in. when needed.

Code: Select all

Event "World Cup"]
[Site "?"]
[Date "2013.08.12"]
[Round "?"]
[White "Duda, Jan-Krzysztof"]
[Black "Ivanchuk, Vassily"]
[ECO "C13"]
[WhiteElo "2534"]
[BlackElo "2731"]
[Result "0-1"]

1. e4 e6 2. d4 d5 3. Nc3 Nf6 4. Bg5 dxe4 5. Nxe4 Nbd7 6. Nf3 h6 7. Nxf6+ Nxf6 8. Be3 Bd6 9. Bd3 b6 10. Qe2 Bb7 11. O-O-O Nd5 12. Kb1 Qf6 13. Nd2 O-O 14. Ne4 Qe7 15. c3 Bf4 16. g3 Bxe3 17. fxe3 e5 18. Rhe1 Rad8 19. Nd2 Nf6 20. Ba6 Ba8 21. Qc4 Rfe8 22. Bb5 c6 23. Bxc6 Rc8 24. d5 Nxd5 25. Qxd5 Rxc6 26. Qb5 Rd8 27. Ne4 Rcc8 28. Qa4 Qe6 29. Qc2 Qg4 30. Nf2 Qh5 31. h4 Qf3 32. g4 Rxd1+ 33. Rxd1 Qxe3 34. Qf5 Rf8 35. Rd7 Bf3 36. Nd3 Qg1+ 37. Nc1 Bxg4 

0-1
So the majority of the file is going to be data like this. What would be the best compressor to use where the priority is data compression over speed? I want to include as many games as possible on the disk.

I been trying to come up with a better way to organize the data but I figured in the end just keep the data in a standard format. When I think of the effort its going to take to re-encode the file format to something custom its probably not worth it.

The plan was a user would consult a printed index of games, then pick the disk image with the game of interest and then crap chess would decompress the data and play through the game on the screen.
User avatar
Mike Davies
Microbot
Posts: 137
Joined: Mon Nov 13, 2017 10:11 am

Re: Compression for chess game data... what to use?

Post by Mike Davies »

The Moves:

There are 64 squares on a chess board, so you can number them 0 to 63. That's 6 bits. Every move can be recorded into a to/from pair, so two 6 bit pairs. 12 bits. I think one of the correspondence chess standard notations is in a similar integer/integer pair. Castling is recording the king move.

The original chessbase format (CBF) compressed even further, but it's harder to implement. They have an algorithm that calculates all the possible moves for the side to move, it's ordered in a specific order. Then the move is stored as the index number of that move in the list of moves. You could probably record the move in 8 bits or less (not sure what's the maximum number of legal moves possible in a position), rather than the explicit 12 bits above.

Might be worth digging around to find the article that details the old chessbase (CBF, CBI) -- the guy reverse engineered the format when he wrote the CBUTILS scripts about 25 years ago? Worth it because Chessbase started life on the Atari ST, back in the days when memory was scarce and disks expensive, so there might be a few tricks there worth incorporating.
Last edited by Mike Davies on Mon Mar 05, 2018 5:54 pm, edited 3 times in total.
User avatar
Mike Davies
Microbot
Posts: 137
Joined: Mon Nov 13, 2017 10:11 am

Re: Compression for chess game data... what to use?

Post by Mike Davies »

The PGN Header, I'd suggest an array for Event, and an array for player, so you can replace those header fields with index references (or token). Then each header can probably be represented as a fixed-length piece of data, since what is left is simple numbers, and something to encode the result of the game (White win, Draw, Black win, Game in progress/unknown).
Ralf
Rick Dangerous
Posts: 2279
Joined: Mon Nov 13, 2017 11:59 am
Location: Poland

Re: Compression for chess game data... what to use?

Post by Ralf »

Do you want to compress just chess game or chess game with metadata like players, elo rankings and so on?


In a game as others say you move from one field (64 combinations=6 bits) to another field. Some special move would be pawn promotion which requires info about new piece (4 combinations=2 bits).

But I see some possible tricks. Actually in many cases there is just one possible move to destination field, you cannot move from every field there.

So:

1) store destination field: 6 bits
2) if move is obvious then no start field and next move
if move is not obvious then start field


It would acutally be an compression algorithm having chess rules built into it :)
Nomad
Manic Miner
Posts: 600
Joined: Thu Dec 28, 2017 12:38 pm

Re: Compression for chess game data... what to use?

Post by Nomad »

Thanks for the info on chessbase, I will have a look into this later.

The technique sounds clever, its like they have encoded the move evaluation rather than the move itself.

Ralf I would like to include the meta data - I guess it could be in a separate file with a linked index. I was trying to keep things simple. I wanted to have the printed index well to make it look a bit more flashy :lol: I guess its not needed on the file if it's already in the printed index...

I think there are some problem with the 6 bit method because you have things like under promotion of pawns, figuring out en passant capture rule & queen side castling stuff the more obscure moves/rules. But for the vast majority of cases the moves can be told in 6 bits. But there needs to be a few more bits for flags to indicate unusual situations.

I was freaking out thinking I would need an extra 4 bits just to cover the promotion/under promotion of the pawn. But I guess just having one bit set. then when that bit is set the interpreter looks at the next byte for the information on the piece to promote to.

I am being a potato - I could have it be 7 bits. Have the last bit as a weirdness flag. If the weirdness flag is set then the next byte is special and tells the interpreter what to do. That way there is only a 1 bit extra hit per ply. but we avoid the edge case situations.

I think the data would need to be interpreted to do it properly. As the operation is pretty slow (only showing the move one at a time on the screen I guess that would not be an issue at all).

Another optimization.. pawn promotion is only ever going to happen on rank 1 or rank 8.. otherwise we don't need to worry about pawn promotion or underpromotion rule.

Probably the same for en passant capture rule. (it has a limited area of effect)

The evaluation on what to do you need to start many moves before but from the board move routine perspective its only on those ranks where it makes sense to worry.
User avatar
Mike Davies
Microbot
Posts: 137
Joined: Mon Nov 13, 2017 10:11 am

Re: Compression for chess game data... what to use?

Post by Mike Davies »

Nomad wrote: Tue Mar 06, 2018 1:21 am The technique sounds clever, its like they have encoded the move evaluation rather than the move itself.

As far as I can remember, the order of moves is strictly defined, something like all King moves first, then all Queen moves, then all Rook moves (don't remember if there's an order to which rook first),... etc. There's a definite, repeatable set of steps that depends on knowing how pieces move (King's can't move into check, or pinned pieces cannot move if that exposes the king to check). But not dependent on positional evaluation.

I'm struggling to find this reverse-engineered CBF format, but have a look for the utility CBASCII -- it's used to covert to and from CBF and PGN. The source code should be available, it's in C, so you could use that to unravel the steps in generating the move list of a position in the correct order.
Ralf
Rick Dangerous
Posts: 2279
Joined: Mon Nov 13, 2017 11:59 am
Location: Poland

Re: Compression for chess game data... what to use?

Post by Ralf »

There is no need of extra compexity for castling and en passant.

You write down castling as king's move. It's either E1-G1 or E1-C1. In both cases it's move by 2 fields so cannot be confused with standard king's move.
En Passant - you just write start and destination field and it's obvious what happened.

One comment - castling and en passant are easy when you need to write them down but make a lot of problems when you create a chess engine ;)
You need to examine game history to know if such move is possible and not justy check current position - if king moves and later returns to start position then casting is no more possible.
Nomad
Manic Miner
Posts: 600
Joined: Thu Dec 28, 2017 12:38 pm

Re: Compression for chess game data... what to use?

Post by Nomad »

I will have a hunt on the chess programming wiki about chessbase. Might be some links their.

Getting the engine to understand the under promotion and stuff like en passant rule. You are right that is pita. :lol: I was surprised how many early engines didn't bother to implement this.
Nomad
Manic Miner
Posts: 600
Joined: Thu Dec 28, 2017 12:38 pm

Re: Compression for chess game data... what to use?

Post by Nomad »

Found it, the reversed engineered chessbase file format.

http://rybkaforum.net/cgi-bin/rybkaforu ... 2#pid26942
Spoiler
This could be interesting. (For some reason, the forum plain refuses to let me post anything with "0xcc" and then a colon in it, so I mangled the original .txt file.)


The CTG (Chessbase book) format
Unofficial specification, version 0.2 (incomplete).
This document is placed in the public domain by its author.

0. Structure

Each Chessbase book consists of four files:

- The .INI file, which contains some auxilliary information. Being a text
file, it's mostly self-explanatory and will not be treated here.
- The .CTG file, which contains all the real data. This is described in
chapter 1.
- The .CTB file, which is unknown but conjectured to be some kind of bitmap
over free pages in the .CTG file. Only described very loosely, in chapter
2.
- The .CTO file, which contains a lookup table into the CTG file for fast
lookups. The format of this is only partially known, but is described in
chapter 3.

1. The CTG file

1.1. General conventions

The following conventions apply:

- All integers are stored in big-endian format, whether they are 2, 3 or 4
bytes long.
- Whenever a length of a block is stored, the number generally includes the
length byte itself.
- There is no alignment of values, except that pages are aligned on
4096-byte boundaries.

The CTG file is divided into 4096 byte pages.

1.2. Header page

The first page is the header page which contains some general information
about the file:

- At offset 28, a 4-byte value is stored, denoting the number of games
"stored" in the field. There is also some other (unknown) data in the
first 32 bytes.
- At offset 1024 (0x400), 32 unknown bytes are stored.
- At offset 2048 (0x800), 32 unknown bytes are stored.

The other bytes in the header page are unused.

1.3. Data pages

After the first page, all the other pages store positions. The first data
page (starting at offset 4096) is denoted "page 0", the next is "page 1",
etc.. The page numbers are used in the CTO file.

In general, what is looked up in the CTG format is positions, although
possible moves from each position are also stored. This enables the format to
deal with transpositions. Each page stores one or more positions in random
order. It is not known how Chessbase decides what page to put each position
in, but it is conjectured to be based on hash value (see the section about
the CTO format below).

Each data page starts with a brief header:

- At offset 0 in the page, a 2-byte integer is stored, telling the number
of positions stored in the page.
- At offset 2, a 2-byte integer is stored, giving the number of bytes
stored in the page (including the four header bytes). Note that after
this, there may be random junk data, so be sure to stop reading the
page after reaching the given number of bytes.

At offset 4, the first position entry starts. Position entries then follow
directly after each other until the end of the page (ie. the number of bytes
specified in the header).

1.4. Position storage

Note that this specification distinguished between position entries and
positions. A position is a chess position, with information about castling
rights and en passant move. (The CTG format does not store any information
about the 50-move rule or move history.) A position entry contains a position
and book information about the position:

The position entry contains a position, the book moves, and some information
associated with the position. (In general, most of the information that seem
to be associated with moves in the Chessbase GUI actually stems from looking
it up in the resulting position.)

1.4.1. Symmetry

Every position is stored as if white is to move. If black is to move,
invert the white/black pieces and flip the board vertically (so a1 <-> a8, a2
<-> a7, etc.). Due to the symmetry of chess this will give the same result.

After this inversion (if any), check if the white king is in the left half
of the board (columns one through four). If it is, and neither side has any
castling rights remaining, flip the board horizontally (so a1 <-> h1,
a2 <-> h2, etc.). Naturally, the en passant square is also flipped.

1.4.2. Header byte

Each position entry begins with a header byte:

- The lower five bits (x & 0x1f) describe the length of the position
itself, in bytes, including the header byte.
- If the sixth bit (x & 0x20) is set, the position contains en passant
information. This happens if and only if an en passant capture is
possible.
- If the seventh bit (x & 0x40) is set, the position contains castling
information. If not, neither side can castle. (The bit is never set
if neither side can castle.)
- The eighth bit (x & 0x80) is unused.

1.4.3. Chess board

After the header byte, the position itself is encoded. The position is stored
using a Huffman encoding and as such uses a variable number of bits. The
bit-packing convention is the usual one, with the first bit in the MSB (0x80)
of the first byte, any any excess bits at the end being set to zero,

The position is encoded starting at the a1 square, then the a2 square, up to
a8, b1, b2, etc.. Each square is encoded as follows:

Code: Select all

  0       Empty square
  11c     Pawn (c=1 means black, c=0 means white)
  1011c   Rook
  1010c   Bishop
  1001c   Knight
  10001c  Queen
  10000c  King
1.4.4. Alignment padding bits

At this point, padding bits may be inserted. The rules are somewhat complex:
First, if the position ended exactly on a byte, there are no castling rights
and no en passant column, insert a full zero byte at this point, and ignore
the rest of this section.

Otherwise, define a number N as follows, based on whether there is an en
passant square (E=1, otherwise E=0) and if castling is stored (C=1, otherwise
C=0):

Code: Select all

  E=0, C=0: N=8
  E=0, C=1: N=4
  E=1, C=0: N=3
  E=1, C=1: N=7
If there are not exactly N bits left in the current byte, insert zero bits
as needed (possibly as many as seven) until there is. This alignment scheme
ensures that the last data bit ends exactly on a byte boundary.

1.4.5. En passant column

If en passant information is stored (see 1.4.2), it follows (still bitpacked)
immediately after the padding. Three bits store the column, where 000=a,
001=b, 010=c, etc..

1.4.6. Castling rights

If castling rights are stored (see 1.4.2), it follows (still bitpacked) at
this point. The castling rights are stored as four bits, storing in turn
black kingside castling rights (1 = can castle, 0 = can not castle), black
queenside castling rights, and then the same for white.

1.4.7. Position end

At this point, the bit-packing ends. As mentioned in section 1.4.4, this
should be exactly on a byte boundary, so no further padding should be needed.

1.4.8. Book moves

After the position, any book moves (excluding unplayed transpositions) are
stored. First, there is a header byte telling the number bytes used for book
moves (including the header byte itself). Each move is stored in two bytes,
and as such the header byte contains the value (2 * the number of moves) + 1.

The first byte of each move contains the move itself, and the second byte
contains the move annotation, if any. The encoding of moves is slightly
unusual -- in general, a look-up table is needed to decode the move
information. Every possible byte can then be looked up into the combination
(piece, relative movement).

The "piece" is in general something like "knight 2". Here, the board is
searched for a white piece of the appropriate type in the same ordering as
the board storage: If a white knight is on a1 it is always "knight 1" (since
no squares are before a1 in this ordering).

The "relative movement" is the number of squares the piece is to move from
its current position, for instance "two squares forward and one to the right"
for a typical knight move. Note that the board is taken to wrap around on all
four sides, so "forward seven squares" is exactly the same as "backwards one
square" and will be encoded using the same byte. Long and short castling have
their own bytes. (It is not known whether the format can represent castling
in Chess960.)

Note that in this encoding, some legal but obscure moves can not be stored.
In particular, promotions are automatically assumed to be to queen, so an
underpromotion can simply not be stored. Also, there are no moves for "queen
5" or above, so if there are more than four queens on the board, moves for
the fifth can not be entered. (In the case of underpromotion, however, the
GUI actually stores the position as "pawn N forward one square" as usual, but
when trying to look up the position later, it cannot find a position with a
queen on the 8th rank, and thus will simply hide the move in the GUI.) In
general, this should not be a problem for an opening book.

An incomplete table of move encoding follows. This encompasses all possible
moves for the sixteen (white) pieces that are initially found on a chess
board, as well as for an extra queen, which should be sufficient for most
books, especially as underpromotion is not supported. A later version may
include the moves for queen 3, rook 3, etc..

Code: Select all

  0x00: Pawn 5    f1 r1
  0x01: Knight 2  b1 l2
  0x03: Queen 2   r2
  0x04: Pawn 2    f1
  0x05: Queen 1   f1
  0x06: Pawn 4    f1 l1
  0x08: Queen 2   r4
  0x09: Bishop 2  f6 r6
  0x0a: King      b1
  0x0c: Pawn 1    f1 l1
  0x0d: Bishop 1  f3 r3
  0x0e: Rook 2    r3
  0x0f: Knight 1  b1 l2
  0x12: Bishop 1  f7 r7
  0x13: King      f1
  0x14: Pawn 8    f1 r1
  0x15: Bishop 1  f5 r5
  0x18: Pawn 7    f1
  0x1a: Queen 2   f6
  0x1b: Bishop 1  f1 l1
  0x1d: Bishop 2  f7 r7
  0x21: Rook 2    r7
  0x22: Bishop 2  f2 l2
  0x23: Queen 2   f6 r6
  0x24: Pawn 8    f1 l1
  0x26: Bishop 1  f7 l7
  0x27: Pawn 3    f1 l1
  0x28: Queen 1   f5 r5
  0x29: Queen 1   r6
  0x2a: Knight 2  b2 r1
  0x2d: Pawn 6    f1 r1
  0x2e: Bishop 1  f1 r1
  0x2f: Queen 1   r1
  0x30: Knight 2  b2 l1
  0x31: Queen 1   r3
  0x32: Bishop 2  f5 r5
  0x34: Knight 1  f2 r1
  0x36: Knight 1  f1 r2
  0x37: Queen 1   f4
  0x38: Queen 2   f4 l4
  0x39: Queen 1   r5
  0x3a: Bishop 1  f6 r6
  0x3b: Queen 2   f5 l5
  0x3c: Bishop 1  f5 l5
  0x41: Queen 2   f5 r5
  0x42: Queen 1   f7 l7
  0x44: King      b1 r1
  0x45: Queen 1   f3 r3
  0x4a: Pawn 8    f2
  0x4b: Queen 1   f5 l5
  0x4c: Knight 2  f2 r1
  0x4d: Queen 2   f1
  0x50: Rook 1    f6
  0x52: Rook 1    r6
  0x54: Bishop 2  f1 l1
  0x55: Pawn 3    f1
  0x5c: Pawn 7    f1 r1
  0x5f: Pawn 5    f2
  0x61: Queen 1   f6 r6
  0x62: Pawn 2    f2
  0x63: Queen 2   f7 l7
  0x66: Bishop 1  f3 l3
  0x67: King      f1 r1
  0x69: Rook 2    f7
  0x6a: Bishop 1  f4 r4
  0x6b: O-O
  0x6e: Rook 1    r5
  0x6f: Queen 2   f7 r7
  0x72: Bishop 2  f7 l7
  0x74: Queen 1   r2
  0x79: Bishop 2  f6 l6
  0x7a: Rook 1    f3
  0x7b: Rook 2    f6
  0x7c: Pawn 3    f1 r1
  0x7d: Rook 2    f1
  0x7e: Queen 1   f3 l3
  0x7f: Rook 1    r1
  0x80: Queen 1   f6 l6
  0x81: Rook 1    f1
  0x82: Pawn 6    f1 l1
  0x85: Knight 1  f2 l1
  0x86: Rook 1    r7
  0x87: Rook 1    f5
  0x8a: Knight 1  b2 r1
  0x8b: Pawn 1    f1 r1
  0x8c: King      b1 l1
  0x8e: Queen 2   f2 l2
  0x8f: Queen 1   r7
  0x92: Queen 2   f1 r1
  0x94: Queen 1   f3
  0x96: Pawn 2    f1 r1
  0x97: King      l1
  0x98: Rook 1    r3
  0x99: Rook 1    f4
  0x9a: Queen 1   f6
  0x9b: Pawn 3    f2
  0x9d: Queen 1   f2
  0x9f: Bishop 2  f4 l4
  0xa0: Queen 2   f3
  0xa2: Queen 1   f2 r2
  0xa3: Pawn 8    f1
  0xa5: Rook 2    f5
  0xa9: Rook 2    r2
  0xab: Queen 2   f6 l6
  0xad: Rook 2    r4
  0xae: Queen 2   f3 r3
  0xb0: Queen 2   f4
  0xb1: Pawn 6    f2
  0xb2: Bishop 1  f6 l6
  0xb5: Rook 2    r5
  0xb7: Queen 1   f5
  0xb9: Bishop 2  f3 r3
  0xbb: Pawn 5    f1
  0xbc: Queen 2   r5
  0xbd: Queen 2   f2
  0xbe: King      r1
  0xc1: Bishop 1  f2 r2
  0xc2: Bishop 2  f2 r2
  0xc3: Bishop 1  f2 l2
  0xc4: Rook 2    r1
  0xc5: Rook 2    f4
  0xc6: Queen 2   f5
  0xc7: Pawn 7    f1 l1
  0xc8: Pawn 7    f2
  0xc9: Queen 2   f7
  0xca: Bishop 2  f3 l3
  0xcb: Pawn 6    f1
  0xcc  Bishop 2  f5 l5
  0xcd: Rook 1    r2
  0xcf: Pawn 4    f1
  0xd1: Pawn 2    f1 l1
  0xd2: Knight 2  f1 r2
  0xd3: Knight 2  f1 l2
  0xd7: Queen 1   f1 l1
  0xd8: Rook 2    r6
  0xd9: Queen 1   f2 l2
  0xda: Knight 1  b2 l1
  0xdb: Pawn 1    f2
  0xde: Pawn 5    f1 l1
  0xdf: King      f1 l1
  0xe0: Knight 2  b1 r2
  0xe1: Rook 1    f7
  0xe3: Rook 2    f3
  0xe5: Queen 1   r4
  0xe6: Pawn 4    f2
  0xe7: Queen 1   f4 r4
  0xe8: Rook 1    f2
  0xe9: Knight 1  b1 r2
  0xeb: Pawn 4    f1 r1
  0xec: Pawn 1    f1
  0xed: Queen 1   f7 r7
  0xee: Queen 2   f1 l1
  0xef: Rook 1    r4
  0xf0: Queen 2   r7
  0xf1: Queen 1   f1 r1
  0xf3: Knight 2  f2 l1
  0xf4: Rook 2    f2
  0xf5: Bishop 2  f1 r1
  0xf6: O-O-O
  0xf7: Knight 1  f1 l2
  0xf8: Queen 2   r1
  0xf9: Queen 2   f6
  0xfa: Queen 2   r3
  0xfb: Queen 2   f2 r2
  0xfd: Queen 1   f7
  0xfe: Queen 2   f3 l3
After the byte describing the move, the annotation for that move is stored as
a single byte:

Code: Select all

  0x00: No annotation
  0x01: !
  0x02: ?
  0x03: !!
  0x04: ??
  0x05: !?
  0x06: ?!
  0x08: Only move
  0x16: Zugzwang
Note that the other annotations (=, Unclear, =+, etc.) belong to the position
and not the move, as described in section 1.4.13.

1.4.9. Position statistics

After all moves and annotations have been encoded, the statistics for this
position are stored:

- First, a three-byte integer describing the number of games for this
position. This is the number reported in parenthesises after "N = 12345"
in Chessbase. (The "real" N number is the sum of the number of wins,
draws and losses.)
- Then, three bytes describing the number of wins for white (1-0). Remember
that the position is always stored from white's perspective, so if black
is to move, one will need to swap the numbers for 1-0 and 0-1.
- Then, three bytes describing the number of wins for black (0-1).
- Then, three bytes describing the number of draws (1/2).

The percentages and performance percentages are calculated from these
numbers, seemingly with correct rounding.

1.4.10. Unknown integer

After the statistics, an unknown 4-byte integer is stored. It is almost
always zero, but sometimes another small number (typically in the single
digits).

1.4.11. Ratings

After the unknown integer, rating information for this position is stored.
There are two ratings. Both are stored first as a number of games (this is
the number shown after the "-" to the lower right in Chessbase) in three
bytes, and then the sum of the ratings in four bytes. To get the average
rating (the number shown before the "-"), divide the second number by the
first number (guarding for division by zero).

The performance rating is calculated from these ratings and the position
statistics.

1.4.12. Engine recommendations

After the ratings, a single byte is stored for the position's
recommendations. The possibilities are:

0x00: No special preference
0x40: Don't play in tournaments (red in the GUI)
0x80: Main move (green in the GUI)

Other values have been spotted (0x04, 0a10, 0x14, 0x20, 0x60, 0xa0), without
knowing exactly what they are. It may be that this is actually a bit field.

Again, note that this value belongs to the position entry, not the move, so
when checking a move's desirability, one will need to look up the resulting
position and check this byte.

After the engine recommendation, a second unknown byte is stored. This may be
a weight somehow, or it may be something else.

1.4.13. Position commentary

Finally for the position, the commentary is stored as a single byte. The
following values are legal:

Code: Select all

  0x00: None
  0x0b: =
  0x0d: Unclear
  0x0e: =+
  0x0f: +=
  0x10: -/+
  0x11: +/-
  0x13: +-
  0x20: Development adv.
  0x24: Initiative
  0x28: With attack
  0x2c: Compensation
  0x84: Counterplay
  0x8a: Zeitnot
  0x92: Novelty
2. The CTB file

Unknown, except that it may be a bit field of some sort over the pages in the
CTG file. However, two values are of special interest:

- The lower storage bound, a 4-byte integer stored at offset 4.
- The upper storage bound, a 4-byte integer stored at offset 8.

The usage of these bounds will be clear from section 3.

3. The CTO file

Note that the CTO file is not as well understood as the CTG file. As such,
the information presented here is mainly guesswork, and may be wrong.

The CTO file is an index into the CTG file, for rapid lookup. It is believed
to be a simple hash table, with no forms of overflow, so only a single lookup
in the CTO file is required to find the right page in the CTG file for a
given position. (It is guessed that if a given page in the CTG file becomes
too big, the number of pages is increased and all the positions moved
around.)

The CTO file starts with sixteen header bytes. At file offset 4, a 4-byte
integer is stored, describing the number of entries in the CTO file. (This
seems to neither be a prime nor a power of two.)

After the header, four bytes are stored for each entry. When looking up a
position, the position is hashed, and the hash value is reduced relative to
the number of index entries. The appropriate four-byte value in the CTO file
is then looked up based on the hash. The number is a page number in the
CTG file, which is then read and searched for the position. (If the page in
the CTG file does not actually contain the position, the GUI seems to simply
fail the lookup, so there appears to be no provision for traditional
overflowing.) -1 (ff ff ff ff) means that no positions with this (reduced)
hash value exists. More than one hash value may point to the same page in the
CTG file.

The code for computing a 32-bit hash from a position (which is encoded
exactly as in the CTG file, including the header byte) is the equivalent
of this C code:

Code: Select all

  unsigned int tbl[] = {
          0x3100d2bf, 0x3118e3de, 0x34ab1372, 0x2807a847,
          0x1633f566, 0x2143b359, 0x26d56488, 0x3b9e6f59,
          0x37755656, 0x3089ca7b, 0x18e92d85, 0x0cd0e9d8,
          0x1a9e3b54, 0x3eaa902f, 0x0d9bfaae, 0x2f32b45b,
          0x31ed6102, 0x3d3c8398, 0x146660e3, 0x0f8d4b76,
          0x02c77a5f, 0x146c8799, 0x1c47f51f, 0x249f8f36,
          0x24772043, 0x1fbc1e4d, 0x1e86b3fa, 0x37df36a6,
          0x16ed30e4, 0x02c3148e, 0x216e5929, 0x0636b34e,
          0x317f9f56, 0x15f09d70, 0x131026fb, 0x38c784b1,
          0x29ac3305, 0x2b485dc5, 0x3c049ddc, 0x35a9fbcd,
          0x31d5373b, 0x2b246799, 0x0a2923d3, 0x08a96e9d,
          0x30031a9f, 0x08f525b5, 0x33611c06, 0x2409db98,
          0x0ca4feb2, 0x1000b71e, 0x30566e32, 0x39447d31,
          0x194e3752, 0x08233a95, 0x0f38fe36, 0x29c7cd57,
          0x0f7b3a39, 0x328e8a16, 0x1e7d1388, 0x0fba78f5,
          0x274c7e7c, 0x1e8be65c, 0x2fa0b0bb, 0x1eb6c371
  };

  unsigned int gen_hash(signed char *ptr, unsigned len)
  {
          signed hash = 0;
          short tmp = 0;
          int i;

          for (i = 0; i < len; ++i) {
                  signed char ch = *ptr++;
                  tmp += ((0x0f - (ch & 0x0f)) << 2) + 1;
                  hash += tbl[tmp & 0x3f];
                  tmp += ((0xf0 - (ch & 0xf0)) >> 2) + 1;
                  hash += tbl[tmp & 0x3f];
          }
          return hash;
  }
Note that this code may not be 64-bit clean, as it relies on 32-bit overflow
behavior of the addition to the hash variable.

The reduction of this 32-bit hash to a value in the CTO file is not
completely known. The following C code is believed to approximate the
lookup, and is worst-case O(log n) (usually much better) in the number of
index entries:

Code: Select all

  for (int n = 0; n < 0x7fffffff; n = 2*n + 1) {
        unsigned c = (hash & n) + n;

        if (c < lower_bound)
                continue;

        search_page(c);

        if (c >= upper_bound)
                break;
  }
Here, lower_bound and upper_bound are the values found in the CTB file as
described in section 2. search_page() indicates that the given value should
be looked up in the CTO file, and the resulting page in the CTG file (if
any) should be searched for the given position. ChessBase itself seems to
have more conditional logic around search_page(), but unless you want
to make new CTG files yourself this uncertainity should not be a big
problem.
Nomad
Manic Miner
Posts: 600
Joined: Thu Dec 28, 2017 12:38 pm

Re: Compression for chess game data... what to use?

Post by Nomad »

Here is the followup reverse engineering research

http://www.talkchess.com/forum/viewtopi ... at&start=0
Spoiler
Thanks, I didn't know about that one. It does actually contain some common information. Okay, so here we go.


File contents
=============
This text file contains the result of many hours work trying to figure out how the Chessbase CBH file format (and related files) work. Some file extensions, such as the search booster and the archive format CBV, are completely ignored for now. All data was derived from ChessBase 9.0.

CBH - Contains the list of games in the database and pointers to location in the other files.
CBG - The actual game data (ie sequence of moves)
CBA - Game annotations
CBP - Player data
CBT - Tournament data
CBC - Annotator data
CBS - Source data
CBE - Team data
CBM - ?
CBJ - ? Contains game->team mappings among other things
CBB - ?

CBH file
========
46 byte header, 46 byte records

Header
------
Ofs 0, Len 6 - ?? 00 00 2C 00 2E 01 if created with cb9, 00 00 24 00 2E 01 if created with cblight
Ofs 6, Len 4 - Number of games+1
Ofs 12, Len 4 - ?? Can be 2
Ofs 16, Len 4 - ?? Can be 3
Ofs 20, Len 4 - ?? Can be 3
Ofs 40, Len 4 - Either number of games+1 (?), or 0 (?)

Other bytes zero?

Game record
-----------
Ofs 0, Len 1 - bit 0: 1 = game (can sometimes be 0 if guiding text is 1?)
bit 1: 1 = guiding text
bit 7: 1 = marked as deleted
Ofs 1, Len 4 - Position in .cbg file where the game data starts (or guiding text)
Ofs 5, Len 4 - Position in .cba file where annotations start (0 = no annotations)
Ofs 9, Len 3 - White player number (0 = first player in .cbp file)
Ofs 12, Len 3 - Black player number (0 = first player in .cbp file)
Ofs 15, Len 3 - Tournament number (0 = first tournament in .cbt file)
Ofs 18, Len 3 - Annotator number (0 = first annotator in .cbc file)
Ofs 21, Len 3 - Source number (0 = first source in .cbs file)
Ofs 24, Len 3 - year & month & day
bit 0-4 = Day (0 = unset)
bit 5-8 = Month (0 = unset)
bit 9-20 = Year (0 = unset)

Ofs 27, Len 1 - 3 = line
2 = 1-0
1 = ½-½
0 = 0-1
6 = +:-
5 = =:=
4 = -:+
7 = 0-0
Ofs 28, Len 1 - Line evaluation (only set if byte above is 3)
0 = unset
18 = +- (white is winning)
16 = +/- (white has a clear advantage)
14 = +/= (white has a slight advantage)
11 = = (the position is equal)
13 = The position is unclear
15 = -/= (black has a slight advantage)
17 = -/+ (black has a clear advantage)
19 = -+ (black is winning)
0x92 = theoretical novelty
0x2C = with compensation for the material
0x84 = with counterplay
0x24 = with initiative
0x28 = with attack

Ofs 29, Len 1 - round (0 = unset)
Ofs 30, Len 1 - subround (0 = unset)
Ofs 31, Len 2 - white rating (0 = unset)
Ofs 33, Len 2 - black rating (0 = unset)

Ofs 35, Len 2 - bit 0-6: Sub ECO (00-99)
bit 7-15: ECO (0 = unset, 1 = A00, 500 = E99)

Ofs 37, Len 2 - Medals:
bit 0: Best game
bit 1: Decided tournament
bit 2: Model game (opening plan)
bit 3: Novelty
bit 4: Pawn structure
bit 5: Strategy
bit 6: Tactics
bit 7: With attack
bit 8: Sacrifice
bit 9: Defense
bit 10: Material
bit 11: Piece play
bit 12: Endgame
bit 13: Tactical blunder
bit 14: Strategical blunder
bit 15: User

Ofs 39 bit 0: 1 = Contains critical position
bit 1: 1 = Contains correspondence header
Ofs 40 bit 0: 1 = Contains embedded audio
bit 1: 1 = Contains embedded picture
bit 2: 1 = Contains embedded video
bit 3: 1 = Contains game quotation
bit 4: 1 = Contains path structure
bit 5: 1 = Contains piece path
Ofs 41: bit 1: 1 = Contains training annotation

Ofs 42, Len 1 - bit 0: P (game starts from a position different from the initial one)
bit 1: v (contains variations)
bit 2: c (contains commentary)
bit 3: s (contains symbols)
bit 4: The game is annotated with colored squares
bit 5: The game is annotated with arrows
bit 7: The game contains time notifications (??)

Ofs 43 ?

Ofs 44, Len 1 - bit 0-1: 1=V, 2=r, 3=R (many variations; bit 1 in ofs 42 must be set)
v = [1,50] variation moves, V = [51,300] variation moves,
r = [301,1000] variation moves, R = [1001,]
bit 2: C (many commentaries; bit 2 in ofs 42 must be set)
bit 3: S (many symbols; bit 3 in ofs 42 must be set)
bit 4: The game is annotated with color squares in at least 10 positions
bit 5: The game is annotated with arrows in at least 10 positions

Ofs 45, Len 1 - number of moves in main variant. If 255, the actual number of
moves is checked in the game data

If guiding text:

Ofs 0-4 same as above
Ofs 5?
Ofs 7, Len 3 - Tournament Id
Ofs 10, Len 3 - Source Id
Ofs 13, Len 3 - Annotator Id
Ofs 16, Len 1 - Round
Ofs 17, Len 1 - ? 0
Ofs 18, Len 1 - bit 2: Contains stream (wmv)
Ofs 19, Len 1 - bit 0: Contains audio link
bit 1: Contains image link
bit 2: Contains video link

CBG Format
==========
26 byte header. Actual game data pointed out from CBH file.

Ofs Type Len Description
----------------------------
0 short 2 Pointer to first game (?)
2 int 4 Pointer to end of file (= file length)
6 int 4 ? Accumulated value of game bytes length changed
14 int 4 Pointer to end of file (= file length) (same?)
22 int 4 ? Accumulated value of game bytes length changed

[At game start]

+0 int 4 bit 0-29: Game size, in number of bytes (including this, excluding trash bytes)
bit 30: If set, the game doesn't start with the initial position.
bit 31: If set, the data is not encoded (iff guiding text?)
+4 byte (28) If game does not start with initial position, here that position follows.
? byte ? Game data (see below), ends with 0C (=pop position)

[For guiding texts]

+0 int 4 bit 0-29: Game size, in number of bytes (including this, excluding trash bytes)
bit 30: If set, the game doesn't start with the initial position.
bit 31: If set, the data is not encoded (iff guiding text?)
+4 word 2 ? (1?) (little-endian)
+6 word 2 Number of titles (can be 0) (little endian)

For each title

+0 word 2 Language (0=english, 1=german, 2=france, 3=spain, 4=italian, 5=dutch) (little endian)
+2 word 2 Length of title (little endian)
+4 string ? Title

Game setup format
-----------------
Ofs 0, Len 1 - Always 1?
Ofs 1, Len 1 - Bit 0-3: En passant file (0 = no en passant, 1 = a, 8 = h)
Bit 4: 0=white to move, 1=black to move
Ofs 2, Len 1 - Bit 0: White O-O-O possible
Bit 1: White O-O possible
Bit 2: Black O-O-O possible
Bit 3: Black O-O possible
Ofs 3, Len 1 - Next move number
Ofs 4, Len 24 - Piece setup as a bit stream (starting with MSB at ofs 4),
using these bit codes:

0 = Empty square
10001 = White king
10010 = White queen
10011 = White knight
10100 = White bishop
10101 = White rook
10110 = White pawn
11001 = Black king
11010 = Black queen
11011 = Black knight
11100 = Black bishop
11101 = Black rook
11110 = Black pawn
Ofs 28 - Start of moves

The board is then scanned from a1 to a8, b1 to b8 etc. The first piece
of each type find is the initial first piece, etc (including pawns, of both colors).

Game data
=========
The moves in a game are represented as a stream of bytes. Most moves are represented as a single byte, describing the relative movement of a piece. Pieces are described in terms such as "first rook" (meaning the rook that for white starts on a1 and for black on a8), "second rook" etc.

When decoding the byte stream, first decrease the byte value with the number of moves so far processed in the game data (minus 0 for the first move). The special codes for "multiple byte moves" and begin/end of variations don't count. The move count should not be restarted when a variation end. Then translate the byte according to the translation table below.

All pawn moments are seen from the point of view of the player to move. Pawn captures include en passant.

When a "first piece" is captured, the "second piece" becomes the new "first piece" for that player, and the "third piece" becomes the "second piece". When a "second piece" is cpatured, the old "third piece" becomes the "second piece". A "fourth piece" (or higher) never "promotes" and its movements are always represented with multiple bytes. Pawns are never adjusted in this way, so the e-pawn remains as the "fifth pawn" throughout the game.

Multiple byte moves describes a move by denoting the start square and the destination square. This is done with 6 bits each. Bit 0-5 in the second byte is the start square (a1=0, a2=1, h8=63), bit 6-7 in the second byte and 0-3 in the first is the destination square. For pawn promotions, bit 4-5 is also used (0=queen, 1=rook, 2=bishop, 3=knight). Any type of move can be described as a multipe byte move, but it's typically only done when moving a fourth piece or doing a pawn promotion.

Single byte move translation table
----------------------------------
Note: These bytes look random, but they have actually been "encrypted" using a static array of 256 bytes that can be found at position 0x7AE4A8 in CBase9.exe and position 0x9D6530 in CBase10.exe (it starts with 0xAA, 0x49, 0x39, 0xD8, 0x5D, 0xC2, 0xB1, 0xB2). By reverse looking up these values, the new value will correspond to the order the are listed here.

Special
-------
AA = null move

King
----
49 = y+1
39 = x+1, y+1
D8 = x+1
5D = x+1, y+7
C2 = y+7
B1 = x+7, y+7
B2 = x+7
47 = x+7, y+1
76 = 0-0
B5 = 0-0-0

First queen
-----------
A5 = y+1
B8 = y+2
CB = y+3
53 = y+4
7F = y+5
6B = y+6
8D = y+7
79 = x+1
BE = x+2
EB = x+3
21 = x+4
99 = x+5
D2 = x+6
57 = x+7
4D = x+1, y+1
B4 = x+2, y+2
BF = x+3, y+3
62 = x+4, y+4
BD = x+5, y+5
24 = x+6, y+6
96 = x+7, y+7
A7 = x+1, y+7
48 = x+2, y+6
28 = x+3, y+5
6E = x+4, y+4 (not primarly used)
2F = x+5, y+3
5A = x+6, y+2
18 = x+7, y+1

First rook (a1/a8 at start)
----------
4E = y+1
F8 = y+2
43 = y+3
D7 = y+4
63 = y+5
9C = y+6
E6 = y+7
2E = x+1
C6 = x+2
26 = x+3
88 = x+4
30 = x+5
61 = x+6
6F = x+7

Second rook (h1/h8 at start)
-----------
14 = y+1
A9 = y+2
68 = y+3
EE = y+4
FB = y+5
77 = y+6
E2 = y+7
A6 = x+1
05 = x+2
8B = x+3
A1 = x+4
98 = x+5
32 = x+6
52 = x+7

First bishop (c1/c8 at start)
------------
02 = x+1, y+1
97 = x+2, y+2
E1 = x+3, y+3
41 = x+4, y+4
C3 = x+5, y+5
7C = x+6, y+6
E4 = x+7, y+7
06 = x+1, y+7
B7 = x+2, y+6
55 = x+3, y+5
D9 = x+4, y+4 (not primarily used)
2C = x+5, y+3
AE = x+6, y+2
37 = x+7, y+1

Second bishop (f1/f8 at start)
-------------
F6 = x+1, y+1
3F = x+2, y+2
08 = x+3, y+3
93 = x+4, y+4
73 = x+5, y+5
5E = x+6, y+6
78 = x+7, y+7
35 = x+1, y+7
F2 = x+2, y+6
6D = x+3, y+5
71 = x+4, y+4 (not primarly used)
A2 = x+5, y+3
F3 = x+6, y+2
16 = x+7, y+1

First knight (b1/b8 at start)
------------
58 = x+2, y+1
3D = x+1, y+2
FA = x-1, y+2
E9 = x-2, y+1
BA = x-2, y-1
D4 = x-1, y-2
DD = x+1, y-2
4A = x+2, y-1

Second knight (g1/g8 at start)
-------------
C4 = x+2, y+1
0E = x+1, y+2
FE = x-1, y+2
5F = x-2, y+1
75 = x-2, y-1
07 = x-1, y-2
89 = x+1, y-2
34 = x+2, y-1

a2/a7-pawn
------------
2D = one step forward
C1 = two steps forward
8E = capture right
F5 = capture left

b2/b7-pawn
------------
64 = one step forward
17 = two steps forward
70 = capture right
A4 = capture left

c2/c7-pawn
------------
7B = one step forward
DA = two steps forward
E0 = capture right
85 = capture left

d2/d7-pawn
------------
C5 = one step forward
0B = two steps forward
90 = capture right
F9 = capture left

e2/e7-pawn
------------
84 = one step forward
FF = two steps forward
15 = capture right
36 = capture left

f2/f7-pawn
------------
09 = one step forward
9E = two steps forward
7D = capture right
DE = capture left

g2/g7-pawn
------------
BB = one step forward
DF = two steps forward
BC = capture right
3A = capture left

h2/h7-pawn
------------
12 = one step forward
33 = two steps forward
13 = capture right
19 = capture left

Second queen
------------
E5 = y+1
94 = y+2
50 = y+3
11 = y+4
EA = y+5
31 = y+6
01 = y+7
5C = x+1
95 = x+2
CA = x+3
D3 = x+4
1D = x+5
7E = x+6
EF = x+7
44 = x+1, y+1
80 = x+2, y+2
A0 = x+3, y+3
1F = x+4, y+4
83 = x+5, y+5
00 = x+6, y+6
4B = x+7, y+7
67 = x+1, y+7
20 = x+2, y+6
5B = x+3, y+5
2A = x+4, y+4 (not primarly used)
92 = x+5, y+3
B6 = x+6, y+2
60 = x+7, y+1

Third queen
-----------
1A = y+1
42 = y+2
0F = y+3
0D = y+4
B0 = y+5
D1 = y+6
23 = y+7
F0 = x+1
7A = x+2
54 = x+3
4F = x+4
F4 = x+5
A8 = x+6
72 = x+7
E7 = x+1, y+1
40 = x+2, y+2
38 = x+3, y+3
59 = x+4, y+4
87 = x+5, y+5
E8 = x+6, y+6
6C = x+7, y+7
86 = x+1, y+7
04 = x+2, y+6
F1 = x+3, y+5
8C = x+4, y+4 (not primarly used)
CE = x+5, y+3
6A = x+6, y+2
DB = x+7, y+1

Third rook
----------
81 = y+1
82 = y+2
9A = y+3
1B = y+4
9D = y+5
0A = y+6
2B = y+7
8F = x+1
CD = x+2
ED = x+3
10 = x+4
74 = x+5
69 = x+6
D6 = x+7

Third bishop
------------
51 = x+1, y+1
B9 = x+2, y+2
45 = x+3, y+3
3B = x+4, y+4
56 = x+5, y+5
91 = x+6, y+6
FD = x+7, y+7
AB = x+1, y+7
66 = x+2, y+6
3E = x+3, y+5
46 = x+4, y+4 (not primarily used)
B3 = x+5, y+3
FC = x+6, y+2
C8 = x+7, y+1

Third knight
------------
9B = x+2, y+1
C0 = x+1, y+2
E3 = x-1, y+2
A3 = x-2, y+1
AC = x-2, y-1
C9 = x-1, y-2
EC = x+1, y-2
27 = x+2, y-1

Special
-------
29 = multiple byte move to follow
9F = dummy - skip and don't count thi move. Used by earlier CB verions as padding!?
25 = unused
C7 = unused
CC = unused
65 = unused
4C = unused
D5 = unused
1E = unused
CF = unused
03 = unused
8A = unused
AF = unused
F7 = unused
AD = unused
3C = unused
D0 = unused
22 = unused
1C = unused
DC = push position (start of variant)
0C = pop position (end of variant)


CBA file (annotation data)
==========================
26 byte header. Actual annotation data pointer out from CBH file.

Game start
----------
Ofs 0, Len 3 - The game ID this is the annotation for (little-endian)
Ofs 4?
Ofs 10, Len 4 - The number of bytes used for annotations in this game, starting at ofs 0 (big-endian)
Ofs 14 - start of annotation data

Annotation data
---------------
Ofs 0, Len 3 - Position in gaming sequence (0 = start position, 1 = after first move, -1 = game general) for this annotation (little-endian)
Ofs 3, Len 1 - Type of annotation:
0x02 = text after move
0x03 = symbols
0x18 = critical position
0x82 = text before move
0x14 = pawn structure (ofs 6 = 3?! TODO)
0x15 = piece path (ofs 6-7 = 03 2b? 03 0d? TODO)
0x13 = game quotation
0x22 = medal
0x23 = variation color
0x10 = sound
0x20 = video
0x11 = picture
0x09 = training annotation (TODO)
0x61 = correspondence header (always move -1?)
0x19 = correspondence move (header must exist?)
Ofs 4, Len 2 - Number of bytes for this annotation (from ofs 0, little-endian).


Symbols:
Ofs 6, Len 1 - Move comment (! ? !? ?! !! ?? Zugzwang Only mode)
Ofs 7, Len 1 - Position evaluation (optional byte, 0 if missing)
Ofs 8, Len 1 - Prefix (optional byte, 0 if missing)

Text:
Ofs 6, Len 1 - 0?
Ofs 7, Len 1 - Language 0 = All, 0x2A = English, 0x35 = Deutsch, 0x31 = France, 0x2B = Spain, 0x46 = Italian, 0x67 = Dutch, 0x75 = Portugese, 0x00 = Pol (bug??)
Ofs 8, Len ? - Text (remaining bytes) 0x9E = diagram (denoted "{#}")

Critical Position:
Ofs 6, Len 1 - 1 = Critical opening position
2 = Critical middlegame position
3 = Critical endgame position

Game Quotation:
Not done yet

Medal:
Ofs 6, Len 4 - Bit 0-15 (bit 0 = 1 in ofs 9) sets medals according to medal table below

Variation Color
Ofs 6, Len 1 - 0?
Ofs 7, Len 1 - blue component
Ofs 8, Len 1 - green component
Ofs 9, Len 1 - red component

Move comments
-------------
1 = !
2 = ?
3 = !!
4 = ??
5 = !?
6 = ?!
0x16 = Zugzwang
8 = Only move

Position evaluations
--------------------
0x12 = +-
0x10 = +/-
0x0E = +/=
0x0B = =
0x0D = unclear
0x0F = =/+
0x11 = -/+
0x13 = -+
0x92 = theoretical novelty
0x2C = with compensation for the material
0x84 = with counterplay
0x24 = with initiative
0x28 = with attack
0x20 = development advantage
0x8A = zeitnot

Prefix
-----
0x91 = Editorial annotation
0x8E = better is
0x8F = worse is
0x90 = Equivalent is
0x8C = with the idea
0x8D = directed against

CBP file (player data)
======================
28 byte header, 67 byte records

Header
------
Ofs 0, Len 4 - Number of player records in file (little endian)
Ofs 4, Len 4 - The root record
Ofs 8, Len 4 - The number 1234567890 (??? reserved for future use perhaps?)
Ofs 12, Len 4 - The size of a record (excluding the 9 bytes making up the binary tree structure)
Ofs 16, Len 4 - The first deleted record (-1 if nothing deleted)
Ofs 20, Len 4 - Number of existing records (little endian)
Ofs 24, Len 4 - ? unused ? Always 0?

Binary tree sorts players in alphabetical order (last name, first name).
An AVL tree is used to make sure the tree is balanced.

Record
------
Ofs 0, Len 4 - -999 if record is deleted. Otherwise left child node (-1 if missing) (little endian)
Ofs 4, Len 4 - Right child node (-1 if missing). If record is deleted, contains next deleted record (or -1 if last) (little endian)
Ofs 8, Len 1 - Height of right subtree minus height of left subtree. Used to balance the AVL tree. Usually -1, 0 or 1, but may become -2 or 2 temporarily. 0 for deleted records (unused)
Ofs 9, Len 30 - Last name
Ofs 39, Len 20 - First name
Ofs 59, Len 4 - Number of games with this player? (little endian)
Ofs 63, Len 4 - First game id with this player? (little endian)

All other player information is fetched from the Player Encyclopedia.



CBT file (tournament data)
==========================
28 byte header, 99 byte records

Same header format at CBP
Binary tree sorts tournament in alphabetical order

Ofs 0, Len 2 - 19 FC == deleted?? FF FF == exist??

Ofs 9, Len 40 - Title (zero terminated string)
Ofs 49, Len 30 - Place (zero terminated string)
Ofs 79, Len 3 - Date (little endian)
bit 0-4 = Day (0 = unset)
bit 5-8 = Month (0 = unset)
bit 9-20 = Year (0 = unset)
Ofs 83, Len 1 - bit 0-4 game type
0 = unset
1 = game
2 = match
3 = tourn
4 = swiss
5 = team
6 = k.o.
7 = simul
8 = schev
bit 5: 1 = blitz
bit 6: 1 = rapid
bit 7: 1 = corresp
Ofs 85, Len 1 - Nation (0=unset, AFG=0x01, ..., GBR=0xF7) (complete list in Cbase9.exe between 0x6ba154 and 0x0x6ba510, in reverse order)
Ofs 87, Len 1 - Category (0-99, 0=unset)
Ofs 88, Len 1 - bit 0: ?? Almost always the same as bit 1, but there are exceptions.
bit 1: complete
bit 2: board points
Ofs 89, Len 1 - Number of rounds (0-255, 0=unset)

Ofs 91, Len 4 - Number of games with this tournament? (little endian)
Ofs 95, Len 4 - First game id with this tournament? (little endian)



CBC file (annotator data)
=========================
28 byte header, 62 byte records

Same header format at CBP
Binary tree used to sort annotators in alphabetical order

Ofs 9, Len 45 - Annotator name (not zero terminated!?)
Ofs 54, Len 4 - Number of games with this annotator? (little endian)
Ofs 58, Len 4 - First game id with this annotator? (little endian)



CBS file (source data)
======================
28 byte header, 68 byte records

Same header format at CBP
Binary tree used to sort sources in alphabetical order

Ofs 9, Len 25 - Title
Ofs 34, Len 16 - Publisher
Ofs 50, Len 3 - Publication (little endian)
Ofs 54, Len 3 - Date (little endian)
Ofs 58, Len 1 - Version
Ofs 59, Len 1 - Quality (1 = high, 2 = normal, 3 = low, 0 = unset but treated as low)
Ofs 60, Len 4 - Number of games with this source? (little endian)
Ofs 64, Len 4 - First game id with this source? (little endian)



CBE file (team data)
====================
28 byte header, 72 byte records

Same header format at CBP
Binary tree used to sort teams in alphabetical order
A somewhat recent invention, not available in cblight which probably
explains why the CBP file lacks pointers to teams.

Ofs 9, Len 45 - Title
Ofs 54, Len 2 - Team Number - Chessbase crashes if this value is large. 5000 works, but not 9999.
Ofs 58, Len 1 - Bit 0: Season (if set, year is year/year+1, otherwise just year)
Ofs 59, Len 2 - Year
Ofs 63, Len 1 - Nation (same numbering as in CBT)
Ofs 64, Len 4 - Number of games with this team? (little endian)
Ofs 68, Len 4 - First game id with this team? (little endian)
User avatar
Mike Davies
Microbot
Posts: 137
Joined: Mon Nov 13, 2017 10:11 am

Re: Compression for chess game data... what to use?

Post by Mike Davies »

That looks like the reverse engineering of the "new" Chessbase format (CBH and it's plethora of files). I was thinking of the original Chessbase format (just the CBF + CBI files).

Ah here's the document I was thinking of:
https://hwiegman.home.xs4all.nl/filefor ... HSBASE.TXT (Section 2.4: Halfmoves and variants)

Sorry it took me some time to track this down again, been a long long time since I last read this.
Nomad
Manic Miner
Posts: 600
Joined: Thu Dec 28, 2017 12:38 pm

Re: Compression for chess game data... what to use?

Post by Nomad »

That is very useful. Thanks. One interesting feature that probably needs to be patched :lol:

Its in the game header.

Code: Select all

  		0     sbyte   GY (signed byte)
                year of game = GY+1900.
                (e.g. -29 = $E3 => 1871)
                if GY = 127 then no year specified
2027 = no year specified

That disassembly is very well done.
User avatar
Mike Davies
Microbot
Posts: 137
Joined: Mon Nov 13, 2017 10:11 am

Re: Compression for chess game data... what to use?

Post by Mike Davies »

Found this article talking about various chess move compression while trawling on a different project:

https://triplehappy.wordpress.com/2015/ ... mpression/
Nomad
Manic Miner
Posts: 600
Joined: Thu Dec 28, 2017 12:38 pm

Re: Compression for chess game data... what to use?

Post by Nomad »

Thank you, that post is very interesting.
Post Reply