graphemes is a good way to make your tree shorter, by using a graphene per node, instead of a character per node, so do use graphemes.PROSM wrote: ↑Fri Feb 16, 2018 8:24 am ...
However, your idea about phonemes raises an interesting point, as you could instead use graphemes (how the phonemes are spelt). There are several more graphemes than phonemes since there are often several ways in which to spell them, but it should help the compression ratio significantly, as there are many three or four letter graphemes, which reduce down to a single token. This allows us to encode words like through as th r ough. With conventional ASCII, this word would take 7 bytes, but using the graphemes, it only takes 3 bytes, one for each grapheme token.
Coupled with Huffman compression, significant space savings could be made, as more frequent tokens will have shorter tokens, reducing the space taken even further.
...
But I would suggest a different approach for the tree, given the premiss of a lot of text, but not that many distinct words in the text.
I would build the tree (or set of trees, or sequences) in a different way.
I would define the leaf of a tree (or start of a sequence) as the beginning of a word, that is constructed, by following the nodes up to the root.
So you will have many roots, for example, one would be all words that end in "th". You only populate the words you actually use in your texts of course.
So to encode a text, you would keep a sequence of word IDs (how you encode this is up to you, to get maximum compression, or minimum use of bits). But given that words that end in the same grapheme will probably improve the space locality of the data, you can use relative distances, with variable bit sizes for length if needed, to link a node to it's parent node.
When decoding, you just grab the first word ID (typically the address of the left node that corresponds to that word), and you crawl up it's three, until you get to it's root, hence producing the required word, by concatenating the node contents while traversing it up.
Add a space and grab the next word ID, and repeat, until your text is done.
The largest data part, will surely be the multitude of word trees, but then adding extra text, should cost very little, at most 2 bytes per word, no matter how long the word is. And if you compress these sequence of IDs you can have a very compact and fast text decompression.
You can make optimizations for plurals, like "tree" and "trees", so that you don't have to create a new tree node sequence just to get a plural of a word.
You can take this idea, one level further, by building sequences of sentences (commonly used Sequences of word IDs ), if you find there are enough that repeat, to justify another structure for it.
I hope I managed to make it clear.
NOTE: you should optimize for the data size, in this case, and not code size, since data size will overwhelm any code size needed for this.
TIP: you code use a single bit, similar to some UTF encodings, to encode if the next byte is part of the current grapheme (0) or not(1). when this bit changes (to 1), the next char, will have the start of the relative position of the parent node. which can again be variable bit encoded, to minimize size.