Challenge: Get a real Spectrum to show the longest word in English

The place for codemasters or beginners to talk about programming any language for the Spectrum.
Post Reply
User avatar
PeteProdge
Bugaboo
Posts: 3588
Joined: Mon Nov 13, 2017 9:03 am

Challenge: Get a real Spectrum to show the longest word in English

Post by PeteProdge »

The longest word in UK English is 189,819 letters long, so that's almost 186K in bytes if stored uncompressed. It's a real word, apparently, being the chemical name of 'titin' - the largest known protein.

Obviously, there is no official cassette-based Spectrum computer that could show the word from an uncompressed source, so this would be a major challenge in text compression.

Yes, if we include the +3, then you could get it onto +3 disk and have 164K left over to tell a short story around the word. Maybe even have a text adventure that includes it? I mean, don't get the poor user to type it in as part of a necessary command, it takes three and a half hours to actually say the word out loud, goodness knows how much longer you'd be there tapping it in and don't tell me you wouldn't make a typo or two (thousand).

But yeah, is it possible, using the power of compression, to get this word fitting in one cassette load on a 128K Spectrum? Or even a 48K Spectrum? Or even... if you're very very good at this... a 16K?

It's pointless, but think of the kudos you'd get for overcoming this challenge. And it'd be a nice CSSCGC entry.
Reheated Pixels - a combination of retrogaming, comedy and factual musing, is here!
New video: Nine ZX Spectrum magazine controversies - How Crash, Your Sinclair and Sinclair User managed to offend the world!
User avatar
Stefan
Manic Miner
Posts: 809
Joined: Mon Nov 13, 2017 9:51 pm
Location: Belgium
Contact:

Re: Challenge: Get a real Spectrum to show the longest word in English

Post by Stefan »

7 zip compresses it to 1986 (coincidence?) bytes, so that is going to be a doddle. :lol:
User avatar
MustardTiger
Microbot
Posts: 122
Joined: Tue May 02, 2023 8:05 pm

Re: Challenge: Get a real Spectrum to show the longest word in English

Post by MustardTiger »

It'd be funny to get the Currah speech to read it out. LET s$="....." etc
User avatar
R-Tape
Site Admin
Posts: 6409
Joined: Thu Nov 09, 2017 11:46 am

Re: Challenge: Get a real Spectrum to show the longest word in English

Post by R-Tape »

Sounds fun. I think I'll have a bash at this.

ZX0 had a good long think about it, and got 191664 txt file down to 10778 bytes. One way would be modding ZX0 to decompress and print on the fly. So unless I'm oversimplifying it'll be easy on a 48K. There's still some way to go for a 16K though.

There could be other ways too, as it's a systematic word based on the amino acids the protein is made from. In total there are 24 amino acids, plus 2 different stems at the very end.

Methionyl 0
threonyl 1
glutaminyl 2
arginyl 3
tyrosyl 4
glutamyl 5
seryl 6
leucyl 7
phenylalanyl 8
alanyl 9
glutaminyl 10
lysyl 11
glycyl 12
valyl 13
prolyl 14
aspartyl 15
isoleucyl 16
aspartyl 17
asparaginyl 18
cysteinyl 19
tryptophyl 20
histidyl 21
acetyl 22
titin 23 *edit this one is the protein
tryptophyl 24
...and the two different words at the end
serx 25
leucine 26
User avatar
Mpk
Dynamite Dan
Posts: 1008
Joined: Tue Feb 09, 2021 8:10 am

Re: Challenge: Get a real Spectrum to show the longest word in English

Post by Mpk »

MustardTiger wrote: Mon Oct 16, 2023 5:05 pm It'd be funny to get the Currah speech to read it out. LET s$="....." etc
Even better, via the 48k speaker so it sounds like the Meteor Storm voice for 3 hours.
equinox
Dynamite Dan
Posts: 1052
Joined: Mon Oct 08, 2018 1:57 am
Location: SE England

Re: Challenge: Get a real Spectrum to show the longest word in English

Post by equinox »

Stefan wrote: Mon Oct 16, 2023 4:42 pm 7 zip compresses it to 1986 (coincidence?) bytes, so that is going to be a doddle. :lol:
Can you decompress and display it as well, though?
You'd need "streaming" decompression as you can't use hard disk as a scratch space (unless you're clever with a DivIDE?)...

(I've never heard of any implementation of ZIP (de)compression on a Speccy. Possibly not worth it, as I imagine ZIP was designed for machines with more memory available.)
Last edited by equinox on Mon Oct 16, 2023 7:15 pm, edited 1 time in total.
User avatar
p13z
Manic Miner
Posts: 611
Joined: Sun Feb 17, 2019 10:41 pm
Location: UK
Contact:

Re: Challenge: Get a real Spectrum to show the longest word in English

Post by p13z »

Mpk wrote: Mon Oct 16, 2023 6:31 pm Even better, via the 48k speaker so it sounds like the Meteor Storm voice for 3 hours.
This is a great* little speech synth, that sits at 47500. This needs to be done.
https://spectrumcomputing.co.uk/entry/2 ... ynthesiser
Using @R-Tape 's method. It could recite it section by section (using appropriately phonic strings for the synth).
It would be pretty tortuous to listen to. Make a horrific YouTube video.
User avatar
R-Tape
Site Admin
Posts: 6409
Joined: Thu Nov 09, 2017 11:46 am

Re: Challenge: Get a real Spectrum to show the longest word in English

Post by R-Tape »

Just having one byte per amino acid would need 128K. But one byte could have two amino acids and that would fit into a 48K.

Image

Here's the assembly for the above program that contains the string with the amino acids in order if anyone wants a play. (I've inserted a db 255 and END within the string so it can be assembled into a 48K.)

The ZX0 method should work best though. Something for another night.
User avatar
ParadigmShifter
Manic Miner
Posts: 670
Joined: Sat Sep 09, 2023 4:55 am

Re: Challenge: Get a real Spectrum to show the longest word in English

Post by ParadigmShifter »

If there's only 27 root words you can huffman encode it by token (which you can decompress on the fly) - each token would be a root word. And you can use A-Z and 1 other character for that.

can't be arsed working out how many bits you would need though :) Depends on the relative frequency of each token (worst case - they all occur same number of times).

or you could tokenise it then LZW compress the token list (which I believe ZX0 is a variant of)
User avatar
ketmar
Manic Miner
Posts: 713
Joined: Tue Jun 16, 2020 5:25 pm
Location: Ukraine

Re: Challenge: Get a real Spectrum to show the longest word in English

Post by ketmar »

simple bit-based entropy coder managed to compress it to 13710 bytes using tokenizing approach R-Tape suggested above. decompressing this requires about 270 bytes of memory (not counting the tokens), and 32-bit multiply. as the codec is bit-based, we can decompress on the fly.

C test program.
Spoiler

Code: Select all

#include <stddef.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define VWAD_COMPILE_DECOMPRESSOR
#define VWAD_OKUNUSED

#define VWADWR_ERR_ARGS  (-1)
#define VWADWR_ERR_MEM  (-1)

#define vwadwr_uint64  uint64_t
#define vwadwr_uint  uint32_t
#define vwadwr_ushort  uint16_t
#define vwadwr_ubyte  uint8_t
#define vwadwr_bool  int

#define intbool_t  int
#define int_true  (1)
#define int_false  (0)

#define vwadwr_memman  void *

#define xalloc(mm_,sz_)  malloc(sz_)
#define xfree(mm_,ptr_)  free(ptr_)

#define vwad__builtin_trap  __builtin_trap


// ////////////////////////////////////////////////////////////////////////// //
static const char *words[] = {
  "methionyl",
  "threonyl",
  "glutaminyl",
  "arginyl",
  "tyrosyl",
  "glutamyl",
  "seryl",
  "leucyl",
  "phenylalanyl",
  "alanyl",
  "lysyl",
  "glycyl",
  "valyl",
  "prolyl",
  "aspartyl",
  "isoleucyl",
  "asparaginyl",
  "cysteinyl",
  "histidyl",
  "acetyl",
  "tryptophyl",
  "titin",
  "ethionyl",
  "serx",
  "isoleucine",
  NULL
};


// ////////////////////////////////////////////////////////////////////////// //
typedef struct {
  vwadwr_uint x1, x2;
  vwadwr_ubyte *dest;
  int dpos, dend;
} EntEncoder;

typedef struct {
  vwadwr_uint x1, x2, x;
  const vwadwr_ubyte *src;
  int spos, send;
} EntDecoder;

typedef struct {
  vwadwr_ushort p1, p2;
} Predictor;


typedef struct {
  Predictor predBits[2 * 32];
  int ctxBits;
} FiveBitsPPM;



// ////////////////////////////////////////////////////////////////////////// //
static void EncInit (EntEncoder *ee, void *outbuf, vwadwr_uint obsize) {
  ee->x1 = 0; ee->x2 = 0xFFFFFFFFU;
  ee->dest = (vwadwr_ubyte *)outbuf; ee->dpos = 0;
  ee->dend = obsize;
}

static void EncEncode (EntEncoder *ee, int p, intbool_t bit) {
  vwadwr_uint xmid = ee->x1 + (vwadwr_uint)(((vwadwr_uint64)((ee->x2 - ee->x1) & 0xffffffffU) * (vwadwr_uint)p) >> 17);
  if (bit) ee->x2 = xmid; else ee->x1 = xmid + 1;
  while ((ee->x1 ^ ee->x2) < (1u << 24)) {
    if (ee->dpos < ee->dend) {
      ee->dest[ee->dpos] = (vwadwr_ubyte)(ee->x2 >> 24); ee->dpos += 1;
    } else {
      ee->dpos = 0x7fffffff - 8;
    }
    ee->x1 <<= 8;
    ee->x2 = (ee->x2 << 8) + 255;
  }
}

static void EncFlush (EntEncoder *ee) {
  if (ee->dpos + 4 <= ee->dend) {
    ee->dest[ee->dpos] = (vwadwr_ubyte)(ee->x2 >> 24); ee->x2 <<= 8; ee->dpos += 1;
    ee->dest[ee->dpos] = (vwadwr_ubyte)(ee->x2 >> 24); ee->x2 <<= 8; ee->dpos += 1;
    ee->dest[ee->dpos] = (vwadwr_ubyte)(ee->x2 >> 24); ee->x2 <<= 8; ee->dpos += 1;
    ee->dest[ee->dpos] = (vwadwr_ubyte)(ee->x2 >> 24); ee->x2 <<= 8; ee->dpos += 1;
  } else {
    ee->dpos = 0x7fffffff - 8;
  }
}


// ////////////////////////////////////////////////////////////////////////// //
static VWAD_OKUNUSED vwadwr_ubyte DecGetByte (EntDecoder *ee) {
  if (ee->spos < ee->send) {
    return ee->src[ee->spos++];
  } else {
    ee->spos = 0x7fffffff;
    return 0;
  }
}

static void DecInit (EntDecoder *ee, const void *inbuf, vwadwr_uint insize) {
  ee->x1 = 0; ee->x2 = 0xFFFFFFFFU;
  ee->src = (const vwadwr_ubyte *)inbuf; ee->spos = 0; ee->send = insize;
  ee->x = DecGetByte(ee);
  ee->x = (ee->x << 8) + DecGetByte(ee);
  ee->x = (ee->x << 8) + DecGetByte(ee);
  ee->x = (ee->x << 8) + DecGetByte(ee);
}

static intbool_t DecDecode (EntDecoder *ee, int p) {
  vwadwr_uint xmid = ee->x1 + (vwadwr_uint)(((vwadwr_uint64)((ee->x2 - ee->x1) & 0xffffffffU) * (vwadwr_uint)p) >> 17);
  intbool_t bit = (ee->x <= xmid);
  if (bit) ee->x2 = xmid; else ee->x1 = xmid + 1;
  while ((ee->x1 ^ ee->x2) < (1u << 24)) {
    ee->x1 <<= 8;
    ee->x2 = (ee->x2 << 8) + 255;
    ee->x = (ee->x << 8) + DecGetByte(ee);
  }
  return bit;
}


// ////////////////////////////////////////////////////////////////////////// //
static void PredInit (Predictor *pp) {
  pp->p1 = 1 << 15; pp->p2 = 1 << 15;
}

static int PredGetP (Predictor *pp) {
  return (int)((vwadwr_uint)pp->p1 + (vwadwr_uint)pp->p2);
}

static void PredUpdate (Predictor *pp, intbool_t bit) {
  if (bit) {
    pp->p1 += ((~((vwadwr_uint)pp->p1)) >> 3) & 0b0001111111111111;
    pp->p2 += ((~((vwadwr_uint)pp->p2)) >> 6) & 0b0000001111111111;
  } else {
    pp->p1 -= ((vwadwr_uint)pp->p1) >> 3;
    pp->p2 -= ((vwadwr_uint)pp->p2) >> 6;
  }
}

static int PredGetPAndUpdate (Predictor *pp, intbool_t bit) {
  int p = PredGetP(pp);
  PredUpdate(pp, bit);
  return p;
}

static intbool_t PredGetBitDecodeUpdate (Predictor *pp, EntDecoder *dec) {
  int p = PredGetP(pp);
  intbool_t bit = DecDecode(dec, p);
  PredUpdate(pp, bit);
  return bit;
}


// ////////////////////////////////////////////////////////////////////////// //
static void FiveBitsPPMInit (FiveBitsPPM *ppm) {
  for (vwadwr_uint f = 0; f < (vwadwr_uint)sizeof(ppm->predBits) / (vwadwr_uint)sizeof(ppm->predBits[0]); f += 1) {
    PredInit(&ppm->predBits[f]);
  }
  ppm->ctxBits = 0;
}

static void FiveBitsPPMEncode (FiveBitsPPM *ppm, EntEncoder *enc, int bt) {
  int c2 = 1;
  int cofs = ppm->ctxBits * 32;
  if (bt < 0 || bt > 0x1f) abort();
  ppm->ctxBits = (bt >= 0x10);
  for (int f = 0; f < 5; f += 1) {
    intbool_t bit = (bt&0x10) != 0; bt <<= 1;
    int p = PredGetPAndUpdate(&ppm->predBits[cofs + c2], bit);
    EncEncode(enc, p, bit);
    c2 += c2; if (bit) c2 += 1;
  }
}

static vwadwr_ubyte FiveBitsPPMDecode (FiveBitsPPM *ppm, EntDecoder *dec) {
  vwadwr_uint n = 1;
  int cofs = ppm->ctxBits * 32;
  do {
    intbool_t bit = PredGetBitDecodeUpdate(&ppm->predBits[cofs + n], dec);
    n += n; if (bit) n += 1;
  } while (n < 0x20);
  n -= 0x20;
  ppm->ctxBits = (n >= 0x10);
  return (vwadwr_ubyte)n;
}


int main () {
  FILE *fi = fopen("the_longest_word.txt", "r");
  char *src = malloc(1024 * 1024);
  ssize_t rd = fread(src, 1, 1024 * 1024, fi);
  fclose(fi);
  printf("%d bytes read.\n", (int)rd);
  src[rd] = 0;
  for (char *s = src; *s; s += 1) {
    if (*s >= 'A' && *s <= 'Z') *s += 32;
    else if (*s < 'a' || *s > 'z') abort();
  }

  char *dest = malloc(1024 * 1024);
  EntEncoder enc;
  EncInit(&enc, dest, 1024 * 1024);

  FiveBitsPPM ppm;
  FiveBitsPPMInit(&ppm);

  int tokens = 0;
  char *text = src;
  while (*text) {
    int idx = 0;
    const char **xword = words;
    while (*xword != NULL && strncmp(text, *xword, strlen(*xword)) != 0) {
      idx += 1; xword += 1;
    }
    if (*xword == NULL) {
      printf("not found. processed %u bytes\n", (unsigned)(ptrdiff_t)(text - src));
      int left = 32;
      while (*text && left != 0) {
        putchar(*text); text += 1; left -= 1;
      }
      putchar('\n');
      abort();
    }
    if (idx > 0x1e) abort();
    FiveBitsPPMEncode(&ppm, &enc, idx);
    text += strlen(*xword);
    tokens += 1;
  }
  FiveBitsPPMEncode(&ppm, &enc, 0x1f);
  EncFlush(&enc);

  printf("compressed to: %d (%d tokens)\n", enc.dpos, tokens);

  char *upk = malloc(1024 * 1024);

  EntDecoder dec;
  DecInit(&dec, dest, enc.dpos);

  FiveBitsPPMInit(&ppm);
  vwadwr_ubyte bt = FiveBitsPPMDecode(&ppm, &dec);
  char *dpos = upk;
  while (bt != 0x1f) {
    strcpy(dpos, words[bt]);
    dpos += strlen(words[bt]);
    bt = FiveBitsPPMDecode(&ppm, &dec);
  }
  *dpos = 0;
  printf("unpacked %u bytes\n", (unsigned)strlen(upk));

  if (strcmp(upk, src) != 0) {
    printf("unpacker failed!\n");
    abort();
  }

  return 0;
}
p.s.: put C code under the spoiler too, because the paste will disappear later, and i HAET posts where the code was available, and now it isn't. ;-)

p.p.s.: bit-based entropy codec is PD from Matt Mahoney, i believe. i… lifted it from some PD compressor, and using for years. it is your bog-standard arithmetic/range coder, only it works strictly with one bit. so it has order-1 per-bit prediction model (actually, slightly more compilcated, but meh). and it works surprisingly good for such small amout of code.
TheMartian
Microbot
Posts: 100
Joined: Wed Feb 03, 2021 5:18 am

Re: Challenge: Get a real Spectrum to show the longest word in English

Post by TheMartian »

The really, really important thing is line 20:

10 LOAD"" CODE: RANDOMIZE USR 32768: REM DECOMPRESS AND PRINT
20 PRINT: PRINT "von Ulm"
User avatar
ketmar
Manic Miner
Posts: 713
Joined: Tue Jun 16, 2020 5:25 pm
Location: Ukraine

Re: Challenge: Get a real Spectrum to show the longest word in English

Post by ketmar »

btw, i wrote Z80 decompressor (but lost the source, lol). it was about 1.5 kb of code, but i didn't even tried to optimise anything, just stitched the code together, closely following the C source. so it's about 14.5-15kb, with on-the-fly decompression. if we disable interrupts, and use scr$ as data area, it would prolly be possible to stuff it into 16K Speccy. ;-)
Post Reply