Understanding lzma-302eos

Hoël Bézier Tue, 23 Aug 2022 13:15:53 -0700

Hi,

I’ve recently decided to learn the Hare language, and figured that implementing lzip support for it would be a good way to start.

Reading the ietf draft regarding the lzip format, the source code of lzd and filling holes with the wikipedia page on lzma, I managed to grasp (most) of the process going on when decompressing lzip — compression will certainly be another challenge, but I’ll see when I get to it.

One thing (amongst many!) that I fail to figure out is why the range decoder skips the first five bytes of the lzma stream. This happens in the Range_decoder constructor in lzd code:


```cpp
  Range_decoder() : member_pos( 6 ), code( 0 ), range( 0xFFFFFFFFU )
    {
    for( int i = 0; i < 5; ++i ) code = ( code << 8 ) | get_byte();
    }
```

This is also confirmed by the ietf draft:
   The range encoder produces a first 0 byte that must be ignored by the
   range decoder.  This is done by shifting 5 bytes in the
   initialization of 'code' instead of 4.

This tells me why it should skip five bytes instead of four, but why do we need to skip four bytes in the first place, that I cannot understand. I guess I’m missing some more general knowledge about range encoding, which is why I’m sending this email in the hope that some of you might enlighten me.

On a side note, this code snippet shows that the first five bytes are used to update the code, which is the current point in the range, according to the ietf draft, but range is not updated. I don’t understand why, and this tells me I do not properly understand what these variables represent. Any insight is welcome on that matter too.


Thanks,
Hoël

signature.asc
Description: PGP signature

Understanding lzma-302eos

Reply via email to