Re: CASSANDRA-13241 lower default chunk_length_in_kb

Ariel Weisberg Fri, 19 Oct 2018 06:46:31 -0700

Hi,

I ran some benchmarks on my laptop
https://issues.apache.org/jira/browse/CASSANDRA-13241?focusedCommentId=16656821&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16656821


For a random read workload, varying chunk size:
Chunk size      Time
       64k     25:20
       64k     25:33  
       32k     20:01
       16k     19:19
       16k     19:14
        8k     16:51
        4k     15:39

Ariel
On Thu, Oct 18, 2018, at 2:55 PM, Ariel Weisberg wrote:
> Hi,
> 
> For those who were asking about the performance impact of block size on 
> compression I wrote a microbenchmark.
> 
> https://pastebin.com/RHDNLGdC
> 
>      [java] Benchmark                                               Mode  
> Cnt          Score          Error  Units
>      [java] CompactIntegerSequenceBench.benchCompressLZ4Fast16k    thrpt   
> 15  331190055.685 ±  8079758.044  ops/s
>      [java] CompactIntegerSequenceBench.benchCompressLZ4Fast32k    thrpt   
> 15  353024925.655 ±  7980400.003  ops/s
>      [java] CompactIntegerSequenceBench.benchCompressLZ4Fast64k    thrpt   
> 15  365664477.654 ± 10083336.038  ops/s
>      [java] CompactIntegerSequenceBench.benchCompressLZ4Fast8k     thrpt   
> 15  305518114.172 ± 11043705.883  ops/s
>      [java] CompactIntegerSequenceBench.benchDecompressLZ4Fast16k  thrpt   
> 15  688369529.911 ± 25620873.933  ops/s
>      [java] CompactIntegerSequenceBench.benchDecompressLZ4Fast32k  thrpt   
> 15  703635848.895 ±  5296941.704  ops/s
>      [java] CompactIntegerSequenceBench.benchDecompressLZ4Fast64k  thrpt   
> 15  695537044.676 ± 17400763.731  ops/s
>      [java] CompactIntegerSequenceBench.benchDecompressLZ4Fast8k   thrpt   
> 15  727725713.128 ±  4252436.331  ops/s
> 
> To summarize, compression is 8.5% slower and decompression is 1% faster. 
> This is measuring the impact on compression/decompression not the huge 
> impact that would occur if we decompressed data we don't need less 
> often.
> 
> I didn't test decompression of Snappy and LZ4 high, but I did test 
> compression.
> 
> Snappy:
>      [java] CompactIntegerSequenceBench.benchCompressSnappy16k   thrpt    
> 2  196574766.116          ops/s
>      [java] CompactIntegerSequenceBench.benchCompressSnappy32k   thrpt    
> 2  198538643.844          ops/s
>      [java] CompactIntegerSequenceBench.benchCompressSnappy64k   thrpt    
> 2  194600497.613          ops/s
>      [java] CompactIntegerSequenceBench.benchCompressSnappy8k    thrpt    
> 2  186040175.059          ops/s
> 
> LZ4 high compressor:
>      [java] CompactIntegerSequenceBench.bench16k  thrpt    2  
> 20822947.578          ops/s
>      [java] CompactIntegerSequenceBench.bench32k  thrpt    2  
> 12037342.253          ops/s
>      [java] CompactIntegerSequenceBench.bench64k  thrpt    2   
> 6782534.469          ops/s
>      [java] CompactIntegerSequenceBench.bench8k   thrpt    2  
> 32254619.594          ops/s
> 
> LZ4 high is the one instance where block size mattered a lot. It's a bit 
> suspicious really when you look at the ratio of performance to block 
> size being close to 1:1. I couldn't spot a bug in the benchmark though.
> 
> Compression ratios with LZ4 fast for the text of Alice in Wonderland was:
> 
> Chunk size 8192, ratio 0.709473
> Chunk size 16384, ratio 0.667236
> Chunk size 32768, ratio 0.634735
> Chunk size 65536, ratio 0.607208
> 
> By way of comparison I also ran deflate with maximum compression:
> 
> Chunk size 8192, ratio 0.426434
> Chunk size 16384, ratio 0.402423
> Chunk size 32768, ratio 0.381627
> Chunk size 65536, ratio 0.364865
> 
> Ariel
>  
> On Thu, Oct 18, 2018, at 5:32 AM, Benedict Elliott Smith wrote:
> > FWIW, I’m not -0, just think that long after the freeze date a change 
> > like this needs a strong mandate from the community.  I think the change 
> > is a good one.
> > 
> > 
> > 
> > 
> > 
> > > On 17 Oct 2018, at 22:09, Ariel Weisberg <[email protected]> wrote:
> > > 
> > > Hi,
> > > 
> > > It's really not appreciably slower compared to the decompression we are 
> > > going to do which is going to take several microseconds. Decompression is 
> > > also going to be faster because we are going to do less unnecessary 
> > > decompression and the decompression itself may be faster since it may fit 
> > > in a higher level cache better. I ran a microbenchmark comparing them.
> > > 
> > > https://issues.apache.org/jira/browse/CASSANDRA-13241?focusedCommentId=16653988&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16653988
> > > 
> > > Fetching a long from memory:       56 nanoseconds
> > > Compact integer sequence   :       80 nanoseconds
> > > Summing integer sequence   :      165 nanoseconds
> > > 
> > > Currently we have one +1 from Kurt to change the representation and 
> > > possibly a -0 from Benedict. That's not really enough to make an 
> > > exception to the code freeze. If you want it to happen (or not) you need 
> > > to speak up otherwise only the default will change.
> > > 
> > > Regards,
> > > Ariel
> > > 
> > > On Wed, Oct 17, 2018, at 6:40 AM, kurt greaves wrote:
> > >> I think if we're going to drop it to 16k, we should invest in the compact
> > >> sequencing as well. Just lowering it to 16k will have potentially a 
> > >> painful
> > >> impact on anyone running low memory nodes, but if we can do it without 
> > >> the
> > >> memory impact I don't think there's any reason to wait another major
> > >> version to implement it.
> > >> 
> > >> Having said that, we should probably benchmark the two representations
> > >> Ariel has come up with.
> > >> 
> > >> On Wed, 17 Oct 2018 at 20:17, Alain RODRIGUEZ <[email protected]> wrote:
> > >> 
> > >>> +1
> > >>> 
> > >>> I would guess a lot of C* clusters/tables have this option set to the
> > >>> default value, and not many of them are having the need for reading so 
> > >>> big
> > >>> chunks of data.
> > >>> I believe this will greatly limit disk overreads for a fair amount (a 
> > >>> big
> > >>> majority?) of new users. It seems fair enough to change this default 
> > >>> value,
> > >>> I also think 4.0 is a nice place to do this.
> > >>> 
> > >>> Thanks for taking care of this Ariel and for making sure there is a
> > >>> consensus here as well,
> > >>> 
> > >>> C*heers,
> > >>> -----------------------
> > >>> Alain Rodriguez - [email protected]
> > >>> France / Spain
> > >>> 
> > >>> The Last Pickle - Apache Cassandra Consulting
> > >>> http://www.thelastpickle.com
> > >>> 
> > >>> Le sam. 13 oct. 2018 à 08:52, Ariel Weisberg <[email protected]> a 
> > >>> écrit :
> > >>> 
> > >>>> Hi,
> > >>>> 
> > >>>> This would only impact new tables, existing tables would get their
> > >>>> chunk_length_in_kb from the existing schema. It's something we record 
> > >>>> in
> > >>> a
> > >>>> system table.
> > >>>> 
> > >>>> I have an implementation of a compact integer sequence that only 
> > >>>> requires
> > >>>> 37% of the memory required today. So we could do this with only 
> > >>>> slightly
> > >>>> more than doubling the memory used. I'll post that to the JIRA soon.
> > >>>> 
> > >>>> Ariel
> > >>>> 
> > >>>> On Fri, Oct 12, 2018, at 1:56 AM, Jeff Jirsa wrote:
> > >>>>> 
> > >>>>> 
> > >>>>> I think 16k is a better default, but it should only affect new tables.
> > >>>>> Whoever changes it, please make sure you think about the upgrade path.
> > >>>>> 
> > >>>>> 
> > >>>>>> On Oct 12, 2018, at 2:31 AM, Ben Bromhead <[email protected]>
> > >>> wrote:
> > >>>>>> 
> > >>>>>> This is something that's bugged me for ages, tbh the performance gain
> > >>>> for
> > >>>>>> most use cases far outweighs the increase in memory usage and I would
> > >>>> even
> > >>>>>> be in favor of changing the default now, optimizing the storage cost
> > >>>> later
> > >>>>>> (if it's found to be worth it).
> > >>>>>> 
> > >>>>>> For some anecdotal evidence:
> > >>>>>> 4kb is usually what we end setting it to, 16kb feels more reasonable
> > >>>> given
> > >>>>>> the memory impact, but what would be the point if practically, most
> > >>>> folks
> > >>>>>> set it to 4kb anyway?
> > >>>>>> 
> > >>>>>> Note that chunk_length will largely be dependent on your read sizes,
> > >>>> but 4k
> > >>>>>> is the floor for most physical devices in terms of ones block size.
> > >>>>>> 
> > >>>>>> +1 for making this change in 4.0 given the small size and the large
> > >>>>>> improvement to new users experience (as long as we are explicit in
> > >>> the
> > >>>>>> documentation about memory consumption).
> > >>>>>> 
> > >>>>>> 
> > >>>>>>> On Thu, Oct 11, 2018 at 7:11 PM Ariel Weisberg <[email protected]>
> > >>>> wrote:
> > >>>>>>> 
> > >>>>>>> Hi,
> > >>>>>>> 
> > >>>>>>> This is regarding
> > >>>> https://issues.apache.org/jira/browse/CASSANDRA-13241
> > >>>>>>> 
> > >>>>>>> This ticket has languished for a while. IMO it's too late in 4.0 to
> > >>>>>>> implement a more memory efficient representation for compressed
> > >>> chunk
> > >>>>>>> offsets. However I don't think we should put out another release
> > >>> with
> > >>>> the
> > >>>>>>> current 64k default as it's pretty unreasonable.
> > >>>>>>> 
> > >>>>>>> I propose that we lower the value to 16kb. 4k might never be the
> > >>>> correct
> > >>>>>>> default anyways as there is a cost to compression and 16k will still
> > >>>> be a
> > >>>>>>> large improvement.
> > >>>>>>> 
> > >>>>>>> Benedict and Jon Haddad are both +1 on making this change for 4.0.
> > >>> In
> > >>>> the
> > >>>>>>> past there has been some consensus about reducing this value
> > >>> although
> > >>>> maybe
> > >>>>>>> with more memory efficiency.
> > >>>>>>> 
> > >>>>>>> The napkin math for what this costs is:
> > >>>>>>> "If you have 1TB of uncompressed data, with 64k chunks that's 16M
> > >>>> chunks
> > >>>>>>> at 8 bytes each (128MB).
> > >>>>>>> With 16k chunks, that's 512MB.
> > >>>>>>> With 4k chunks, it's 2G.
> > >>>>>>> Per terabyte of data (pre-compression)."
> > >>>>>>> 
> > >>>>>>> 
> > >>>> 
> > >>> https://issues.apache.org/jira/browse/CASSANDRA-13241?focusedCommentId=15886621&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15886621
> > >>>>>>> 
> > >>>>>>> By way of comparison memory mapping the files has a similar cost per
> > >>>> 4k
> > >>>>>>> page of 8 bytes. Multiple mappings makes this more expensive. With a
> > >>>>>>> default of 16kb this would be 4x less expensive than memory mapping
> > >>> a
> > >>>> file.
> > >>>>>>> I only mention this to give a sense of the costs we are already
> > >>>> paying. I
> > >>>>>>> am not saying they are directly related.
> > >>>>>>> 
> > >>>>>>> I'll wait a week for discussion and if there is consensus make the
> > >>>> change.
> > >>>>>>> 
> > >>>>>>> Regards,
> > >>>>>>> Ariel
> > >>>>>>> 
> > >>>>>>> 
> > >>> ---------------------------------------------------------------------
> > >>>>>>> To unsubscribe, e-mail: [email protected]
> > >>>>>>> For additional commands, e-mail: [email protected]
> > >>>>>>> 
> > >>>>>>> --
> > >>>>>> Ben Bromhead
> > >>>>>> CTO | Instaclustr <https://www.instaclustr.com/>
> > >>>>>> +1 650 284 9692
> > >>>>>> Reliability at Scale
> > >>>>>> Cassandra, Spark, Elasticsearch on AWS, Azure, GCP and Softlayer
> > >>>>> 
> > >>>>> ---------------------------------------------------------------------
> > >>>>> To unsubscribe, e-mail: [email protected]
> > >>>>> For additional commands, e-mail: [email protected]
> > >>>>> 
> > >>>> 
> > >>>> ---------------------------------------------------------------------
> > >>>> To unsubscribe, e-mail: [email protected]
> > >>>> For additional commands, e-mail: [email protected]
> > >>>> 
> > >>>> 
> > >>> 
> > > 
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: [email protected]
> > > For additional commands, e-mail: [email protected]
> > > 
> > 
> > 
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [email protected]
> > For additional commands, e-mail: [email protected]
> > 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: CASSANDRA-13241 lower default chunk_length_in_kb

Reply via email to