Re: Improved DeletionTime serialization to reduce disk size

2023-06-25 Thread Berenguer Blasi

Thanks for the replies.

I intend to javadoc the ssatble format in detail someday and more 
improvements might come up then, along the vint encoding mentioned here. 
But unless sbdy volunteers to do that in 5.0, is anybody against I try 
to get the original proposal (1 byte flags for sentinel values) in?


Regards


Distant future people will not be happy about this, I can already tell 
you now.
Eh, they'll all be AI's anyway and will just rewrite the code in a 
background thread.


LOL




On 23/6/23 15:44, Josh McKenzie wrote:
If we’re doing this, why don’t we delta encode a vint from some 
per-sstable minimum value? I’d expect that to commonly compress to a 
single byte or so.

+1 to this approach.

Distant future people will not be happy about this, I can already 
tell you now.
Eh, they'll all be AI's anyway and will just rewrite the code in a 
background thread.


On Fri, Jun 23, 2023, at 9:02 AM, Berenguer Blasi wrote:

It's a possibility. Though I haven't coded and benchmarked such an
approach and I don't think I would have the time before the freeze to
take advantage of the sstable format change opportunity.

Still it's sthg that can be explored later. If we can shave a few extra
% then that would always be great imo.

On 23/6/23 13:57, Benedict wrote:
> If we’re doing this, why don’t we delta encode a vint from some 
per-sstable minimum value? I’d expect that to commonly compress to a 
single byte or so.

>
>> On 23 Jun 2023, at 12:55, Aleksey Yeshchenko  
wrote:

>>
>> Distant future people will not be happy about this, I can already 
tell you now.

>>
>> Sounds like a reasonable improvement to me however.
>>
>>> On 23 Jun 2023, at 07:22, Berenguer Blasi 
 wrote:

>>>
>>> Hi all,
>>>
>>> DeletionTime.markedForDeleteAt is a long useconds since Unix 
Epoch. But I noticed that with 7 bytes we can already encode ~2284 
years. We can either shed the 8th byte, for reduced IO and disk, or 
can encode some sentinel values (such as LIVE) as flags there. That 
would mean reading and writing 1 byte instead of 12 (8 mfda long + 4 
ldts int). Yes we already avoid serializing DeletionTime (DT) in 
sstables at _row_ level entirely but not at _partition_ level and it 
is also serialized at index, metadata, etc.

>>>
>>> So here's a POC: 
https://github.com/bereng/cassandra/commits/ldtdeser-trunk and some 
jmh (1) to evaluate the impact of the new alg (2). It's tested here 
against a 70% and a 30% LIVE DTs  to see how we perform:

>>>
>>>  [java] Benchmark (liveDTPcParam) (sstableParam)  Mode  Cnt  
Score   Error  Units
>>>  [java] DeletionTimeDeSerBench.testRawAlgReads 70PcLive NC  
avgt   15  0.331 ± 0.001  ns/op
>>>  [java] DeletionTimeDeSerBench.testRawAlgReads 70PcLive OA  
avgt   15  0.335 ± 0.004  ns/op
>>>  [java] DeletionTimeDeSerBench.testRawAlgReads 30PcLive NC  
avgt   15  0.334 ± 0.002  ns/op
>>>  [java] DeletionTimeDeSerBench.testRawAlgReads 30PcLive OA  
avgt   15  0.340 ± 0.008  ns/op
>>>  [java] DeletionTimeDeSerBench.testNewAlgWrites 70PcLive NC  
avgt   15  0.337 ± 0.006  ns/op
>>>  [java] DeletionTimeDeSerBench.testNewAlgWrites 70PcLive OA  
avgt   15  0.340 ± 0.004  ns/op
>>>  [java] DeletionTimeDeSerBench.testNewAlgWrites 30PcLive NC  
avgt   15  0.339 ± 0.004  ns/op
>>>  [java] DeletionTimeDeSerBench.testNewAlgWrites 30PcLive OA  
avgt   15  0.343 ± 0.016  ns/op

>>>
>>> That was ByteBuffer backed to test the extra bit level operations 
impact. But what would be the impact of an end to end test against disk?

>>>
>>>  [java] Benchmark (diskRAMParam) (liveDTPcParam)  
(sstableParam)  Mode  Cnt Score    Error Units
>>>  [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM 
70PcLive  NC  avgt   15   605236.515 ± 19929.058 ns/op
>>>  [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM 
70PcLive  OA  avgt   15   586477.039 ± 7384.632 ns/op
>>>  [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM 
30PcLive  NC  avgt   15   937580.311 ± 30669.647 ns/op
>>>  [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM 
30PcLive  OA  avgt   15   914097.770 ± 9865.070 ns/op
>>>  [java] DeletionTimeDeSerBench.testE2EDeSerializeDT   Disk 
70PcLive  NC  avgt   15  1314417.207 ± 37879.012 ns/op
>>>  [java] DeletionTimeDeSerBench.testE2EDeSerializeDT 
Disk 70PcLive  OA  avgt   15 805256.345 ± 
15471.587  ns/op
>>>  [java] DeletionTimeDeSerBench.testE2EDeSerializeDT 
Disk 30PcLive  NC  avgt   15 1583239.011 ±  
50104.245  ns/op
>>>  [java] DeletionTimeDeSerBench.testE2EDeSerializeDT 
Disk 30PcLive  OA  avgt   15 1439605.006 ±  
64342.510  ns/op
>>>  [java] DeletionTimeDeSerBench.testE2ESerializeDT  
RAM 70PcLive  NC  avgt   15 295711.217 ±   5432.507 ns/op
>>>  [java] DeletionTimeDeSerBench.testE2ESerializeDT    RAM 
70PcLive  OA  avgt   15 305282.827 ±   1906.841 ns

[DISCUSS] When to run CheckStyle and other verificiations

2023-06-25 Thread Jacek Lewandowski
Hi,


The context is that we currently have 3 checks in the build:

- Checkstyle,

- Eclipse-Warnings,

- RAT


CheckStyle and RAT are executed with almost every target we run: build,
jar, test, test-some, testclasslist, etc.; on the other hand,
Eclipse-Warnings is executed automatically only with the artifacts target.


Checkstyle currently uses some caching, so subsequent reruns without
cleaning the project validate only the modified files.


Both CI - Jenkins and Circle forces running all checks.


I want to discuss whether you are ok with extracting all checks to their
distinct target and not running it automatically with the targets which
devs usually run locally. In particular:



   - "build", "jar", and all "test" targets would not trigger CheckStyle,
   RAT or Eclipse-Warnings
   - A new target "check" would trigger all CheckStyle, RAT, and
   Eclipse-Warnings
   - The new "check" target would be run along with the "artifacts" target
   on Jenkins-CI, and it as a separate build step in CircleCI


The rationale for that change is:

   - Running all the checks together would be more consistent, but running
   all of them automatically with build and test targets could waste time when
   we develop something locally, frequently rebuilding and running tests.
   - On the other hand, it would be more consistent if the build did what
   we want - as a dev, when prototyping, I don't want to be forced to run
   analysis (and potentially fix issues) whenever I want to build a project or
   just run a single test.
   - There are ways to avoid running checks automatically by specifying
   some build properties. Though, the discussion is about the default behavior
   - on the flip side, if one wants to run the checks along with the specified
   target, they could add the "check" target to the command line.


The rationale for keeping the checks running automatically with every
target is to reduce the likelihood of not running the checks locally before
pushing the branch and being surprised by failing CI soon after starting
the build.


That could be fixed by running checks in a pre-push Git hook. There are
some benefits of this compared to the current behavior:

   - the checks would be run automatically only once
   - they would be triggered even for those devs who do everything in IDE
   and do not even touch Ant commands directly


Checks can take time; to optimize that, they could be enforced locally to
verify only the modified files in the same way as we currently determine
the tests to be repeated for CircleCI.


Thanks
- - -- --- -  -
Jacek Lewandowski