Ah ok thanks. This brings up another question: how did the HexStrings generator
code path even get called?
When I saw these results, I was using the following test table:
CREATE TABLE testtable (
partition_key text,
clustering_column text,
value text,
PRIMARY KEY (partition_key, clustering_column)
)
From StressProfile.java, any column of type TEXT should use the Strings
generator.
However, my data looks suspiciously like the HexStrings generator
was being used instead.
First, the generated strings included control characters like SUB (\x1A), BEL
(\x07), etc. However, the Strings generator code looks like it forces the
characters to be in the printing characters range.
Second, the result I documented previously (that the characters are normally
distributed, but the strings are not), matches the implementation of
HexStrings.
Do you know why this might be the case?
Thanks,
-Saleil
From: [email protected] At: 12/12/18 18:09:14To: Saleil Bhat (BLOOMBERG/ 731
LEX ) , [email protected]
Subject: Re: cassandra-stress HexStrings generator
Yes, I’m pretty sure you understood correctly (I wrote most of this, but it’s
been a long time so I cannot remember much for certain).
It should be implemented like the Strings generator. It looks like both
HexStrings and HexBytes are incorrect, and have been for a long time.
> On 12 Dec 2018, at 22:27, Saleil Bhat (BLOOMBERG/ 731 LEX)
<[email protected]> wrote:
>
> Hi,
>
> I have a question about the behavior of the HexStrings value generator in the
cassandra-stress tool, particularly concerning its population/identity
distribution.
>
>
> Per the discussion in JIRA item CASSANDRA-6146 concerning the stress YAML
profile, the population field in a columnspec “represents the total unique
population distribution of that column across rows.”
>
>
> I interpreted this to mean that if I specify some distribution 'F' for a
column, then the probability of occurrence for each potential value of that
column is given by 'F'.
>
> So, for example, if I provided the following columnspec for a text column:
> name: fake_column
> size: fixed(32)
> population: gaussian(1..100)
> and then generated a large amount of data according to this specification,
> I would expect there to be 100 distinct values for ‘fake_column’, and that a
histogram of the frequency of occurrence of each value would be roughly
bell-shaped.
>
>
>
> However, the current implementation of the HexStrings generator deviates from
this expectation. In the current implementation, each CHARACTER in the string
is drawn from F, rather than the string as a whole. Therefore, if you plot the
histogram of frequency of occurrence for each character, you get a bell-shaped
curve, but the distribution of the occurrences of whole strings (the actual
columns) is something else.
>
>
> My question is, is this the desired behavior for string columns? Was my
expectation/interpretation incorrect? If so, can anyone give some insight as to
why strings are designed to behave this way and what the use case is for this
behavior?
>
> Thanks,
> -Saleil