Re: Inclusive/exclusive endpoints when compacting token ranges

2022-08-08 Thread Berenguer Blasi

+1 to new flags also from me

On 26/7/22 18:39, Andrés de la Peña wrote:
I think that's right, using a closed range makes sense to consume the 
data provided by "sstablemetadata", which also provides closed ranges. 
Especially because with half-open ranges we couldn't compact a sstable 
with a single big partition, of which we might only know the token but 
no the partition key.


So probably we should just add documentation about both -st and -et 
being inclusive, and live with a different meaning of -st in repair 
and compact.


Also, the reason why this is so confusing in the test that started the 
discussion is that those closed token ranges are internally 
represented as "Range" objects, which are half-open by 
definition. So we should document those methods, and maybe do some 
minor changes to avoid the use of "Range" to silently represent 
closed token ranges.


On Tue, 26 Jul 2022 at 16:27, Jeremiah D Jordan 
 wrote:


Reading the responses here and taking a step back, I think the
current behavior of nodetool compact is probably the correct
behavior.  The main use case I can see for using nodetool compact
is someone wants to take some sstable and compact it with all the
overlapping sstables.  So you run “sstablemetadata” on the sstable
and get the min and max tokens, and then you pass those in to
nodetool compact.  In that case you do want the closed range.

This is different from running repair where you get the tokens
from the nodes/nodetool ring and node those level token ranges
ownership is half open when going from “token owned by node a to
token owned by node b”.

So my initial thought/gut reaction that it should work like repair
is misleading, because you don’t get the tokens from the same
place you get them when running repair.

Making the command line options more explicit and documented does
seem like it could be useful.

-Jeremiah Jordan


On Jul 26, 2022, at 9:16 AM, Derek Chen-Becker
 wrote:

+1 to new flags. A released, albeit undocumented, behavior is
still a contract with the end user. Flags (and documentation)
seem like the right path to address the situation.

Cheers,

Derek

On Tue, Jul 26, 2022 at 7:28 AM Benedict Elliott Smith
 wrote:


I think a change like this could be dangerous for a lot of
existing automation built atop nodetool.

I’m not sure this change is worthwhile. I think it would be
better to introduce e.g. -ste and -ete for “start token
exclusive” and “end token exclusive” so that users can opt-in
to whichever scheme they prefer for their tooling, without
breaking existing users.

> On 26 Jul 2022, at 14:22, Brandon Williams
 wrote:
>
> +1, I think that makes the most sense.
>
> Kind Regards,
> Brandon
>
> On Tue, Jul 26, 2022 at 8:19 AM J. D. Jordan
 wrote:
>>
>> I like the third option, especially if it makes it
consistent with repair, which has supported ranges longer and
I would guess most people would think the compact ranges work
the same as the repair ranges.
>>
>> -Jeremiah Jordan
>>
>>> On Jul 26, 2022, at 6:49 AM, Andrés de la Peña
 wrote:
>>>
>>> 
>>> Hi all,
>>>
>>> CASSANDRA-17575 has detected that token ranges in
nodetool compact are interpreted as closed on both sides. For
example, the command "nodetool compact -st 10 -et 50" will
compact the tokens in [10, 50]. This way of interpreting
token ranges is unusual since token ranges are usually
half-open, and I think that in the previous example one would
expect that the compacted tokens would be in (10, 50]. That's
for example the way nodetool repair works, and indeed the
class org.apache.cassandra.dht.Range is always half-open.
>>>
>>> It's worth mentioning that, differently from nodetool
repair, the help and doc for nodetool compact doesn't specify
whether the supplied start/end tokens are inclusive or exclusive.
>>>
>>> I think that ideally nodetool compact should interpret
the provided token ranges as half-open, to be consistent with
how token ranges are usually interpreted. However, this would
change the way the tool has worked until now. This change
might be problematic for existing users relying on the old
behaviour. That would be especially severe for the case where
the begin and end token are the same, because interpreting
[x, x] we would compact a single token, whereas I think that
interpreting (x, x] would compact all the tokens. As for
compacting ranges including multiple tokens, I think the
change wouldn't be so bad, since probably the supplied to

dtests to reproduce the schema disagreement

2022-08-08 Thread Cheng Wang via dev
Hello,

I am working on improving the schema disagreement issue. I need some dtests
which can reproduce the schema disagreement.  Anyone know if there are any
existing tests for that? Or something similar?

Thanks
Cheng


Re: dtests to reproduce the schema disagreement

2022-08-08 Thread Brandon Williams
If you simply do a lot of schema changes quickly without waiting for
agreement, that should get you there.

Kind Regards,
Brandon

On Mon, Aug 8, 2022 at 5:08 PM Cheng Wang via dev
 wrote:
>
> Hello,
>
> I am working on improving the schema disagreement issue. I need some dtests 
> which can reproduce the schema disagreement.  Anyone know if there are any 
> existing tests for that? Or something similar?
>
> Thanks
> Cheng


Re: dtests to reproduce the schema disagreement

2022-08-08 Thread Cheng Wang via dev
Thank you for the reply, Brandon! It is helpful!

I was thinking of creating a cluster with 2 nodes and having two concurrent
CREATE TABLE statements running. But the test will be flaky as there is no
guarantee that the query runs before the schema agreement has been reached.
Any ideas for that?

Thanks,
Cheng

On Mon, Aug 8, 2022 at 3:19 PM Brandon Williams  wrote:

> If you simply do a lot of schema changes quickly without waiting for
> agreement, that should get you there.
>
> Kind Regards,
> Brandon
>
> On Mon, Aug 8, 2022 at 5:08 PM Cheng Wang via dev
>  wrote:
> >
> > Hello,
> >
> > I am working on improving the schema disagreement issue. I need some
> dtests which can reproduce the schema disagreement.  Anyone know if there
> are any existing tests for that? Or something similar?
> >
> > Thanks
> > Cheng
>


Re: dtests to reproduce the schema disagreement

2022-08-08 Thread Jeff Jirsa
Which (of the many) schema disagreement issue(s)?



On Mon, Aug 8, 2022 at 3:29 PM Cheng Wang via dev 
wrote:

> Thank you for the reply, Brandon! It is helpful!
>
> I was thinking of creating a cluster with 2 nodes and having two
> concurrent CREATE TABLE statements running. But the test will be flaky as
> there is no guarantee that the query runs before the schema agreement has
> been reached.
> Any ideas for that?
>
> Thanks,
> Cheng
>
> On Mon, Aug 8, 2022 at 3:19 PM Brandon Williams  wrote:
>
>> If you simply do a lot of schema changes quickly without waiting for
>> agreement, that should get you there.
>>
>> Kind Regards,
>> Brandon
>>
>> On Mon, Aug 8, 2022 at 5:08 PM Cheng Wang via dev
>>  wrote:
>> >
>> > Hello,
>> >
>> > I am working on improving the schema disagreement issue. I need some
>> dtests which can reproduce the schema disagreement.  Anyone know if there
>> are any existing tests for that? Or something similar?
>> >
>> > Thanks
>> > Cheng
>>
>


Re: dtests to reproduce the schema disagreement

2022-08-08 Thread Cheng Wang via dev
Jeff,

The issue I was trying to address is when there are two CREATE TABLE
queries running on two coordinator nodes concurrently, it might end up with
2 schema versions and they would never get resolved automatically because
table id is random TimeUUID.



On Mon, Aug 8, 2022 at 3:54 PM Jeff Jirsa  wrote:

> Which (of the many) schema disagreement issue(s)?
>
>
>
> On Mon, Aug 8, 2022 at 3:29 PM Cheng Wang via dev <
> dev@cassandra.apache.org> wrote:
>
>> Thank you for the reply, Brandon! It is helpful!
>>
>> I was thinking of creating a cluster with 2 nodes and having two
>> concurrent CREATE TABLE statements running. But the test will be flaky as
>> there is no guarantee that the query runs before the schema agreement has
>> been reached.
>> Any ideas for that?
>>
>> Thanks,
>> Cheng
>>
>> On Mon, Aug 8, 2022 at 3:19 PM Brandon Williams  wrote:
>>
>>> If you simply do a lot of schema changes quickly without waiting for
>>> agreement, that should get you there.
>>>
>>> Kind Regards,
>>> Brandon
>>>
>>> On Mon, Aug 8, 2022 at 5:08 PM Cheng Wang via dev
>>>  wrote:
>>> >
>>> > Hello,
>>> >
>>> > I am working on improving the schema disagreement issue. I need some
>>> dtests which can reproduce the schema disagreement.  Anyone know if there
>>> are any existing tests for that? Or something similar?
>>> >
>>> > Thanks
>>> > Cheng
>>>
>>


Re: dtests to reproduce the schema disagreement

2022-08-08 Thread Jeff Jirsa
I see. Then yes, make a cluster with at least 2 hosts, run the CREATE TABLE
on them at the same time. If you use the pause injection framework, you can
probably pause threads after the CFID is generated but before it's
broadcast.

If you make the CFID deterministic, you can avoid the race, but can run
into problems if you create/drop/create (a node that was down during the
drop may resurrect data)

If you leave the CFID non-deterministic, the only way you're going to get
safety is a global ordering or transactional system, which more or less
reduces down to https://issues.apache.org/jira/browse/CASSANDRA-10699

Now, there are some things you can do to minimize risk along the way - you
could try to hunt down all of the possible races where in-memory state and
on-disk state diverge, create signals/log messages / warnings to make it
easier to detect, etc. But I'd be worried that any partial fixes will
complicate 10699 (either make the merge worse, or be outright removed
later), so it may be worth floating your proposed fix before you invest a
ton of time on it.








On Mon, Aug 8, 2022 at 3:57 PM Cheng Wang  wrote:

> Jeff,
>
> The issue I was trying to address is when there are two CREATE TABLE
> queries running on two coordinator nodes concurrently, it might end up with
> 2 schema versions and they would never get resolved automatically because
> table id is random TimeUUID.
>
>
>
> On Mon, Aug 8, 2022 at 3:54 PM Jeff Jirsa  wrote:
>
>> Which (of the many) schema disagreement issue(s)?
>>
>>
>>
>> On Mon, Aug 8, 2022 at 3:29 PM Cheng Wang via dev <
>> dev@cassandra.apache.org> wrote:
>>
>>> Thank you for the reply, Brandon! It is helpful!
>>>
>>> I was thinking of creating a cluster with 2 nodes and having two
>>> concurrent CREATE TABLE statements running. But the test will be flaky as
>>> there is no guarantee that the query runs before the schema agreement has
>>> been reached.
>>> Any ideas for that?
>>>
>>> Thanks,
>>> Cheng
>>>
>>> On Mon, Aug 8, 2022 at 3:19 PM Brandon Williams 
>>> wrote:
>>>
 If you simply do a lot of schema changes quickly without waiting for
 agreement, that should get you there.

 Kind Regards,
 Brandon

 On Mon, Aug 8, 2022 at 5:08 PM Cheng Wang via dev
  wrote:
 >
 > Hello,
 >
 > I am working on improving the schema disagreement issue. I need some
 dtests which can reproduce the schema disagreement.  Anyone know if there
 are any existing tests for that? Or something similar?
 >
 > Thanks
 > Cheng

>>>


Re: dtests to reproduce the schema disagreement

2022-08-08 Thread Cheng Wang via dev
Hi Jeff,

Thank you for your reply! Yes, we are working on generating a deterministic
CFID at table creation time. We will also most likely block the pattern of
drop and create to avoid the data reassurance issue once we identify all
the potential risks with the deterministic id.
That's why I asked to create some dtests to reproduce the schema
disagreement issue and show the deterministic table id can avoid the issue.

Thanks
Cheng

On Mon, Aug 8, 2022 at 4:46 PM Jeff Jirsa  wrote:

> I see. Then yes, make a cluster with at least 2 hosts, run the CREATE
> TABLE on them at the same time. If you use the pause injection framework,
> you can probably pause threads after the CFID is generated but before it's
> broadcast.
>
> If you make the CFID deterministic, you can avoid the race, but can run
> into problems if you create/drop/create (a node that was down during the
> drop may resurrect data)
>
> If you leave the CFID non-deterministic, the only way you're going to get
> safety is a global ordering or transactional system, which more or less
> reduces down to https://issues.apache.org/jira/browse/CASSANDRA-10699
>
> Now, there are some things you can do to minimize risk along the way - you
> could try to hunt down all of the possible races where in-memory state and
> on-disk state diverge, create signals/log messages / warnings to make it
> easier to detect, etc. But I'd be worried that any partial fixes will
> complicate 10699 (either make the merge worse, or be outright removed
> later), so it may be worth floating your proposed fix before you invest a
> ton of time on it.
>
>
>
>
>
>
>
>
> On Mon, Aug 8, 2022 at 3:57 PM Cheng Wang  wrote:
>
>> Jeff,
>>
>> The issue I was trying to address is when there are two CREATE TABLE
>> queries running on two coordinator nodes concurrently, it might end up with
>> 2 schema versions and they would never get resolved automatically because
>> table id is random TimeUUID.
>>
>>
>>
>> On Mon, Aug 8, 2022 at 3:54 PM Jeff Jirsa  wrote:
>>
>>> Which (of the many) schema disagreement issue(s)?
>>>
>>>
>>>
>>> On Mon, Aug 8, 2022 at 3:29 PM Cheng Wang via dev <
>>> dev@cassandra.apache.org> wrote:
>>>
 Thank you for the reply, Brandon! It is helpful!

 I was thinking of creating a cluster with 2 nodes and having two
 concurrent CREATE TABLE statements running. But the test will be flaky as
 there is no guarantee that the query runs before the schema agreement has
 been reached.
 Any ideas for that?

 Thanks,
 Cheng

 On Mon, Aug 8, 2022 at 3:19 PM Brandon Williams 
 wrote:

> If you simply do a lot of schema changes quickly without waiting for
> agreement, that should get you there.
>
> Kind Regards,
> Brandon
>
> On Mon, Aug 8, 2022 at 5:08 PM Cheng Wang via dev
>  wrote:
> >
> > Hello,
> >
> > I am working on improving the schema disagreement issue. I need some
> dtests which can reproduce the schema disagreement.  Anyone know if there
> are any existing tests for that? Or something similar?
> >
> > Thanks
> > Cheng
>



Re: dtests to reproduce the schema disagreement

2022-08-08 Thread Konstantin Osipov via dev
* Cheng Wang via dev  [22/08/09 09:43]:

> I am working on improving the schema disagreement issue. I need some dtests
> which can reproduce the schema disagreement.  Anyone know if there are any
> existing tests for that? Or something similar?

cassandra-10250 is a good start.

-- 
Konstantin Osipov, Moscow, Russia