Re: secondary index table - tombstones surviving compactions

2018-05-31 Thread Roman Bielik
Hi Jordan,

thank you for accepting this as an issue.
I will follow the ticket.

Best regards,
Roman


On 30 May 2018 at 11:40, Jordan West  wrote:

> Hi Roman,
>
> I was able to reproduce the issue you described. I filed
> https://issues.apache.org/jira/browse/CASSANDRA-14479. More details there.
>
> Thanks for reporting!
> Jordan
>
>
> On Wed, May 23, 2018 at 12:06 AM, Roman Bielik <
> roman.bie...@openmindnetworks.com> wrote:
>
> > Hi,
> >
> > I apologise for a late response I wanted to run some further tests so I
> can
> > provide more information to you.
> >
> > @Jeff, no I don't set the "only_purge_repaired_tombstone" option. It
> > should
> > be default: False.
> > But no I don't run repairs during the tests.
> >
> > @Eric, I understand that rapid deletes/inserts are some kind of
> > antipattern, nevertheless I'm not experiencing any problems with that
> > (except for the 2nd indices).
> >
> > Update: I run a new test where I delete the indexed columns extra, plus
> > delete the whole row at the end.
> > And surprisingly this test scenario works fine. Using nodetool flush +
> > compact (in order to expedite the test) seems to always purge the index
> > table.
> > So that's great because I seem to have found a workaround, on the other
> > hand, could there be a bug in Cassandra - leaking index table?
> >
> > Test details:
> > Create table with LeveledCompactionStrategy;
> > 'tombstone_compaction_interval': 60; gc_grace_seconds=60
> > There are two indexed columns for comparison: column1, column2
> > Insert keys {1..x} with random values in column1 & column2
> > Delete {key:column2} (but not column1)
> > Delete {key}
> > Repeat n-times from the inserts
> > Wait 1 minute
> > nodetool flush
> > nodetool compact (sometimes compact  
> > nodetool cfstats
> >
> > What I observe is, that the data table is empty, column2 index table is
> > also empty and column1 index table has non-zero (leaked) "space used" and
> > "estimated rows".
> >
> > Roman
> >
> >
> >
> >
> >
> >
> > On 18 May 2018 at 16:13, Jeff Jirsa  wrote:
> >
> > > This would matter for the base table, but would be less likely for the
> > > secondary index, where the partition key is the value of the base row
> > >
> > > Roman: there’s a config option related to only purging repaired
> > tombstones
> > > - do you have that enabled ? If so, are you running repairs?
> > >
> > > --
> > > Jeff Jirsa
> > >
> > >
> > > > On May 18, 2018, at 6:41 AM, Eric Stevens  wrote:
> > > >
> > > > The answer to Question 3 is "yes."  One of the more subtle points
> about
> > > > tombstones is that Cassandra won't remove them during compaction if
> > there
> > > > is a bloom filter on any SSTable on that replica indicating that it
> > > > contains the same partition (not primary) key.  Even if it is older
> > than
> > > > gc_grace, and would otherwise be a candidate for cleanup.
> > > >
> > > > If you're recycling partition keys, your tombstones may never be able
> > to
> > > be
> > > > cleaned up, because in this scenario there is a high probability that
> > an
> > > > SSTable not involved in that compaction also contains the same
> > partition
> > > > key, and so compaction cannot have confidence that it's safe to
> remove
> > > the
> > > > tombstone (it would have to fully materialize every record in the
> > > > compaction, which is too expensive).
> > > >
> > > > In general it is an antipattern in Cassandra to write to a given
> > > partition
> > > > indefinitely for this and other reasons.
> > > >
> > > > On Fri, May 18, 2018 at 2:37 AM Roman Bielik <
> > > > roman.bie...@openmindnetworks.com> wrote:
> > > >
> > > >> Hi,
> > > >>
> > > >> I have a Cassandra 3.11 table (with compact storage) and using
> > secondary
> > > >> indices with rather unique data stored in the indexed columns. There
> > are
> > > >> many inserts and deletes, so in order to avoid tombstones piling up
> > I'm
> > > >> re-using primary keys from a pool (which works fine).
> > > >> I'm aware that this design pattern is not ideal, but for now I can
> not
> > > >> change it easily.
> > > >>
> > > >> The problem is, the size of 2nd index tables keeps growing (filled
> > with
> > > >> tombstones) no matter what.
> > > >>
> > > >> I tried some aggressive configuration (just for testing) in order to
> > > >> expedite the tombstone removal but with little-to-zero effect:
> > > >> COMPACTION = { 'class':
> > > >> 'LeveledCompactionStrategy', 'unchecked_tombstone_compaction':
> > 'true',
> > > >> 'tombstone_compaction_interval': 600 }
> > > >> gc_grace_seconds = 600
> > > >>
> > > >> I'm aware that perhaps Materialized views could provide a solution
> to
> > > this,
> > > >> but I'm bind to the Thrift interface, so can not use them.
> > > >>
> > > >> Questions:
> > > >> 1. Is there something I'm missing? How come compaction does not
> remove
> > > the
> > > >> obsolete indices/tombstones from 2nd index tables? Can I trigger the
> > > >> cleanup manually somehow?
> > > >> I have tried nodetool f

REMINDER: Apache EU Roadshow 2018 in Berlin is less than 2 weeks away!

2018-05-31 Thread sharan

Hello Apache Supporters and Enthusiasts

This is a reminder that our Apache EU Roadshow in Berlin is less than 
two weeks away and we need your help to spread the word. Please let your 
work colleagues, friends and anyone interested in any attending know 
about our Apache EU Roadshow event.


We have a great schedule including tracks on Apache Tomcat, Apache Http 
Server, Microservices, Internet of Things (IoT) and Cloud Technologies. 
You can find more details at the link below:


https://s.apache.org/0hnG

Ticket prices will be going up on 8^th June 2018, so please make sure 
that you register soon if you want to beat the price increase. 
https://foss-backstage.de/tickets


Remember that registering for the Apache EU Roadshow also gives you 
access to FOSS Backstage so you can attend any talks and workshops from 
both conferences. And don’t forget that our Apache Lounge will be open 
throughout the whole conference as a place to meet up, hack and relax.


We look forward to seeing you in Berlin!

Thanks
Sharan Foga,  VP Apache Community Development

http://apachecon.com/
@apachecon

PLEASE NOTE: You are receiving this message because you are subscribed 
to a user@ or dev@ list of one or more Apache Software Foundation projects.


Planning to port cqlsh to Python 3 (CASSANDRA-10190)

2018-05-31 Thread Patrick Bannister
I propose porting cqlsh and cqlshlib to Python 3. End-of-life for Python 2.7
 is currently planned for 1
January 2020. We should prepare to port the tool to a version of Python
that will be officially supported.

I'm seeking input on three questions:
- Should we port it to straight Python 3, or Python 2/3 cross compatible?
- How much more testing is needed?
- Can we wait until after 4.0 for this?

I have an implementation
 to go with my
proposal. In parallel with getting the dtest cqlsh_tests working again, I
ported cqlsh.py and cqlshlib to Python 3. It passes with almost all of the
dtests and the unittests, so it's in pretty good shape, although it's not
100% done (more on that below).

*Python 3 or 2/3 cross compatible?* There are plenty of examples of Python
libraries that are compatible with both Python 2 and Python 3 (notably the
Cassandra Python driver), so I think this is achievable. The question is,
do we want to pay the price of cross compatibility? If we write cqlsh to be
2/3 cross compatible, we'll carry a long term technical debt to maintain
that feature. The value of continuing to support Python 2 will diminish
over time. However, a cross compatible implementation may ease the
transition for some users, especially if there are users who have made
significant custom modifications to the Python 2.7 implementation of cqlsh,
so I think we must at least consider the question.

*What additional testing is needed before we could release it?* I used
coverage.py to check on the code coverage of our existing dtest cqlsh_tests
and cqlshlib unittests. There are several blind spots in our current
testing that should be addressed before we release a port of cqlsh. Details
of this are available on JIRA ticket CASSANDRA-10190
 in the attachment
coverage_notes.txt
.
Beyond that, I've made no efforts to test on platforms other than Ubuntu
and CentOS, so Windows testing is needed if we're making efforts to support
Windows. It would also be preferable for some real users to try out the
port before it replaces the Python 2.7 cqlsh in a release.

Besides this, there are a couple of test failures I'm still trying to
figure out, notably tests involving user defined map types (a task made
more interesting by Python's general lack of support for immutable map
types).

*Can we wait until after 4.0 for this?* I don't think it's reasonable to
try to release this with 4.0 given the current consensus around a feature
freeze in the next few months. My feeling is that our testers and
committers are already very busy with the currently planned changes for
4.0. I recommend planning toward a release to occur after 4.0. If we run up
against Python 2.7 EOL before we can cut the next release, we could
consider releasing a ported cqlsh independently, for installation through
distutils or pip.

Patrick Bannister