Re: March 2015 QA retrospective

Benedict Elliott Smith Fri, 10 Apr 2015 16:10:58 -0700

>
> CASSANDRA-8459 <https://issues.apache.org/jira/browse/CASSANDRA-8459>
> "autocompaction"
> on reads can prevent memtable space reclaimation
>
> Can you link a ticket to CASSANDRA-9012 and characterize in a way we can
> try and implement how to make sufficiently large partitions, over
> sufficiently large periods of time?


Maybe also enumerate the other permutations where this matters like
> secondary indexes and the access patterns (scans).
>

Does this really qualify for its own ticket? This should just be one of
many configurations for stress' part in the new tests. We should perhaps
have an aggregation ticket where we ensure we enumerate the configuration
data points we've met that need to be covered. But, IMO at least, a
methodical exhaustive approach should be undertaken separately, and only be
corroborated against such a list to ensure it was done sufficiently well.


>
> CASSANDRA-8619 <https://issues.apache.org/jira/browse/CASSANDRA-8619> -
> using
> CQLSSTableWriter gives ConcurrentModificationException
>
> OK. I don't think the original fix meets our new definition of done since
> the was insufficient coverage, and in this case no regression test. To be
> done you would have to either implement the coverage or file a JIRA to add
> it.
>
> Can you file a ticket with as much detail as you can on what a the test
> might look like and link it to CASSANDRA-9012?
>
>
Well, the goal posts have shifted a smidgen since then :)

I've already filed CASSANDRA-9163 and CASSANDRA-9164 (the former I have
linked to CASSANDRA-9012). These problems would trivially be caught by any
kind of randomized long testing of these utilities, basically.

This does raise an interesting, but probably not significant downside to
the new approach: I fixed this ticket because somebody mentioned to me that
it was hurting them, and I saw a quick and easy fix. The testing would not
be quick and easy, so I am unlikely to volunteer to patch quick fixes in
the new world order. This will certainly lead to higher quality bug fixes,
but it may lead to fewer of them, and fewer instances of volunteer work to
help people out, because the overhead eats too much into the work you're
actually responsible for. This may lead to bug fixing being seen as much
more of a chore than it already can be. I don't say this to discourage the
new approach; it is just a thought that occurs to me off the back of this
specific discussion.


CASSANDRA-8668 <https://issues.apache.org/jira/browse/CASSANDRA-8668> We
> don't enforce offheap memory constraints; regression introduced by 7882
>
> We need to note somewhere that the kitchen sink test needs to insert large
> columns. How would it detect that the constraint was violated


It would fall over with an OOM


> I am starting to think we need a google doc for kitchen sink test wish
> listing and design discussion rather then scattering bits about it in JIRA.
>

 Agreed.



> CASSANDRA-8719 <https://issues.apache.org/jira/browse/CASSANDRA-8719>
> Using
> thrift HSHA with offheap_objects appears to corrupt data
>
> Can you file a ticket for having the kitchen sink tests be configurable to
> run against all client access paths? Linked to 9012 for now?
>

This only requires unit testing or dtests to be run this way. However for
the kitchen sink tests this is just another dimension in the configuration
state space, which IMO should be addressed as a whole methodically. Perhaps
we should file a central JIRA, or the Google doc you suggested, for
tracking all of these data points?


> CASSANDRA-8726 <https://issues.apache.org/jira/browse/CASSANDRA-8726>
> throw
> OOM in Memory if we fail to allocate OOM
>
> Can you create a ticket for this? I think that testing each allocation is
> not realistic in the sense that they don't fail in isolation. The JVM
> itself can ruin our day in OOM conditions as well. There is also heap OOM
> vs native memory OOM. It's worth some thought as to what the best bang for
> the buck testing strategy is going to be.
>

That's a bit of a different scope to the original problem, since in those
instances the VM explicitly throws an OOM. We can fault injection test both
of these scenarios, though, and I've already filed CASSANDRA-9165 for this.
I have commented on the ticket so that these scenarios are amongst those
explicitly considered when we address it, but I expect the scope of that
ticket to be very broad, and probably introduce its own entire class of
subtickets.


> Thanks,
> Ariel
>
> On Fri, Apr 10, 2015 at 8:04 AM, Benedict Elliott Smith <
> [email protected]> wrote:
>
> > TL;DR: "Kitchen sink" (aggressive randomised stress with subsystem
> > correctness) tests; commitlog/memtable isolated correctness stress
> testing;
> > improved tool/utility testing; internal structural changes to prevent
> > occurrence (delivered); fault injection testing. Filed #916[1-5]
> >
> > <https://issues.apache.org/jira/browse/CASSANDRA-7704> Benedict
> > FileNotFoundException during STREAM-OUT triggers 100% CPU usage Streaming
> >
> > This particular class of bug should be near impossible, due to structural
> > changes beginning with 7705. For testing such an uncommon race condition,
> > we would hope it to be exhibited eventually by our kitchen sink
> aggressive
> > testing, but it would be a very uncommon event.
> >
> > CASSANDRA-8383 <https://issues.apache.org/jira/browse/CASSANDRA-8383>
> > Benedict Memtable
> > flush may expire records from the commit log that are in a later memtable
> > No
> > regression test, no follow up ticket. Could/should this have been
> > reproducable as an actual bug?
> >
> > As stated on the ticket, we need to introduce rigorous randomized testing
> > of the commit log's correctness, both in isolation and in conjunction
> with
> > memtable flushing. This is not a trivial undertaking. Whether or not it
> > integrates with our kitchen sink tests is an open question, but I think
> > that might be difficult. I've filed #9162 to track this.
> >
> > CASSANDRA-8429 <https://issues.apache.org/jira/browse/CASSANDRA-8429>
> > Benedict
> > Some keys unreadable during compaction
> >
> > Running stress in CI would have caught this, and we're going to do that
> >
> > CASSANDRA-8459 <https://issues.apache.org/jira/browse/CASSANDRA-8459>
> > Benedict
> > "autocompaction" on reads can prevent memtable space reclaimation
> >
> > Kitchen sink tests with sufficiently large partitions written over a
> > sufficiently large period of time. Same risk present for e.g. secondary
> > indexes, so aggressive coverage of these, including scans etc, important.
> >
> > CASSANDRA-8499 <https://issues.apache.org/jira/browse/CASSANDRA-8499>
> > Benedict
> > Ensure SSTableWriter cleans up properly after failure
> > Testing error paths? Any way to test things in a loop to detect leaks?
> >
> > This kind of leak are now reported, and autocorrected for, so detecting
> is
> > much easier. However fault injection testing (if we can find a good way
> for
> > license compliance) as I started in CASSANDRA-8568 would help a lot also.
> >
> > CASSANDRA-8513 <https://issues.apache.org/jira/browse/CASSANDRA-8513>
> > Benedict
> > SSTableScanner may not acquire reference, but will still release it when
> > closed
> > This had a user visible component, what test could have caught it befor
> > erelease?
> >
> > Again, this cannot happen now, due to internal structural changes to
> > prevent it.
> >
> > CASSANDRA-8619 <https://issues.apache.org/jira/browse/CASSANDRA-8619
> > > Benedict
> > using CQLSSTableWriter gives ConcurrentModificationException
> >
> > Some better testing of our tools and utilities. The fix for this
> introduced
> > its own bug, by the looks of it, which we also did not catch. Better
> > (randomized long testing) coverage of these tools would help in both
> fixing
> > and ensuring it doesn't return again.
> >
> > CASSANDRA-8632 <https://issues.apache.org/jira/browse/CASSANDRA-8632>
> > Benedict
> > cassandra-stress only generating a single unique row
> >
> > This was caught prior to release by developer use, which is currently the
> > only QA we have for stress. Some basic testing would certainly be
> helpful,
> > but there is a tension between getting stress to do useful things, and
> > testing that it does so, since there are finite resources available to
> us.
> > The utility is currently probably more pressing, given the eyes it gets
> > when it is used. With more complex validation arriving, in conjunction
> with
> > performance profile histories and its generally being employed as a dev
> > tool, it should somewhat self test (major changes in performance profiles
> > should be explicable else investigated, and critical mistakes should
> often
> > lead to failed validation, or to users noticing a problem), and I expect
> > this will have to suffice for the interim.
> >
> >
> > CASSANDRA-8668 <https://issues.apache.org/jira/browse/CASSANDRA-8668>
> > Benedict We don't enforce offheap memory constraints; regression
> > introduced by 7882
> >
> > This would have been easily found with a kitchen sink test that was
> > inserting large columns. We should probably also have some specific tests
> > for ensuring the allocation tracking is exactly correct (by inspecting
> the
> > whole object graph independently, and reconciling the values), but this
> is
> > fiddly and of low immediate yield.
> >
> > CASSANDRA-8719 <https://issues.apache.org/jira/browse/CASSANDRA-8719>
> > Benedict
> > Using thrift HSHA with offheap_objects appears to corrupt data
> >
> > *Untested configuration before release, this would be straightforward if
> we
> > ran with it? *
> > Spot on.
> >
> > CASSANDRA-8726 <https://issues.apache.org/jira/browse/CASSANDRA-8726
> > > Benedict
> > throw OOM in Memory if we fail to allocate OOM
> >
> > Kind of tricky to induce an OOM; in general we consider an OOM to put C*
> > into an unstable state as well, so correct behaviour is just to shut
> down,
> > making it potentially tricky to test all avenues that could throw OOM.
> > Possibly the best route is to modify the byte code to corrupt the return
> > value to zero for each possible avenue we can reach it by, and confirm
> that
> > shutdown occurs safely.
> >
>

Re: March 2015 QA retrospective

Reply via email to