If you guys are still seeing the problem, would be good to have a JIRA written up, as all the ones linked were fixed in 2017 and 2015. CASSANDRA-13700 was found during our testing, and we haven’t seen any other issues since fixing it.
-Jeremiah > On Oct 22, 2018, at 10:12 PM, Sankalp Kohli <kohlisank...@gmail.com> wrote: > > No worries...I mentioned the issue not the JIRA number > >> On Oct 22, 2018, at 8:01 PM, Jeremiah D Jordan <jerem...@datastax.com> wrote: >> >> Sorry, maybe my spam filter got them or something, but I have never seen a >> JIRA number mentioned in the thread before this one. Just looked back >> through again to make sure, and this is the first email I have with one. >> >> -Jeremiah >> >>> On Oct 22, 2018, at 9:37 PM, sankalp kohli <kohlisank...@gmail.com> wrote: >>> >>> Here are some of the JIRAs which are fixed but actually did not fix the >>> issue. We have tried fixing this by several patches. May be it will be >>> fixed when Gossip is rewritten(CASSANDRA-12345). I should find or create a >>> new JIRA as this issue still exists. >>> https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_CASSANDRA-2D10366&d=DwIFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=CNZK3RiJDLqhsZDG6FQGnXn8WyPRCQhp4x_uBICNC0g&m=lI3KEen0YYUim6t3VWsvITHUZfFX8oYaczP_t3kk21o&s=W_HfejhgW1gmZ06L0CXOnp_EgBQ1oI5MLMoyz0OrvFw&e= >>> https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_CASSANDRA-2D10089&d=DwIFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=CNZK3RiJDLqhsZDG6FQGnXn8WyPRCQhp4x_uBICNC0g&m=lI3KEen0YYUim6t3VWsvITHUZfFX8oYaczP_t3kk21o&s=qXzh1nq2yE27J8SvwYoRf9HPQE83m07cKdKVHXyOyAE&e= >>> (related to it) >>> >>> Also the quote you are using was written as a follow on email. I have >>> already said what the bug I was referring to. >>> >>> "Say you restarted all instances in the cluster and status for some host >>> goes missing. Now when you start a host replacement, the new host won’t >>> learn about the host whose status is missing and the view of this host will >>> be wrong." >>> >>> - CASSANDRA-10366 >>> >>> >>> On Mon, Oct 22, 2018 at 7:22 PM Sankalp Kohli <kohlisank...@gmail.com> >>> wrote: >>> >>>> I will send the JIRAs of the bug which we thought we have fixed but it >>>> still exists. >>>> >>>> Have you done any correctness testing after doing all these tests...have >>>> you done the tests for 1000 instance clusters? >>>> >>>> It is great you have done these tests and I am hoping the gossiping snitch >>>> is good. Also was there any Gossip bug fixed post 3.0? May be I am seeing >>>> the bug which is fixed. >>>> >>>>> On Oct 22, 2018, at 7:09 PM, J. D. Jordan <jeremiah.jor...@gmail.com> >>>> wrote: >>>>> >>>>> Do you have a specific gossip bug that you have seen recently which >>>> caused a problem that would make this happen? Do you have a specific JIRA >>>> in mind? “We can’t remove this because what if there is a bug” doesn’t >>>> seem like a good enough reason to me. If that was a reason we would never >>>> make any changes to anything. >>>>> I think many people have seen PFS actually cause real problems, where >>>> with GPFS the issue being talked about is predicated on some theoretical >>>> gossip bug happening. >>>>> In the past year at DataStax we have done a lot of testing on 3.0 and >>>> 3.11 around adding nodes, adding DC’s, replacing nodes, replacing racks, >>>> and replacing DC’s, all while using GPFS, and as far as I know we have not >>>> seen any “lost” rack/DC information during such testing. >>>>> >>>>> -Jeremiah >>>>> >>>>>> On Oct 22, 2018, at 5:46 PM, sankalp kohli <kohlisank...@gmail.com> >>>> wrote: >>>>>> >>>>>> We will have similar issues with Gossip but this will create more >>>> issues as >>>>>> more things will be relied on Gossip. >>>>>> >>>>>> I agree PFS should be removed but I dont see how it can be with issues >>>> like >>>>>> these or someone proves that it wont cause any issues. >>>>>> >>>>>> On Mon, Oct 22, 2018 at 2:21 PM Paulo Motta <pauloricard...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> I can understand keeping PFS for historical/compatibility reasons, but >>>> if >>>>>>> gossip is broken I think you will have similar ring view problems >>>> during >>>>>>> replace/bootstrap that would still occur with the use of PFS (such as >>>>>>> missing tokens, since those are propagated via gossip), so that doesn't >>>>>>> seem like a strong reason to keep it around. >>>>>>> >>>>>>> With PFS it's pretty easy to shoot yourself in the foot if you're not >>>>>>> careful enough to have identical files across nodes and updating it >>>> when >>>>>>> adding nodes/dcs, so it's seems to be less foolproof than other >>>> snitches. >>>>>>> While the rejection of verbs to invalid replicas on trunk could address >>>>>>> concerns raised by Jeremy, this would only happen after the new node >>>> joins >>>>>>> the ring, so you would need to re-bootstrap the node and lose all the >>>> work >>>>>>> done in the original bootstrap. >>>>>>> >>>>>>> Perhaps one good reason to use PFS is the ability to easily package it >>>>>>> across multiple nodes, as pointed out by Sean Durity on CASSANDRA-10745 >>>>>>> (which is also it's Achilles' heel). To keep this ability, we could >>>> make >>>>>>> GPFS compatible with the cassandra-topology.properties file, but >>>> reading >>>>>>> only the dc/rack info about the local node. >>>>>>> >>>>>>> Em seg, 22 de out de 2018 às 16:58, sankalp kohli < >>>> kohlisank...@gmail.com> >>>>>>> escreveu: >>>>>>> >>>>>>>> Yes it will happen. I am worried that same way DC or rack info can go >>>>>>>> missing. >>>>>>>> >>>>>>>> On Mon, Oct 22, 2018 at 12:52 PM Paulo Motta < >>>> pauloricard...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>>> the new host won’t learn about the host whose status is missing and >>>>>>> the >>>>>>>>> view of this host will be wrong. >>>>>>>>> >>>>>>>>> Won't this happen even with PropertyFileSnitch as the token(s) for >>>> this >>>>>>>>> host will be missing from gossip/system.peers? >>>>>>>>> >>>>>>>>> Em sáb, 20 de out de 2018 às 00:34, Sankalp Kohli < >>>>>>>> kohlisank...@gmail.com> >>>>>>>>> escreveu: >>>>>>>>> >>>>>>>>>> Say you restarted all instances in the cluster and status for some >>>>>>> host >>>>>>>>>> goes missing. Now when you start a host replacement, the new host >>>>>>> won’t >>>>>>>>>> learn about the host whose status is missing and the view of this >>>>>>> host >>>>>>>>> will >>>>>>>>>> be wrong. >>>>>>>>>> >>>>>>>>>> PS: I will be happy to be proved wrong as I can also start using >>>>>>> Gossip >>>>>>>>>> snitch :) >>>>>>>>>> >>>>>>>>>>> On Oct 19, 2018, at 2:41 PM, Jeremy Hanna < >>>>>>>> jeremy.hanna1...@gmail.com> >>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>> Do you mean to say that during host replacement there may be a time >>>>>>>>> when >>>>>>>>>> the old->new host isn’t fully propagated and therefore wouldn’t yet >>>>>>> be >>>>>>>> in >>>>>>>>>> all system tables? >>>>>>>>>>> >>>>>>>>>>>> On Oct 17, 2018, at 4:20 PM, sankalp kohli < >>>>>>> kohlisank...@gmail.com> >>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>> This is not the case during host replacement correct? >>>>>>>>>>>> >>>>>>>>>>>> On Tue, Oct 16, 2018 at 10:04 AM Jeremiah D Jordan < >>>>>>>>>>>> jeremiah.jor...@gmail.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> As long as we are correctly storing such things in the system >>>>>>>> tables >>>>>>>>>> and >>>>>>>>>>>>> reading them out of the system tables when we do not have the >>>>>>>>>> information >>>>>>>>>>>>> from gossip yet, it should not be a problem. (As far as I know >>>>>>> GPFS >>>>>>>>>> does >>>>>>>>>>>>> this, but I have not done extensive code diving or testing to >>>>>>> make >>>>>>>>>> sure all >>>>>>>>>>>>> edge cases are covered there) >>>>>>>>>>>>> >>>>>>>>>>>>> -Jeremiah >>>>>>>>>>>>> >>>>>>>>>>>>>> On Oct 16, 2018, at 11:56 AM, sankalp kohli < >>>>>>>> kohlisank...@gmail.com >>>>>>>>>> >>>>>>>>>>>>> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> Will GossipingPropertyFileSnitch not be vulnerable to Gossip >>>>>>> bugs >>>>>>>>>> where >>>>>>>>>>>>> we >>>>>>>>>>>>>> lose hostId or some other fields when we restart C* for large >>>>>>>>>>>>>> clusters(~1000 instances)? >>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Tue, Oct 16, 2018 at 7:59 AM Jeff Jirsa <jji...@gmail.com> >>>>>>>>> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> We should, but the 4.0 features that log/reject verbs to >>>>>>> invalid >>>>>>>>>>>>> replicas >>>>>>>>>>>>>>> solves a lot of the concerns here >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>> Jeff Jirsa >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Oct 16, 2018, at 4:10 PM, Jeremy Hanna < >>>>>>>>>> jeremy.hanna1...@gmail.com> >>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> We have had PropertyFileSnitch for a long time even though >>>>>>>>>>>>>>> GossipingPropertyFileSnitch is effectively a superset of what >>>>>>> it >>>>>>>>>> offers >>>>>>>>>>>>> and >>>>>>>>>>>>>>> is much less error prone. There are some unexpected behaviors >>>>>>>> when >>>>>>>>>>>>> things >>>>>>>>>>>>>>> aren’t configured correctly with PFS. For example, if you >>>>>>>> replace >>>>>>>>>>>>> nodes in >>>>>>>>>>>>>>> one DC and add those nodes to that DCs property files and not >>>>>>> the >>>>>>>>>> other >>>>>>>>>>>>> DCs >>>>>>>>>>>>>>> property files - the resulting problems aren’t very >>>>>>>> straightforward >>>>>>>>>> to >>>>>>>>>>>>>>> troubleshoot. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> We could try to improve the resilience and fail fast error >>>>>>>>> checking >>>>>>>>>> and >>>>>>>>>>>>>>> error reporting of PFS, but honestly, why wouldn’t we deprecate >>>>>>>> and >>>>>>>>>>>>> remove >>>>>>>>>>>>>>> PropertyFileSnitch? Are there reasons why GPFS wouldn’t be >>>>>>>>>> sufficient >>>>>>>>>>>>> to >>>>>>>>>>>>>>> replace it? >>>>>>>>>>>>>>>> >>>>>>>>>> >>>> --------------------------------------------------------------------- >>>>>>>>>>>>>>>> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org >>>>>>>>>>>>>>>> For additional commands, e-mail: >>>>>>> dev-h...@cassandra.apache.org >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>> --------------------------------------------------------------------- >>>>>>>>>>>>>>> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org >>>>>>>>>>>>>>> For additional commands, e-mail: dev-h...@cassandra.apache.org >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>> --------------------------------------------------------------------- >>>>>>>>>>>>> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org >>>>>>>>>>>>> For additional commands, e-mail: dev-h...@cassandra.apache.org >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>> --------------------------------------------------------------------- >>>>>>>>>>> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org >>>>>>>>>>> For additional commands, e-mail: dev-h...@cassandra.apache.org >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>> --------------------------------------------------------------------- >>>>>>>>>> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org >>>>>>>>>> For additional commands, e-mail: dev-h...@cassandra.apache.org >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>> >>>>> --------------------------------------------------------------------- >>>>> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org >>>>> For additional commands, e-mail: dev-h...@cassandra.apache.org >>>>> >>>> >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org >> For additional commands, e-mail: dev-h...@cassandra.apache.org >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org > For additional commands, e-mail: dev-h...@cassandra.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org For additional commands, e-mail: dev-h...@cassandra.apache.org