Re: URGENT: CASSANDRA-14092 causes Data Loss
Hi, I think CASSANDRA-14227 is pending for long time now. Though, the data loss issue was addressed in CASSANDRA-14092, Cassandra users are still prohibited to use long TTLs (20+ years) as the maximum expiration timestamp that can be represented by the storage engine is 2038-01-19T03:14:06+00:00 (due to the encoding of localExpirationTime as an int32). As per JIRA comments, the fix seems relatively simple. Considering high impact/returns and relatively less efforts, are there any plans to prioritize this fix for upcoming releases? Thanks Anuj On Saturday, 27 January, 2018, 8:35:20 PM IST, Anuj Wadehra wrote: Hi Paulo, Thanks for coming out with the Emergency Hot Fix!! The patch will help many Cassandra users in saving their precious data. I think the criticality and urgency of the bug is too high. How can we make sure that maximum Cassandra users are alerted about the silent deletion problem? What are formal ways of working for broadcasting such critical alerts? I still see that the JIRA is marked as a "Major" defect and not a "Blocker". What worst can happen to a database than irrecoverable silent deletion of successfully inserted data. I hope you understand. ThanksAnuj On Fri, 26 Jan 2018 at 18:57, Paulo Motta wrote: > I have serious concerns regarding reducing the TTL to 15 yrs.The patch will immediately break all existing applications in Production which are using 15+ yrs TTL. In order to prevent applications from breaking I will update the patch to automatically set the maximum TTL to '03:14:08 UTC 19 January 2038' when it overflows and log a warning as a initial measure. We will work on extending this limit or lifting this limitation, probably for the 3.0+ series due to the large scale compatibility changes required on lower versions, but community patches are always welcome. Companies that cannot upgrade to a version with the proper fix will need to workaround this limitation in some other way: do a batch job to delete old data periodically, perform deletes with timestamps in the future, etc. > If its a 32 bit timestamp, can't we just save/read localDeletionTime as > unsinged int? The proper fix will likely be along these lines, but this involve many changes throughout the codebase where localDeletionTime is consumed and extensive testing, reviewing, etc, so we're now looking into a emergency hot fix to prevent silent data loss while the permanent fix is not in place. 2018-01-26 6:27 GMT-02:00 Anuj Wadehra : > Hi Jeff, > One correction in my last message: "it may be more feasible to SUPPORT (not > extend) the 20 year limit in Cassandra in 2.1/2.2". > I completely agree that the existing 20 years TTL support is okay for older > versions. > > If I have understood your last message correctly, upcoming patches are on > following lines : > > 1. New Patches shall be released for 2.1, 2.2 and 3.x.2. The patches for 2.1 > & 2.2 would support the existing 20 year TTL limit and ensure that there is > no data loss when 20 year is set as TTL.3. The patches for 2.1 and 2.2 are > unlikely to update the sstable format. > 4. 3.x patches may even remove the 20 year TTL constraint (and extend TTL > support beyond 2038). > I think that the JIRA priority should be increased from "Major" to "Blocker" > as the JIRA may cause unexpected data loss. Also, all impacted versions > should be included in the JIRA. This will attract the due attention of all > Cassandra users. > ThanksAnuj > On Friday 26 January 2018, 12:47:18 PM IST, Anuj Wadehra > wrote: > > Hi Jeff, > > Thanks for the prompt action! I agree that patching an application MAY have a > shorter life cycle than patching Cassandra in production. But, in the > interest of the larger Cassandra user community, we should put our best > effort to avoid breaking all the affected applications in production. We > should also consider that updating business logic as per the new 15 year TTL > constraint may have business implications for many users. I have a limited > understanding about the complexity of the code patch, but it may be more > feasible to extend the 20 year limit in Cassandra in 2.1/2.2 rather than > asking all impacted users to do an immediate business logic adaptation. > Moreover, now that we officially support Cassandra 2.1 & 2.2 until 4.0 > release and provide critical fixes for 2.1, it becomes even more reasonable > to provide this extremely critical patch for 2.1 & 2.2 (unless its absolutely > impossible). Still, many users use Cassandra 2.1 and 2.2 in their most > critical production systems. > > Thanks > Anuj > > On Friday 26 January 2018, 11:06:30 AM IST, Jeff Jirsa >wrote: > > We’ll get patches out. They almost certainly aren’t going to ch
Impact of removing compactions_in_progress folder
Often we face errors on Cassandra start regarding unfinished compactions particularly when cassandra was abrupty shut down . Problem gets resolved when we delete /var/lib/cassandra/data/system/compactions_in_progress folder. Does deletion of the folder has any impact on integrity of data or any other aspect? Thanks Anuj Wadehra
Drawbacks of Major Compaction now that Automatic Tombstone Compaction Exists
Recently we faced an issue where every repair operation caused addition of hundreds of sstables (CASSANDRA-9146). In order to bring situation under control and make sure reads are not impacted, we were left with no option but to run major compaction to ensure that thousands of tiny sstables are compacted. Queries: Does major compaction has any drawback after automatic tombstone compaction got implemented in 1.2 via tombstone_threshold sub-property(CASSANDRA-3442)? I understand that the huge SSTable created after major compaction wont be compacted with new data any time soon but is that a problem if purged data is removed via automatic tombstone compaction? If we major compaction results in a huge file say 500GB, what are the drawbacks of it? If one big sstable is a problem, is there any way of solving the problem? We tried running sstablesplit after major compaction to split the big sstable but as new sstables were of same size they are again compacted into single huge table once Cassandra was started after executing sstablesplit. Thanks Anuj Wadehra
Re: Drawbacks of Major Compaction now that Automatic Tombstone Compaction Exists
I havent got much response regarding this on user list..so posting it on dev list too.. Thanks Anuj Wadehra Sent from Yahoo Mail on Android From:"Anuj Wadehra" Date:Tue, 14 Apr, 2015 at 7:05 am Subject:Drawbacks of Major Compaction now that Automatic Tombstone Compaction Exists Recently we faced an issue where every repair operation caused addition of hundreds of sstables (CASSANDRA-9146). In order to bring situation under control and make sure reads are not impacted, we were left with no option but to run major compaction to ensure that thousands of tiny sstables are compacted. Queries: Does major compaction has any drawback after automatic tombstone compaction got implemented in 1.2 via tombstone_threshold sub-property(CASSANDRA-3442)? I understand that the huge SSTable created after major compaction wont be compacted with new data any time soon but is that a problem if purged data is removed via automatic tombstone compaction? If we major compaction results in a huge file say 500GB, what are the drawbacks of it? If one big sstable is a problem, is there any way of solving the problem? We tried running sstablesplit after major compaction to split the big sstable but as new sstables were of same size they are again compacted into single huge table once Cassandra was started after executing sstablesplit. Thanks Anuj Wadehra
Repair with -pr and -local after CASSANDRA-7450
Hi, This is regarding execution of repair -pr in local DC. CASSANDRA-7313 disabled using pr with local option. Later, CASSANDRA-7450 allowed it. But when I look at the code of Cassandra 2.0.13, I see that using pr with local is still illegal: How to do Repair with pr and local DC option in 2.0.13/2.0.14 ? We dont want to run full repair on each node of a DC. Moreover, we dont want to incur cross DC repair. public void forceKeyspaceRepairPrimaryRange(final String keyspaceName, boolean isSequential, boolean isLocal, final String... columnFamilies) throws IOException { // primary range repair can only be performed for whole cluster. // NOTE: we should omit the param but keep API as is for now. if (isLocal) { throw new IllegalArgumentException("You need to run primary range repair on all nodes in the cluster."); } forceKeyspaceRepairRange(keyspaceName, getLocalPrimaryRanges(keyspaceName), isSequential ? RepairParallelism.SEQUENTIAL : RepairParallelism.PARALLEL, false, columnFamilies); } ThanksAnuj
Contribution to Cassandra Community and Branching Strategy
Hi, I want to submit patch for Cassandra JIRA tickets. I have some questions: 1. As per http://wiki.apache.org/cassandra/HowToContribute, we need to clone trunk and provide patch on that. So, I need to understand how this patch is going to be merged in 2.0.x and 2.1.x ? 2. Where can I get the detailed Branching strategy followed by Cassandra? 3. How to say "that I am looking into a JIRA"? Should I just put a patch when ready? ThanksAnuj Wadehra
Re: Repair with -pr and -local after CASSANDRA-7450
Ok. So, does that mean that -pr is not available in 2.0.x untill you are willing to pay additional cost for cross DC repair (which I think is not practical)? Thanks Anuj Sent from Yahoo Mail on Android From:"Yuki Morishita" Date:Tue, 19 May, 2015 at 1:06 am Subject:Re: Repair with -pr and -local after CASSANDRA-7450 CASSANDRA-7450 is for version 2.1.1 and higher. So it is not available in 2.0.x. On Mon, May 18, 2015 at 1:43 PM, Anuj Wadehra wrote: > Hi, > This is regarding execution of repair -pr in local DC. > CASSANDRA-7313 disabled using pr with local option. Later, CASSANDRA-7450 > allowed it. But when I look at the code of Cassandra 2.0.13, I see that using > pr with local is still illegal: > How to do Repair with pr and local DC option in 2.0.13/2.0.14 ? We dont want > to run full repair on each node of a DC. Moreover, we dont want to incur > cross DC repair. > > public void forceKeyspaceRepairPrimaryRange(final String keyspaceName, > boolean isSequential, boolean isLocal, final String... columnFamilies) throws > IOException > { > // primary range repair can only be performed for whole cluster. > // NOTE: we should omit the param but keep API as is for now. > if (isLocal) > { > throw new IllegalArgumentException("You need to run primary range >repair on all nodes in the cluster."); > } > > forceKeyspaceRepairRange(keyspaceName, >getLocalPrimaryRanges(keyspaceName), isSequential ? >RepairParallelism.SEQUENTIAL : RepairParallelism.PARALLEL, false, >columnFamilies); > } > > > ThanksAnuj -- Yuki Morishita t:yukim (http://twitter.com/yukim )
Is deletion of compactions_in_progress files safe?
I need to understand how compaction Algorithm works at high level. 1. What is the significance of systems.compactions_in_progress table and files in compactions_in_progress directory. 2. We are on 2.0.3. We frequently face scenarios where Cassandra fails to restart with Exception regarding unfinished compactions. The problem is how to deal with such scenario "Till we upgrade". When we delete files in compactions_in_progress folder, we are able to start Cassandra. Is that a SAFE thing to do in PRODUCTION? Are there any better alternatives? We are concerned about data integrity. Thanks Anuj
Behavior of nodetool stop compaction
Firing nodetool stop command prints CompactionInterruptedException stacktrace. 1. Exception stacktrace gives an impression of killing a compaction forcefully, is nodetool stop COMPACTION a clean way to interrupt an ongoing MAJOR compaction? 2. How the logic works? When nodetool stop is fired to stop MINOR compactions, are we temporarily suspending all minor compactions and these will resume automatically afterwards or the work done by in progress compactions is discarded ? Thanks Anuj
Re: Behavior of nodetool stop compaction
org.apache.cassandra.db.compaction.CompactionManager$6.runMayThrow(CompactionManager.java:296) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.run(FutureTask.java:262) ... 3 more ThanksAnuj Wadehra On Monday, 25 May 2015 2:44 PM, Jason Wee wrote: Hello, could you paste the exception and also show what is the cassandra version running? jason On Sun, May 24, 2015 at 2:12 AM, Anuj Wadehra wrote: > Firing nodetool stop command prints CompactionInterruptedException > stacktrace. > > 1. Exception stacktrace gives an impression of killing a compaction > forcefully, is nodetool stop COMPACTION a clean way to interrupt an ongoing > MAJOR compaction? > > 2. How the logic works? When nodetool stop is fired to stop MINOR > compactions, are we temporarily suspending all minor compactions and these > will resume automatically afterwards or the work done by in progress > compactions is discarded ? > > Thanks > Anuj > >
Re: Behavior of nodetool stop compaction
org.apache.cassandra.db.compaction.CompactionManager$6.runMayThrow(CompactionManager.java:296) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.run(FutureTask.java:262) ... 3 more ThanksAnuj Wadehra On Monday, 25 May 2015 2:44 PM, Jason Wee wrote: Hello, could you paste the exception and also show what is the cassandra version running? jason On Sun, May 24, 2015 at 2:12 AM, Anuj Wadehra wrote: > Firing nodetool stop command prints CompactionInterruptedException > stacktrace. > > 1. Exception stacktrace gives an impression of killing a compaction > forcefully, is nodetool stop COMPACTION a clean way to interrupt an ongoing > MAJOR compaction? > > 2. How the logic works? When nodetool stop is fired to stop MINOR > compactions, are we temporarily suspending all minor compactions and these > will resume automatically afterwards or the work done by in progress > compactions is discarded ? > > Thanks > Anuj > >
Re: Versioning policy?
Hi Jonathan, Thanks for the crisp communication regarding the tick tock release & EOL. I think its worth considering some points regarding EOL policy and it would be great if you can share your thoughts on below points: 1. EOL of a release should be based on "most stable"/"production ready" version date rather than "GA" date of subsequent major releases. 2. I think we should have "Formal EOL Announcement" on Apache Cassandra website. 3. "Formal EOL Announcement" should come at least 6 months before the EOL, so that users get reasonable time to upgrade. 4. EOL Policy (even if flexible) should be stated on Apache Cassandra website EOL thread on users mailing list ended with the conclusion of raising a Wishlist JIRA but I think above points are more about working on policy and processes rather than just a wish list. ThanksAnuj Sent from Yahoo Mail on Android On Thu, 14 Jan, 2016 at 10:57 pm, Jonathan Ellis wrote: Hi Maciek, First let's talk about the tick-tock series, currently 3.x. This is pretty simple: outside of the regular monthly releases, we will release fixes for critical bugs against the most recent bugfix release, the way we did recently with 3.1.1 for CASSANDRA-10822 [1]. No older tick-tock releases will be patched. Now, we also have three other release series currently being supported: 2.1.x: supported with critical fixes only until 4.0 is released, projected in November 2016 [2] 2.2.x: maintained until 4.0 is released 3.0.x: maintained for 6 months after 4.0, i.e. projected until May 2017 I will add this information to the releases page [3]. [1] https://mail-archives.apache.org/mod_mbox/incubator-cassandra-user/201512.mbox/%3CCAKkz8Q3StqRFHfMgCMRYaaPdg+HE5N5muBtFVt-=v690pzp...@mail.gmail.com%3E [2] 4.0 will be an ordinary tick-tock release after 3.11, but we will be sunsetting deprecated features like Thrift so bumping the major version seems appropriate [3] http://cassandra.apache.org/download/ On Sun, Jan 10, 2016 at 9:29 PM, Maciek Sakrejda wrote: > There was a discussion recently about changing the Cassandra EOL policy on > the users list [1], but it didn't really go anywhere. I wanted to ask here > instead to clear up the status quo first. What's the current versioning > policy? The tick-tock versioning blog post [2] states in passing that two > major releases are maintained, but I have not found this as an official > policy stated anywhere. For comparison, the Postgres project lays this out > very clearly [3]. To be clear, I'm not looking for any official support, > I'm just asking for clarification regarding the maintenance policy: if a > critical bug or security vulnerability is found in version X.Y.Z, when can > I expect it to be fixed in a bugfix patch to that major version, and when > do I need to upgrade to the next major version. > > [1]: http://www.mail-archive.com/user@cassandra.apache.org/msg45324.html > [2]: http://www.planetcassandra.org/blog/cassandra-2-2-3-0-and-beyond/ > [3]: http://www.postgresql.org/support/versioning/ > -- Jonathan Ellis Project Chair, Apache Cassandra co-founder, http://www.datastax.com @spyced
Repair when a replica is Down
Hi We are on 2.0.14,RF=3 in a 3 node cluster. We use repair -pr . Recently, we observed that repair -pr for all nodes fails if a node is down. Then I found the JIRA https://issues.apache.org/jira/plugins/servlet/mobile#issue/CASSANDRA-2290 where an intentional decision was taken to abort the repair if a replica is down. I need to understand the reasoning behind aborting the repair instead of proceeding with available replicas. I have following concerns with the approach: We say that we have a fault tolerant Cassandra system such that we can afford single node failure because RF=3 and we read/write at QUORUM.But when a node goes down and we are not sure how much time will be needed to restore the node, entire system health is in question as gc_grace_period is approaching and we are not able to run repair -pr on any of the nodes. Then there is a dilemma:Whether to remove the faulty node well before gc grace period so that we get enough time to save data by repairing other two nodes? This may cause massive streaming which may be unnecessary if we are able to bring back the faulty node before gc grace period. OR Wait and hope that the issue will be resolved before gc grace time and we will have some buffer to run repair -pr on all nodes. OR Increase the gc grace period temporarily. Then we should have capacity planning to accomodate the extra storage needed for extra gc grace that may be needed in case of node failure scenarios. I need to understand the recommeded approach too for maintaing a fault tolerant system which can handle such node failures without hiccups. ThanksAnuj
Re: Repair when a replica is Down
Hi I have intentionally posted this message to the dev mailing list instead of users list because its regarding a conscious design decision taken regarding a bug and I feel that dev team is the most appropriate team who could respond to it. Please let me know if there are better ways to get it addressed. ThanksAnuj Sent from Yahoo Mail on Android On Fri, 15 Jan, 2016 at 11:36 pm, Anuj Wadehra wrote: Hi We are on 2.0.14,RF=3 in a 3 node cluster. We use repair -pr . Recently, we observed that repair -pr for all nodes fails if a node is down. Then I found the JIRA https://issues.apache.org/jira/plugins/servlet/mobile#issue/CASSANDRA-2290 where an intentional decision was taken to abort the repair if a replica is down. I need to understand the reasoning behind aborting the repair instead of proceeding with available replicas. I have following concerns with the approach: We say that we have a fault tolerant Cassandra system such that we can afford single node failure because RF=3 and we read/write at QUORUM.But when a node goes down and we are not sure how much time will be needed to restore the node, entire system health is in question as gc_grace_period is approaching and we are not able to run repair -pr on any of the nodes. Then there is a dilemma:Whether to remove the faulty node well before gc grace period so that we get enough time to save data by repairing other two nodes? This may cause massive streaming which may be unnecessary if we are able to bring back the faulty node before gc grace period. OR Wait and hope that the issue will be resolved before gc grace time and we will have some buffer to run repair -pr on all nodes. OR Increase the gc grace period temporarily. Then we should have capacity planning to accomodate the extra storage needed for extra gc grace that may be needed in case of node failure scenarios. I need to understand the recommeded approach too for maintaing a fault tolerant system which can handle such node failures without hiccups. ThanksAnuj
Re: Versioning policy?
Hi Jonathan It would be really nice if you could share your thoughts on the four points raised regarding the Cassandra EOL process. I think similar things happen for other open source products and it would be really nice if we could streamline such things for Apache Cassandra. ThanksAnuj Sent from Yahoo Mail on Android On Thu, 14 Jan, 2016 at 11:28 pm, Anuj Wadehra wrote: Hi Jonathan, Thanks for the crisp communication regarding the tick tock release & EOL. I think its worth considering some points regarding EOL policy and it would be great if you can share your thoughts on below points: 1. EOL of a release should be based on "most stable"/"production ready" version date rather than "GA" date of subsequent major releases. 2. I think we should have "Formal EOL Announcement" on Apache Cassandra website. 3. "Formal EOL Announcement" should come at least 6 months before the EOL, so that users get reasonable time to upgrade. 4. EOL Policy (even if flexible) should be stated on Apache Cassandra website EOL thread on users mailing list ended with the conclusion of raising a Wishlist JIRA but I think above points are more about working on policy and processes rather than just a wish list. ThanksAnuj Sent from Yahoo Mail on Android On Thu, 14 Jan, 2016 at 10:57 pm, Jonathan Ellis wrote: Hi Maciek, First let's talk about the tick-tock series, currently 3.x. This is pretty simple: outside of the regular monthly releases, we will release fixes for critical bugs against the most recent bugfix release, the way we did recently with 3.1.1 for CASSANDRA-10822 [1]. No older tick-tock releases will be patched. Now, we also have three other release series currently being supported: 2.1.x: supported with critical fixes only until 4.0 is released, projected in November 2016 [2] 2.2.x: maintained until 4.0 is released 3.0.x: maintained for 6 months after 4.0, i.e. projected until May 2017 I will add this information to the releases page [3]. [1] https://mail-archives.apache.org/mod_mbox/incubator-cassandra-user/201512.mbox/%3CCAKkz8Q3StqRFHfMgCMRYaaPdg+HE5N5muBtFVt-=v690pzp...@mail.gmail.com%3E [2] 4.0 will be an ordinary tick-tock release after 3.11, but we will be sunsetting deprecated features like Thrift so bumping the major version seems appropriate [3] http://cassandra.apache.org/download/ On Sun, Jan 10, 2016 at 9:29 PM, Maciek Sakrejda wrote: > There was a discussion recently about changing the Cassandra EOL policy on > the users list [1], but it didn't really go anywhere. I wanted to ask here > instead to clear up the status quo first. What's the current versioning > policy? The tick-tock versioning blog post [2] states in passing that two > major releases are maintained, but I have not found this as an official > policy stated anywhere. For comparison, the Postgres project lays this out > very clearly [3]. To be clear, I'm not looking for any official support, > I'm just asking for clarification regarding the maintenance policy: if a > critical bug or security vulnerability is found in version X.Y.Z, when can > I expect it to be fixed in a bugfix patch to that major version, and when > do I need to upgrade to the next major version. > > [1]: http://www.mail-archive.com/user@cassandra.apache.org/msg45324.html > [2]: http://www.planetcassandra.org/blog/cassandra-2-2-3-0-and-beyond/ > [3]: http://www.postgresql.org/support/versioning/ > -- Jonathan Ellis Project Chair, Apache Cassandra co-founder, http://www.datastax.com @spyced
Re: Versioning policy?
I was not referring to Enterprise support here. When I said Open source "product" by mistake, I was just referring to some other Apache open source projects like Apache Cassandra where you get EOL announcements, info etc on the main Apache web site. I think all four points are very relevant in context of an Open source project and thats why I wanted to thoughts on these points. ThanksAnuj Sent from Yahoo Mail on Android On Sun, 17 Jan, 2016 at 1:43 am, Michael Kjellman wrote: Correct, this is an open source project. If you want a Enterprise support story Datastax has an Enterprise option for you. > On Jan 16, 2016, at 11:19 AM, Anuj Wadehra wrote: > > Hi Jonathan > > It would be really nice if you could share your thoughts on the four points > raised regarding the Cassandra EOL process. I think similar things happen for > other open source products and it would be really nice if we could streamline > such things for Apache Cassandra. > > ThanksAnuj > > Sent from Yahoo Mail on Android > > On Thu, 14 Jan, 2016 at 11:28 pm, Anuj Wadehra >wrote: Hi Jonathan, > Thanks for the crisp communication regarding the tick tock release & EOL. > I think its worth considering some points regarding EOL policy and it would > be great if you can share your thoughts on below points: > 1. EOL of a release should be based on "most stable"/"production ready" > version date rather than "GA" date of subsequent major releases. > 2. I think we should have "Formal EOL Announcement" on Apache Cassandra > website. > 3. "Formal EOL Announcement" should come at least 6 months before the EOL, so > that users get reasonable time to upgrade. > 4. EOL Policy (even if flexible) should be stated on Apache Cassandra website > > EOL thread on users mailing list ended with the conclusion of raising a > Wishlist JIRA but I think above points are more about working on policy and > processes rather than just a wish list. > > ThanksAnuj > > > > Sent from Yahoo Mail on Android > > On Thu, 14 Jan, 2016 at 10:57 pm, Jonathan Ellis wrote: >Hi Maciek, > > First let's talk about the tick-tock series, currently 3.x. This is pretty > simple: outside of the regular monthly releases, we will release fixes for > critical bugs against the most recent bugfix release, the way we did > recently with 3.1.1 for CASSANDRA-10822 [1]. No older tick-tock releases > will be patched. > > Now, we also have three other release series currently being supported: > > 2.1.x: supported with critical fixes only until 4.0 is released, projected > in November 2016 [2] > 2.2.x: maintained until 4.0 is released > 3.0.x: maintained for 6 months after 4.0, i.e. projected until May 2017 > > I will add this information to the releases page [3]. > > [1] > https://mail-archives.apache.org/mod_mbox/incubator-cassandra-user/201512.mbox/%3CCAKkz8Q3StqRFHfMgCMRYaaPdg+HE5N5muBtFVt-=v690pzp...@mail.gmail.com%3E > [2] 4.0 will be an ordinary tick-tock release after 3.11, but we will be > sunsetting deprecated features like Thrift so bumping the major version > seems appropriate > [3] http://cassandra.apache.org/download/ > >> On Sun, Jan 10, 2016 at 9:29 PM, Maciek Sakrejda wrote: >> >> There was a discussion recently about changing the Cassandra EOL policy on >> the users list [1], but it didn't really go anywhere. I wanted to ask here >> instead to clear up the status quo first. What's the current versioning >> policy? The tick-tock versioning blog post [2] states in passing that two >> major releases are maintained, but I have not found this as an official >> policy stated anywhere. For comparison, the Postgres project lays this out >> very clearly [3]. To be clear, I'm not looking for any official support, >> I'm just asking for clarification regarding the maintenance policy: if a >> critical bug or security vulnerability is found in version X.Y.Z, when can >> I expect it to be fixed in a bugfix patch to that major version, and when >> do I need to upgrade to the next major version. >> >> [1]: http://www.mail-archive.com/user@cassandra.apache.org/msg45324.html >> [2]: http://www.planetcassandra.org/blog/cassandra-2-2-3-0-and-beyond/ >> [3]: http://www.postgresql.org/support/versioning/ > > > > -- > Jonathan Ellis > Project Chair, Apache Cassandra > co-founder, http://www.datastax.com > @spyced >
Re: Versioning policy?
HAPPY to see that Apache Cassandra web site has been updated to include EOL information :) Thanks !!! I have some queries on the updated content: 1. Earlier, Apache web site always used to show 2 Cassandra versions - one which is "most stable" (production-ready) and other for development use. Now, I can't see that "most stable"(production-ready) tag on any version. Site says "The latest tick-tock release is 3.2" As per the tick-tock logic, Does that mean 3.1 is the latest most stable Cassandra version available today as it had bug-fixes for 3.0 ? Is 3.1 production ready? If NO, then how would production users on earlier releases get indication on their next upgrade version e.g. 3.x or 2.2 ? 2. I am assuming that going forward EOL announcements will be published at the Apache web site before hand just like some other Apache projects do. Is that assumption valid? It will certainly help to get such insights before hand on Apache site so that community users can prepare their upgrade road map. ThanksAnuj On Sunday, 17 January 2016 12:48 AM, Anuj Wadehra wrote: Hi Jonathan It would be really nice if you could share your thoughts on the four points raised regarding the Cassandra EOL process. I think similar things happen for other open source products and it would be really nice if we could streamline such things for Apache Cassandra. ThanksAnuj Sent from Yahoo Mail on Android On Thu, 14 Jan, 2016 at 11:28 pm, Anuj Wadehra wrote: Hi Jonathan, Thanks for the crisp communication regarding the tick tock release & EOL. I think its worth considering some points regarding EOL policy and it would be great if you can share your thoughts on below points: 1. EOL of a release should be based on "most stable"/"production ready" version date rather than "GA" date of subsequent major releases. 2. I think we should have "Formal EOL Announcement" on Apache Cassandra website. 3. "Formal EOL Announcement" should come at least 6 months before the EOL, so that users get reasonable time to upgrade. 4. EOL Policy (even if flexible) should be stated on Apache Cassandra website EOL thread on users mailing list ended with the conclusion of raising a Wishlist JIRA but I think above points are more about working on policy and processes rather than just a wish list. ThanksAnuj Sent from Yahoo Mail on Android On Thu, 14 Jan, 2016 at 10:57 pm, Jonathan Ellis wrote: Hi Maciek, First let's talk about the tick-tock series, currently 3.x. This is pretty simple: outside of the regular monthly releases, we will release fixes for critical bugs against the most recent bugfix release, the way we did recently with 3.1.1 for CASSANDRA-10822 [1]. No older tick-tock releases will be patched. Now, we also have three other release series currently being supported: 2.1.x: supported with critical fixes only until 4.0 is released, projected in November 2016 [2] 2.2.x: maintained until 4.0 is released 3.0.x: maintained for 6 months after 4.0, i.e. projected until May 2017 I will add this information to the releases page [3]. [1] https://mail-archives.apache.org/mod_mbox/incubator-cassandra-user/201512.mbox/%3CCAKkz8Q3StqRFHfMgCMRYaaPdg+HE5N5muBtFVt-=v690pzp...@mail.gmail.com%3E [2] 4.0 will be an ordinary tick-tock release after 3.11, but we will be sunsetting deprecated features like Thrift so bumping the major version seems appropriate [3] http://cassandra.apache.org/download/ On Sun, Jan 10, 2016 at 9:29 PM, Maciek Sakrejda wrote: > There was a discussion recently about changing the Cassandra EOL policy on > the users list [1], but it didn't really go anywhere. I wanted to ask here > instead to clear up the status quo first. What's the current versioning > policy? The tick-tock versioning blog post [2] states in passing that two > major releases are maintained, but I have not found this as an official > policy stated anywhere. For comparison, the Postgres project lays this out > very clearly [3]. To be clear, I'm not looking for any official support, > I'm just asking for clarification regarding the maintenance policy: if a > critical bug or security vulnerability is found in version X.Y.Z, when can > I expect it to be fixed in a bugfix patch to that major version, and when > do I need to upgrade to the next major version. > > [1]: http://www.mail-archive.com/user@cassandra.apache.org/msg45324.html > [2]: http://www.planetcassandra.org/blog/cassandra-2-2-3-0-and-beyond/ > [3]: http://www.postgresql.org/support/versioning/ > -- Jonathan Ellis Project Chair, Apache Cassandra co-founder, http://www.datastax.com @spyced
Re: Repair when a replica is Down
Thanks Tyler !! I understand that we need to consider a node as lost when its down for gc grace and bootstrap it. My question is more about the JIRA https://issues.apache.org/jira/plugins/servlet/mobile#issue/CASSANDRA-2290 where an intentional decision was taken to abort the repair if a single replica is down. Precisely, I need to understand the reasoning behind aborting the repair instead of proceeding with available replicas. As it is related to a specific fix, I thought that developers involved in the decision could better explain the reasoning. So, I posted it on dev list first. Consider a scenario where I have a 20 node clsuter, RF=5, Read/Write Quorum, gc grace period=20. My cluster is fault tolerant and it can afford 2 node failure. Suddenly, one node goes down due to some hardware issue. Its 10 days since my node is down, none of the 19 nodes are being repaired and now its decision time. I am not sure how soon issue would be fixed may be 8 days before gc grace, so I shouldnt remove node early and add node back as it would cause unnecessary streaming. At the same time, if I dont remove the failed node, my entire system health would be in question and it would be a panic situation as no data got repaired in last 10 days and gc grace is approaching. I need sufficient time to repair 19 nodes. What looked like a fault tolerant system which can afford 2 node failure, required urgent attention and manual decision making when a single node went down. Why cant we just go ahead and repair remaining replicas if some replicas are down? If failed node comes up before gc grace period, we would run repair to fix inconsistencies and otheriwse we would discard data and bootstrap. I think that would be a really robust fault tolerant system. ThanksAnuj On Tue, 19 Jan, 2016 at 9:44 pm, Tyler Hobbs wrote: On Fri, Jan 15, 2016 at 12:06 PM, Anuj Wadehra wrote: > Increase the gc grace period temporarily. Then we should have capacity > planning to accomodate the extra storage needed for extra gc grace that may > be needed in case of node failure scenarios. I would do this. Nodes that are down for longer than gc_grace_seconds should not re-enter the cluster, because they may contain data that has been deleted and the tombstone has already been purged (repairing doesn't change this). Bringing them back up will result in "zombie" data. Also, I do think that the user mailing list is a better place for the first round of this conversation. -- Tyler Hobbs DataStax <http://datastax.com/>
Re: Repair when a replica is Down
There is a JIRA Issue https://issues.apache.org/jira/browse/CASSANDRA-10446 . But its open with Minor prority and type as Improvement. I think its a very valid concern for all and especially for users who have bigger clusters. More of an issue related with Design decision rather than an improvement. Can we change its priority so that it gets appropriate attention? ThanksAnuj On Tue, 19 Jan, 2016 at 10:35 pm, Tyler Hobbs wrote: On Tue, Jan 19, 2016 at 10:44 AM, Anuj Wadehra wrote: Consider a scenario where I have a 20 node clsuter, RF=5, Read/Write Quorum, gc grace period=20. My cluster is fault tolerant and it can afford 2 node failure. Suddenly, one node goes down due to some hardware issue. Its 10 days since my node is down, none of the 19 nodes are being repaired and now its decision time. I am not sure how soon issue would be fixed may be 8 days before gc grace, so I shouldnt remove node early and add node back as it would cause unnecessary streaming. At the same time, if I dont remove the failed node, my entire system health would be in question and it would be a panic situation as no data got repaired in last 10 days and gc grace is approaching. I need sufficient time to repair 19 nodes. What looked like a fault tolerant system which can afford 2 node failure, required urgent attention and manual decision making when a single node went down. Why cant we just go ahead and repair remaining replicas if some replicas are down? If failed node comes up before gc grace period, we would run repair to fix inconsistencies and otheriwse we would discard data and bootstrap. I think that would be a really robust fault tolerant system. That makes sense. It seems like having the option to ignore down replicas during repair could be at least somewhat helpful, although it may be tricky to decide how this should interact with incremental repairs. If there isn't a jira ticket for this already, can you open one with the scenario above? -- Tyler Hobbs DataStax
Re: Repair when a replica is Down
Hi Tyler, I think the scenario needs some correction. 20 node clsuter, RF=5, Read/Write Quorum, gc grace period=20. If a node goes down, repair -pr would fail on 4 nodes maintaining replicas and full repair would fail on even greater no.of number of nodes but not 19. Please confirm. Anyways the system health would get impacted as multiple nodes are not repairing with a single node failure. ThanksAnujSent from Yahoo Mail on Android On Tue, 19 Jan, 2016 at 10:48 pm, Anuj Wadehra wrote: There is a JIRA Issue https://issues.apache.org/jira/browse/CASSANDRA-10446 . But its open with Minor prority and type as Improvement. I think its a very valid concern for all and especially for users who have bigger clusters. More of an issue related with Design decision rather than an improvement. Can we change its priority so that it gets appropriate attention? ThanksAnuj On Tue, 19 Jan, 2016 at 10:35 pm, Tyler Hobbs wrote: On Tue, Jan 19, 2016 at 10:44 AM, Anuj Wadehra wrote: Consider a scenario where I have a 20 node clsuter, RF=5, Read/Write Quorum, gc grace period=20. My cluster is fault tolerant and it can afford 2 node failure. Suddenly, one node goes down due to some hardware issue. Its 10 days since my node is down, none of the 19 nodes are being repaired and now its decision time. I am not sure how soon issue would be fixed may be 8 days before gc grace, so I shouldnt remove node early and add node back as it would cause unnecessary streaming. At the same time, if I dont remove the failed node, my entire system health would be in question and it would be a panic situation as no data got repaired in last 10 days and gc grace is approaching. I need sufficient time to repair 19 nodes. What looked like a fault tolerant system which can afford 2 node failure, required urgent attention and manual decision making when a single node went down. Why cant we just go ahead and repair remaining replicas if some replicas are down? If failed node comes up before gc grace period, we would run repair to fix inconsistencies and otheriwse we would discard data and bootstrap. I think that would be a really robust fault tolerant system. That makes sense. It seems like having the option to ignore down replicas during repair could be at least somewhat helpful, although it may be tricky to decide how this should interact with incremental repairs. If there isn't a jira ticket for this already, can you open one with the scenario above? -- Tyler Hobbs DataStax
Re: Repair when a replica is Down
Actually I have not checked how repair -pr abort logic is implemented in code. So irrespective of repair pr or full repair scenarios, problem can be stated as follows: 20 node cluster, RF=5, Read/Write Quorum, gc grace period=20. If a node goes down, 1/20 th of data for which the failed node was responsible(owner) cant be repaired as 1 out of 5 replicas is down. This will put entire system health in question just because of single node failure. ThanksAnuj Sent from Yahoo Mail on Android On Tue, 19 Jan, 2016 at 11:12 pm, Anuj Wadehra wrote: Hi Tyler, I think the scenario needs some correction. 20 node clsuter, RF=5, Read/Write Quorum, gc grace period=20. If a node goes down, repair -pr would fail on 4 nodes maintaining replicas and full repair would fail on even greater no.of number of nodes but not 19. Please confirm. Anyways the system health would get impacted as multiple nodes are not repairing with a single node failure. ThanksAnujSent from Yahoo Mail on Android On Tue, 19 Jan, 2016 at 10:48 pm, Anuj Wadehra wrote: There is a JIRA Issue https://issues.apache.org/jira/browse/CASSANDRA-10446 . But its open with Minor prority and type as Improvement. I think its a very valid concern for all and especially for users who have bigger clusters. More of an issue related with Design decision rather than an improvement. Can we change its priority so that it gets appropriate attention? ThanksAnuj On Tue, 19 Jan, 2016 at 10:35 pm, Tyler Hobbs wrote: On Tue, Jan 19, 2016 at 10:44 AM, Anuj Wadehra wrote: Consider a scenario where I have a 20 node clsuter, RF=5, Read/Write Quorum, gc grace period=20. My cluster is fault tolerant and it can afford 2 node failure. Suddenly, one node goes down due to some hardware issue. Its 10 days since my node is down, none of the 19 nodes are being repaired and now its decision time. I am not sure how soon issue would be fixed may be 8 days before gc grace, so I shouldnt remove node early and add node back as it would cause unnecessary streaming. At the same time, if I dont remove the failed node, my entire system health would be in question and it would be a panic situation as no data got repaired in last 10 days and gc grace is approaching. I need sufficient time to repair 19 nodes. What looked like a fault tolerant system which can afford 2 node failure, required urgent attention and manual decision making when a single node went down. Why cant we just go ahead and repair remaining replicas if some replicas are down? If failed node comes up before gc grace period, we would run repair to fix inconsistencies and otheriwse we would discard data and bootstrap. I think that would be a really robust fault tolerant system. That makes sense. It seems like having the option to ignore down replicas during repair could be at least somewhat helpful, although it may be tricky to decide how this should interact with incremental repairs. If there isn't a jira ticket for this already, can you open one with the scenario above? -- Tyler Hobbs DataStax
Re: 3.1 status?
I agree with the thought of not recommending any production ready version. If something is not production ready, it should ideally be release candidate and when GA happens, it should implicitly mean stable as it is assumed that the GA is only done for production ready releases. ThanksAnuj Sent from Yahoo Mail on Android On Wed, 20 Jan, 2016 at 11:03 am, Jonathan Ellis wrote: On Tue, Jan 19, 2016 at 11:17 PM, Jack Krupansky wrote: > It's great to see clear support status marked on the 3.0.x and 2.x releases > on the download page now. A couple more questions... > > 1. What is the support and stability status of 3.1 and 3.2 (as opposed to > 3.2.1)? Are they "for non-production development only"? Are they considered > "stable"? The page should say. > I disagree that the page should make a recommendation here, but see below. 2. Is there simply no "stable" release for 3.x, or is the latest tick-tock > release by definition considered "stable"? > If you want to have that mental box, then I would put the most recent bug fix release in it. (3.1.1 will be going back on the download page soon; removing it was an oversight.) > 3. The first paragraph says "If a critical bug is found, a patch will be > released against the most recent bug fix release", but in fact the latest > critical patch (3.2.1) is against a feature release, not a bug fix release. > Should that simply say "... against the most recent tick-tock release" > regardless of whether it was an even (feature) or odd (bug fix) release? > Case by case basis. In this instance, the bug that prompted the release was a new regression, so there was no need to patch 3.1. (And no, I don't want to belabor the syntax on the download page to spell this out in minute detail.) -- Jonathan Ellis Project Chair, Apache Cassandra co-founder, http://www.datastax.com @spyced
EOL for COMPACT STORAGE
Hi, Is there any plan to completely phase out COMPACT STORAGE table format such that backward compatability is broken in future? ThanksAnuj
Re: EOL for COMPACT STORAGE
I would appreciate if someone from Dev team could reply? ThanksAnuj Sent from Yahoo Mail on Android On Sun, 31 Jan, 2016 at 7:23 pm, Anuj Wadehra wrote: Hi, Is there any plan to completely phase out COMPACT STORAGE table format such that backward compatability is broken in future? ThanksAnuj
Criteria for upgrading to 3.x releases in PROD
Hi, Tick-Tock release strategy in 3.x was a good intiative to ensure frequent & stable releases. While odd releases are supposed to get all the bug fixes and should be most stable, many people like me, who got used to the comforting "production ready/stable" tag on Apache website, are still reluctant to take latest 3.x odd releases into production. I think the hesitation is somewhat justified as processes often take time to mature. So here I would like to ask the experts, people who know the ground situation, people who actively develop it and manage it. Considering the current scenario, What should be a resonable criteria for taking 3.x releases in production? ThanksAnuj
Re: Criteria for upgrading to 3.x releases in PROD
Can someone help me with this one? ThanksAnuj Sent from Yahoo Mail on Android On Sun, 10 Apr, 2016 at 11:07 PM, Anuj Wadehra wrote: Hi, Tick-Tock release strategy in 3.x was a good intiative to ensure frequent & stable releases. While odd releases are supposed to get all the bug fixes and should be most stable, many people like me, who got used to the comforting "production ready/stable" tag on Apache website, are still reluctant to take latest 3.x odd releases into production. I think the hesitation is somewhat justified as processes often take time to mature. So here I would like to ask the experts, people who know the ground situation, people who actively develop it and manage it. Considering the current scenario, What should be a resonable criteria for taking 3.x releases in production? ThanksAnuj
Re: Criteria for upgrading to 3.x releases in PROD
Hi All, For last several months, the "most stable version" question pops up on the user mailing list and then people get all sorts of responses/suggestions.. If you are conservative go for x if adventurous y.. If you have good risk appetite go for x else y.. If you want features go for x else y.. Unfortunately, all above responses dont help many users..but only reinforce the low confidence in latest releases.Who wants to be adventurous in Production? Who wants to test his risk appetite in Production? And who would want features for stability in Production? Not many..I am sure. So my question is: Would it be a wise decision to mention the "most stable/production ready" version (as it used to be before 3.x) on the Apache website till tick-tock release strategy evolves and matures? That will somewhat contradict the tick-tock philosphy of stable odd releases but would be more realistic as every big change needs time to stabilise. Its slightly unfair, if users are kept in confused state till the strategy matures and starts delivering solid stable builds. I think the question is more appropriate in dev list so I have kept it here. ThanksAnuj Sent from Yahoo Mail on Android On Mon, 11 Apr, 2016 at 11:39 PM, Aleksey Yeschenko wrote: The answer will depend on how conservative you are. The most conservative choice overall would be to go with the 2.2.x line. 3.0.x if you want to the new nice and shiny 3.0 things, but can tolerate some risk (the branch has a lot of relatively new core code, and hasn’t yet been tried out by as many users as the 2.x branch had). The latest odd 3.x if you want the shiniest (3.5 to be released soon, with features like the new SASI secondary indexes support). Also, there hasn’t yet been that much divergence between 3.0.x and 3.x, so risk levels are around the same, so long as you limit yourself to only the features present in 3.0.x. Either way, make sure to properly test whatever release you go for in staging first, as Michael says, and you’ll be alright. -- AY On 11 April 2016 at 18:42:31, Anuj Wadehra (anujw_2...@yahoo.co.in.invalid) wrote: Can someone help me with this one? ThanksAnuj Sent from Yahoo Mail on Android On Sun, 10 Apr, 2016 at 11:07 PM, Anuj Wadehra wrote: Hi, Tick-Tock release strategy in 3.x was a good intiative to ensure frequent & stable releases. While odd releases are supposed to get all the bug fixes and should be most stable, many people like me, who got used to the comforting "production ready/stable" tag on Apache website, are still reluctant to take latest 3.x odd releases into production. I think the hesitation is somewhat justified as processes often take time to mature. So here I would like to ask the experts, people who know the ground situation, people who actively develop it and manage it. Considering the current scenario, What should be a resonable criteria for taking 3.x releases in production? ThanksAnuj
Re: Criteria for upgrading to 3.x releases in PROD
I am sorry but here, I am not expecting thousands to decide a stable version for my use case. I have a serious question about publishing some info on the Apache website. As dev list has active contributors, I posted it here. If not this forum, Whats the best way to put your suggestions regarding Apache content and initiate a meaningful and conclusive discussion thread? ThanksAnuj Sent from Yahoo Mail on Android On Mon, 18 Apr, 2016 at 11:27 PM, Michael Kjellman wrote: This is best for the users list. Test the releases yourself and then decide when it's ready for your use case, ops team, and organization. This is a personal decision and not one for *thousands* of others on this mailing list to make for you. best, kjellman > On Apr 18, 2016, at 10:54 AM, Anuj Wadehra > wrote: > > Hi All, > For last several months, the "most stable version" question pops up on the > user mailing list and then people get all sorts of responses/suggestions.. > If you are conservative go for x if adventurous y.. > If you have good risk appetite go for x else y.. > If you want features go for x else y.. > > Unfortunately, all above responses dont help many users..but only reinforce > the low confidence in latest releases.Who wants to be adventurous in > Production? Who wants to test his risk appetite in Production? And who would > want features for stability in Production? Not many..I am sure. > So my question is: > Would it be a wise decision to mention the "most stable/production ready" > version (as it used to be before 3.x) on the Apache website till tick-tock > release strategy evolves and matures? > That will somewhat contradict the tick-tock philosphy of stable odd releases >but would be more realistic as every big change needs time to stabilise. Its >slightly unfair, if users are kept in confused state till the strategy matures >and starts delivering solid stable builds. > I think the question is more appropriate in dev list so I have kept it here. > ThanksAnuj > Sent from Yahoo Mail on Android > > On Mon, 11 Apr, 2016 at 11:39 PM, Aleksey Yeschenko >wrote: The answer will depend on how conservative you are. > > The most conservative choice overall would be to go with the 2.2.x line. > > 3.0.x if you want to the new nice and shiny 3.0 things, but can tolerate some > risk (the branch has a lot of relatively new core code, and hasn’t yet been > tried out by as many users as the 2.x branch had). > > The latest odd 3.x if you want the shiniest (3.5 to be released soon, with > features like the new SASI secondary indexes support). Also, there hasn’t yet > been that much divergence between 3.0.x and 3.x, so risk levels are around > the same, so long as you limit yourself to only the features present in 3.0.x. > > Either way, make sure to properly test whatever release you go for in staging > first, as Michael says, and you’ll be alright. > > -- > AY > > On 11 April 2016 at 18:42:31, Anuj Wadehra (anujw_2...@yahoo.co.in.invalid) > wrote: > > Can someone help me with this one? > ThanksAnuj > > Sent from Yahoo Mail on Android > > On Sun, 10 Apr, 2016 at 11:07 PM, Anuj Wadehra wrote: > Hi, > Tick-Tock release strategy in 3.x was a good intiative to ensure frequent & > stable releases. While odd releases are supposed to get all the bug fixes and > should be most stable, many people like me, who got used to the comforting > "production ready/stable" tag on Apache website, are still reluctant to take > latest 3.x odd releases into production. I think the hesitation is somewhat > justified as processes often take time to mature. > So here I would like to ask the experts, people who know the ground > situation, people who actively develop it and manage it. Considering the > current scenario, What should be a resonable criteria for taking 3.x releases > in production? > > > ThanksAnuj > > > > >
Re: Criteria for upgrading to 3.x releases in PROD
Hi All, Let me reiterate, my question is not about selecting right Cassandra for me. The intent is to get dev community response on below question. Question: Would it be a wise decision to mention the "most stable/production ready" version (as it used to be before 3.x) on the Apache website till tick-tock release strategy evolves and matures? Drivers for posting above info on website: I have read all the posts/forums and realized that there is no absolute answer for selecting Production Ready Cassandra version one should use..Even now, people often hesitate to recommend latest releases for Prod and go back to 2.1 and 2.2..In every suggestion there are too many ifs..like I said...if you want features x..if u want rock solid y..if you are adventurous zno offense but who would not want a rock solid version for Production? Who would want features for stability in Prod? And who would want to take risks in Prod? The stability of a release should NOT depend my risk appetite and use case..if some version of 2.1 or 2.2 or 3.0.x is stable for production why not put that info until tick-tock matures? Please realize that everyone goes for thorough testing before upgrading but the scope of application testing cant uncover most critical bugs..Community guidance and a bigger picture on stability can help the community until tick-tock matures and we deliver stable production ready releases. ThanksAnuj Sent from Yahoo Mail on Android On Tue, 19 Apr, 2016 at 3:01 AM, Carlos Rolo wrote: My blog post regarding this: https://www.pythian.com/blog/cassandra-version-production/ There is a choice for everyone, and explained. Regards, Carlos Juzarte Rolo Cassandra Consultant / Datastax Certified Architect / Cassandra MVP Pythian - Love your data rolo@pythian | Twitter: @cjrolo | Linkedin: *linkedin.com/in/carlosjuzarterolo <http://linkedin.com/in/carlosjuzarterolo>* Mobile: +351 91 891 81 00 | Tel: +1 613 565 8696 x1649 www.pythian.com On Mon, Apr 18, 2016 at 7:12 PM, Anuj Wadehra < anujw_2...@yahoo.co.in.invalid> wrote: > I am sorry but here, I am not expecting thousands to decide a stable > version for my use case. I have a serious question about publishing some > info on the Apache website. As dev list has active contributors, I posted > it here. If not this forum, Whats the best way to put your suggestions > regarding Apache content and initiate a meaningful and conclusive > discussion thread? > > ThanksAnuj > > Sent from Yahoo Mail on Android > > On Mon, 18 Apr, 2016 at 11:27 PM, Michael Kjellman< > mkjell...@internalcircle.com> wrote: This is best for the users list. > Test the releases yourself and then decide when it's ready for your use > case, ops team, and organization. This is a personal decision and not one > for *thousands* of others on this mailing list to make for you. > > best, > kjellman > > > On Apr 18, 2016, at 10:54 AM, Anuj Wadehra > wrote: > > > > Hi All, > > For last several months, the "most stable version" question pops up on > the user mailing list and then people get all sorts of > responses/suggestions.. > > If you are conservative go for x if adventurous y.. > > If you have good risk appetite go for x else y.. > > If you want features go for x else y.. > > > > Unfortunately, all above responses dont help many users..but only > reinforce the low confidence in latest releases.Who wants to be adventurous > in Production? Who wants to test his risk appetite in Production? And who > would want features for stability in Production? Not many..I am sure. > > So my question is: > > Would it be a wise decision to mention the "most stable/production > ready" version (as it used to be before 3.x) on the Apache website till > tick-tock release strategy evolves and matures? > > That will somewhat contradict the tick-tock philosphy of stable odd > releases but would be more realistic as every big change needs time to > stabilise. Its slightly unfair, if users are kept in confused state till > the strategy matures and starts delivering solid stable builds. > > I think the question is more appropriate in dev list so I have kept it > here. > > ThanksAnuj > > Sent from Yahoo Mail on Android > > > > On Mon, 11 Apr, 2016 at 11:39 PM, Aleksey Yeschenko > wrote: The answer will depend on how conservative you are. > > > > The most conservative choice overall would be to go with the 2.2.x line. > > > > 3.0.x if you want to the new nice and shiny 3.0 things, but can tolerate > some risk (the branch has a lot of relatively new core code, and hasn’t yet > been tried out by as many users as the 2.x branch had). > > > > The latest odd 3.x if you want the shiniest (3.5 to be released soon, >
Re: Criteria for upgrading to 3.x releases in PROD
Jonathan, I understand you point. In my perspective, people in production usually prefer stability over features and would always want at least emergency fix releases if not fully supported versions.I am glad that today we have such releases which are very stable and not yet EOL. Its just that users are tempted to use latest odd releases as per the tick-tock strategy highlighted on the website and then probably fallback to previous ones after discussing stable versions on various forums. I just wanted to make their decisions simpler :) I agree with you - Every thing cant be white and black..stable and unstable..At the same..I feel.. most of the time there would be a single stable release which is not EOL. Thanks for your time. Anuj Sent from Yahoo Mail on Android On Tue, 19 Apr, 2016 at 7:06 AM, Jonathan Ellis wrote: Anuj, The problem is that this question defies a simplistic answer like "version X is the most stable" (are you willing to use unsupported releases? what about emergency-fix-only? what features can you not live without?) so we're intentionally resisting the urge to oversimplify the situation. On Mon, Apr 18, 2016 at 8:25 PM, Anuj Wadehra < anujw_2...@yahoo.co.in.invalid> wrote: > Hi All, > Let me reiterate, my question is not about selecting right Cassandra for > me. The intent is to get dev community response on below question. > Question: > Would it be a wise decision to mention the "most stable/production > ready" version (as it used to be before 3.x) on the Apache website till > tick-tock release strategy evolves and matures? > > Drivers for posting above info on website: > I have read all the posts/forums and realized that there is no absolute > answer for selecting Production Ready Cassandra version one should > use..Even now, people often hesitate to recommend latest releases for Prod > and go back to 2.1 and 2.2..In every suggestion there are too many > ifs..like I said...if you want features x..if u want rock solid y..if you > are adventurous zno offense but who would not want a rock solid > version for Production? Who would want features for stability in Prod? And > who would want to take risks in Prod? > The stability of a release should NOT depend my risk appetite and use > case..if some version of 2.1 or 2.2 or 3.0.x is stable for production why > not put that info until tick-tock matures? > > Please realize that everyone goes for thorough testing before upgrading > but the scope of application testing cant uncover most critical > bugs..Community guidance and a bigger picture on stability can help the > community until tick-tock matures and we deliver stable production ready > releases. > > > > ThanksAnuj > Sent from Yahoo Mail on Android > > On Tue, 19 Apr, 2016 at 3:01 AM, Carlos Rolo wrote: > My blog post regarding this: > > https://www.pythian.com/blog/cassandra-version-production/ > > There is a choice for everyone, and explained. > > Regards, > > Carlos Juzarte Rolo > Cassandra Consultant / Datastax Certified Architect / Cassandra MVP > > Pythian - Love your data > > rolo@pythian | Twitter: @cjrolo | Linkedin: * > linkedin.com/in/carlosjuzarterolo > <http://linkedin.com/in/carlosjuzarterolo>* > Mobile: +351 91 891 81 00 | Tel: +1 613 565 8696 x1649 > www.pythian.com > > On Mon, Apr 18, 2016 at 7:12 PM, Anuj Wadehra < > anujw_2...@yahoo.co.in.invalid> wrote: > > > I am sorry but here, I am not expecting thousands to decide a stable > > version for my use case. I have a serious question about publishing some > > info on the Apache website. As dev list has active contributors, I posted > > it here. If not this forum, Whats the best way to put your suggestions > > regarding Apache content and initiate a meaningful and conclusive > > discussion thread? > > > > ThanksAnuj > > > > Sent from Yahoo Mail on Android > > > > On Mon, 18 Apr, 2016 at 11:27 PM, Michael Kjellman< > > mkjell...@internalcircle.com> wrote: This is best for the users list. > > Test the releases yourself and then decide when it's ready for your use > > case, ops team, and organization. This is a personal decision and not one > > for *thousands* of others on this mailing list to make for you. > > > > best, > > kjellman > > > > > On Apr 18, 2016, at 10:54 AM, Anuj Wadehra > > wrote: > > > > > > Hi All, > > > For last several months, the "most stable version" question pops up on > > the user mailing list and then people get all sorts of > > responses/suggestions.. > > > If you are conservative go for x if adventurous y.. > > > If you have good risk appetite go
Re: Criteria for upgrading to 3.x releases in PROD
Jack, The question was about publishing "most stable" release on Apache website as it done before 3.x. Regarding your comments, I still feel adventure cant happen in production systems. And you should certainly test every release before upgrading but you woulf not like to upgrade to latest releases based on your limited testing. I feel that you cant do exhaustive testing of the database and can easily miss critical corner cases which may trigger in production. But its just my perspective of looking at things. People may think differently. Thanks All of you for your comments !! ThanksAnuj Sent from Yahoo Mail on Android On Sun, 24 Apr, 2016 at 1:28 AM, Jack Krupansky wrote: Is the question whether a new application can go into production with 3.x, or whether an existing application in production with 2.x.y should be upgraded to 3.x? For the latter, a "If it ain't broke, don't fix it" philosophy is best. And if there are critical bug fixes needed, simply upgrade the 2.x line that you are already on. Or if your production is on 3.0.x, upgrade to 3.0.x+k. For the former, we aren't hearing people hollering that 3.x is crap, so it is reasonably safe for a new app going into production, subject to your own testing. Given the relative stability of 3.x due to the tick-tock and "trunk always releasable" strategies, users are no longer faced with the kind of wild instabilities of the past. Ultimately, stability really is subjective and in the eye of the beholder - how conservative or adventurous are you and your organization. Sure, maybe 2.2.x is more stable in some abstract sense, but for a new app, why start so far behind the curve? In fact, for a new app you should be trying to take advantage of new features and performance improvements, like materialized views, SASI, and wide rows coming soon. In the past, upgrading from 2.x to 2.y was a big deal. That just isn't a problem with upgrading from 3.x to 3.y. At least in theory, and again, nobody has been hollering about having problems doing that. For EOL, you will have to judge for yourself how long it may take your organization to carefully migrate a production 2.x system to 3.x somewhere down the road. No need to rush, but don't wait until the last minute either. And I suspect that you won't even want to think about upgrading 2.x to 4.x - IOW, upgrade to 3.x well before 3.x EOL. -- Jack Krupansky On Sat, Apr 23, 2016 at 3:28 PM, Anuj Wadehra < anujw_2...@yahoo.co.in.invalid> wrote: > Jonathan, > I understand you point. In my perspective, people in production usually > prefer stability over features and would always want at least emergency fix > releases if not fully supported versions.I am glad that today we have such > releases which are very stable and not yet EOL. Its just that users are > tempted to use latest odd releases as per the tick-tock strategy > highlighted on the website and then probably fallback to previous ones > after discussing stable versions on various forums. I just wanted to make > their decisions simpler :) I agree with you - Every thing cant be white and > black..stable and unstable..At the same..I feel.. most of the time there > would be a single stable release which is not EOL. > Thanks for your time. > > > Anuj > Sent from Yahoo Mail on Android > > On Tue, 19 Apr, 2016 at 7:06 AM, Jonathan Ellis > wrote: Anuj, > > The problem is that this question defies a simplistic answer like "version > X is the most stable" (are you willing to use unsupported releases? what > about emergency-fix-only? what features can you not live without?) so > we're intentionally resisting the urge to oversimplify the situation. > > On Mon, Apr 18, 2016 at 8:25 PM, Anuj Wadehra < > anujw_2...@yahoo.co.in.invalid> wrote: > > > Hi All, > > Let me reiterate, my question is not about selecting right Cassandra for > > me. The intent is to get dev community response on below question. > > Question: > > Would it be a wise decision to mention the "most stable/production > > ready" version (as it used to be before 3.x) on the Apache website till > > tick-tock release strategy evolves and matures? > > > > Drivers for posting above info on website: > > I have read all the posts/forums and realized that there is no absolute > > answer for selecting Production Ready Cassandra version one should > > use..Even now, people often hesitate to recommend latest releases for > Prod > > and go back to 2.1 and 2.2..In every suggestion there are too many > > ifs..like I said...if you want features x..if u want rock solid y..if you > > are adventurous zno offense but who would not want a rock solid > > version for Production? Who would want features for stability in Prod? >
Re: [Proposal] Mandatory comments
I think this is very basic. If its not followed till now, we should do that now on.Just a suggestion. So,there should be a rule and may be a code review checklist point to verify that "quality" of comments is ok and comments are sufficient. Regarding, high level comments, I feel that they are wonderful and often you can effortlessly get the low level design by reading them. The only drawback is their maintenance. Maintenance of "big picture" when code starts drifiting from it is tough. You will make an important change in class X and should take care that the big picture is updated at some other place. One may debate that such instances would be rare. ThanksAnuj Sent from Yahoo Mail on Android On Tue, 3 May, 2016 at 12:26 AM, Sylvain Lebresne wrote: On Mon, May 2, 2016 at 7:16 PM, Jonathan Ellis wrote: > What I'd like to see is more comments like the one in StreamSession: > something that can give me the "big picture" for a piece of functionality. > I wholeheartedly agree that we need more of those. I don't think though that we need a single kind of comment, nor even that we're lacking on a single kind. > > I wonder if focusing on class-based comments might miss an opportunity > here. I don't meant to imply any exclusive focusing by my suggestion. I'm constantly seeing classes that not well explained and methods that make complex and undocumented assumptions, so I'm very much convinced improvements are needed on that front. Without, again, invalidating the equal need for big picture comments. > > Is this a case for package-level javadoc, and organizing our class > hierarchy better along those lines? > I agree that this would probably be the best place for those bit-picture documentation. I'd be totally fine adding on top of the rule I suggested another one that says: - If you create a new package, you should have a package level javadoc that describe the big picture of what that package is about. I do want to note that I'm trying to focus the discussion here on a few simple concrete points we could hopefully easily agree on so that we improve our ways moving forward and I'd personally love to focus on that first. That won't fix existing code by itself, but the optimistic in me hopes that if we get more consistent quality of comments in new code, our inconfort with the lack of comments in old code will grow and we'll end up fixing it naturally over time. -- Sylvain > > On Mon, May 2, 2016 at 11:26 AM, Sylvain Lebresne > wrote: > > > There could be disagreement on this, but I think that we are in general > not > > very good at commenting Cassandra code and I would suggest that we make a > > collective effort to improve on this. And to help with that goal, I would > > like > > to suggest that we add the following rule to our code style guide > > (https://wiki.apache.org/cassandra/CodeStyle): > > - Every public class and method must have a quality javadoc. That > > javadoc, on > > top of describing what the class/method does, should call particular > > attention to the assumptions that the class/method does, if any. > > > > And of course, we'd also start enforcing that rule by not +1ing code > unless > > it > > adheres to this rule. > > > > Note that I'm not pretending this arguably simplistic rule will magically > > make > > comments perfect, it won't. It's still up to us to write good and > complete > > comments, and it doesn't even talk about comments within methods that are > > important too. But I think some basic rule here would be beneficial and > > that > > one feels sensible to me. > > > > Looking forward to other's opinions and feedbacks on this proposal. > > > > -- > > Sylvain > > > > > > -- > Jonathan Ellis > Project Chair, Apache Cassandra > co-founder, http://www.datastax.com > @spyced >
Possible Bug: bucket_low has no effect in STCS
Hi, I am trying to understand the algorithm of STCS. As per my current understanding of the code, there seems to be no impact of setting bucket_low in the STCS compaction algorithm. Moreover, I see some optimization. I would appreciate if some designer can correct me or confirm that it's a bug sonthat I can raise a JIRA. Details -- getBuckets() method of SizeTieredCompactionStrategy sorts sstables by size in ascending order and then iterates over them one by one to associate them to an existing/new bucket. When, iterating sstables in ascending order of size, I can't find ANY single scenario where the current sstable in the outer loop iteration is below the oldAverageSize of any existing bucket. Current sstable being iterated will ALWAYS be greater than/equal to the oldAverageSize of ALL existing buckets as ALL previous sstables in existing buckets were smaller/equal in size to the sstable being iterated. So, there is NO scenario when size > (oldAverageSize * bucketLow) and size < oldAverageSize, so bucket_low property never comes into play no matter what value you set for it. Also, while iteraitng over sstables (sortedfiles) by size in ascending order, there is no point iterating over all existing buckets. We could just start from the LAST bucket where previous sstable was associated. oldAverageSize of ALL other buckets will NEVER allow the sstable being iterated. for (Entry> entry : buckets.entrySet()) {...} Thanks Anuj
Re: Possible Bug: bucket_low has no effect in STCS
Can any developer confirm the issue? ThanksAnuj Sent from Yahoo Mail on Android On Mon, 13 Jun, 2016 at 11:15 PM, Anuj Wadehra wrote: Hi, I am trying to understand the algorithm of STCS. As per my current understanding of the code, there seems to be no impact of setting bucket_low in the STCS compaction algorithm. Moreover, I see some optimization. I would appreciate if some designer can correct me or confirm that it's a bug sonthat I can raise a JIRA. Details -- getBuckets() method of SizeTieredCompactionStrategy sorts sstables by size in ascending order and then iterates over them one by one to associate them to an existing/new bucket. When, iterating sstables in ascending order of size, I can't find ANY single scenario where the current sstable in the outer loop iteration is below the oldAverageSize of any existing bucket. Current sstable being iterated will ALWAYS be greater than/equal to the oldAverageSize of ALL existing buckets as ALL previous sstables in existing buckets were smaller/equal in size to the sstable being iterated. So, there is NO scenario when size > (oldAverageSize * bucketLow) and size < oldAverageSize, so bucket_low property never comes into play no matter what value you set for it. Also, while iteraitng over sstables (sortedfiles) by size in ascending order, there is no point iterating over all existing buckets. We could just start from the LAST bucket where previous sstable was associated. oldAverageSize of ALL other buckets will NEVER allow the sstable being iterated. for (Entry> entry : buckets.entrySet()) {...} Thanks Anuj
Re: Possible Bug: bucket_low has no effect in STCS
Should I raise JIRA ?? Or some develiper with knowledge of STCS could confirm the bug ?? Anuj Sent from Yahoo Mail on Android On Tue, 14 Jun, 2016 at 12:52 PM, Anuj Wadehra wrote: Can any developer confirm the issue? ThanksAnuj Sent from Yahoo Mail on Android On Mon, 13 Jun, 2016 at 11:15 PM, Anuj Wadehra wrote: Hi, I am trying to understand the algorithm of STCS. As per my current understanding of the code, there seems to be no impact of setting bucket_low in the STCS compaction algorithm. Moreover, I see some optimization. I would appreciate if some designer can correct me or confirm that it's a bug sonthat I can raise a JIRA. Details -- getBuckets() method of SizeTieredCompactionStrategy sorts sstables by size in ascending order and then iterates over them one by one to associate them to an existing/new bucket. When, iterating sstables in ascending order of size, I can't find ANY single scenario where the current sstable in the outer loop iteration is below the oldAverageSize of any existing bucket. Current sstable being iterated will ALWAYS be greater than/equal to the oldAverageSize of ALL existing buckets as ALL previous sstables in existing buckets were smaller/equal in size to the sstable being iterated. So, there is NO scenario when size > (oldAverageSize * bucketLow) and size < oldAverageSize, so bucket_low property never comes into play no matter what value you set for it. Also, while iteraitng over sstables (sortedfiles) by size in ascending order, there is no point iterating over all existing buckets. We could just start from the LAST bucket where previous sstable was associated. oldAverageSize of ALL other buckets will NEVER allow the sstable being iterated. for (Entry> entry : buckets.entrySet()) {...} Thanks Anuj
Re: [VOTE] Release Apache Cassandra 3.8
Hi Michael, Just found an issue in Changes.txt. Add cross-DC latency metrics (CASSANDRA-11596) should be Track message latency across DCs CASSANDRA-11569. ThanksAnujSent from Yahoo Mail on Android On Thu, 21 Jul, 2016 at 3:18 AM, Michael Shuler wrote: I propose the following artifacts for release as 3.8. sha1: c3ded0551f538f7845602b27d53240cd8129265c Git: http://git-wip-us.apache.org/repos/asf?p=cassandra.git;a=shortlog;h=refs/tags/3.8-tentative Artifacts: https://repository.apache.org/content/repositories/orgapachecassandra-1123/org/apache/cassandra/apache-cassandra/3.8/ Staging repository: https://repository.apache.org/content/repositories/orgapachecassandra-1123/ The debian packages are available here: http://people.apache.org/~mshuler/ The vote will be open for 72 hours (longer if needed). [1]: http://goo.gl/oGNH0i (CHANGES.txt) [2]: http://goo.gl/KjMtUn (NEWS.txt) [3]: https://goo.gl/TxVLKo (3.8 Test Summary)
Re: A proposal to move away from Jira-centric development
Hi, I think tracking things in a tool would be better than having mailing lists+JIRA. To make feature JIRAs easier to comprehend, we can close every JIRA discussion with an attached Design proposal (mandatory). Once design is frozen and complete, one can start with the implementation. Not sure about JIRA customizations possible.It would be good if we could customize JIRA tickets to keep discussions isolated from approved design (within single JIRA ticket). I personally find it tough to go through long JIRA discussions, just to understand the final design concluded for a problem/feature. Discussing initial thoughts about pain areas,improvements etc can be done on the dev mailing list. ThanksAnuj On Mon, 15 Aug, 2016 at 7:52 PM, Jonathan Ellis wrote: A long time ago, I was a proponent of keeping most development discussions on Jira, where tickets can be self contained and the threadless nature helps keep discussions from getting sidetracked. But Cassandra was a lot smaller then, and as we've grown it has become necessary to separate out the signal (discussions of new features and major changes) from the noise of routine bug reports. I propose that we take advantage of the dev list to perform that separation. Major new features and architectural improvements should be discussed first here, then when consensus on design is achieved, moved to Jira for implementation and review. I think this will also help with the problem when the initial idea proves to be unworkable and gets revised substantially later after much discussion. It can be difficult to figure out what the conclusion was, as review comments start to pile up afterwards. Having that discussion on the list, and summarizing on Jira, would mitigate this. -- Jonathan Ellis Project Chair, Apache Cassandra co-founder, http://www.datastax.com @spyced
Re: Broader community involvement in 4.0 (WAS Re: Rough roadmap for 4.0)
Hi, We need to understand that time is precious for all of us. Even if a developer has intentions to contribute, he may take months to contribute his first patch or may be longer. Some common entry barriers are: 1. Difficult to identify low hanging fruits. 30 JIRA comments on a ticket and a new comer is LOST, even though the exact fix may be much simpler. 2. Dead JIRA discussions with no clue on the current status of the ticket. 3. No response on new JIRAs raised. Response time to validate/reject the problem is important. Should I pick? Is it really a bug? Maybe some expert can confirm it first and then I can pick it.. 4.Ping Pong JIRAs: Your read 10 comments of a ticket then see duplicates and related ones..then read 30 more comments and then so on till you land up on same JIRA which is not concluded yet. Possible Solution for above 4 points: A. Add a new JIRA field to crisply summarize what conclusive discussion has taken place till now ,what's the status of current JIRA, proposed/feasible solution etc. B. Mark low hanging fruits regularly. C. Validate/Reject newly reported JIRAs on priority. Using dev list to validate/reject the issue before logging the JIRA?? D. Make sure that duplicates are real proven duplicates. 5. Insufficient code comments. Solution: Coding comments should be a mandatory part of code review checklist. It makes reviews faster and encourage people to understand the flow and fix things on their own. 6. Insufficient Design documentation. Solution:Detailed documentation for at least new features so that people are comfortable with the design. Reading English and understanding diagrams/flows is much simpler than just jumping into the code upfront. 7. No/Little formal communication on active development and way forward. Solution: What about a monthly summary of New/Hot/critical JIRAs and new feature development (with JIRA links so that topics of interest are accessible)? ThanksAnuj On Thu, 10 Nov, 2016 at 7:09 AM, Nate McCall wrote: I like the idea of a goal-based approach. I think that would make coming to a consensus a bit easier particularly if a larger number of people are involved. On Tue, Nov 8, 2016 at 8:04 PM, Dikang Gu wrote: > My 2 cents. I'm wondering is it a good idea to have some high level goals > for the major release? For example, the goals could be something like: > 1. Improve the scalability/reliability/performance by X%. > 2. Add Y new features (feature A, B, C, D...). > 3. Fix Z known issues (issue A, B, C, D...). > > I feel If we can have the high level goals, it would be easy to pick the > jiras to be included in the release. > > Does it make sense? > > Thanks > Dikang. > > On Mon, Nov 7, 2016 at 11:22 AM, Oleksandr Petrov < > oleksandr.pet...@gmail.com> wrote: > >> Recently there was another discussion on documentation and comments [1] >> >> On one hand, documentation and comments will help newcomers to familiarise >> themselves with the codebase. On the other - one may get up to speed by >> reading the code and adding some docs. Such things may require less >> oversight and can play some role in improving diversity / increasing an >> amount of involved people. >> >> Same thing with tests. There are some areas where tests need some >> refactoring / improvements, or even just splitting them from one file to >> multiple. It's a good way to experience the process and get involved into >> discussion. >> >> For that, we could add some issues with subtasks (just a few for starters) >> or even just a wiki page with a doc/test wishlist where everyone could add >> a couple of points. >> >> Docs and tests could be used in addition to lhf issues, helping people, >> having comprehensive and quick process and everything else that was >> mentioned in this thread. >> >> Thank you. >> >> [1] >> http://mail-archives.apache.org/mod_mbox/cassandra-dev/201605.mbox/% >> 3ccakkz8q088ojbvhycyz2_2eotqk4y-svwiwksinpt6rr9pop...@mail.gmail.com%3E >> >> On Mon, Nov 7, 2016 at 5:38 PM Aleksey Yeschenko >> wrote: >> >> > Agreed. >> > >> > -- >> > AY >> > >> > On 7 November 2016 at 16:38:07, Jeff Jirsa (jeff.ji...@crowdstrike.com) >> > wrote: >> > >> > ‘Accepted’ JIRA status seems useful, but would encourage something more >> > explicit like ‘Concept Accepted’ or similar to denote that the concept is >> > agreed upon, but the actual patch itself may not be accepted yet. >> > >> > /bikeshed. >> > >> > On 11/7/16, 2:56 AM, "Ben Slater" wrote: >> > >> > >Thanks Dave. The shepherd concept sounds a lot like I had in mind (and a >> > >better name). >> > > >> > >One other thing I noted from the Mesos process - they have an “Accepted” >> > >jira status that comes after open and means “at least one Mesos >> developer >> > >thought that the ideas proposed in the issue are worth pursuing >> further”. >> > >Might also be something to consider as part of a process like this? >> > > >> > >Cheers >> > >Ben >> > > >> > >On Mon, 7 Nov 2016 at 09:37 Dave Lester wrote: >> > > >
Re: Broader community involvement in 4.0 (WAS Re: Rough roadmap for 4.0)
Thanks for the information Jeremy. My main concern is around making JIRAs easy to understand. I am not sure how community feels about it. But, I have personally observed that long discussion thread on JIRAs is not user friendly for someone trying to understand the ticket or may be trying to contribute to the discussion/fix . I strongly feel that there should be a better way e.g. a summary field in JIRA which filters out the discussions, arguments, solutions etc.and just crisply summarizes the problem, solution under discussion and the current status. Sometimes description of the defect is not sufficient. For a new comer trying to understand a JIRA, this summary would be a good start to understand the problem upfront and then if you want to go into details, you can understand the long JIRA thread. Also, some JIRAs are in dead state and you don't get a clue what's the current status after so much discussion over the ticket? Some JIRAs are neither rejected nor validated, so even if its a bug, some people would be reluctant to pick JIRAs which have not been validated yet. ThanksAnuj On Friday, 11 November 2016 1:40 AM, Jeremy Hanna wrote: Regarding low hanging fruit, on the How To Contribute page [1] we’ve tried to keep a list of lhf tickets [2] linked to help people get started. They are usually good starting points and don’t require much context. I rarely see duplicates from lhf tickets. Regarding duplicates, in my experience those who resolve tickets as duplicates are generally pretty good. I think the safest bet to start is to look at How To Contribute page and the lhf labeled tickets. [1] https://wiki.apache.org/cassandra/HowToContribute <https://wiki.apache.org/cassandra/HowToContribute> [2] https://issues.apache.org/jira/secure/IssueNavigator.jspa?reset=true&jqlQuery=project+=+12310865+AND+labels+=+lhf+AND+status+!=+resolved <https://issues.apache.org/jira/secure/IssueNavigator.jspa?reset=true&jqlQuery=project+=+12310865+AND+labels+=+lhf+AND+status+!=+resolved> > On Nov 10, 2016, at 12:06 PM, Anuj Wadehra > wrote: > > > Hi, > > We need to understand that time is precious for all of us. Even if a > developer has intentions to contribute, he may take months to contribute his > first patch or may be longer. Some common entry barriers are: > 1. Difficult to identify low hanging fruits. 30 JIRA comments on a ticket and > a new comer is LOST, even though the exact fix may be much simpler. > 2. Dead JIRA discussions with no clue on the current status of the ticket. > 3. No response on new JIRAs raised. Response time to validate/reject the > problem is important. Should I pick? Is it really a bug? Maybe some expert > can confirm it first and then I can pick it.. > 4.Ping Pong JIRAs: Your read 10 comments of a ticket then see duplicates and > related ones..then read 30 more comments and then so on till you land up on > same JIRA which is not concluded yet. > Possible Solution for above 4 points: > A. Add a new JIRA field to crisply summarize what conclusive discussion has > taken place till now ,what's the status of current JIRA, proposed/feasible > solution etc. > B. Mark low hanging fruits regularly. > C. Validate/Reject newly reported JIRAs on priority. Using dev list to > validate/reject the issue before logging the JIRA?? > D. Make sure that duplicates are real proven duplicates. > > 5. Insufficient code comments. > Solution: Coding comments should be a mandatory part of code review > checklist. It makes reviews faster and encourage people to understand the > flow and fix things on their own. > 6. Insufficient Design documentation. > Solution:Detailed documentation for at least new features so that people are > comfortable with the design. Reading English and understanding diagrams/flows > is much simpler than just jumping into the code upfront. > 7. No/Little formal communication on active development and way forward. > Solution: What about a monthly summary of New/Hot/critical JIRAs and new > feature development (with JIRA links so that topics of interest are > accessible)? > > ThanksAnuj > > > On Thu, 10 Nov, 2016 at 7:09 AM, Nate McCall wrote: I >like the idea of a goal-based approach. I think that would make > coming to a consensus a bit easier particularly if a larger number of > people are involved. > > On Tue, Nov 8, 2016 at 8:04 PM, Dikang Gu wrote: >> My 2 cents. I'm wondering is it a good idea to have some high level goals >> for the major release? For example, the goals could be something like: >> 1. Improve the scalability/reliability/performance by X%. >> 2. Add Y new features (feature A, B, C, D...). >> 3. Fix Z known issues (issue A, B, C, D...). >> >> I feel If we can have the high
Re: Wrapping up tick-tock
Hi, Now that we are rethinking versioning and release frequency, there exists an opportunity to make life easier for Cassandra users. How often mailing lists are discussing: "Which Cassandra version is stable for production?"OR"Is x version stable?" Your release version should indicate your confidence on the stability of the release , is it a bug fix or a feature release, are there any breaking changes or not. +1 semver and alpha/beta/GA releases So that you dont find every second Cassandra user asking about the latest stable Cassandra version. Thanks Anuj On Sat, 14 Jan, 2017 at 1:04 AM, Jeff Jirsa wrote: Mick proposed it (semver) in one of the release proposals, and I dropped the ball on sending out the actual "vote on which release plan we want to use" email, because I messed up and got busy. On Fri, Jan 13, 2017 at 11:26 AM, Russell Bradberry wrote: > Has any thought been given to SemVer? > > http://semver.org/ > > -Russ > > On 1/13/17, 1:57 PM, "Jason Brown" wrote: > > It's fine to limit the minimum time between major releases to six > months, > but I do not think we should force a major just because n months have > passed. I think we should up the major only when we have significant > (possibly breaking) changes/features. It would seem odd to have a 6.0 > that's basically the same as 4.0 (in terms of features and > protocol/format > compatibility). > > Thoughts? > > On Wed, Jan 11, 2017 at 1:58 AM, Stefan Podkowinski > wrote: > > > I honestly don't understand the release cadence discussion. The 3.x > branch > > is far from production ready. Is this really the time to plan the > next > > major feature releases on top of it, instead of focusing to > stabilize 3.x > > first? Who knows how long that would take, even if everyone would > > exclusively work on bug fixing (which I think should happen). > > > > On Tue, Jan 10, 2017 at 4:29 PM, Jonathan Haddad > > wrote: > > > > > I don't see why it has to be one extreme (yearly) or another > (monthly). > > > When you had originally proposed Tick Tock, you wrote: > > > > > > "The primary goal is to improve release quality. Our current > major “dot > > > zero” releases require another five or six months to make them > stable > > > enough for production. This is directly related to how we pile > features > > in > > > for 9 to 12 months and release all at once. The interactions > between the > > > new features are complex and not always obvious. 2.1 was no > exception, > > > despite DataStax hiring a full tme test engineering team > specifically for > > > Apache Cassandra." > > > > > > I agreed with you at the time that the yearly cycle was too long > to be > > > adding features before cutting a release, and still do now. > Instead of > > > elastic banding all the way back to a process which wasn't working > > before, > > > why not try somewhere in the middle? A release every 6 months > (with > > > monthly bug fixes for a year) gives: > > > > > > 1. long enough time to stabilize (1 year vs 1 month) > > > 2. not so long things sit around untested forever > > > 3. only 2 releases (current and previous) to do bug fix support at > any > > > given time. > > > > > > Jon > > > > > > On Tue, Jan 10, 2017 at 6:56 AM Jonathan Ellis > > wrote: > > > > > > > Hi all, > > > > > > > > We’ve had a few threads now about the successes and failures of > the > > > > tick-tock release process and what to do to replace it, but they > all > > died > > > > out without reaching a robust consensus. > > > > > > > > In those threads we saw several reasonable options proposed, but > from > > my > > > > perspective they all operated in a kind of theoretical fantasy > land of > > > > testing and development resources. In particular, it takes > around a > > > > person-week of effort to verify that a release is ready. That > is, > > going > > > > through all the test suites, inspecting and re-running failing > tests to > > > see > > > > if there is a product problem or a flaky test. > > > > > > > > (I agree that in a perfect world this wouldn’t be necessary > because > > your > > > > test ci is always green, but see my previous framing of the > perfect > > world > > > > as a fantasy land. It’s also worth noting that this is a common > > problem > > > > for large OSS projects, not necessarily something to beat > ourselves up > > > > over, but in any case, that's our reality right now.) > > > > > > > > I submit that any process that assumes a monthly release cadence > is not > > > > realistic from a resourcing standpoint for this validation. > Notably, > > we > > > > have struggled to marshal this for 3.10 for two months now. > > > > > > > > Therefore, I suggest first that we collectively roll up our > sleeves to > > > vet > > >
Restore Snapshot
Hi, I am curious to know how people practically use Snapshot restore provided that snapshot restore may lead to inconsistent reads until full repair is run on the node being restored ( if you have dropped mutations in your cluster). Example:9 am snapshot taken on all 3 nodes10 am mutation drop on node 311 am snapshot restore on node 1. Now the data is only on node 2 if we are writing at quorum and we will observe inconsistent reads till we repair node 1. If you use restore snapshot with join_ring equal to false, repair the node and then join the restored node when repair completes, the node will not lead to inconsistent reads but will miss writes during the time its being repaired as simply booting the node with join_ring=false would also stop pushing writes to the node ( unlike boostrap with join_ring=false where writes are pushed to the node being bootstrapped) and thus you would need another full repair to make data of the node restored via snapshot in sync with other nodes. Its hard to believe that a simple snapshot restore scenario is still broken and people are not complaining. So, I thought of asking the community members..how do you practically use snapshot restore while addressing the read inconsistency issue. ThanksAnuj Sent from Yahoo Mail on Android
Re: Restore Snapshot
I mistakenly posted it on dev mailing list. Please ignore. Posting it on user mailing list. :) ThanksAnuj Sent from Yahoo Mail on Android On Tue, Jun 27, 2017 at 7:01 PM, Anuj Wadehra wrote: Hi, I am curious to know how people practically use Snapshot restore provided that snapshot restore may lead to inconsistent reads until full repair is run on the node being restored ( if you have dropped mutations in your cluster). Example:9 am snapshot taken on all 3 nodes10 am mutation drop on node 311 am snapshot restore on node 1. Now the data is only on node 2 if we are writing at quorum and we will observe inconsistent reads till we repair node 1. If you use restore snapshot with join_ring equal to false, repair the node and then join the restored node when repair completes, the node will not lead to inconsistent reads but will miss writes during the time its being repaired as simply booting the node with join_ring=false would also stop pushing writes to the node ( unlike boostrap with join_ring=false where writes are pushed to the node being bootstrapped) and thus you would need another full repair to make data of the node restored via snapshot in sync with other nodes. Its hard to believe that a simple snapshot restore scenario is still broken and people are not complaining. So, I thought of asking the community members..how do you practically use snapshot restore while addressing the read inconsistency issue. ThanksAnuj Sent from Yahoo Mail on Android
URGENT: CASSANDRA-14092 causes Data Loss
Hi, For all those people who use MAX TTL=20 years for inserting/updating data in production, https://issues.apache.org/jira/browse/CASSANDRA-14092 can silently cause irrecoverable Data Loss. This seems like a certain TOP MOST BLOCKER to me. I think the category of the JIRA must be raised to BLOCKER from Major. Unfortunately, the JIRA is still "Unassigned" and no one seems to be actively working on it. Just like any other critical vulnerability, this vulnerability demands immediate attention from some very experienced folks to bring out an Urgent Fast Track Patch for all currently Supported Cassandra versions 2.1,2.2 and 3.x. As per my understanding of the JIRA comments, the changes may not be that trivial for older releases. So, community support on the patch is very much appreciated. Thanks Anuj
Re: URGENT: CASSANDRA-14092 causes Data Loss
Hi Jeremiah, Validation is on TTL value not on (system_time+ TTL). You can test it with below example. Insert is successful, overflow happens silently and data is lost: create table test(name text primary key,age int); insert into test(name,age) values('test_20yrs',30) USING TTL 63072; select * from test where name='test_20yrs'; name | age --+- (0 rows) insert into test(name,age) values('test_20yr_plus_1',30) USING TTL 630720001;InvalidRequest: Error from server: code=2200 [Invalid query] message="ttl is too large. requested (630720001) maximum (63072)" ThanksAnuj On Friday 26 January 2018, 12:11:03 AM IST, J. D. Jordan wrote: Where is the dataloss? Does the INSERT operation return successfully to the client in this case? From reading the linked issues it sounds like you get an error client side. -Jeremiah > On Jan 25, 2018, at 1:24 PM, Anuj Wadehra > wrote: > > Hi, > > For all those people who use MAX TTL=20 years for inserting/updating data in > production, https://issues.apache.org/jira/browse/CASSANDRA-14092 can > silently cause irrecoverable Data Loss. This seems like a certain TOP MOST > BLOCKER to me. I think the category of the JIRA must be raised to BLOCKER > from Major. Unfortunately, the JIRA is still "Unassigned" and no one seems to > be actively working on it. Just like any other critical vulnerability, this > vulnerability demands immediate attention from some very experienced folks to > bring out an Urgent Fast Track Patch for all currently Supported Cassandra > versions 2.1,2.2 and 3.x. As per my understanding of the JIRA comments, the > changes may not be that trivial for older releases. So, community support on > the patch is very much appreciated. > > Thanks > Anuj - To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org For additional commands, e-mail: dev-h...@cassandra.apache.org
Re: URGENT: CASSANDRA-14092 causes Data Loss
Hi Paulo, Thanks for looking into the issue on priority. I have serious concerns regarding reducing the TTL to 15 yrs.The patch will immediately break all existing applications in Production which are using 15+ yrs TTL. And then they would be stuck again until all the logic in Production software is modified and the software is upgraded immediately. This may take days. Such heavy downtime is generally not acceptable for any business. Yes, they will not have silent data loss but they would not be able to do any business either. I think the permanent fix must be prioritized and put on extremely fast track. This is a certain Blocker and the impact could be enormous--with and without the 15 year short-term patch. And believe me --there are plenty such business use cases where you use very long TTLs such as 20 yrs for compliance and other reasons. Thanks Anuj On Friday 26 January 2018, 4:57:13 AM IST, Michael Kjellman wrote: why are people inserting data with a 15+ year TTL? sorta curious about the actual use case for that. > On Jan 25, 2018, at 12:36 PM, horschi wrote: > > The assertion was working fine until yesterday 03:14 UTC. > > The long term solution would be to work with a long instead of a int. The > serialized seems to be a variable-int already, so that should be fine > already. > > If you change the assertion to 15 years, then applications might fail, as > they might be setting a 15+ year ttl. > > regards, > Christian > > On Thu, Jan 25, 2018 at 9:19 PM, Paulo Motta > wrote: > >> Thanks for raising this. Agreed this is bad, when I filed >> CASSANDRA-14092 I thought a write would fail when localDeletionTime >> overflows (as it is with 2.1), but that doesn't seem to be the case on >> 3.0+ >> >> I propose adding the assertion back so writes will fail, and reduce >> the max TTL to something like 15 years for the time being while we >> figure a long term solution. >> >> 2018-01-25 18:05 GMT-02:00 Jeremiah D Jordan : >>> If you aren’t getting an error, then I agree, that is very bad. Looking >> at the 3.0 code it looks like the assertion checking for overflow was >> dropped somewhere along the way, I had only been looking into 2.1 where you >> get an assertion error that fails the query. >>> >>> -Jeremiah >>> >>>> On Jan 25, 2018, at 2:21 PM, Anuj Wadehra >> wrote: >>>> >>>> >>>> Hi Jeremiah, >>>> Validation is on TTL value not on (system_time+ TTL). You can test it >> with below example. Insert is successful, overflow happens silently and >> data is lost: >>>> create table test(name text primary key,age int); >>>> insert into test(name,age) values('test_20yrs',30) USING TTL 63072; >>>> select * from test where name='test_20yrs'; >>>> >>>> name | age >>>> --+- >>>> >>>> (0 rows) >>>> >>>> insert into test(name,age) values('test_20yr_plus_1',30) USING TTL >> 630720001;InvalidRequest: Error from server: code=2200 [Invalid query] >> message="ttl is too large. requested (630720001) maximum (63072)" >>>> ThanksAnuj >>>> On Friday 26 January 2018, 12:11:03 AM IST, J. D. Jordan < >> jeremiah.jor...@gmail.com> wrote: >>>> >>>> Where is the dataloss? Does the INSERT operation return successfully >> to the client in this case? From reading the linked issues it sounds like >> you get an error client side. >>>> >>>> -Jeremiah >>>> >>>>> On Jan 25, 2018, at 1:24 PM, Anuj Wadehra >> wrote: >>>>> >>>>> Hi, >>>>> >>>>> For all those people who use MAX TTL=20 years for inserting/updating >> data in production, https://issues.apache.org/jira/browse/CASSANDRA-14092 >> can silently cause irrecoverable Data Loss. This seems like a certain TOP >> MOST BLOCKER to me. I think the category of the JIRA must be raised to >> BLOCKER from Major. Unfortunately, the JIRA is still "Unassigned" and no >> one seems to be actively working on it. Just like any other critical >> vulnerability, this vulnerability demands immediate attention from some >> very experienced folks to bring out an Urgent Fast Track Patch for all >> currently Supported Cassandra versions 2.1,2.2 and 3.x. As per my >> understanding of the JIRA comments, the changes may not be that trivial for >> older releases. So, community support on the patch is very much appreciated. >>
Re: URGENT: CASSANDRA-14092 causes Data Loss
Hi Jeff, Thanks for the prompt action! I agree that patching an application MAY have a shorter life cycle than patching Cassandra in production. But, in the interest of the larger Cassandra user community, we should put our best effort to avoid breaking all the affected applications in production. We should also consider that updating business logic as per the new 15 year TTL constraint may have business implications for many users. I have a limited understanding about the complexity of the code patch, but it may be more feasible to extend the 20 year limit in Cassandra in 2.1/2.2 rather than asking all impacted users to do an immediate business logic adaptation. Moreover, now that we officially support Cassandra 2.1 & 2.2 until 4.0 release and provide critical fixes for 2.1, it becomes even more reasonable to provide this extremely critical patch for 2.1 & 2.2 (unless its absolutely impossible). Still, many users use Cassandra 2.1 and 2.2 in their most critical production systems. Thanks Anuj On Friday 26 January 2018, 11:06:30 AM IST, Jeff Jirsa wrote: We’ll get patches out. They almost certainly aren’t going to change the sstable format for old versions (unless whoever writes the patch makes a great argument for it), so there’s probably not going to be post-2038 ttl support for 2.1/2.2. For those old versions, we can definitely make it not lose data, but we almost certainly aren’t going to make the ttl go past 2038 in old versions. More importantly, any company trying to do 20 year ttl’s that’s waiting for a patched version should start by patching their app to not write invalid ttls - your app release cycle is almost certainly faster than db patch / review / test / release / validation, and you can avoid the data loss application side by calculating the ttl explicitly. It’s not the best solution, but it beats doing nothing, and we’re not rushing out a release in less than a day (we haven’t even started a vote, and voting window is 72 hours for members to review and approve or reject the candidate). -- Jeff Jirsa > On Jan 25, 2018, at 9:07 PM, Jeff Jirsa wrote: > > Patches welcome. > > -- > Jeff Jirsa > > >> On Jan 25, 2018, at 8:15 PM, Anuj Wadehra >> wrote: >> >> Hi Paulo, >> >> Thanks for looking into the issue on priority. I have serious concerns >> regarding reducing the TTL to 15 yrs.The patch will immediately break all >> existing applications in Production which are using 15+ yrs TTL. And then >> they would be stuck again until all the logic in Production software is >> modified and the software is upgraded immediately. This may take days. Such >> heavy downtime is generally not acceptable for any business. Yes, they will >> not have silent data loss but they would not be able to do any business >> either. I think the permanent fix must be prioritized and put on extremely >> fast track. This is a certain Blocker and the impact could be enormous--with >> and without the 15 year short-term patch. >> >> And believe me --there are plenty such business use cases where you use very >> long TTLs such as 20 yrs for compliance and other reasons. >> >> Thanks >> Anuj >> >> On Friday 26 January 2018, 4:57:13 AM IST, Michael Kjellman >> wrote: >> >> why are people inserting data with a 15+ year TTL? sorta curious about the >> actual use case for that. >> >>> On Jan 25, 2018, at 12:36 PM, horschi wrote: >>> >>> The assertion was working fine until yesterday 03:14 UTC. >>> >>> The long term solution would be to work with a long instead of a int. The >>> serialized seems to be a variable-int already, so that should be fine >>> already. >>> >>> If you change the assertion to 15 years, then applications might fail, as >>> they might be setting a 15+ year ttl. >>> >>> regards, >>> Christian >>> >>> On Thu, Jan 25, 2018 at 9:19 PM, Paulo Motta >>> wrote: >>> >>>> Thanks for raising this. Agreed this is bad, when I filed >>>> CASSANDRA-14092 I thought a write would fail when localDeletionTime >>>> overflows (as it is with 2.1), but that doesn't seem to be the case on >>>> 3.0+ >>>> >>>> I propose adding the assertion back so writes will fail, and reduce >>>> the max TTL to something like 15 years for the time being while we >>>> figure a long term solution. >>>> >>>> 2018-01-25 18:05 GMT-02:00 Jeremiah D Jordan : >>>>> If you aren’t getting an error, then I agree, that is very bad. Looking >>>> at the 3.0 code it looks like the asser
Re: URGENT: CASSANDRA-14092 causes Data Loss
Hi Jeff, One correction in my last message: "it may be more feasible to SUPPORT (not extend) the 20 year limit in Cassandra in 2.1/2.2". I completely agree that the existing 20 years TTL support is okay for older versions. If I have understood your last message correctly, upcoming patches are on following lines : 1. New Patches shall be released for 2.1, 2.2 and 3.x.2. The patches for 2.1 & 2.2 would support the existing 20 year TTL limit and ensure that there is no data loss when 20 year is set as TTL.3. The patches for 2.1 and 2.2 are unlikely to update the sstable format. 4. 3.x patches may even remove the 20 year TTL constraint (and extend TTL support beyond 2038). I think that the JIRA priority should be increased from "Major" to "Blocker" as the JIRA may cause unexpected data loss. Also, all impacted versions should be included in the JIRA. This will attract the due attention of all Cassandra users. ThanksAnuj On Friday 26 January 2018, 12:47:18 PM IST, Anuj Wadehra wrote: Hi Jeff, Thanks for the prompt action! I agree that patching an application MAY have a shorter life cycle than patching Cassandra in production. But, in the interest of the larger Cassandra user community, we should put our best effort to avoid breaking all the affected applications in production. We should also consider that updating business logic as per the new 15 year TTL constraint may have business implications for many users. I have a limited understanding about the complexity of the code patch, but it may be more feasible to extend the 20 year limit in Cassandra in 2.1/2.2 rather than asking all impacted users to do an immediate business logic adaptation. Moreover, now that we officially support Cassandra 2.1 & 2.2 until 4.0 release and provide critical fixes for 2.1, it becomes even more reasonable to provide this extremely critical patch for 2.1 & 2.2 (unless its absolutely impossible). Still, many users use Cassandra 2.1 and 2.2 in their most critical production systems. Thanks Anuj On Friday 26 January 2018, 11:06:30 AM IST, Jeff Jirsa wrote: We’ll get patches out. They almost certainly aren’t going to change the sstable format for old versions (unless whoever writes the patch makes a great argument for it), so there’s probably not going to be post-2038 ttl support for 2.1/2.2. For those old versions, we can definitely make it not lose data, but we almost certainly aren’t going to make the ttl go past 2038 in old versions. More importantly, any company trying to do 20 year ttl’s that’s waiting for a patched version should start by patching their app to not write invalid ttls - your app release cycle is almost certainly faster than db patch / review / test / release / validation, and you can avoid the data loss application side by calculating the ttl explicitly. It’s not the best solution, but it beats doing nothing, and we’re not rushing out a release in less than a day (we haven’t even started a vote, and voting window is 72 hours for members to review and approve or reject the candidate). -- Jeff Jirsa > On Jan 25, 2018, at 9:07 PM, Jeff Jirsa wrote: > > Patches welcome. > > -- > Jeff Jirsa > > >> On Jan 25, 2018, at 8:15 PM, Anuj Wadehra >> wrote: >> >> Hi Paulo, >> >> Thanks for looking into the issue on priority. I have serious concerns >> regarding reducing the TTL to 15 yrs.The patch will immediately break all >> existing applications in Production which are using 15+ yrs TTL. And then >> they would be stuck again until all the logic in Production software is >> modified and the software is upgraded immediately. This may take days. Such >> heavy downtime is generally not acceptable for any business. Yes, they will >> not have silent data loss but they would not be able to do any business >> either. I think the permanent fix must be prioritized and put on extremely >> fast track. This is a certain Blocker and the impact could be enormous--with >> and without the 15 year short-term patch. >> >> And believe me --there are plenty such business use cases where you use very >> long TTLs such as 20 yrs for compliance and other reasons. >> >> Thanks >> Anuj >> >> On Friday 26 January 2018, 4:57:13 AM IST, Michael Kjellman >> wrote: >> >> why are people inserting data with a 15+ year TTL? sorta curious about the >> actual use case for that. >> >>> On Jan 25, 2018, at 12:36 PM, horschi wrote: >>> >>> The assertion was working fine until yesterday 03:14 UTC. >>> >>> The long term solution would be to work with a long instead of a int. The >>> serialized seems to be a variable-int already, so that should be fine >>> al
Re: URGENT: CASSANDRA-14092 causes Data Loss
Hi Paulo, Thanks for coming out with the Emergency Hot Fix!! The patch will help many Cassandra users in saving their precious data. I think the criticality and urgency of the bug is too high. How can we make sure that maximum Cassandra users are alerted about the silent deletion problem? What are formal ways of working for broadcasting such critical alerts? I still see that the JIRA is marked as a "Major" defect and not a "Blocker". What worst can happen to a database than irrecoverable silent deletion of successfully inserted data. I hope you understand. ThanksAnuj On Fri, 26 Jan 2018 at 18:57, Paulo Motta wrote: > I have serious concerns regarding reducing the TTL to 15 yrs.The patch will immediately break all existing applications in Production which are using 15+ yrs TTL. In order to prevent applications from breaking I will update the patch to automatically set the maximum TTL to '03:14:08 UTC 19 January 2038' when it overflows and log a warning as a initial measure. We will work on extending this limit or lifting this limitation, probably for the 3.0+ series due to the large scale compatibility changes required on lower versions, but community patches are always welcome. Companies that cannot upgrade to a version with the proper fix will need to workaround this limitation in some other way: do a batch job to delete old data periodically, perform deletes with timestamps in the future, etc. > If its a 32 bit timestamp, can't we just save/read localDeletionTime as > unsinged int? The proper fix will likely be along these lines, but this involve many changes throughout the codebase where localDeletionTime is consumed and extensive testing, reviewing, etc, so we're now looking into a emergency hot fix to prevent silent data loss while the permanent fix is not in place. 2018-01-26 6:27 GMT-02:00 Anuj Wadehra : > Hi Jeff, > One correction in my last message: "it may be more feasible to SUPPORT (not > extend) the 20 year limit in Cassandra in 2.1/2.2". > I completely agree that the existing 20 years TTL support is okay for older > versions. > > If I have understood your last message correctly, upcoming patches are on > following lines : > > 1. New Patches shall be released for 2.1, 2.2 and 3.x.2. The patches for 2.1 > & 2.2 would support the existing 20 year TTL limit and ensure that there is > no data loss when 20 year is set as TTL.3. The patches for 2.1 and 2.2 are > unlikely to update the sstable format. > 4. 3.x patches may even remove the 20 year TTL constraint (and extend TTL > support beyond 2038). > I think that the JIRA priority should be increased from "Major" to "Blocker" > as the JIRA may cause unexpected data loss. Also, all impacted versions > should be included in the JIRA. This will attract the due attention of all > Cassandra users. > ThanksAnuj > On Friday 26 January 2018, 12:47:18 PM IST, Anuj Wadehra > wrote: > > Hi Jeff, > > Thanks for the prompt action! I agree that patching an application MAY have a > shorter life cycle than patching Cassandra in production. But, in the > interest of the larger Cassandra user community, we should put our best > effort to avoid breaking all the affected applications in production. We > should also consider that updating business logic as per the new 15 year TTL > constraint may have business implications for many users. I have a limited > understanding about the complexity of the code patch, but it may be more > feasible to extend the 20 year limit in Cassandra in 2.1/2.2 rather than > asking all impacted users to do an immediate business logic adaptation. > Moreover, now that we officially support Cassandra 2.1 & 2.2 until 4.0 > release and provide critical fixes for 2.1, it becomes even more reasonable > to provide this extremely critical patch for 2.1 & 2.2 (unless its absolutely > impossible). Still, many users use Cassandra 2.1 and 2.2 in their most > critical production systems. > > Thanks > Anuj > > On Friday 26 January 2018, 11:06:30 AM IST, Jeff Jirsa >wrote: > > We’ll get patches out. They almost certainly aren’t going to change the >sstable format for old versions (unless whoever writes the patch makes a great >argument for it), so there’s probably not going to be post-2038 ttl support >for 2.1/2.2. For those old versions, we can definitely make it not lose data, >but we almost certainly aren’t going to make the ttl go past 2038 in old >versions. > > More importantly, any company trying to do 20 year ttl’s that’s waiting for a > patched version should start by patching their app to not write invalid ttls > - your app release cycle is almost certainly faster than db patch / review / > test / release / validation
Design Proposal for Auditing feature in Cassandra
Hi, Apache Cassandra doesn't provides an auditing feature. As Database auditing is critical for any production level database like Apache Cassandra, our team is keen on designing & implementing this feature in Apache Cassandra. I have submitted the Design proposal for "Database Auditing" feature under the JIRA: https://issues.apache.org/jira/browse/CASSANDRA-12151 . Can some of you please review the proposal and share your feedback? ThanksAnuj
Run Mixed Workload using two instances on one node
Hi, We are trying to Decouple our Reporting DB from OLTP. Need urgent help on the feasibility of proposed solution for PRODUCTION. Use Case: Currently, our OLTP and Reporting application and DB are same. Some CF are used for both OLTP and Reporting while others are solely used for Reporting.Every business transaction synchronously updates the main OLTP CF and asynchronously updates other Reporting CFs. Problem Statement: 1. Decouple Reporting and OLTP such that Reporting load can't impact OLTP performance. 2. Scaling of Reporting and OLTP modules must be independent 3. OLTP client should not update all Reporting CFs. We generate Data Records on File sytem/shared disk.Reporting should use these Records to create Reporting DB. 4. Small customers may do OLTP and Reporting on same 3-node cluster. Bigger customers can be given an option to have dedicated OLTP and Reporting nodes. So, standard Hardware box should be usable for 3 deployments (OLTP,Reporting or OLTP+Reporting) Note: Reporting is ad-hoc, may involve full table scans and does not involve Analytics. Data size is huge 2TB (OLTP+Reporting) per node. Hardware : Standard deployment -3 node cluster with each node having 24 cores, 64GB RAM, 400GB * 6 SSDs in RAID5 Proposed Solution: 1. Split OLTP and Reporting clients into two application components. 2. For small deployments where more than 3 nodes are not required: A. Install 2 Cassandra instances on each node one for OLTP and other for Reporting B. To distribute I/O load in 2:1 --Remove RAID5 (as Cassandra offers replication) and assign 4 disks as JBod for OLTP and 2 disks for Reporting C. RAM is abundant and often under-utilized , so assign 8GB each for 2 Cassandra instance D. To make sure that Reporting is not able to overload CPU, tune concurrent_reads,concurrent_writes OLTP client will only write to OLTP DB and generate DB record. Reporting client will poll FS and populate Reporting DB in required format. 3. Larger customers can have Reporting clients and DB on dedicated physical nodes with all resources. Key Questions: Is it ok to run 2 Cassandra instances on one node in Production system and limit CPU Usage,Disk I/O and RAM as suggested above? Any other solution for above mentioned problem statement? Thanks Anuj