Re: [DISCUSSION] Cassandra's code style and source code analysis

2023-08-01 Thread Miklosovic, Stefan
I think we might wait for Accord and transactional metadata as the last big 
contributions in 5.0 (if I have not forgotten something) and then we can just 
polish it all just before the release. There will be still some room to do the 
housekeeping like this after these patches lend. It is not like Accord will be 
in trunk on Monday and we release Tuesday ...


From: Maxim Muzafarov 
Sent: Monday, July 31, 2023 23:05
To: dev@cassandra.apache.org
Subject: Re: [DISCUSSION] Cassandra's code style and source code analysis

NetApp Security WARNING: This is an external email. Do not click links or open 
attachments unless you recognize the sender and know the content is safe.




Hello everyone,


It's been a long time since the last discussion about the import order
code style, so I want to give these changes a chance as all the major
JIRA issues have already landed on the release branch so we won't
affect anyone. I'd be happy to find any reviewers who are interested
in helping with the next steps :-) I've updated the changes to reflect
the latest checkstyle work, so here they are:

https://issues.apache.org/jira/browse/CASSANDRA-17925
https://github.com/apache/cassandra/pull/2108


The changes look scary at first glance, but they're actually quite
simple and in line with what we've discussed above. In short, we can
divide all the affected files into two parts: the update of the code
style configuration files (checkstyle + IDE configs), and the update
of all the sources to match the code style.

In short:

- "import order" hotkey will work regardless of which IDE you are using;
- updated checkstyle configuration, and IDEA, Eclipse, NetBeans
configurations have been updated;
- AvoidStarImport checkstyle rule applied as well;

The import order we've agreed upon:

java.*
[blank line]
javax.*
[blank line]
com.*
[blank line]
net.*
[blank line]
org.*
[blank line]
org.apache.cassandra.*
[blank line]
all other imports
[blank line]
static all other imports

On Mon, 27 Feb 2023 at 13:26, Maxim Muzafarov  wrote:
>
> > I suppose it can be easy for the existing feature branches if they have a 
> > single commit. Don't we need to adjust each commit for multi-commit feature 
> > branches?
>
> It depends on how feature branches are maintained and developed, I
> guess. My thoughts here are that the IDE's hotkeys should just work to
> resolve any code-style issues that arise during rebase/maintenance.
> I'm not talking about enforcing all our code-style rules but giving
> developers good flexibility. The classes import order rule might be a
> good example here.
>
> On Wed, 22 Feb 2023 at 21:27, Jacek Lewandowski
>  wrote:
> >
> > I suppose it can be easy for the existing feature branches if they have a 
> > single commit. Don't we need to adjust each commit for multi-commit feature 
> > branches?
> >
> > śr., 22 lut 2023, 19:48 użytkownik Maxim Muzafarov  
> > napisał:
> >>
> >> Hello everyone,
> >>
> >> I have created an issue CASSANDRA-18277 that may help us move forward
> >> with code style changes. It only affects the way we store the IntelliJ
> >> code style configuration and has no effect on any current (or any)
> >> releases, so it should be safe to merge. So, once the issue is
> >> resolved, every developer that checkouts a release branch will use the
> >> same code style stored in that branch. This in turn makes rebasing a
> >> big change like the import order [1] a really straightforward matter
> >> (by pressing Crtl + Opt + O in their local branch to organize
> >> imports).
> >>
> >> See:
> >>
> >> Move the IntelliJ Idea code style and inspections configuration to the
> >> project's root .idea directory
> >> https://issues.apache.org/jira/browse/CASSANDRA-18277
> >>
> >>
> >>
> >> [1] https://issues.apache.org/jira/browse/CASSANDRA-17925
> >>
> >> On Wed, 25 Jan 2023 at 13:05, Miklosovic, Stefan
> >>  wrote:
> >> >
> >> > Thank you Maxim for doing this.
> >> >
> >> > It is nice to see this effort materialized in a PR.
> >> >
> >> > I would wait until bigger chunks of work are committed to trunk (like 
> >> > CEP-15) to not collide too much. I would say we can postpone doing this 
> >> > until the actual 5.0 release, last weeks before it so we would not clash 
> >> > with any work people would like to include in 5.0. This can go in 
> >> > anytime, basically.
> >> >
> >> > Are people on the same page?
> >> >
> >> > Regards
> >> >
> >> > 
> >> > From: Maxim Muzafarov 
> >> > Sent: Monday, January 23, 2023 19:46
> >> > To: dev@cassandra.apache.org
> >> > Subject: Re: [DISCUSSION] Cassandra's code style and source code analysis
> >> >
> >> > NetApp Security WARNING: This is an external email. Do not click links 
> >> > or open attachments unless you recognize the sender and know the content 
> >> > is safe.
> >> >
> >> >
> >> >
> >> >
> >> > Hello everyone,
> >> >
> >> > You can find the changes here:
> >> > https://issues.apache.org/jira/brow

[CMWG] Agenda and call details for Aug. 2

2023-08-01 Thread Melissa Logan
Join us for the August Cassandra Marketing Working Group tomorrow,
Wednesday, August 2 at 8:00 AM PST.

Agenda

   -

   Sharing results from the Cassandra user survey
   

   (Patrick McFadin)
   -

   Planet Cassandra Contributors (Patrick McFadin)
   -

   Update: MVP Program (Melissa Logan)
   -

   Update: Checklist for Cassandra events (Melissa Logan)
   -

   Update: How to do calendar better (Melissa Logan)


***NEW ZOOM***
https://us02web.zoom.us/j/82210868338?pwd=V3hrV3BUd2duVU5mVkE4RWhBNDZ3Zz09

Wiki:
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=240883297
Subscribe: marketing-subscr...@cassandra.apache.org
Discuss: #cassandra-events (ASF Slack)

See you then.
Melissa


Re: Tokenization and SAI query syntax

2023-08-01 Thread Caleb Rackliffe
Here are some additional bits of prior art, if anyone finds them useful:


The Stratio Lucene Index -
https://github.com/Stratio/cassandra-lucene-index#examples

Stratio was the reason C* added the "expr" functionality. They embedded
something similar to ElasticSearch JSON, which probably isn't my favorite
choice, but it's there.


The ElasticSearch match query syntax -
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-match-query.html

Again, not my favorite. It's verbose, and probably too powerful for us.


ElasticSearch's documentation for the basic Lucene query syntax -
https://www.elastic.co/guide/en/elasticsearch/reference/8.9/query-dsl-query-string-query.html#query-string-syntax

One idea is to take the basic Lucene index, which it seems we already have
some support for, and feed it to "expr". This is nice for two reasons:

1.) People can just write Lucene queries if they already know how.
2.) No changes to the grammar.

Lucene has distinct concepts of filtering and querying, and this is kind of
the latter. I'm not sure how, for example, we would want "expr" to interact
w/ filters on other column indexes in vanilla CQL space...


On Mon, Jul 24, 2023 at 9:37 AM Josh McKenzie  wrote:

> `column CONTAINS term`. Contains is used by both Java and Python for
> substring searches, so at least some users will be surprised by term-based
> behavior.
>
> I wonder whether users are in their "programming language" headspace or in
> their "querying a database" headspace when interacting with CQL? i.e. this
> would only present confusion if we expected users to be thinking in the
> idioms of their respective programming languages. If they're thinking in
> terms of SQL, MATCHES would probably end up confusing them a bit since it
> doesn't match the general structure of the MATCH operator.
>
> That said, I also think CONTAINS loses something important that you allude
> to here Jonathan:
>
> with corresponding query-time tokenization and analysis.  This means that
> the query term is not always a substring of the original string!  Besides
> obvious transformations like lowercasing, you have things like
> PhoneticFilter available as well.
>
> So to me, neither MATCHES nor CONTAINS are particularly great candidates.
>
> So +1 to the "I don't actually hate it" sentiment on:
>
> column : term`. Inspired by Lucene’s syntax
>
>
> On Mon, Jul 24, 2023, at 8:35 AM, Benedict wrote:
>
>
> I have a strong preference not to use the name of an SQL operator, since
> it precludes us later providing the SQL standard operator to users.
>
> What about CONTAINS TOKEN term? Or CONTAINS TERM term?
>
>
> On 24 Jul 2023, at 13:34, Andrés de la Peña  wrote:
>
> 
> `column = term` is definitively problematic because it creates an
> ambiguity when the queried column belongs to the primary key. For some
> queries we wouldn't know whether the user wants a primary key query using
> regular equality or an index query using the analyzer.
>
> `term_matches(column, term)` seems quite clear and hard to misinterpret,
> but it's quite long to write and its implementation will be challenging
> since we would need a bunch of special casing around SelectStatement and
> functions.
>
> LIKE, MATCHES and CONTAINS could be a bit misleading since they seem to
> evoke different behaviours to what they would have.
>
> `column LIKE :term:` seems a bit redundant compared to just using `column
> : term`, and we are still introducing a new symbol.
>
> I think I like `column : term` the most, because it's brief, it's similar
> to the equivalent Lucene's syntax, and it doesn't seem to clash with other
> different meanings that I can think of.
>
> On Mon, 24 Jul 2023 at 13:13, Jonathan Ellis  wrote:
>
> Hi all,
>
> With phase 1 of SAI wrapping up, I’d like to start the ball rolling on
> aligning around phase 2 features.
>
> In particular, we need to nail down the syntax for doing non-exact string
> matches.  We have a proof of concept that includes full Lucene analyzer and
> filter functionality – just the text transformation pieces, none of the
> storage parts – which is the gold standard in this space.  For example, the
> StandardAnalyzer [1] lowercases all terms and removes stopwords (common
> words like “a”, “is”, “the” that are usually not useful to search
> against).  Lucene also has classes that offer stemming, special case
> handling for email, and many languages besides English [2].
>
> What syntax should we use to express “rows whose analyzed tokens match
> this search term?”
>
> The syntax must be clear that we want to look for this term within the
> column data using the configured index with corresponding query-time
> tokenization and analysis.  This means that the query term is not always a
> substring of the original string!  Besides obvious transformations like
> lowercasing, you have things like PhoneticFilter available as well.
>
> Here are my thoughts on some of the options:
>
> `column = term`.  This is what the POC does to

Raw results from User Survey

2023-08-01 Thread Patrick McFadin
Thanks to everyone who participated in this survey. We had a significant
enough responses to give this a legitimacy.  220 responses!

I wanted to get the raw results out first so everyone can participate with
the full picture. I'll work on a blog post to post on the Apache web site
after this is done.

Graphs (easy read)
https://docs.google.com/document/d/1Rbg-VP4Xdvgp8EKNczkqfhFYeKwfc_ZmMW0c5Gol9pk/edit?usp=sharing

Anonymized spreadsheet of responses (make your own graphs)
https://docs.google.com/spreadsheets/d/1pjhpjID5sEW4Vcff8tq0Atbcq8Cds18pXorM4CQcStk/edit?usp=sharing

I'll be giving a bit more discussion in the Cassandra marketing meeting
tomorrow if you want to come hear my thoughts.
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=240883297

Now, what surprised you in the results?

Patrick