Been chatting a bit w/Caleb about this offline and poking around to better
educate myself.
> using functions (ignoring the implementation complexity) at least removes
> ambiguity.
This, plus using functions lets us kick the can down the road a bit in terms of
landing on an integrated grammar we agree on. It seems to me there's a tension
between:
1. "SQL-like" (i.e. postgres-like)
2. "Indexing and Search domain-specific-like" (i.e. lucene syntax which, as
Benedict points out, doesn't really jell w/what we have in CQL at this point),
and
3. ??? Some other YOLO CQL / C* specific thing where we go our own road
I don't think we're really going to know what our feature-set in terms of
indexing is going to look like or the shape it's going to take for awhile, so
backing ourselves into any of the 3 corners above right now feels very
premature to me.
So I'm coming around to the expr / method call approach to preserve that
flexibility. It's maximally explicit and preserves optionality at the expense
of being clunky. For now.
On Mon, Aug 7, 2023, at 4:00 PM, Caleb Rackliffe wrote:
> > I do not think we should start using lucene syntax for it, it will make
> > people think they can do everything else lucene allows.
>
> I'm sure we won't be supporting everything Lucene allows, but this is going
> to evolve. Right off the bat, if you introduce support for tokenization and
> filtering, someone is, for example, going to ask for phrase queries. ("John
> Smith landed in Virginia" is tokenized, but someone wants to match exactly on
> "John Smith".) The whole point of the Vector project is to do relevance,
> right? Are we going to do term boosting? Do we need queries like "field:
> quick brown +fox -news" where fox must be present, news cannot be present,
> and quick and brown increase relevance?
>
> SASI uses "=" and "LIKE" in a way that assumes the user understands the
> tokenization scheme in use on the target field. I understand that's a bit
> ambiguous.
>
> If we object to allowing expr embedding of a subset of the Lucene syntax, I
> can't imagine we're okay w/ then jamming a subset of that syntax into the
> main CQL grammar.
>
> If we want to do this in non-expr CQL space, I think using functions
> (ignoring the implementation complexity) at least removes ambiguity.
> "token_match", "phrase_match", "token_like", "=", and "LIKE" would all be
> pretty clear, although there may be other problems. For instance, what
> happens when I try to use "token_match" on an indexed field whose analyzer
> does not tokenize? We obviously can't use the index, so we'd be reduced to
> requiring a filtering query, but maybe that's fine. My point is that, if
> we're going to make write and read analyzers symmetrical, there's really no
> way to make the semantics of our queries totally independent of analysis.
> (ex. "field : foo bar" behaves differently w/ read tokenization than it does
> without. It could even be an OR or AND query w/ tokenization, depending on
> our defaults.)
>
> On Mon, Aug 7, 2023 at 12:55 PM Atri Sharma <[email protected]> wrote:
>> Why not start with SQLish operators supported by many databases (LIKE and
>> CONTAINS)?
>>
>> On Mon, Aug 7, 2023 at 10:01 PM J. D. Jordan <[email protected]>
>> wrote:
>>>
>>> I am also -1 on directly exposing lucene like syntax here. Besides being
>>> ugly, SAI is not lucene, I do not think we should start using lucene syntax
>>> for it, it will make people think they can do everything else lucene allows.
>>>
>>>> On Aug 7, 2023, at 5:13 AM, Benedict <[email protected]> wrote:
>>>>
>>>>
>>>> I’m strongly opposed to :
>>>>
>>>> It is very dissimilar to our current operators. CQL is already not the
>>>> prettiest language, but let’s not make it a total mish mash.
>>>>
>>>>
>>>>
>>>>
>>>>> On 7 Aug 2023, at 10:59, Mike Adamson <[email protected]> wrote:
>>>>>
>>>>> I am also in agreement with 'column : token' in that 'I don't hate it'
>>>>> but I'd like to offer an alternative to this in 'column HAS token'. HAS
>>>>> is currently not a keyword that we use so wouldn't cause any brain
>>>>> conflicts.
>>>>>
>>>>> While I don't hate ':' I have a particular dislike of the lucene search
>>>>> syntax because of its terseness and lack of easy readability.
>>>>>
>>>>> Saying that, I'm happy to do with ':' if that is the decision.
>>>>>
>>>>> On Fri, 4 Aug 2023 at 00:23, Jon Haddad <[email protected]>
>>>>> wrote:
>>>>>> Assuming SAI is a superset of SASI, and we were to set up something so
>>>>>> that SASI indexes auto convert to SAI, this gives even more weight to my
>>>>>> point regarding how differing behavior for the same syntax can lead to
>>>>>> issues. Imo the best case scenario results in the user not even
>>>>>> noticing their indexes have changed.
>>>>>>
>>>>>> An (maybe better?) alternative is to add a flag to the index
>>>>>> configuration for "compatibility mod", which might address the concerns
>>>>>> around using an equality operator when it actually is a partial match.
>>>>>>
>>>>>> For what it's worth, I'm in agreement that = should mean full equality
>>>>>> and not token match.
>>>>>>
>>>>>> On 2023/08/03 03:56:23 Caleb Rackliffe wrote:
>>>>>> > For what it's worth, I'd very much like to completely remove SASI from
>>>>>> > the
>>>>>> > codebase for 6.0. The only remaining functionality gaps at the moment
>>>>>> > are
>>>>>> > LIKE (prefix/suffix) queries and its limited tokenization
>>>>>> > capabilities, both of which already have SAI Phase 2 Jiras.
>>>>>> >
>>>>>> > On Wed, Aug 2, 2023 at 7:20 PM Jeremiah Jordan <[email protected]>
>>>>>> > wrote:
>>>>>> >
>>>>>> > > SASI just uses “=“ for the tokenized equality matching, which is the
>>>>>> > > exact
>>>>>> > > thing this discussion is about changing/not liking.
>>>>>> > >
>>>>>> > > > On Aug 2, 2023, at 7:18 PM, J. D. Jordan
>>>>>> > > > <[email protected]>
>>>>>> > > wrote:
>>>>>> > > >
>>>>>> > > > I do not think LIKE actually applies here. LIKE is used for
>>>>>> > > > prefix,
>>>>>> > > contains, or suffix searches in SASI depending on the index type.
>>>>>> > > >
>>>>>> > > > This is about exact matching of tokens.
>>>>>> > > >
>>>>>> > > >> On Aug 2, 2023, at 5:53 PM, Jon Haddad
>>>>>> > > >> <[email protected]>
>>>>>> > > wrote:
>>>>>> > > >>
>>>>>> > > >> Certain bits of functionality also already exist on the SASI
>>>>>> > > >> side of
>>>>>> > > things, but I'm not sure how much overlap there is. Currently,
>>>>>> > > there's a
>>>>>> > > LIKE keyword that handles token matching, although it seems to have
>>>>>> > > some
>>>>>> > > differences from the feature set in SAI.
>>>>>> > > >>
>>>>>> > > >> That said, there seems to be enough of an overlap that it would
>>>>>> > > >> make
>>>>>> > > sense to consider using LIKE in the same manner, doesn't it? I
>>>>>> > > think it
>>>>>> > > would be a little odd if we have different syntax for different
>>>>>> > > indexes.
>>>>>> > > >>
>>>>>> > > >> https://github.com/apache/cassandra/blob/trunk/doc/SASI.md
>>>>>> > > >>
>>>>>> > > >> I think one complication here is that there seems to be a desire,
>>>>>> > > >> that
>>>>>> > > I very much agree with, to expose as much of the underlying
>>>>>> > > flexibility of
>>>>>> > > Lucene as much as possible. If it means we use Caleb's suggestion,
>>>>>> > > I'd ask
>>>>>> > > that the queries that SASI and SAI both support use the same syntax,
>>>>>> > > even
>>>>>> > > if it means there's two ways of writing the same query. To use
>>>>>> > > Caleb's
>>>>>> > > example, this would mean supporting both LIKE and the `expr` column.
>>>>>> > > >>
>>>>>> > > >> Jon
>>>>>> > > >>
>>>>>> > > >>>> On 2023/08/01 19:17:11 Caleb Rackliffe wrote:
>>>>>> > > >>> Here are some additional bits of prior art, if anyone finds them
>>>>>> > > useful:
>>>>>> > > >>>
>>>>>> > > >>>
>>>>>> > > >>> The Stratio Lucene Index -
>>>>>> > > >>> https://github.com/Stratio/cassandra-lucene-index#examples
>>>>>> > > >>>
>>>>>> > > >>> Stratio was the reason C* added the "expr" functionality. They
>>>>>> > > >>> embedded
>>>>>> > > >>> something similar to ElasticSearch JSON, which probably isn't my
>>>>>> > > favorite
>>>>>> > > >>> choice, but it's there.
>>>>>> > > >>>
>>>>>> > > >>>
>>>>>> > > >>> The ElasticSearch match query syntax -
>>>>>> > > >>>
>>>>>> > > https://urldefense.com/v3/__https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-match-query.html__;!!PbtH5S7Ebw!ZHwYJ2xkivwTzYgjkp5QFAzALXCWPqkga6GBD-m2aK3j06ioSCRPsdZD0CIe50VpRrtW-1rY_m6lrSpp7zVlAf0MsxZ9$
>>>>>> > > >>>
>>>>>> > > >>> Again, not my favorite. It's verbose, and probably too powerful
>>>>>> > > >>> for us.
>>>>>> > > >>>
>>>>>> > > >>>
>>>>>> > > >>> ElasticSearch's documentation for the basic Lucene query syntax -
>>>>>> > > >>>
>>>>>> > > https://urldefense.com/v3/__https://www.elastic.co/guide/en/elasticsearch/reference/8.9/query-dsl-query-string-query.html*query-string-syntax__;Iw!!PbtH5S7Ebw!ZHwYJ2xkivwTzYgjkp5QFAzALXCWPqkga6GBD-m2aK3j06ioSCRPsdZD0CIe50VpRrtW-1rY_m6lrSpp7zVlAXEPP1sK$
>>>>>> > > >>>
>>>>>> > > >>> One idea is to take the basic Lucene index, which it seems we
>>>>>> > > >>> already
>>>>>> > > have
>>>>>> > > >>> some support for, and feed it to "expr". This is nice for two
>>>>>> > > >>> reasons:
>>>>>> > > >>>
>>>>>> > > >>> 1.) People can just write Lucene queries if they already know
>>>>>> > > >>> how.
>>>>>> > > >>> 2.) No changes to the grammar.
>>>>>> > > >>>
>>>>>> > > >>> Lucene has distinct concepts of filtering and querying, and this
>>>>>> > > >>> is
>>>>>> > > kind of
>>>>>> > > >>> the latter. I'm not sure how, for example, we would want "expr"
>>>>>> > > >>> to
>>>>>> > > interact
>>>>>> > > >>> w/ filters on other column indexes in vanilla CQL space...
>>>>>> > > >>>
>>>>>> > > >>>
>>>>>> > > >>>> On Mon, Jul 24, 2023 at 9:37 AM Josh McKenzie
>>>>>> > > >>>> <[email protected]>
>>>>>> > > wrote:
>>>>>> > > >>>>
>>>>>> > > >>>> `column CONTAINS term`. Contains is used by both Java and
>>>>>> > > >>>> Python for
>>>>>> > > >>>> substring searches, so at least some users will be surprised by
>>>>>> > > term-based
>>>>>> > > >>>> behavior.
>>>>>> > > >>>>
>>>>>> > > >>>> I wonder whether users are in their "programming language"
>>>>>> > > >>>> headspace
>>>>>> > > or in
>>>>>> > > >>>> their "querying a database" headspace when interacting with
>>>>>> > > >>>> CQL? i.e.
>>>>>> > > this
>>>>>> > > >>>> would only present confusion if we expected users to be
>>>>>> > > >>>> thinking in
>>>>>> > > the
>>>>>> > > >>>> idioms of their respective programming languages. If they're
>>>>>> > > >>>> thinking
>>>>>> > > in
>>>>>> > > >>>> terms of SQL, MATCHES would probably end up confusing them a bit
>>>>>> > > since it
>>>>>> > > >>>> doesn't match the general structure of the MATCH operator.
>>>>>> > > >>>>
>>>>>> > > >>>> That said, I also think CONTAINS loses something important that
>>>>>> > > >>>> you
>>>>>> > > allude
>>>>>> > > >>>> to here Jonathan:
>>>>>> > > >>>>
>>>>>> > > >>>> with corresponding query-time tokenization and analysis. This
>>>>>> > > >>>> means
>>>>>> > > that
>>>>>> > > >>>> the query term is not always a substring of the original string!
>>>>>> > > Besides
>>>>>> > > >>>> obvious transformations like lowercasing, you have things like
>>>>>> > > >>>> PhoneticFilter available as well.
>>>>>> > > >>>>
>>>>>> > > >>>> So to me, neither MATCHES nor CONTAINS are particularly great
>>>>>> > > candidates.
>>>>>> > > >>>>
>>>>>> > > >>>> So +1 to the "I don't actually hate it" sentiment on:
>>>>>> > > >>>>
>>>>>> > > >>>> column : term`. Inspired by Lucene’s syntax
>>>>>> > > >>>>
>>>>>> > > >>>>
>>>>>> > > >>>>> On Mon, Jul 24, 2023, at 8:35 AM, Benedict wrote:
>>>>>> > > >>>>
>>>>>> > > >>>>
>>>>>> > > >>>> I have a strong preference not to use the name of an SQL
>>>>>> > > >>>> operator,
>>>>>> > > since
>>>>>> > > >>>> it precludes us later providing the SQL standard operator to
>>>>>> > > >>>> users.
>>>>>> > > >>>>
>>>>>> > > >>>> What about CONTAINS TOKEN term? Or CONTAINS TERM term?
>>>>>> > > >>>>
>>>>>> > > >>>>
>>>>>> > > >>>>> On 24 Jul 2023, at 13:34, Andrés de la Peña
>>>>>> > > >>>>> <[email protected]>
>>>>>> > > wrote:
>>>>>> > > >>>>
>>>>>> > > >>>>
>>>>>> > > >>>> `column = term` is definitively problematic because it creates
>>>>>> > > >>>> an
>>>>>> > > >>>> ambiguity when the queried column belongs to the primary key.
>>>>>> > > >>>> For some
>>>>>> > > >>>> queries we wouldn't know whether the user wants a primary key
>>>>>> > > >>>> query
>>>>>> > > using
>>>>>> > > >>>> regular equality or an index query using the analyzer.
>>>>>> > > >>>>
>>>>>> > > >>>> `term_matches(column, term)` seems quite clear and hard to
>>>>>> > > misinterpret,
>>>>>> > > >>>> but it's quite long to write and its implementation will be
>>>>>> > > challenging
>>>>>> > > >>>> since we would need a bunch of special casing around
>>>>>> > > >>>> SelectStatement
>>>>>> > > and
>>>>>> > > >>>> functions.
>>>>>> > > >>>>
>>>>>> > > >>>> LIKE, MATCHES and CONTAINS could be a bit misleading since they
>>>>>> > > >>>> seem
>>>>>> > > to
>>>>>> > > >>>> evoke different behaviours to what they would have.
>>>>>> > > >>>>
>>>>>> > > >>>> `column LIKE :term:` seems a bit redundant compared to just
>>>>>> > > >>>> using
>>>>>> > > `column
>>>>>> > > >>>> : term`, and we are still introducing a new symbol.
>>>>>> > > >>>>
>>>>>> > > >>>> I think I like `column : term` the most, because it's brief,
>>>>>> > > >>>> it's
>>>>>> > > similar
>>>>>> > > >>>> to the equivalent Lucene's syntax, and it doesn't seem to clash
>>>>>> > > >>>> with
>>>>>> > > other
>>>>>> > > >>>> different meanings that I can think of.
>>>>>> > > >>>>
>>>>>> > > >>>>> On Mon, 24 Jul 2023 at 13:13, Jonathan Ellis
>>>>>> > > >>>>> <[email protected]>
>>>>>> > > wrote:
>>>>>> > > >>>>
>>>>>> > > >>>> Hi all,
>>>>>> > > >>>>
>>>>>> > > >>>> With phase 1 of SAI wrapping up, I’d like to start the ball
>>>>>> > > >>>> rolling on
>>>>>> > > >>>> aligning around phase 2 features.
>>>>>> > > >>>>
>>>>>> > > >>>> In particular, we need to nail down the syntax for doing
>>>>>> > > >>>> non-exact
>>>>>> > > string
>>>>>> > > >>>> matches. We have a proof of concept that includes full Lucene
>>>>>> > > analyzer and
>>>>>> > > >>>> filter functionality – just the text transformation pieces,
>>>>>> > > >>>> none of
>>>>>> > > the
>>>>>> > > >>>> storage parts – which is the gold standard in this space. For
>>>>>> > > example, the
>>>>>> > > >>>> StandardAnalyzer [1] lowercases all terms and removes stopwords
>>>>>> > > (common
>>>>>> > > >>>> words like “a”, “is”, “the” that are usually not useful to
>>>>>> > > >>>> search
>>>>>> > > >>>> against). Lucene also has classes that offer stemming, special
>>>>>> > > >>>> case
>>>>>> > > >>>> handling for email, and many languages besides English [2].
>>>>>> > > >>>>
>>>>>> > > >>>> What syntax should we use to express “rows whose analyzed
>>>>>> > > >>>> tokens match
>>>>>> > > >>>> this search term?”
>>>>>> > > >>>>
>>>>>> > > >>>> The syntax must be clear that we want to look for this term
>>>>>> > > >>>> within the
>>>>>> > > >>>> column data using the configured index with corresponding
>>>>>> > > >>>> query-time
>>>>>> > > >>>> tokenization and analysis. This means that the query term is
>>>>>> > > >>>> not
>>>>>> > > always a
>>>>>> > > >>>> substring of the original string! Besides obvious
>>>>>> > > >>>> transformations
>>>>>> > > like
>>>>>> > > >>>> lowercasing, you have things like PhoneticFilter available as
>>>>>> > > >>>> well.
>>>>>> > > >>>>
>>>>>> > > >>>> Here are my thoughts on some of the options:
>>>>>> > > >>>>
>>>>>> > > >>>> `column = term`. This is what the POC does today and it’s super
>>>>>> > > confusing
>>>>>> > > >>>> to overload = to mean something other than exact equality. I
>>>>>> > > >>>> am not
>>>>>> > > a fan.
>>>>>> > > >>>>
>>>>>> > > >>>> `column LIKE term` or `column LIKE %term%`. The closest SQL
>>>>>> > > >>>> operator,
>>>>>> > > but
>>>>>> > > >>>> neither the wildcarded nor unwildcarded syntax matches the
>>>>>> > > >>>> semantics
>>>>>> > > of
>>>>>> > > >>>> term-based search.
>>>>>> > > >>>>
>>>>>> > > >>>> `column MATCHES term`. I rather like this one, although Mike
>>>>>> > > >>>> points
>>>>>> > > out
>>>>>> > > >>>> that “match” has a meaning in the context of regular
>>>>>> > > >>>> expressions that
>>>>>> > > could
>>>>>> > > >>>> cause confusion here.
>>>>>> > > >>>>
>>>>>> > > >>>> `column CONTAINS term`. Contains is used by both Java and
>>>>>> > > >>>> Python for
>>>>>> > > >>>> substring searches, so at least some users will be surprised by
>>>>>> > > term-based
>>>>>> > > >>>> behavior.
>>>>>> > > >>>>
>>>>>> > > >>>> `term_matches(column, term)`. Postgresql FTS makes you use
>>>>>> > > >>>> functions
>>>>>> > > like
>>>>>> > > >>>> this for everything. It’s pretty clunky, and we would need to
>>>>>> > > >>>> make
>>>>>> > > the
>>>>>> > > >>>> amazingly hairy SelectStatement even hairier to handle “use a
>>>>>> > > >>>> function
>>>>>> > > >>>> result in a predicate” like this.
>>>>>> > > >>>>
>>>>>> > > >>>> `column : term`. Inspired by Lucene’s syntax. I don’t actually
>>>>>> > > >>>> hate
>>>>>> > > it.
>>>>>> > > >>>>
>>>>>> > > >>>> `column LIKE :term:`. Stick with the LIKE operator but add a new
>>>>>> > > symbol to
>>>>>> > > >>>> indicate term matching. Arguably more SQL-ish than a new bare
>>>>>> > > >>>> symbol
>>>>>> > > >>>> operator.
>>>>>> > > >>>>
>>>>>> > > >>>> [1]
>>>>>> > > >>>>
>>>>>> > > https://lucene.apache.org/core/9_7_0/core/org/apache/lucene/analysis/standard/StandardAnalyzer.html
>>>>>> > > >>>> [2]
>>>>>> > > >>>> https://lucene.apache.org/core/9_7_0/analysis/common/index.html
>>>>>> > > >>>>
>>>>>> > > >>>> --
>>>>>> > > >>>> Jonathan Ellis
>>>>>> > > >>>> co-founder, http://www.datastax.com
>>>>>> > > >>>> @spyced
>>>>>> > > >>>>
>>>>>> > > >>>>
>>>>>> > > >>>>
>>>>>> > > >>>
>>>>>> > >
>>>>>> >
>>>>>
>>>>>
>>>>> --
>>>>> DataStax Logo Square <https://www.datastax.com/>
>>>>> *Mike Adamson*
>>>>> Engineering
>>>>> +1 650 389 6000 <tel:16503896000> | datastax.com
>>>>> <https://www.datastax.com/>
>>>>> Find DataStax Online:
>>>>> LinkedIn Logo
>>>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_company_datastax&d=DwMFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=IFj3MdIKYLLXIUhYdUGB0cTzTlxyCb7_VUmICBaYilU&m=uHzE4WhPViSF0rsjSxKhfwGDU1Bo7USObSc_aIcgelo&s=akx0E6l2bnTjOvA-YxtonbW0M4b6bNg4nRwmcHNDo4Q&e=>
>>>>> Facebook Logo
>>>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.facebook.com_datastax&d=DwMFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=IFj3MdIKYLLXIUhYdUGB0cTzTlxyCb7_VUmICBaYilU&m=uHzE4WhPViSF0rsjSxKhfwGDU1Bo7USObSc_aIcgelo&s=ncMlB41-6hHuqx-EhnM83-KVtjMegQ9c2l2zDzHAxiU&e=>
>>>>> Twitter Logo <https://twitter.com/DataStax> RSS Feed
>>>>> <https://www.datastax.com/blog/rss.xml> Github Logo
>>>>> <https://github.com/datastax>
>>
>>
>> --
>> Regards,
>> Atri
>> Apache Concerted