Re: [DISCUSSION] New dependencies for SAI CEP-7

2022-12-13 Thread Mike Adamson
>
> Can you talk more about why?  There are several ways to do random testing
> in-tree ATM, so wondering why we need another one


I can see one mechanism for random testing in-tree. That is the Simulator
but that seems primarily involved in the random orchestration of
operations. My apologies if I have simplified its significance. Apart from
that, I can only see different usages of Random in unit tests. I admit I
have not looked beyond this at dtests.

The random testing in SAI is more focussed on the behaviour of the
low-level index structures and flow of data to / from these. Using randomly
generated values in tests has proved invaluable in highlighting edge
conditions in the code. This above library was only added to provide us
with a rich set of random generators. I am happy to look at removing this
library if its inclusion is contentious.


On Mon, 12 Dec 2022 at 19:41, David Capwell  wrote:

> com.carrotsearch.randomizedtesting.randomizedtesting-runner 2.1.2 - test
> dependency
>
>
> Can you talk more about why?  There are several ways to do random testing
> in-tree ATM, so wondering why we need another one
>
>
> On Dec 8, 2022, at 6:51 AM, Mike Adamson  wrote:
>
> Hi,
>
> I wanted to discuss the addition of the following dependencies for CEP-7.
> The dependencies are:
>
> org.apache.lucene.lucene-core 7.5.0
> org.apache.lucene.lucene-analyzers-common 7.5.0
> com.carrotsearch.randomizedtesting.randomizedtesting-runner 2.1.2 - test
> dependency
>
> Lucene is an apache project so is licensed APL2. Carrotsearch is not an
> apache project but is licensed APL2
>
> We are also removing the dependency
> on com.github.rholder.snowball-stemmer. This library is used by SASI
> stemming filters but a later version of the same library is available in
> the lucene libraries.
>
> Does anyone have any concerns about these changes?
>
> Mike Adamson
>
>
>

-- 
[image: DataStax Logo Square] <https://www.datastax.com/> *Mike Adamson*
Engineering

+1 650 389 6000 <16503896000> | datastax.com <https://www.datastax.com/>
Find DataStax Online: [image: LinkedIn Logo]
<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_company_datastax&d=DwMFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=IFj3MdIKYLLXIUhYdUGB0cTzTlxyCb7_VUmICBaYilU&m=uHzE4WhPViSF0rsjSxKhfwGDU1Bo7USObSc_aIcgelo&s=akx0E6l2bnTjOvA-YxtonbW0M4b6bNg4nRwmcHNDo4Q&e=>
   [image: Facebook Logo]
<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.facebook.com_datastax&d=DwMFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=IFj3MdIKYLLXIUhYdUGB0cTzTlxyCb7_VUmICBaYilU&m=uHzE4WhPViSF0rsjSxKhfwGDU1Bo7USObSc_aIcgelo&s=ncMlB41-6hHuqx-EhnM83-KVtjMegQ9c2l2zDzHAxiU&e=>
   [image: Twitter Logo] <https://twitter.com/DataStax>   [image: RSS Feed]
<https://www.datastax.com/blog/rss.xml>   [image: Github Logo]
<https://github.com/datastax>


Re: [DISCUSSION] New dependencies for SAI CEP-7

2022-12-14 Thread Mike Adamson
Thanks for your detailed response to this. I am definitely not fixed on
using carrot for this so am happy to look at a replacement. I wasn't aware
of the addition of QuickTheories or CassandraGenerators. A combination of
these could easily supply the functionality we need for the SAI testing.
The Generators could definitely replace the functionality in
SAIRandomizedTest.

I will take a look at these and see if we can work without the carrot
generators and will report back in a couple of days on this thread if I can
do this easily.

As an aside, Caleb and me have already spoken about adding support to Harry
for SAI and using this for more large-scale randomized testing of SAI.

On Tue, 13 Dec 2022 at 18:24, Josh McKenzie  wrote:

> Whatever we decide on, let's make sure we document it so newcomers on the
> project (or really anyone new to property based testing) can better
> discover those things.
>
> https://cassandra.apache.org/_/development/testing.html
>
> On Tue, Dec 13, 2022, at 1:08 PM, David Capwell wrote:
>
> Speaking to Caleb in Slack, so putting the main comments I have there here…
>
> I am not -1 on this new dependency, but more asking what we should use for
> random testing moving forward…. ATM we have the following:
>
> 1) QuickTheories - I feel like I am the only user at this point…
> 2) 1-off - many reinvent random testing for a specific class; using
> Random, ThreadLocalRandom, UUID.randomUUID(), and lang3 classes (such
> as org.apache.commons.lang3.RandomUtils)
> 3) Harry - even though the main API is for cluster testing, this is built
> on-top of random generation so could be used for low level random testing
> (just less fleshed out for this use-case)
> 4) Simulator - same as Harry, built on top of a random generator and not
> fleshed out for low level random testing
>
> Another reason I ask this is I have a fuzz testing that I have developed
> for Accord testing that generates random valid CQL statements to make sure
> we “do the right thing” and have been struggling with the question “where
> do I put this” and “what random do I use?”.  I built this off QuickTheories
> as I have a lot of utilities for building all supported Tables and Types so
> really quick do bootstrap, and every other random testing thing we have are
> less fleshed out… so if we add yet another random testing library what
> “should” we be using?  Do we build on-top of it to get to the same level
> QuickTheory is
> (see org.apache.cassandra.utils.Generators, 
> org.apache.cassandra.utils.CassandraGenerators,
> and org.apache.cassandra.utils.AbstractTypeGenerators)?
>
> On Dec 13, 2022, at 9:21 AM, Caleb Rackliffe 
> wrote:
>
> We need random generators no matter what for these tests, so I think what
> we need to decide is whether to continue to use Carrot or migrate those to
> QuickTheories, along the lines of what we have now in
> org.apache.cassandra.utils.Generators.
>
> When it comes to a library like this, the thing I would optimize for is
> how much it already provides (and therefore how much we need to write and
> maintain ourselves). If you look at something like NumericTypeSortingTest
> in the 18058 branch <https://github.com/maedhroz/cassandra/pull/6>, it's
> pretty compact w/ Carrot's RandomizedTest in use, but I suppose it could
> also use IntegersDSL from QT...
>
> (Not that it matters, but just for reference, we do use
> com.carrotsearch.hppc already.)
>
> On Tue, Dec 13, 2022 at 10:14 AM Mike Adamson 
> wrote:
>
> Can you talk more about why?  There are several ways to do random testing
> in-tree ATM, so wondering why we need another one
>
>
> I can see one mechanism for random testing in-tree. That is the Simulator
> but that seems primarily involved in the random orchestration of
> operations. My apologies if I have simplified its significance. Apart from
> that, I can only see different usages of Random in unit tests. I admit I
> have not looked beyond this at dtests.
>
> The random testing in SAI is more focussed on the behaviour of the
> low-level index structures and flow of data to / from these. Using randomly
> generated values in tests has proved invaluable in highlighting edge
> conditions in the code. This above library was only added to provide us
> with a rich set of random generators. I am happy to look at removing this
> library if its inclusion is contentious.
>
>
> On Mon, 12 Dec 2022 at 19:41, David Capwell  wrote:
>
> com.carrotsearch.randomizedtesting.randomizedtesting-runner 2.1.2 - test
> dependency
>
>
> Can you talk more about why?  There are several ways to do random testing
> in-tree ATM, so wondering why we need another one
>
>
> On Dec 8, 2022, at 6:51 AM, Mike Adamson  wrote:
>
> Hi,
>

Re: [DISCUSSION] New dependencies for SAI CEP-7

2022-12-14 Thread Mike Adamson
I have had a look at whether we could use the QuickTheories in our
randomized testing and come to the following conclusions:

Pros:
1) It has a very rich set of random generators out of the box.
2) It has a very powerful mechanism for generating customised randomized
datasets.
3) It is very pluggable within the constraints of its framework.

Cons:
1) The framework has to be used in a very specific way in order for it to
work. It does not allow for subsets of the framework to be used in
isolation.
2) The code hasn't been touched for 3 years. This is an observation as much
as anything but it does not appear to be being maintained at the moment.

The carrotsearch generators use a seeded Random to generate their values so
are also repeatable. It also provides a very rich set of random generators
that can be used in isolation of any other part of the framework. This
project is also being actively maintained.

As such I would prefer to keep using the carrotsearch generators. I have
made a change to the SAI testing that removes our usage of RandomizedTest
from the library and have stuck to just using the lower level random
generators. We already had a Randomization class in our test framework that
provided a lot of the RandomizedTest functionality (primarily the reporting
on failed tests of the random seed and the reuse of seeds) so using both
made no sense.


Re: [EXTERNAL] Re: [DISCUSS] Next release date

2023-03-13 Thread Mike Adamson
is valuable enough to cut a major based on that, we do
>>> it.
>>>
>>> ~Josh
>>>
>>> On Fri, Mar 3, 2023, at 7:37 PM, German Eichberger via dev wrote:
>>>
>>> Hi,
>>>
>>> We shouldn't release just for releases sake. Are there enough new
>>> features and are they working well enough (quality!).
>>>
>>> The big feature from our perspective for 5.0 is ACCORD (CEP-15) and I
>>> would advocate to delay until this has sufficient quality to be in
>>> production.
>>>
>>> Just because something is released doesn't mean anyone is gonna use it.
>>> To add some operator perspective: Every time there is a new release we need
>>> to decide
>>> 1) are we supporting it
>>> 2) which other release can we deprecate
>>>
>>> and potentially migrate people - which is also a tough sell if there are
>>> no significant features and/or breaking changes.  So from my perspective
>>> less frequent releases are better - after all we haven't gotten around to
>>> support 4.1 🙂
>>>
>>> The 5.0 release is also coupled with deprecating  3.11 which is what a
>>> significant amount of people are using - given 4.1 took longer I am not
>>> sure how many people are assuming that 5 will be delayed and haven't made
>>> plans (OpenJDK support for 8 is longer than Java 17 🙂) . So being a
>>> bit more deliberate with releasing 5.0 and having a longer beta phase are
>>> all things we should consider.
>>>
>>> My 2cts,
>>> German
>>>
>>> *From:* Benedict 
>>> *Sent:* Wednesday, March 1, 2023 5:59 AM
>>> *To:* dev@cassandra.apache.org 
>>> *Subject:* [EXTERNAL] Re: [DISCUSS] Next release date
>>>
>>>
>>> It doesn’t look like we agreed to a policy of annual branch dates, only
>>> annual releases and that we would schedule this for 4.1 based on 4.0’s
>>> branch date. Given this was the reasoning proposed I can see why folk would
>>> expect this would happen for the next release. I don’t think there was a
>>> strong enough commitment here to be bound by, it if we think different
>>> maths would work better.
>>>
>>> I recall the goal for an annual cadence was to ensure we don’t have
>>> lengthy periods between releases like 3.x to 4.0, and to try to reduce the
>>> pressure certain contributors might feel to hit a specific release with a
>>> given feature.
>>>
>>> I think it’s better to revisit these underlying reasons and check how
>>> they apply than to pick a mechanism and stick to it too closely.
>>>
>>> The last release was quite recent, so we aren’t at risk of slow releases
>>> here. Similarly, there are some features that the *project* would probably
>>> benefit from landing prior to release, if this doesn’t push release back
>>> too far.
>>>
>>>
>>>
>>>
>>>
>>> On 1 Mar 2023, at 13:38, Mick Semb Wever  wrote:
>>>
>>> 
>>>
>>> My thoughts don't touch on CEPs inflight.
>>>
>>>
>>>
>>>
>>> For the sake of broadening the discussion, additional questions I think
>>> worthwhile to raise are…
>>>
>>> 1. What third parties, or other initiatives, are invested and/or
>>> working against the May deadline? and what are their views on changing it?
>>>   1a. If we push branching back to September, how confident are we that
>>> we'll get to GA before the December Summit?
>>> 2. What CEPs look like not landing by May that we consider a must-have
>>> this year?
>>>   2a. Is it just tail-end commits in those CEPs that won't make it? Can
>>> these land (with or without a waiver) during the alpha phase?
>>>   2b. If the final components to specified CEPs are not
>>> approved/appropriate to land during alpha, would it be better if the
>>> project commits to a one-off half-year release later in the year?
>>>
>>>
>>>

-- 
[image: DataStax Logo Square] <https://www.datastax.com/> *Mike Adamson*
Engineering

+1 650 389 6000 <16503896000> | datastax.com <https://www.datastax.com/>
Find DataStax Online: [image: LinkedIn Logo]
<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_company_datastax&d=DwMFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=IFj3MdIKYLLXIUhYdUGB0cTzTlxyCb7_VUmICBaYilU&m=uHzE4WhPViSF0rsjSxKhfwGDU1Bo7USObSc_aIcgelo&s=akx0E6l2bnTjOvA-YxtonbW0M4b6bNg4nRwmcHNDo4Q&e=>
   [image: Facebook Logo]
<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.facebook.com_datastax&d=DwMFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=IFj3MdIKYLLXIUhYdUGB0cTzTlxyCb7_VUmICBaYilU&m=uHzE4WhPViSF0rsjSxKhfwGDU1Bo7USObSc_aIcgelo&s=ncMlB41-6hHuqx-EhnM83-KVtjMegQ9c2l2zDzHAxiU&e=>
   [image: Twitter Logo] <https://twitter.com/DataStax>   [image: RSS Feed]
<https://www.datastax.com/blog/rss.xml>   [image: Github Logo]
<https://github.com/datastax>


Re: [EXTERNAL] Re: [DISCUSS] Next release date

2023-03-16 Thread Mike Adamson
Sorry, I realised that I hadn't included any completion date for CEP-7. At
the current time we are looking at completion mid  to end of April.

Mike

On Mon, 13 Mar 2023 at 11:34, Mike Adamson  wrote:

> CEP-7 Storage Attached Index is in review with ~430 files and ~70k LOC.
> The bulk of the project is in 3 main patches. The first patch (in-memory
> index and query path) is merged to the feature branch CASSANDRA-16052 and
> the second patch (on-disk write and literal / string index) is in review.
>
> Mike
>
> On Thu, 9 Mar 2023 at 09:13, Branimir Lambov  wrote:
>
>> CEPs 25 (trie-indexed sstables) and 26 (unified compaction strategy)
>> should both be ready for review by mid-April.
>>
>> Both are around 10k LOC, fairly isolated, and in need of a committer to
>> review.
>>
>> Regards,
>> Branimir
>>
>> On Mon, Mar 6, 2023 at 11:25 AM Benjamin Lerer  wrote:
>>
>>> Sorry, I realized that when I started the discussion I probably did not
>>> frame it enough as I see that it is now going into different directions.
>>> The concerns I am seeing are:
>>> 1) A too small amount of time between releases  is inefficient from a
>>> development perspective and from a user perspective. From a development
>>> point of view because we are missing time to deliver some features. From a
>>> user perspective because they cannot follow with the upgrade.
>>> 2) Some features are so anticipated (Accord being the one mentioned)
>>> that people would prefer to delay the release to make sure that it is
>>> available as soon as possible.
>>> 3) We do not know how long we need to go from the freeze to GA. We hope
>>> for 2 months but our last experience was 6 months. So delaying the release
>>> could mean not releasing this year.
>>> 4) For people doing marketing it is really hard to promote a product
>>> when you do not know when the release will come and what features might be
>>> there.
>>>
>>> All those concerns are probably even made worse by the fact that we do
>>> not have a clear visibility on where we are.
>>>
>>> Should we clarify that part first by getting an idea of the status of
>>> the different CEPs and other big pieces of work? From there we could agree
>>> on some timeline for the freeze. We could then discuss how to make
>>> predictable the time from freeze to GA.
>>>
>>>
>>>
>>> Le sam. 4 mars 2023 à 18:14, Josh McKenzie  a
>>> écrit :
>>>
>>>> (for convenience sake, I'm referring to both Major and Minor semver
>>>> releases as "major" in this email)
>>>>
>>>> The big feature from our perspective for 5.0 is ACCORD (CEP-15) and I
>>>> would advocate to delay until this has sufficient quality to be in
>>>> production.
>>>>
>>>> This approach can be pretty unpredictable in this domain; often
>>>> unforeseen things come up in implementation that can give you a long tail
>>>> on something being production ready. For the record - I don't intend to
>>>> single Accord out *at all* on this front, quite the opposite given how
>>>> much rigor's gone into the design and implementation. I'm just thinking
>>>> from my personal experience: everything I've worked on, overseen, or
>>>> followed closely on this codebase always has a few tricks up its sleeve
>>>> along the way to having edge-cases stabilized.
>>>>
>>>> Much like on some other recent topics, I think there's a nuanced middle
>>>> ground where we take things on a case-by-case basis. Some factors that have
>>>> come up in this thread that resonated with me:
>>>>
>>>> For a given potential release date 'X':
>>>> 1. How long has it been since the last release?
>>>> 2. How long do we expect qualification to take from a "freeze" (i.e. no
>>>> new improvement or features, branch) point?
>>>> 3. What body of merged production ready work is available?
>>>> 4. What body of new work do we have high confidence will be ready
>>>> within Y time?
>>>>
>>>> I think it's worth defining a loose "minimum bound and upper bound" on
>>>> release cycles we want to try and stick with barring extenuating
>>>> circumstances. For instance: try not to release sooner than maybe 10 months
>>>> out from a prior major, and try not to release later than 18 months out
>>>> from 

[DISCUSS] Introduce DATABASE as an alternative to KEYSPACE

2023-04-04 Thread Mike Adamson
Hi,

I'd like to propose that we add DATABASE to the CQL grammar as an
alternative to KEYSPACE.

Background: While TABLE was introduced as an alternative for COLUMNFAMILY
in the grammar we have kept KEYSPACE for the container name for a group of
tables. Nearly all traditional SQL databases use DATABASE as the container
name for a group of tables so it would make sense for Cassandra to adopt
this naming as well.

KEYSPACE would be kept in the grammar but we would update some logging and
documentation to encourage use of the new name.

Mike Adamson

-- 
[image: DataStax Logo Square] <https://www.datastax.com/> *Mike Adamson*
Engineering

+1 650 389 6000 <16503896000> | datastax.com <https://www.datastax.com/>
Find DataStax Online: [image: LinkedIn Logo]
<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_company_datastax&d=DwMFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=IFj3MdIKYLLXIUhYdUGB0cTzTlxyCb7_VUmICBaYilU&m=uHzE4WhPViSF0rsjSxKhfwGDU1Bo7USObSc_aIcgelo&s=akx0E6l2bnTjOvA-YxtonbW0M4b6bNg4nRwmcHNDo4Q&e=>
   [image: Facebook Logo]
<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.facebook.com_datastax&d=DwMFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=IFj3MdIKYLLXIUhYdUGB0cTzTlxyCb7_VUmICBaYilU&m=uHzE4WhPViSF0rsjSxKhfwGDU1Bo7USObSc_aIcgelo&s=ncMlB41-6hHuqx-EhnM83-KVtjMegQ9c2l2zDzHAxiU&e=>
   [image: Twitter Logo] <https://twitter.com/DataStax>   [image: RSS Feed]
<https://www.datastax.com/blog/rss.xml>   [image: Github Logo]
<https://github.com/datastax>


Re: [DISCUSS] Introduce DATABASE as an alternative to KEYSPACE

2023-04-06 Thread Mike Adamson
My apologies. I started this discussion off the back of a usability
discussion around new user accessibility to Cassandra and the premise that
there is an initial steep learning curve for new users. Including new users
who have worked for a long time in the traditional DBMS field.

On the basis of the reason for the discussion,  TABLEGROUP doesn't sit well
because of user types / functions / indexes etc. which are not strictly
tables and is also yet another Cassandra only term.

NAMESPACE could work but it's different usage in other systems could be
just as confusing to new users.

And, I certainly don't think having multiple names for the same thing just
to satisfy different parties is a good idea at all.

I'm quite happy to leave things as they are if that is the consensus.

On Thu, 6 Apr 2023 at 14:16, Josh McKenzie  wrote:

> KEYSPACE is fine. If we want to introduce a standard nomenclature like
> DATABASE that’s also fine. Inventing brand new ones is not fine, there’s no
> benefit.
>
> I'm with Benedict in principle, with Aleksey in practice; I think KEYSPACE
> and SCHEMA are actually fine enough.
>
> If and when we get to any kind of multi-tenancy, having a more
> metaphorical abstraction that users are familiar with like these becomes
> more valuable; it's pretty clear that things in different keyspaces,
> different databases, or even different schemas could have different access
> rules, resourcing, etc from one another.
>
> While the off-the-cuff logical TABLEGROUP thing is a *literal* statement
> about what the thing is, it'd be another unique term to us;  we have enough
> things in our system where we've charted our own path. My personal .02 is
> we don't need to go adding more. :)
>
> On Thu, Apr 6, 2023, at 8:54 AM, Mick Semb Wever wrote:
>
>
> … but that should be a different discussion about how we evolve config.
>
>
>
> I disagree. Nomenclature being difficult can benefit from holistic and
> forward thinking.
> Sure you can label this off-topic if you like, but I value our discuss
> threads being collaborative in an open-mode. Sometimes the best idea is on
> the tail end of a sequence of bad and/or unpopular ideas.
>
>
>
>
>
>

-- 
[image: DataStax Logo Square] <https://www.datastax.com/> *Mike Adamson*
Engineering

+1 650 389 6000 <16503896000> | datastax.com <https://www.datastax.com/>
Find DataStax Online: [image: LinkedIn Logo]
<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_company_datastax&d=DwMFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=IFj3MdIKYLLXIUhYdUGB0cTzTlxyCb7_VUmICBaYilU&m=uHzE4WhPViSF0rsjSxKhfwGDU1Bo7USObSc_aIcgelo&s=akx0E6l2bnTjOvA-YxtonbW0M4b6bNg4nRwmcHNDo4Q&e=>
   [image: Facebook Logo]
<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.facebook.com_datastax&d=DwMFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=IFj3MdIKYLLXIUhYdUGB0cTzTlxyCb7_VUmICBaYilU&m=uHzE4WhPViSF0rsjSxKhfwGDU1Bo7USObSc_aIcgelo&s=ncMlB41-6hHuqx-EhnM83-KVtjMegQ9c2l2zDzHAxiU&e=>
   [image: Twitter Logo] <https://twitter.com/DataStax>   [image: RSS Feed]
<https://www.datastax.com/blog/rss.xml>   [image: Github Logo]
<https://github.com/datastax>


Re: [POLL] Vector type for ML

2023-05-04 Thread Mike Adamson
>
> For syntax, I think one option was just FLOAT[N]. In VECTOR FLOAT[N],
> VECTOR is redundant - FLOAT[N] is fully descriptive by itself. I don’t
> think VECTOR should be used to simply imply non-null, as this would be very
> unintuitive. More logical would be NONNULL, if this is the only condition
> being applied. Alternatively for arrays we could default to NONNULL and
> later introduce NULLABLE if we want to permit nulls.
>

I have a small issue relating to not having a specific VECTOR tag on the
data type. The driver behind adding this datatype is the hnsw index that is
being added to consume this data. If we have a generic array datatype, what
is the expectation going to be for users who create an index on it? The
hnsw index will support only floats initially so we would have to reject
any non-float arrays if an attempt was made to create an hnsw index on it.
While there is no problem with doing this, there would be a problem if, in
the future, we allow indexing in arrays in the same way that we index
collections. In this case we would then need to have the user select what
type of index they want at creation time.

Can I add another proposal that we allow a VECTOR or DENSE (this is a well
known term in the ML space) keyword that could be used when the array is
going to be used for ML workloads. This would be optional and would
function similarly to FROZEN in that it would limit the functionality of
the array to ML usage.

On Thu, 4 May 2023 at 09:45, Benedict  wrote:

> Hurrah for initial agreement.
>
> For syntax, I think one option was just FLOAT[N]. In VECTOR FLOAT[N],
> VECTOR is redundant - FLOAT[N] is fully descriptive by itself. I don’t
> think VECTOR should be used to simply imply non-null, as this would be very
> unintuitive. More logical would be NONNULL, if this is the only condition
> being applied. Alternatively for arrays we could default to NONNULL and
> later introduce NULLABLE if we want to permit nulls.
>
> If the word vector is to be used it makes more sense to make it look like
> a list, so VECTOR as here the word VECTOR is clearly not
> redundant.
>
> So, I vote:
>
> 1) (NON NULL) FLOAT[N]
> 2) FLOAT[N]   (Non null by default)
> 3) VECTOR
>
>
>
> On 4 May 2023, at 08:52, Mick Semb Wever  wrote:
>
> 
>
>> Did we agree on a CQL syntax?
>>
>> I don’t believe there has been a pool on CQL syntax… my understanding
>> reading all the threads is that there are ~4-5 options and non are -1ed, so
>> believe we are waiting for majority rule on this?
>>
>
>
> Re-reading that thread, IIUC the valid choices remaining are…
>
> 1. VECTOR FLOAT[n]
> 2. FLOAT VECTOR[n]
> 3. VECTOR
> 4. VECTOR[n]
> 5. ARRAY
> 6. NON-NULL FROZEN
>
>
> Yes I'm putting my preference (1) first ;) because (banging on) if the
> future of CQL will have FLOAT[n] and FROZEN, where the VECTOR
> keyword is: for general cql users; just meaning "non-null and frozen",
> these gel best together.
>
> Options (5) and (6) are for those that feel we can and should provide this
> type without introducing the vector keyword.
>
>
>
>

-- 
[image: DataStax Logo Square] <https://www.datastax.com/> *Mike Adamson*
Engineering

+1 650 389 6000 <16503896000> | datastax.com <https://www.datastax.com/>
Find DataStax Online: [image: LinkedIn Logo]
<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_company_datastax&d=DwMFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=IFj3MdIKYLLXIUhYdUGB0cTzTlxyCb7_VUmICBaYilU&m=uHzE4WhPViSF0rsjSxKhfwGDU1Bo7USObSc_aIcgelo&s=akx0E6l2bnTjOvA-YxtonbW0M4b6bNg4nRwmcHNDo4Q&e=>
   [image: Facebook Logo]
<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.facebook.com_datastax&d=DwMFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=IFj3MdIKYLLXIUhYdUGB0cTzTlxyCb7_VUmICBaYilU&m=uHzE4WhPViSF0rsjSxKhfwGDU1Bo7USObSc_aIcgelo&s=ncMlB41-6hHuqx-EhnM83-KVtjMegQ9c2l2zDzHAxiU&e=>
   [image: Twitter Logo] <https://twitter.com/DataStax>   [image: RSS Feed]
<https://www.datastax.com/blog/rss.xml>   [image: Github Logo]
<https://github.com/datastax>


Re: [POLL] Vector type for ML

2023-05-04 Thread Mike Adamson
That's fair comment. In this case I would be happy with any of your
suggestions although I would prefer that the datatype did not support
nulls.

On Thu, 4 May 2023 at 11:55, Benedict  wrote:

> I would expect that the type of index would be specified anyway?
>
> I don’t think it’s good API design to have the field define the index you
> create - only to shape what is permitted.
>
> A HNSW index is very specific and should be asked for specifically, not
> implicitly, IMO.
>
> On 4 May 2023, at 11:47, Mike Adamson  wrote:
>
> 
>
>> For syntax, I think one option was just FLOAT[N]. In VECTOR FLOAT[N],
>> VECTOR is redundant - FLOAT[N] is fully descriptive by itself. I don’t
>> think VECTOR should be used to simply imply non-null, as this would be very
>> unintuitive. More logical would be NONNULL, if this is the only condition
>> being applied. Alternatively for arrays we could default to NONNULL and
>> later introduce NULLABLE if we want to permit nulls.
>>
>
> I have a small issue relating to not having a specific VECTOR tag on the
> data type. The driver behind adding this datatype is the hnsw index that is
> being added to consume this data. If we have a generic array datatype, what
> is the expectation going to be for users who create an index on it? The
> hnsw index will support only floats initially so we would have to reject
> any non-float arrays if an attempt was made to create an hnsw index on it.
> While there is no problem with doing this, there would be a problem if, in
> the future, we allow indexing in arrays in the same way that we index
> collections. In this case we would then need to have the user select what
> type of index they want at creation time.
>
> Can I add another proposal that we allow a VECTOR or DENSE (this is a well
> known term in the ML space) keyword that could be used when the array is
> going to be used for ML workloads. This would be optional and would
> function similarly to FROZEN in that it would limit the functionality of
> the array to ML usage.
>
> On Thu, 4 May 2023 at 09:45, Benedict  wrote:
>
>> Hurrah for initial agreement.
>>
>> For syntax, I think one option was just FLOAT[N]. In VECTOR FLOAT[N],
>> VECTOR is redundant - FLOAT[N] is fully descriptive by itself. I don’t
>> think VECTOR should be used to simply imply non-null, as this would be very
>> unintuitive. More logical would be NONNULL, if this is the only condition
>> being applied. Alternatively for arrays we could default to NONNULL and
>> later introduce NULLABLE if we want to permit nulls.
>>
>> If the word vector is to be used it makes more sense to make it look like
>> a list, so VECTOR as here the word VECTOR is clearly not
>> redundant.
>>
>> So, I vote:
>>
>> 1) (NON NULL) FLOAT[N]
>> 2) FLOAT[N]   (Non null by default)
>> 3) VECTOR
>>
>>
>>
>> On 4 May 2023, at 08:52, Mick Semb Wever  wrote:
>>
>> 
>>
>>> Did we agree on a CQL syntax?
>>>
>>> I don’t believe there has been a pool on CQL syntax… my understanding
>>> reading all the threads is that there are ~4-5 options and non are -1ed, so
>>> believe we are waiting for majority rule on this?
>>>
>>
>>
>> Re-reading that thread, IIUC the valid choices remaining are…
>>
>> 1. VECTOR FLOAT[n]
>> 2. FLOAT VECTOR[n]
>> 3. VECTOR
>> 4. VECTOR[n]
>> 5. ARRAY
>> 6. NON-NULL FROZEN
>>
>>
>> Yes I'm putting my preference (1) first ;) because (banging on) if the
>> future of CQL will have FLOAT[n] and FROZEN, where the VECTOR
>> keyword is: for general cql users; just meaning "non-null and frozen",
>> these gel best together.
>>
>> Options (5) and (6) are for those that feel we can and should provide
>> this type without introducing the vector keyword.
>>
>>
>>
>>
>
> --
> [image: DataStax Logo Square] <https://www.datastax.com/> *Mike Adamson*
> Engineering
>
> +1 650 389 6000 <16503896000> | datastax.com <https://www.datastax.com/>
> Find DataStax Online: [image: LinkedIn Logo]
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_company_datastax&d=DwMFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=IFj3MdIKYLLXIUhYdUGB0cTzTlxyCb7_VUmICBaYilU&m=uHzE4WhPViSF0rsjSxKhfwGDU1Bo7USObSc_aIcgelo&s=akx0E6l2bnTjOvA-YxtonbW0M4b6bNg4nRwmcHNDo4Q&e=>
>[image: Facebook Logo]
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.facebook.com_datastax&d=DwMFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=IFj3MdIKYLLXIUhYdUGB0cTzTlxyCb7_VUmICBaYilU&m=uHzE4WhP

Re: [POLL] Vector type for ML

2023-05-05 Thread Mike Adamson
 just means “sequential layout”, which is what our frozen array/list already
>>>> are… but since the target user is coming from a ML background, this
>>>> shouldn’t offer much confusion.  DENSE just means FROZEN in Cassandra, with
>>>> NON NULL elements (SPARSE allows for NULL and isn’t frozen)… So DENSE just
>>>> acts as syntax sugar for frozen
>>>>
>>>>
>>>> On May 4, 2023, at 4:13 AM, Brandon Williams  wrote:
>>>>
>>>> 1. VECTOR
>>>> 2. VECTOR FLOAT[n]
>>>> 3. FLOAT[N]   (Non null by default)
>>>>
>>>> Redundant or not, I think having the VECTOR keyword helps signify what
>>>> the app is generally about and helps get buy-in from ML stakeholders.
>>>>
>>>> On Thu, May 4, 2023 at 3:45 AM Benedict  wrote:
>>>>
>>>>
>>>> Hurrah for initial agreement.
>>>>
>>>> For syntax, I think one option was just FLOAT[N]. In VECTOR FLOAT[N],
>>>> VECTOR is redundant - FLOAT[N] is fully descriptive by itself. I don’t
>>>> think VECTOR should be used to simply imply non-null, as this would be very
>>>> unintuitive. More logical would be NONNULL, if this is the only condition
>>>> being applied. Alternatively for arrays we could default to NONNULL and
>>>> later introduce NULLABLE if we want to permit nulls.
>>>>
>>>> If the word vector is to be used it makes more sense to make it look
>>>> like a list, so VECTOR as here the word VECTOR is clearly not
>>>> redundant.
>>>>
>>>> So, I vote:
>>>>
>>>> 1) (NON NULL) FLOAT[N]
>>>> 2) FLOAT[N]   (Non null by default)
>>>> 3) VECTOR
>>>>
>>>>
>>>>
>>>> On 4 May 2023, at 08:52, Mick Semb Wever  wrote:
>>>>
>>>> 
>>>>
>>>>
>>>> Did we agree on a CQL syntax?
>>>>
>>>> I don’t believe there has been a pool on CQL syntax… my understanding
>>>> reading all the threads is that there are ~4-5 options and non are -1ed, so
>>>> believe we are waiting for majority rule on this?
>>>>
>>>>
>>>>
>>>>
>>>> Re-reading that thread, IIUC the valid choices remaining are…
>>>>
>>>> 1. VECTOR FLOAT[n]
>>>> 2. FLOAT VECTOR[n]
>>>> 3. VECTOR
>>>> 4. VECTOR[n]
>>>> 5. ARRAY
>>>> 6. NON-NULL FROZEN
>>>>
>>>>
>>>> Yes I'm putting my preference (1) first ;) because (banging on) if the
>>>> future of CQL will have FLOAT[n] and FROZEN, where the VECTOR
>>>> keyword is: for general cql users; just meaning "non-null and frozen",
>>>> these gel best together.
>>>>
>>>> Options (5) and (6) are for those that feel we can and should provide
>>>> this type without introducing the vector keyword.
>>>>
>>>>
>>>>
>>>>

-- 
[image: DataStax Logo Square] <https://www.datastax.com/> *Mike Adamson*
Engineering

+1 650 389 6000 <16503896000> | datastax.com <https://www.datastax.com/>
Find DataStax Online: [image: LinkedIn Logo]
<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_company_datastax&d=DwMFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=IFj3MdIKYLLXIUhYdUGB0cTzTlxyCb7_VUmICBaYilU&m=uHzE4WhPViSF0rsjSxKhfwGDU1Bo7USObSc_aIcgelo&s=akx0E6l2bnTjOvA-YxtonbW0M4b6bNg4nRwmcHNDo4Q&e=>
   [image: Facebook Logo]
<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.facebook.com_datastax&d=DwMFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=IFj3MdIKYLLXIUhYdUGB0cTzTlxyCb7_VUmICBaYilU&m=uHzE4WhPViSF0rsjSxKhfwGDU1Bo7USObSc_aIcgelo&s=ncMlB41-6hHuqx-EhnM83-KVtjMegQ9c2l2zDzHAxiU&e=>
   [image: Twitter Logo] <https://twitter.com/DataStax>   [image: RSS Feed]
<https://www.datastax.com/blog/rss.xml>   [image: Github Logo]
<https://github.com/datastax>


Re: [POLL] Vector type for ML

2023-05-05 Thread Mike Adamson
;
>>>>> If getting into the "built-in syntactic sugar mapping for communities
>>>>> and specific use-cases" is something we're willing to consider.
>>>>>
>>>>> On Fri, May 5, 2023, at 7:26 AM, Patrick McFadin wrote:
>>>>>
>>>>> I think we are still discussing implementation here when I'm talking
>>>>> about developer experience. I want developers to adopt this quickly, 
>>>>> easily
>>>>> and be successful. Vector search is already a thing. People use it every
>>>>> day. A successful outcome, in my view, is developers picking up this
>>>>> feature without reading a manual. (Because they don't anyway and get in
>>>>> trouble) I did some more extensive research about what other DBs are using
>>>>> for syntax. The consensus is some variety of 'VECTOR', 'DENSE' and 
>>>>> 'SPARSE'
>>>>>
>>>>> Pinecone[1] - dense_vector, sparse_vector
>>>>> Elastic[2]: dense_vector
>>>>> Milvus[3]: float_vector, binary_vector
>>>>> pgvector[4]: vector
>>>>> Weaviate[5]: Different approach. All typed arrays can be indexed
>>>>>
>>>>> Based on that I'm advocating a similar syntax:
>>>>>
>>>>> - DENSE VECTOR
>>>>> or
>>>>> - VECTOR
>>>>>
>>>>> [1] https://docs.pinecone.io/docs/hybrid-search
>>>>> <https://urldefense.com/v3/__https://docs.pinecone.io/docs/hybrid-search__;!!PbtH5S7Ebw!epFk5syZ_avANqrEkFR0WT7Alkybo0yrvO-_awqqn8mVWpnyuSgAm0FMgbE_rYpSWJSC91KmoX7nGOa1KY4$>
>>>>> [2]
>>>>> https://www.elastic.co/guide/en/elasticsearch/reference/current/dense-vector.html
>>>>> <https://urldefense.com/v3/__https://www.elastic.co/guide/en/elasticsearch/reference/current/dense-vector.html__;!!PbtH5S7Ebw!epFk5syZ_avANqrEkFR0WT7Alkybo0yrvO-_awqqn8mVWpnyuSgAm0FMgbE_rYpSWJSC91KmoX7n--HiUaw$>
>>>>> [3] https://milvus.io/docs/create_collection.md
>>>>> <https://urldefense.com/v3/__https://milvus.io/docs/create_collection.md__;!!PbtH5S7Ebw!epFk5syZ_avANqrEkFR0WT7Alkybo0yrvO-_awqqn8mVWpnyuSgAm0FMgbE_rYpSWJSC91KmoX7nQttAKvY$>
>>>>> [4] https://github.com/pgvector/pgvector
>>>>> [5] https://weaviate.io/developers/weaviate/config-refs/datatypes
>>>>> <https://urldefense.com/v3/__https://weaviate.io/developers/weaviate/config-refs/datatypes__;!!PbtH5S7Ebw!epFk5syZ_avANqrEkFR0WT7Alkybo0yrvO-_awqqn8mVWpnyuSgAm0FMgbE_rYpSWJSC91KmoX7n0yKoHLs$>
>>>>>
>>>>> On Fri, May 5, 2023 at 6:07 AM Mike Adamson 
>>>>> wrote:
>>>>>
>>>>> Then we can have the indexing apparatus only accept *frozen* for
>>>>> the HSNW case.
>>>>>
>>>>> I'm inclined to agree with Benedict that the index will need to be
>>>>> specifically select by option rather than inferred based on type. As such
>>>>> there is no real reason for the *frozen* requirement on the type. The
>>>>> hnsw index can be built just as easily from a non-frozen array.
>>>>>
>>>>> I am in favour of enforcing non-null on the elements of an array by
>>>>> default. I would prefer that allowing nulls in the array would be a later
>>>>> addition if and when a use case arose for it.
>>>>>
>>>>> On Fri, 5 May 2023 at 03:02, Caleb Rackliffe 
>>>>> wrote:
>>>>>
>>>>> Even in the ML case, sparse can just mean zeros rather than nulls, and
>>>>> they should compress similarly anyway.
>>>>>
>>>>> If we really want null values, I'd rather leave that in collections
>>>>> space.
>>>>>
>>>>> On Thu, May 4, 2023 at 8:59 PM Caleb Rackliffe <
>>>>> calebrackli...@gmail.com> wrote:
>>>>>
>>>>> I actually still prefer *type[dimension]*, because I think I
>>>>> intuitively read this as a primitive (meaning no null elements) array. 
>>>>> Then
>>>>> we can have the indexing apparatus only accept *frozen* for
>>>>> the HSNW case.
>>>>>
>>>>> If that isn't intuitive to anyone else, I don't really have a strong
>>>>> opinion...but...conflating "frozen" and "dense" seems like a bad idea. One
>>>>> should indicate single vs. mul

Re: [VOTE] CEP-29 CQL NOT Operator

2023-05-12 Thread Mike Adamson
+1 (nb)

On Fri, 12 May 2023 at 14:05, Doug Rohrer  wrote:

> +1 (nb)
>
> > On May 8, 2023, at 4:52 AM, Piotr Kołaczkowski 
> wrote:
> >
> > Let's vote.
> >
> >
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-29%3A+CQL+NOT+operator
> >
> > Piotr Kołaczkowski
> > e. pkola...@datastax.com
> > w. www.datastax.com
>
>

-- 
[image: DataStax Logo Square] <https://www.datastax.com/> *Mike Adamson*
Engineering

+1 650 389 6000 <16503896000> | datastax.com <https://www.datastax.com/>
Find DataStax Online: [image: LinkedIn Logo]
<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_company_datastax&d=DwMFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=IFj3MdIKYLLXIUhYdUGB0cTzTlxyCb7_VUmICBaYilU&m=uHzE4WhPViSF0rsjSxKhfwGDU1Bo7USObSc_aIcgelo&s=akx0E6l2bnTjOvA-YxtonbW0M4b6bNg4nRwmcHNDo4Q&e=>
   [image: Facebook Logo]
<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.facebook.com_datastax&d=DwMFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=IFj3MdIKYLLXIUhYdUGB0cTzTlxyCb7_VUmICBaYilU&m=uHzE4WhPViSF0rsjSxKhfwGDU1Bo7USObSc_aIcgelo&s=ncMlB41-6hHuqx-EhnM83-KVtjMegQ9c2l2zDzHAxiU&e=>
   [image: Twitter Logo] <https://twitter.com/DataStax>   [image: RSS Feed]
<https://www.datastax.com/blog/rss.xml>   [image: Github Logo]
<https://github.com/datastax>


Re: [DISCUSS] The future of CREATE INDEX

2023-05-15 Thread Mike Adamson
>
> [POLL] Centralize existing syntax or create new syntax?
>
> 1.) CREATE INDEX ... USING  WITH OPTIONS...
> 2.) CREATE LOCAL INDEX ... USING ... WITH OPTIONS...  (same as 1, but
> adds LOCAL keyword for clarity and separation from future GLOBAL indexes)
>

1.) CREATE INDEX ... USING  WITH OPTIONS...

[POLL] Should there be a default? (YES/NO)
>

Yes

[POLL] What do do with the default?
>
> 1.) Allow a default, and switch it to SAI (no configurables)
> 2.) Allow a default, and stay w/ the legacy 2i (no configurables)
> 3.) YAML config to override default index (legacy 2i remains the default)
> 4.) YAML config/guardrail to require index type selection (not required by
> default)
>

3.) YAML config to override default index (legacy 2i remains the default)



On Mon, 15 May 2023 at 08:54, Mick Semb Wever  wrote:

>
>
> [POLL] Centralize existing syntax or create new syntax?
>>
>> 1.) CREATE INDEX ... USING  WITH OPTIONS...
>> 2.) CREATE LOCAL INDEX ... USING ... WITH OPTIONS...  (same as 1, but
>> adds LOCAL keyword for clarity and separation from future GLOBAL indexes)
>>
>
>
> (1) CREATE INDEX …
>
>
>
>> [POLL] Should there be a default? (YES/NO)
>>
>
>
> Yes (but see below).
>
>
>
>> [POLL] What do do with the default?
>>
>> 1.) Allow a default, and switch it to SAI (no configurables)
>> 2.) Allow a default, and stay w/ the legacy 2i (no configurables)
>> 3.) YAML config to override default index (legacy 2i remains the default)
>> 4.) YAML config/guardrail to require index type selection (not required
>> by default)
>>
>
>
> (4) YAML config. Commented out default of 2i.
>
> I agree that the default cannot change in 5.0, but our existing default of
> 2i can be commented out.
>
> For the user this gives them the same feedback, and puts the same
> requirement to edit one line of yaml, as when we disabled MVs and SASI in
> 4.0
> No one has complained about either of these, which is a clear signal folk
> understood how to get their existing DDLs to work from 3.x to 4.x
>


-- 
[image: DataStax Logo Square] <https://www.datastax.com/> *Mike Adamson*
Engineering

+1 650 389 6000 <16503896000> | datastax.com <https://www.datastax.com/>
Find DataStax Online: [image: LinkedIn Logo]
<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_company_datastax&d=DwMFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=IFj3MdIKYLLXIUhYdUGB0cTzTlxyCb7_VUmICBaYilU&m=uHzE4WhPViSF0rsjSxKhfwGDU1Bo7USObSc_aIcgelo&s=akx0E6l2bnTjOvA-YxtonbW0M4b6bNg4nRwmcHNDo4Q&e=>
   [image: Facebook Logo]
<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.facebook.com_datastax&d=DwMFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=IFj3MdIKYLLXIUhYdUGB0cTzTlxyCb7_VUmICBaYilU&m=uHzE4WhPViSF0rsjSxKhfwGDU1Bo7USObSc_aIcgelo&s=ncMlB41-6hHuqx-EhnM83-KVtjMegQ9c2l2zDzHAxiU&e=>
   [image: Twitter Logo] <https://twitter.com/DataStax>   [image: RSS Feed]
<https://www.datastax.com/blog/rss.xml>   [image: Github Logo]
<https://github.com/datastax>


Re: [VOTE] CEP-30 ANN Vector Search

2023-05-26 Thread Mike Adamson
+1 (nb)

On Fri, 26 May 2023 at 12:50, Stefania Alborghetti 
wrote:

> +1
>
> On Fri, May 26, 2023 at 7:31 AM Aleksey Yeshchenko 
> wrote:
>
>> +1
>>
>> On 26 May 2023, at 07:19, Berenguer Blasi 
>> wrote:
>>
>> +1
>> On 26/5/23 6:07, guo Maxwell wrote:
>>
>> +1
>>
>> Dinesh Joshi 于2023年5月26日 周五上午11:08写道:
>>
>>> +1
>>>
>>>
>>> On May 25, 2023, at 8:45 AM, Jonathan Ellis  wrote:
>>>
>>> 
>>>
>>> Let's make this official.
>>>
>>> CEP:
>>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-30%3A+Approximate+Nearest+Neighbor%28ANN%29+Vector+Search+via+Storage-Attached+Indexes
>>>
>>> POC that demonstrates all the big rocks, including distributed queries:
>>> https://github.com/datastax/cassandra/tree/cep-vsearch
>>>
>>> --
>>> Jonathan Ellis
>>> co-founder, http://www.datastax.com
>>> @spyced
>>>
>>> --
>> you are the apple of my eye !
>>
>>
>>

-- 
[image: DataStax Logo Square] <https://www.datastax.com/> *Mike Adamson*
Engineering

+1 650 389 6000 <16503896000> | datastax.com <https://www.datastax.com/>
Find DataStax Online: [image: LinkedIn Logo]
<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_company_datastax&d=DwMFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=IFj3MdIKYLLXIUhYdUGB0cTzTlxyCb7_VUmICBaYilU&m=uHzE4WhPViSF0rsjSxKhfwGDU1Bo7USObSc_aIcgelo&s=akx0E6l2bnTjOvA-YxtonbW0M4b6bNg4nRwmcHNDo4Q&e=>
   [image: Facebook Logo]
<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.facebook.com_datastax&d=DwMFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=IFj3MdIKYLLXIUhYdUGB0cTzTlxyCb7_VUmICBaYilU&m=uHzE4WhPViSF0rsjSxKhfwGDU1Bo7USObSc_aIcgelo&s=ncMlB41-6hHuqx-EhnM83-KVtjMegQ9c2l2zDzHAxiU&e=>
   [image: Twitter Logo] <https://twitter.com/DataStax>   [image: RSS Feed]
<https://www.datastax.com/blog/rss.xml>   [image: Github Logo]
<https://github.com/datastax>


Re: [DISCUSS] When to run CheckStyle and other verificiations

2023-06-26 Thread Mike Adamson
While I like the idea of this because of added time these checks take, I
was under the impression that checkstyle (at least) can be disabled with a
flag.

If we did do this, would it make sense to have a "release"  or "commit"
target (or some other name) that ran a full build with all checks that can
be used prior to pushing changes?

On Mon, 26 Jun 2023 at 08:35, Berenguer Blasi 
wrote:

> I would prefer sthg that is totally transparent to me and not add one more
> step I have to remember. Just to push/run CI to find out I missed it and
> rinse and repeat... With the recent fix to checkstyle I am happy as things
> stand atm. My 2cts
> On 26/6/23 8:43, Jacek Lewandowski wrote:
>
> Hi,
>
>
> The context is that we currently have 3 checks in the build:
>
> - Checkstyle,
>
> - Eclipse-Warnings,
>
> - RAT
>
>
> CheckStyle and RAT are executed with almost every target we run: build,
> jar, test, test-some, testclasslist, etc.; on the other hand,
> Eclipse-Warnings is executed automatically only with the artifacts target.
>
>
> Checkstyle currently uses some caching, so subsequent reruns without
> cleaning the project validate only the modified files.
>
>
> Both CI - Jenkins and Circle forces running all checks.
>
>
> I want to discuss whether you are ok with extracting all checks to their
> distinct target and not running it automatically with the targets which
> devs usually run locally. In particular:
>
>
>
>- "build", "jar", and all "test" targets would not trigger CheckStyle,
>RAT or Eclipse-Warnings
>- A new target "check" would trigger all CheckStyle, RAT, and
>Eclipse-Warnings
>- The new "check" target would be run along with the "artifacts"
>target on Jenkins-CI, and it as a separate build step in CircleCI
>
>
> The rationale for that change is:
>
>- Running all the checks together would be more consistent, but
>running all of them automatically with build and test targets could waste
>time when we develop something locally, frequently rebuilding and running
>tests.
>- On the other hand, it would be more consistent if the build did what
>we want - as a dev, when prototyping, I don't want to be forced to run
>analysis (and potentially fix issues) whenever I want to build a project or
>just run a single test.
>- There are ways to avoid running checks automatically by specifying
>some build properties. Though, the discussion is about the default behavior
>- on the flip side, if one wants to run the checks along with the specified
>target, they could add the "check" target to the command line.
>
>
> The rationale for keeping the checks running automatically with every
> target is to reduce the likelihood of not running the checks locally before
> pushing the branch and being surprised by failing CI soon after starting
> the build.
>
>
> That could be fixed by running checks in a pre-push Git hook. There are
> some benefits of this compared to the current behavior:
>
>- the checks would be run automatically only once
>- they would be triggered even for those devs who do everything in IDE
>and do not even touch Ant commands directly
>
>
> Checks can take time; to optimize that, they could be enforced locally to
> verify only the modified files in the same way as we currently determine
> the tests to be repeated for CircleCI.
>
> Thanks
> - - -- --- -  -
> Jacek Lewandowski
>
>

-- 
[image: DataStax Logo Square] <https://www.datastax.com/> *Mike Adamson*
Engineering

+1 650 389 6000 <16503896000> | datastax.com <https://www.datastax.com/>
Find DataStax Online: [image: LinkedIn Logo]
<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_company_datastax&d=DwMFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=IFj3MdIKYLLXIUhYdUGB0cTzTlxyCb7_VUmICBaYilU&m=uHzE4WhPViSF0rsjSxKhfwGDU1Bo7USObSc_aIcgelo&s=akx0E6l2bnTjOvA-YxtonbW0M4b6bNg4nRwmcHNDo4Q&e=>
   [image: Facebook Logo]
<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.facebook.com_datastax&d=DwMFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=IFj3MdIKYLLXIUhYdUGB0cTzTlxyCb7_VUmICBaYilU&m=uHzE4WhPViSF0rsjSxKhfwGDU1Bo7USObSc_aIcgelo&s=ncMlB41-6hHuqx-EhnM83-KVtjMegQ9c2l2zDzHAxiU&e=>
   [image: Twitter Logo] <https://twitter.com/DataStax>   [image: RSS Feed]
<https://www.datastax.com/blog/rss.xml>   [image: Github Logo]
<https://github.com/datastax>


Re: Tokenization and SAI query syntax

2023-08-07 Thread Mike Adamson
ms quite clear and hard to
> > > misinterpret,
> > > >>>> but it's quite long to write and its implementation will be
> > > challenging
> > > >>>> since we would need a bunch of special casing around
> SelectStatement
> > > and
> > > >>>> functions.
> > > >>>>
> > > >>>> LIKE, MATCHES and CONTAINS could be a bit misleading since they
> seem
> > > to
> > > >>>> evoke different behaviours to what they would have.
> > > >>>>
> > > >>>> `column LIKE :term:` seems a bit redundant compared to just using
> > > `column
> > > >>>> : term`, and we are still introducing a new symbol.
> > > >>>>
> > > >>>> I think I like `column : term` the most, because it's brief, it's
> > > similar
> > > >>>> to the equivalent Lucene's syntax, and it doesn't seem to clash
> with
> > > other
> > > >>>> different meanings that I can think of.
> > > >>>>
> > > >>>>> On Mon, 24 Jul 2023 at 13:13, Jonathan Ellis 
> > > wrote:
> > > >>>>
> > > >>>> Hi all,
> > > >>>>
> > > >>>> With phase 1 of SAI wrapping up, I’d like to start the ball
> rolling on
> > > >>>> aligning around phase 2 features.
> > > >>>>
> > > >>>> In particular, we need to nail down the syntax for doing non-exact
> > > string
> > > >>>> matches.  We have a proof of concept that includes full Lucene
> > > analyzer and
> > > >>>> filter functionality – just the text transformation pieces, none
> of
> > > the
> > > >>>> storage parts – which is the gold standard in this space.  For
> > > example, the
> > > >>>> StandardAnalyzer [1] lowercases all terms and removes stopwords
> > > (common
> > > >>>> words like “a”, “is”, “the” that are usually not useful to search
> > > >>>> against).  Lucene also has classes that offer stemming, special
> case
> > > >>>> handling for email, and many languages besides English [2].
> > > >>>>
> > > >>>> What syntax should we use to express “rows whose analyzed tokens
> match
> > > >>>> this search term?”
> > > >>>>
> > > >>>> The syntax must be clear that we want to look for this term
> within the
> > > >>>> column data using the configured index with corresponding
> query-time
> > > >>>> tokenization and analysis.  This means that the query term is not
> > > always a
> > > >>>> substring of the original string!  Besides obvious transformations
> > > like
> > > >>>> lowercasing, you have things like PhoneticFilter available as
> well.
> > > >>>>
> > > >>>> Here are my thoughts on some of the options:
> > > >>>>
> > > >>>> `column = term`.  This is what the POC does today and it’s super
> > > confusing
> > > >>>> to overload = to mean something other than exact equality.  I am
> not
> > > a fan.
> > > >>>>
> > > >>>> `column LIKE term` or `column LIKE %term%`. The closest SQL
> operator,
> > > but
> > > >>>> neither the wildcarded nor unwildcarded syntax matches the
> semantics
> > > of
> > > >>>> term-based search.
> > > >>>>
> > > >>>> `column MATCHES term`. I rather like this one, although Mike
> points
> > > out
> > > >>>> that “match” has a meaning in the context of regular expressions
> that
> > > could
> > > >>>> cause confusion here.
> > > >>>>
> > > >>>> `column CONTAINS term`. Contains is used by both Java and Python
> for
> > > >>>> substring searches, so at least some users will be surprised by
> > > term-based
> > > >>>> behavior.
> > > >>>>
> > > >>>> `term_matches(column, term)`. Postgresql FTS makes you use
> functions
> > > like
> > > >>>> this for everything.  It’s pretty clunky, and we would need to
> make
> > > the
> > > >>>> amazingly hairy SelectStatement even hairier to handle “use a
> function
> > > >>>> result in a predicate” like this.
> > > >>>>
> > > >>>> `column : term`. Inspired by Lucene’s syntax.  I don’t actually
> hate
> > > it.
> > > >>>>
> > > >>>> `column LIKE :term:`. Stick with the LIKE operator but add a new
> > > symbol to
> > > >>>> indicate term matching.  Arguably more SQL-ish than a new bare
> symbol
> > > >>>> operator.
> > > >>>>
> > > >>>> [1]
> > > >>>>
> > >
> https://lucene.apache.org/core/9_7_0/core/org/apache/lucene/analysis/standard/StandardAnalyzer.html
> > > >>>> [2]
> https://lucene.apache.org/core/9_7_0/analysis/common/index.html
> > > >>>>
> > > >>>> --
> > > >>>> Jonathan Ellis
> > > >>>> co-founder, http://www.datastax.com
> > > >>>> @spyced
> > > >>>>
> > > >>>>
> > > >>>>
> > > >>>
> > >
> >
>


-- 
[image: DataStax Logo Square] <https://www.datastax.com/> *Mike Adamson*
Engineering

+1 650 389 6000 <16503896000> | datastax.com <https://www.datastax.com/>
Find DataStax Online: [image: LinkedIn Logo]
<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_company_datastax&d=DwMFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=IFj3MdIKYLLXIUhYdUGB0cTzTlxyCb7_VUmICBaYilU&m=uHzE4WhPViSF0rsjSxKhfwGDU1Bo7USObSc_aIcgelo&s=akx0E6l2bnTjOvA-YxtonbW0M4b6bNg4nRwmcHNDo4Q&e=>
   [image: Facebook Logo]
<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.facebook.com_datastax&d=DwMFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=IFj3MdIKYLLXIUhYdUGB0cTzTlxyCb7_VUmICBaYilU&m=uHzE4WhPViSF0rsjSxKhfwGDU1Bo7USObSc_aIcgelo&s=ncMlB41-6hHuqx-EhnM83-KVtjMegQ9c2l2zDzHAxiU&e=>
   [image: Twitter Logo] <https://twitter.com/DataStax>   [image: RSS Feed]
<https://www.datastax.com/blog/rss.xml>   [image: Github Logo]
<https://github.com/datastax>


[DISCUSS] Addition of smile-nlp test dependency for CEP-30

2023-09-13 Thread Mike Adamson
CEP-30: [Approximate Nearest Neighbor(ANN) Vector Search via
Storage-Attached Indexes] uses the smile-nlp library
(com.github.haifengl.smile-nlp) in its testing to allow the creation of
word2vec embeddings for valid input into the HNSW graph index.

The reason for this library is that we found that using random vectors in
testing produced very inconsistent results. Using the smile-nlp word2vec
implementation with the glove.3k.50d library produces repeatable results.

Does anyone have any objections to the use of this library as a test only
dependency?
-- 
[image: DataStax Logo Square] <https://www.datastax.com/> *Mike Adamson*
Engineering

+1 650 389 6000 <16503896000> | datastax.com <https://www.datastax.com/>
Find DataStax Online: [image: LinkedIn Logo]
<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_company_datastax&d=DwMFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=IFj3MdIKYLLXIUhYdUGB0cTzTlxyCb7_VUmICBaYilU&m=uHzE4WhPViSF0rsjSxKhfwGDU1Bo7USObSc_aIcgelo&s=akx0E6l2bnTjOvA-YxtonbW0M4b6bNg4nRwmcHNDo4Q&e=>
   [image: Facebook Logo]
<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.facebook.com_datastax&d=DwMFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=IFj3MdIKYLLXIUhYdUGB0cTzTlxyCb7_VUmICBaYilU&m=uHzE4WhPViSF0rsjSxKhfwGDU1Bo7USObSc_aIcgelo&s=ncMlB41-6hHuqx-EhnM83-KVtjMegQ9c2l2zDzHAxiU&e=>
   [image: Twitter Logo] <https://twitter.com/DataStax>   [image: RSS Feed]
<https://www.datastax.com/blog/rss.xml>   [image: Github Logo]
<https://github.com/datastax>


Re: [DISCUSS] Addition of smile-nlp test dependency for CEP-30

2023-09-14 Thread Mike Adamson
We can't use open-nlp because it is JDK 17 only. I'll pull the smile-nlp
dependency and write something to do the same thing. Our usage was trivial.

On Thu, 14 Sept 2023 at 00:10, J. D. Jordan 
wrote:

> Reading through smile license again, it is licensed pure GPL 3, not GPL
> with classpath exception. So I think that kills all debate here.
>
> -1 on inclusion
>
> On Sep 13, 2023, at 2:30 PM, Jeremiah Jordan 
> wrote:
>
> 
> I wonder if it can easily be replaced with Apache open-nlp?  It also
> provides an implementation of GloVe.
>
>
> https://opennlp.apache.org/docs/2.3.0/apidocs/opennlp-tools/opennlp/tools/util/wordvector/Glove.html
>
>
> On Sep 13, 2023 at 1:17:46 PM, Benedict  wrote:
>
>> There’s a distinction for spotbugs and other build related tools where
>> they can be downloaded and used during the build so long as they’re not
>> critical to the build process.
>>
>> They have to be downloaded dynamically in binary form I believe though,
>> they cannot be included in the release.
>>
>> So it’s not really in conflict with what Jeff is saying, and my
>> recollection accords with Jeff’s
>>
>> On 13 Sep 2023, at 17:42, Brandon Williams  wrote:
>>
>> 
>>
>> On Wed, Sep 13, 2023 at 11:37 AM Jeff Jirsa  wrote:
>>
>>> You can open a legal JIRA to confirm, but based on my understanding (and
>>> re-confirming reading
>>> https://www.apache.org/legal/resolved.html#category-a ):
>>>
>>>
>> We should probably get clarification here regardless, iirc this came up
>> when we were considering SpotBugs too.
>>
>>

-- 
[image: DataStax Logo Square] <https://www.datastax.com/> *Mike Adamson*
Engineering

+1 650 389 6000 <16503896000> | datastax.com <https://www.datastax.com/>
Find DataStax Online: [image: LinkedIn Logo]
<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_company_datastax&d=DwMFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=IFj3MdIKYLLXIUhYdUGB0cTzTlxyCb7_VUmICBaYilU&m=uHzE4WhPViSF0rsjSxKhfwGDU1Bo7USObSc_aIcgelo&s=akx0E6l2bnTjOvA-YxtonbW0M4b6bNg4nRwmcHNDo4Q&e=>
   [image: Facebook Logo]
<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.facebook.com_datastax&d=DwMFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=IFj3MdIKYLLXIUhYdUGB0cTzTlxyCb7_VUmICBaYilU&m=uHzE4WhPViSF0rsjSxKhfwGDU1Bo7USObSc_aIcgelo&s=ncMlB41-6hHuqx-EhnM83-KVtjMegQ9c2l2zDzHAxiU&e=>
   [image: Twitter Logo] <https://twitter.com/DataStax>   [image: RSS Feed]
<https://www.datastax.com/blog/rss.xml>   [image: Github Logo]
<https://github.com/datastax>


[DISCUSS] Add JVector as a dependency for CEP-30

2023-09-20 Thread Mike Adamson
The original patch for CEP-30 brought several modified Lucene classes
in-tree to implement the concurrent HNSW graph used by the vector index.

These classes are now being replaced with the io.github.jbellis.jvector
library, which contains an improved diskANN implementation for the on-disk
graph format.

The repo for this library is here: https://github.com/jbellis/jvector.

The library does not replace any code used by SAI or other parts of the
codebase and is used solely by the vector index.

I would welcome any feedback on this change.
-- 
[image: DataStax Logo Square] <https://www.datastax.com/> *Mike Adamson*
Engineering

+1 650 389 6000 <16503896000> | datastax.com <https://www.datastax.com/>
Find DataStax Online: [image: LinkedIn Logo]
<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_company_datastax&d=DwMFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=IFj3MdIKYLLXIUhYdUGB0cTzTlxyCb7_VUmICBaYilU&m=uHzE4WhPViSF0rsjSxKhfwGDU1Bo7USObSc_aIcgelo&s=akx0E6l2bnTjOvA-YxtonbW0M4b6bNg4nRwmcHNDo4Q&e=>
   [image: Facebook Logo]
<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.facebook.com_datastax&d=DwMFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=IFj3MdIKYLLXIUhYdUGB0cTzTlxyCb7_VUmICBaYilU&m=uHzE4WhPViSF0rsjSxKhfwGDU1Bo7USObSc_aIcgelo&s=ncMlB41-6hHuqx-EhnM83-KVtjMegQ9c2l2zDzHAxiU&e=>
   [image: Twitter Logo] <https://twitter.com/DataStax>   [image: RSS Feed]
<https://www.datastax.com/blog/rss.xml>   [image: Github Logo]
<https://github.com/datastax>


Re: [DISCUSS] Add JVector as a dependency for CEP-30

2023-09-22 Thread Mike Adamson
Just for my understanding on this. Is the issue that the code has a
copyright header on it or that it is copyright to a corporate entity?

On Fri, 22 Sept 2023 at 10:11, Mick Semb Wever  wrote:

> Especially for an optional feature with clear alternative implementations,
>> this doesn't bother me at all. It's well within ASF policy to include
>> permissively licensed code copyrighted by other people or entities.
>>
>
>
> We should be conscious of the problem if this was a crucial (and evolving)
> part of the code that the project was dependent on, even if only the
> optics of it are problematic.
>
> So long we're asked the question, and this is just an add-on feature that
> the codebase is not dependent on,  and no one has any objections then I'm
> ok with it.
>


-- 
[image: DataStax Logo Square] <https://www.datastax.com/> *Mike Adamson*
Engineering

+1 650 389 6000 <16503896000> | datastax.com <https://www.datastax.com/>
Find DataStax Online: [image: LinkedIn Logo]
<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_company_datastax&d=DwMFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=IFj3MdIKYLLXIUhYdUGB0cTzTlxyCb7_VUmICBaYilU&m=uHzE4WhPViSF0rsjSxKhfwGDU1Bo7USObSc_aIcgelo&s=akx0E6l2bnTjOvA-YxtonbW0M4b6bNg4nRwmcHNDo4Q&e=>
   [image: Facebook Logo]
<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.facebook.com_datastax&d=DwMFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=IFj3MdIKYLLXIUhYdUGB0cTzTlxyCb7_VUmICBaYilU&m=uHzE4WhPViSF0rsjSxKhfwGDU1Bo7USObSc_aIcgelo&s=ncMlB41-6hHuqx-EhnM83-KVtjMegQ9c2l2zDzHAxiU&e=>
   [image: Twitter Logo] <https://twitter.com/DataStax>   [image: RSS Feed]
<https://www.datastax.com/blog/rss.xml>   [image: Github Logo]
<https://github.com/datastax>


Re: [DISCUSS] Add JVector as a dependency for CEP-30

2023-09-22 Thread Mike Adamson
> For my understanding, isn’t it gonna be an issue to be copyrighted also
to a single person? For the same reasons?

This was partly why I asked. I did a random check of libraries that are
definite dependencies (netty, guava) and both contain author copyrights.

On Fri, 22 Sept 2023, 16:01 Ekaterina Dimitrova, 
wrote:

> For my understanding, isn’t it gonna be an issue to be copyrighted also to
> a single person? For the same reasons?
>
> On Fri, 22 Sep 2023 at 7:59, Mick Semb Wever  wrote:
>
>>
>>
>> Just for my understanding on this. Is the issue that the code has a
>>> copyright header on it or that it is copyright to a corporate entity?
>>>
>>
>>
>> The potential issue here is about dependence upon one vendor (or
>> commercial actor).
>> If the project is not usable without a specific piece of work (library)
>> that is controlled and maintained elsewhere, and exercising our freedom to
>> rewrite/fork is difficult, the project isn't really independent.  Being
>> independent is an important tenant for ASF projects.
>>
>> I don't see this being an issue with jamm or jvector.  But I do think
>> it's important to check.
>>
>>


Re: [VOTE] Release Apache Cassandra 5.0-beta1

2023-11-27 Thread Mike Adamson
> Furthermore, we don't even know if it's still an issue after 19034 was
committed.

It's a difficult one to reproduce because we don't have access to the harry
script that generated the error in the first place. I am investigating it
but without the original reproduction it may take some time.

On Mon, 27 Nov 2023 at 16:19, Mick Semb Wever  wrote:

>
>
> On Mon, 27 Nov 2023 at 16:28, Brandon Williams  wrote:
>
>> On Mon, Nov 27, 2023 at 9:25 AM Mick Semb Wever  wrote:
>> >
>> > It was agreed to move them to 5.0-rc
>>
>> Where?
>>
>
>
> Typo, "it" not "them".
> I'm only talking about 19011.  The others were already 5.0-rc, or infact
> forward from 5.0.x.
>
> Here:
> https://issues.apache.org/jira/browse/CASSANDRA-19011?focusedCommentId=17789202&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17789202
>
> Furthermore, we don't even know if it's still an issue after 19034 was
> committed.  We want to figure this out before the vote window closes.
>
>
>
>
>


-- 
[image: DataStax Logo Square] <https://www.datastax.com/> *Mike Adamson*
Engineering

+1 650 389 6000 <16503896000> | datastax.com <https://www.datastax.com/>
Find DataStax Online: [image: LinkedIn Logo]
<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_company_datastax&d=DwMFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=IFj3MdIKYLLXIUhYdUGB0cTzTlxyCb7_VUmICBaYilU&m=uHzE4WhPViSF0rsjSxKhfwGDU1Bo7USObSc_aIcgelo&s=akx0E6l2bnTjOvA-YxtonbW0M4b6bNg4nRwmcHNDo4Q&e=>
   [image: Facebook Logo]
<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.facebook.com_datastax&d=DwMFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=IFj3MdIKYLLXIUhYdUGB0cTzTlxyCb7_VUmICBaYilU&m=uHzE4WhPViSF0rsjSxKhfwGDU1Bo7USObSc_aIcgelo&s=ncMlB41-6hHuqx-EhnM83-KVtjMegQ9c2l2zDzHAxiU&e=>
   [image: Twitter Logo] <https://twitter.com/DataStax>   [image: RSS Feed]
<https://www.datastax.com/blog/rss.xml>   [image: Github Logo]
<https://github.com/datastax>


[DISCUSS] CASSANDRA-18940 SAI post-filtering reads don't update local table latency metrics

2023-12-01 Thread Mike Adamson
Hi,

We are looking at adding SAI post-filtering reads to the local table
metrics and would like some feedback on the best approach.

We don't think that SAI reads are that special so they can be included in
the table latencies, but how do we handle the global counts and the SAI
counts? Do we need to maintain a separate count of SAI reads? We feel the
answer to this is yes so how do we do the counting? There are two options
(others welcome):

1. All reads go into the current global count and we have a separate count
for SAI specific reads. So non-SAI reads = global count - SAI count
2. We leave the exclude the SAI reads from the current global count so
total reads = global count + SAI count

Our preference is for option 1 above. Does anyone have any strong views /
opinions on this?



-- 
[image: DataStax Logo Square] <https://www.datastax.com/> *Mike Adamson*
Engineering

+1 650 389 6000 <16503896000> | datastax.com <https://www.datastax.com/>
Find DataStax Online: [image: LinkedIn Logo]
<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_company_datastax&d=DwMFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=IFj3MdIKYLLXIUhYdUGB0cTzTlxyCb7_VUmICBaYilU&m=uHzE4WhPViSF0rsjSxKhfwGDU1Bo7USObSc_aIcgelo&s=akx0E6l2bnTjOvA-YxtonbW0M4b6bNg4nRwmcHNDo4Q&e=>
   [image: Facebook Logo]
<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.facebook.com_datastax&d=DwMFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=IFj3MdIKYLLXIUhYdUGB0cTzTlxyCb7_VUmICBaYilU&m=uHzE4WhPViSF0rsjSxKhfwGDU1Bo7USObSc_aIcgelo&s=ncMlB41-6hHuqx-EhnM83-KVtjMegQ9c2l2zDzHAxiU&e=>
   [image: Twitter Logo] <https://twitter.com/DataStax>   [image: RSS Feed]
<https://www.datastax.com/blog/rss.xml>   [image: Github Logo]
<https://github.com/datastax>


Re: [DISCUSS] CASSANDRA-18940 SAI post-filtering reads don't update local table latency metrics

2023-12-04 Thread Mike Adamson
Thanks for the feedback. To wrap this up we will introduce a new SAI local
read metric (name to be decided later) to record read count and latency and
these reads to be kept separate from existing local range read metrics.

On Fri, 1 Dec 2023 at 19:10, Jeremiah Jordan 
wrote:

> Again I am coming at this from the operator/end user perspective.
> Creating a metrics dashboard, and then I am looking at those metrics to
> understand what my queries are doing.  We have coordinator query level
> metrics, and then we have lower level table metrics on the replicas.  I
> want to be able to draw a line from this set of coordinator query metrics,
> to that set of table metrics, and be able to understand how they are
> affecting each other for a given query.
>
> The best would be for SAI / Indexes to have their very own sets of all the
> metrics to understand how many rows are read by a given SAI query, and how
> that turns into the over all time for the query, and how long those
> individual reads were taking, etc.
>
> But at the very least I want all of that separate from the metrics for my
> regular point reads.
>
> And yes putting the individual point read metrics into the range metrics
> would be strange.  But rolling up the time to get all the rows and rolling
> that into the Range metrics could possibly make sense.  Still strange.  So
> again SAI specific metrics seem the best to me, rather than shoe horning
> them into the existing metrics.
>
> -Jeremiah
>
> On Dec 1, 2023 at 1:04:47 PM, Caleb Rackliffe 
> wrote:
>
>> Right. SAI queries are distributed range queries that produce local
>> single-partition reads. They should absolutely not be recorded in the local
>> range read latency metric. I'm fine ultimately with a new metric or the
>> existing local single-partition read metric.
>>
>> On Fri, Dec 1, 2023 at 1:02 PM J. D. Jordan 
>> wrote:
>>
>>> At the coordinator level SAI queries fall under Range metrics. I would
>>> either put them under the same at the lower level or in a new SAI metric.
>>>
>>> It would be confusing to have the top level coordinator query metrics in
>>> Range and the lower level in Read.
>>>
>>> On Dec 1, 2023, at 12:50 PM, Caleb Rackliffe 
>>> wrote:
>>>
>>> 
>>> So the plan would be to have local "Read" and "Range" remain unchanged
>>> in TableMetrics, but have a third "SAIRead" (?) just for SAI post-filtering
>>> read SinglePartitionReadCommands? I won't complain too much if that's what
>>> we settle on, but it just depends on how much this is a metric for
>>> ReadCommand subclasses operating at the node-local level versus something
>>> we think we should link conceptually to a user query. SAI queries will
>>> produce a SinglePartitionReadCommand per matching primary key, so that
>>> definitely won't work for the latter.
>>>
>>> @Mike On a related note, we now have "PartitionReads" and "RowsFiltered"
>>> in TableQueryMetrics. Should the former just be removed, given a.) it
>>> actually is rows now not partitions and b.) "RowsFiltered" seems like it'll
>>> be almost  the same thing now? (I guess if we ever try batching rows reads
>>> per partition, it would come in handy again...)
>>>
>>> On Fri, Dec 1, 2023 at 12:30 PM J. D. Jordan 
>>> wrote:
>>>
>>>> I prefer option 2. It is much easier to understand and roll up two
>>>> metrics than to do subtractive dashboards.
>>>>
>>>> SAI reads are already “range reads” for the client level metrics, not
>>>> regular reads. So grouping them into the regular read metrics at the lower
>>>> level seems confusing to me in that sense as well.
>>>>
>>>> As an operator I want to know how my SAI reads and normal reads are
>>>> performing latency wise separately.
>>>>
>>>> -Jeremiah
>>>>
>>>> On Dec 1, 2023, at 11:15 AM, Caleb Rackliffe 
>>>> wrote:
>>>>
>>>> 
>>>> Option 1 would be my preference. Seems both useful to have a single
>>>> metric for read load against the table and a way to break out SAI reads
>>>> specifically.
>>>>
>>>> On Fri, Dec 1, 2023 at 11:00 AM Mike Adamson 
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> We are looking at adding SAI post-filtering reads to the local table
>>>>> metrics and would like some feedback on the best approach.
>>

Re: Welcome Maxim Muzafarov as Cassandra Committer

2024-01-09 Thread Mike Adamson
Congrats Maxim!!

On Tue, 9 Jan 2024, 10:41 Andrés de la Peña,  wrote:

> Congrats, Maxim!
>
> On Tue, 9 Jan 2024 at 03:45, guo Maxwell  wrote:
>
>> Congratulations, Maxim!
>>
>> Francisco Guerrero  于2024年1月9日周二 09:00写道:
>>
>>> Congratulations, Maxim! Well deserved!
>>>
>>> On 2024/01/08 18:19:04 Josh McKenzie wrote:
>>> > The Apache Cassandra PMC is pleased to announce that Maxim Muzafarov
>>> has accepted
>>> > the invitation to become a committer.
>>> >
>>> > Thanks for all the hard work and collaboration on the project thus
>>> far, and we're all looking forward to working more with you in the future.
>>> Congratulations and welcome!
>>> >
>>> > The Apache Cassandra PMC members
>>> >
>>> >
>>>
>>


Re: [DISCUSS] Stream Pipelines on hot paths

2024-05-30 Thread Mike Adamson
Definitely +1 on this. We saw in the early days of SAI development that
stream pipelines had a substantial impact on performance.

On Thu, 30 May 2024 at 19:28, Caleb Rackliffe 
wrote:

> +1
>
> On Thu, May 30, 2024 at 11:29 AM Benedict  wrote:
>
>> Since it’s related to the logging discussion we’re already having, I have
>> seen stream pipelines showing up in a lot of traces recently. I am
>> surprised; I thought it was understood that they shouldn’t be used on hot
>> paths as they are not typically as efficient as old skool for-each
>> constructions done sensibly, especially for small collections that may
>> normally take zero or one items.
>>
>> I would like to propose forbidding the use of streams on hot paths
>> without good justification that the cost:benefit is justified.
>>
>> It looks like it was nominally agreed two years ago that we would include
>> words to this effect in the code style guide, but I forgot to include them
>> when I transferred the new contents from the Google Doc proposal. So we
>> could just include the “Performance” section that was meant to be included
>> at the time.
>>
>> lists.apache.org
>> 
>> [image: favicon.ico]
>> 
>> 
>>
>>
>> On 30 May 2024, at 13:33, Štefan Miklošovič 
>> wrote:
>>
>> 
>> I see the feedback is overall positive. I will merge that and I will
>> improve the documentation on the website along with what Benedict suggested.
>>
>> On Thu, May 30, 2024 at 10:32 AM Mick Semb Wever  wrote:
>>
>>>
>>>
>>>
 Based on these findings, I went through the code and I have
 incorporated these rules and I rewrote it like this:

 1) no wrapping in "if" if we are not logging more than 2 parameters.
 2) rewritten log messages to not contain any string concatenation but
 moving it all to placeholders ({}).
 3) wrap it in "if" if we need to execute a method(s) on parameter(s)
 which is resource-consuming.

>>>
>>>
>>> +1
>>>
>>>
>>> It's a shame slf4j botched it with lambdas, their 2.0 fluent api doesn't
>>> impress me.
>>>
>>


favicon.ico
Description: Binary data


Re: CQL and pygments

2020-06-01 Thread Mike Adamson
The correct code location is:

https://github.com/apache/cassandra/tree/trunk/doc/source/_util

On Mon, 1 Jun 2020 at 14:21, Lorina Poland  wrote:

> Some time back, someone (Sylvain?) wrote some code to use CQL with
> pygments. Can I interest anyone in picking up that work, perhaps doing some
> update and submitting it upstream to pygments.org? It would be
> exceedingly helpful to me personally (for C* documentation work), but also
> a wider audience, I'm sure. Here's a pointer to the existing code:
>
> github.com/apache/cassandra/doc/source/_util
>
> Thanks, Lorina
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: dev-h...@cassandra.apache.org
>
>

-- 
Mike Adamson
e. madam...@datastax.com
w. www.datastax.com


Re: Error running tests: java.security.InvalidKeyException: Illegal key size

2016-05-18 Thread Mike Adamson
Do you have the JCE unlimited strength policy files installed in you JDK?

http://www.oracle.com/technetwork/java/javase/downloads/jce8-download-2133166.html

On Tue, 17 May 2016 at 22:58 Mahdi Mohammadi  wrote:

> I can successfully compile cassandra source using `ant` command but when
> running `ant test` for EncryptionUtilsTest I am getting this error:
>
> *[junit] - Standard Output ---*
> *[junit] ERROR 21:55:32 SLF4J: stderr*
> *[junit] INFO  21:55:32 initializing CipherFactory*
> *[junit] INFO  21:55:32 initializing keystore from file
> test/conf/cassandra.keystore*
> *[junit] INFO  21:55:32 loading secret key for alias testing:1*
> *[junit] ERROR 21:55:32 could not build cipher*
> *[junit] java.security.InvalidKeyException: Illegal key size*
> *[junit] at javax.crypto.Cipher.checkCryptoPerm(Cipher.java:1039)
> ~[na:1.8.0_71]*
> *[junit] at javax.crypto.Cipher.implInit(Cipher.java:805)
> ~[na:1.8.0_71]*
> *[junit] at javax.crypto.Cipher.chooseProvider(Cipher.java:864)
> ~[na:1.8.0_71]*
> *[junit] at javax.crypto.Cipher.init(Cipher.java:1396)
> ~[na:1.8.0_71]*
> *[junit] at javax.crypto.Cipher.init(Cipher.java:1327)
> ~[na:1.8.0_71]*
> *[junit] at
>
> org.apache.cassandra.security.CipherFactory.buildCipher(CipherFactory.java:133)
> [main/:na]*
> *[junit] at
>
> org.apache.cassandra.security.CipherFactory.getEncryptor(CipherFactory.java:107)
> [main/:na]*
> *[junit] at
>
> org.apache.cassandra.security.EncryptionUtilsTest.fullRoundTrip(EncryptionUtilsTest.java:102)
> [classes/:na]*
> *[junit] at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
> Method) ~[na:1.8.0_74]*
> *[junit] at
>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> ~[na:1.8.0_74]*
> *[junit] at
>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> ~[na:1.8.0_74]*
> *[junit] at java.lang.reflect.Method.invoke(Method.java:498)
> ~[na:1.8.0_74]*
> *[junit] at
>
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:44)
> [junit-4.6.jar:na]*
>
>
> Any idea?
>
>
> Best Regards
>


Re: Cassandra did not start listening for CQL clients

2017-06-29 Thread Mike Adamson
Hi Tomas,

Try adding:

start_native_transport: true

to your config.

Cheers,
MikeA

On Thu, 29 Jun 2017 at 15:08 Tomas Repik  wrote:

> Hello,
>
> I've tried to create a minimal config file that is needed to start
> Cassandra server. Is it even possible?
> What is the minimal set of options that need to be set in the
> cassandra.yaml file in order for Cassandra to run flawlessly.
>
> These are the options I use:
> commitlog_sync: periodic
> commitlog_sync_period_in_ms: 1
> partitioner: org.apache.cassandra.dht.Murmur3Partitioner
> endpoint_snitch: SimpleSnitch
> seed_provider:
>   - class_name: org.apache.cassandra.locator.SimpleSeedProvider
>   parameters:
> - seeds: "127.0.0.1"
>
> I gotta be missing something because the server does not start listening
> for CQL clients and cqlsh can't be used therefore.
>
> Thanks in advance for your replies.
>
> Tomas
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: dev-h...@cassandra.apache.org
>
>


Re: Cassandra did not start listening for CQL clients

2017-06-29 Thread Mike Adamson
I've honestly no idea but it is still defaulting to false in Config.java.
I'm assuming it will change to defaulting to true when thrift is finally
removed.

On Thu, 29 Jun 2017 at 15:34 Tomas Repik  wrote:

> Thanks Mike,
>
> now I remember this option, but I thought it was set to true by default.
> Any reasons why false is the default?
>
> - Original Message -
> > Hi Tomas,
> >
> > Try adding:
> >
> > start_native_transport: true
> >
> > to your config.
> >
> > Cheers,
> > MikeA
> >
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: dev-h...@cassandra.apache.org
>
>


Re: [DISCUSS] CEP-7 Storage Attached Index

2021-09-16 Thread Mike Adamson
Hi,

Just to keep this thread up to date with development progress, we will be 
adding row-aware support to SAI in the next few weeks. This is currently going 
through the final stages of review and testing. 

This feature also adds on-disk versioning to SAI. This allows SAI to support 
multiple on-disk formats during upgrades. 

I am mentioning this now because the CEP mentions “Partition Based Iteration” 
as an initial feature. We will change that to “Row Based Iteration” when the 
feature is merged.

MikeA

> On 15 Sep 2021, at 19:42, Caleb Rackliffe  wrote:
> 
> Hey there,
> 
> In the spirit of trying to get as many possible objections to a successful
> vote out of the way, I've added a "Challenges" section to the CEP:
> 
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-7%3A+Storage+Attached+Index#CEP7:StorageAttachedIndex-Challenges
>  
> 
> 
> Most of you will be familiar with these, but I think we need to be as
> open/candid as possible about the potential risk they pose to SAI's broader
> usability. I've described them from the point of view that they are not
> intractable, but if anyone thinks they are, let's hash that disagreement
> out.
> 
> Thanks!
> 
> On Thu, Sep 9, 2021 at 11:13 AM Patrick McFadin  > wrote:
> 
>> +1 on introducing this in an incremental manner and after reading through
>> CASSANDRA-16092 that seems like a perfect place to start. I see that work
>> on that Jira has stopped until direction for CEP-7 has been voted in.
>> 
>> I say start the vote and let's get this really valuable developer feature
>> underway.
>> 
>> Patrick
>> 
>> On Tue, Sep 7, 2021 at 10:40 AM Caleb Rackliffe 
>> wrote:
>> 
>>> So this thread stalled almost a year ago. (Wow, time flies when you're
>>> trying to release 4.0.) My synthesis of the conversation to this point is
>>> that while there are some open questions about testing
>>> methodology/"definition of done" and our choice of particular on-disk
>> data
>>> structures, neither of these should be a serious obstacle to moving
>> forward
>>> w/ a vote. Having said that, is there anything left around the CEP that
>> we
>>> feel should prevent it from moving to a vote?
>>> 
>>> In terms of how we would proceed from the point a vote passes, it seems
>>> like there have been enough concerns around the proposed/necessary
>> breaking
>>> changes to the 2i API, that we will start development by introducing
>>> components as incrementally as possible into a long-running feature
>> branch
>>> off trunk. (This work would likely start w/ *CASSANDRA-16092*
>>> , which we could
>>> resolve as a sub-task of the SAI epic without interfering with other
>> trunk
>>> development likely destined for a 4.x minor, etc.)
>>> 
>>> On Thu, Sep 24, 2020 at 2:47 AM Jasonstack Zhao Yang <
>>> jasonstack.z...@gmail.com> wrote:
>>> 
>> Question is: is this planned as a next step?
>> If yes, how are we going to mark SAI as experimental until it gets
>> row offsets? Also, it is likely that index format is going to change
 when
>> row offsets are added, so my concern is that we may have to support
>>> two
>> versions of a format for a smooth migration.
 
 The goal is to support row-level index when merging SAI, I will update
>>> the
 CEP about it.
 
>> I think switching to row
>> offsets also has a huge impact on interaction with SPRC and has some
>> potential for optimisations.
 
 Can you share more details on the optimizations?
 
 
 
 On Thu, 24 Sep 2020 at 15:20, Oleksandr Petrov <
>>> oleksandr.pet...@gmail.com
> 
 wrote:
 
>> But for improving overall index read performance, I think improving
 base
> table read perf  (because SAI/SASI executes LOTS of
> SinglePartitionReadCommand after searching on-disk index) is more
 effective
> than switching from Trie to Prefix BTree.
> 
> I haven't suggested switching to Prefix B-Tree or any other
>> structure,
 the
> question was about rationale and motivation of picking one over the
 other,
> which I am curious about for personal reasons/interests that lie
>>> outside
 of
> Cassandra. Having this listed in CEP could have been helpful for
>> future
> guidance. It's ok if this question is outside of the CEP scope.
> 
> I also agree that there are many areas that require improvement
>> around
 the
> read/write path and 2i, many of which (even outside of base table
>>> format
 or
> read perf) can yield positive performance results.
> 
>> FWIW, I personally look forward to receiving that contribution when
>>> the
> time is right.
> 
> I am very excited for this contribution, too, and it looks like very
 solid
> work.
> 
> I h

Re: [DISCUSS] CEP-7 Storage Attached Index

2022-02-02 Thread Mike Adamson
Hi,

I’d like to restart this thread.

We merged the row-aware branch to the SAI codebase just before Christmas and 
have subsequently updated the CEP to reflect these changes.

I would like to move the discussion forward as to how we move this CEP towards 
a vote.

MikeA

> On 16 Sep 2021, at 19:49, DuyHai Doan  wrote:
> 
> Good new Mike that row based indexing will be available, this was a major
> lacking from SASI at that time !
> 
> Le jeu. 16 sept. 2021 à 15:38, Mike Adamson  <mailto:madam...@datastax.com>> a
> écrit :
> 
>> Hi,
>> 
>> Just to keep this thread up to date with development progress, we will be
>> adding row-aware support to SAI in the next few weeks. This is currently
>> going through the final stages of review and testing.
>> 
>> This feature also adds on-disk versioning to SAI. This allows SAI to
>> support multiple on-disk formats during upgrades.
>> 
>> I am mentioning this now because the CEP mentions “Partition Based
>> Iteration” as an initial feature. We will change that to “Row Based
>> Iteration” when the feature is merged.
>> 
>> MikeA
>> 
>>> On 15 Sep 2021, at 19:42, Caleb Rackliffe 
>> wrote:
>>> 
>>> Hey there,
>>> 
>>> In the spirit of trying to get as many possible objections to a
>> successful
>>> vote out of the way, I've added a "Challenges" section to the CEP:
>>> 
>>> 
>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-7%3A+Storage+Attached+Index#CEP7:StorageAttachedIndex-Challenges
>> <
>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-7%3A+Storage+Attached+Index#CEP7:StorageAttachedIndex-Challenges
>>  
>> <https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-7%3A+Storage+Attached+Index#CEP7:StorageAttachedIndex-Challenges>
>>> 
>>> 
>>> Most of you will be familiar with these, but I think we need to be as
>>> open/candid as possible about the potential risk they pose to SAI's
>> broader
>>> usability. I've described them from the point of view that they are not
>>> intractable, but if anyone thinks they are, let's hash that disagreement
>>> out.
>>> 
>>> Thanks!
>>> 
>>> On Thu, Sep 9, 2021 at 11:13 AM Patrick McFadin > <mailto:pmcfa...@gmail.com <mailto:pmcfa...@gmail.com>>> wrote:
>>> 
>>>> +1 on introducing this in an incremental manner and after reading
>> through
>>>> CASSANDRA-16092 that seems like a perfect place to start. I see that
>> work
>>>> on that Jira has stopped until direction for CEP-7 has been voted in.
>>>> 
>>>> I say start the vote and let's get this really valuable developer
>> feature
>>>> underway.
>>>> 
>>>> Patrick
>>>> 
>>>> On Tue, Sep 7, 2021 at 10:40 AM Caleb Rackliffe <
>> calebrackli...@gmail.com <mailto:calebrackli...@gmail.com>>
>>>> wrote:
>>>> 
>>>>> So this thread stalled almost a year ago. (Wow, time flies when you're
>>>>> trying to release 4.0.) My synthesis of the conversation to this point
>> is
>>>>> that while there are some open questions about testing
>>>>> methodology/"definition of done" and our choice of particular on-disk
>>>> data
>>>>> structures, neither of these should be a serious obstacle to moving
>>>> forward
>>>>> w/ a vote. Having said that, is there anything left around the CEP that
>>>> we
>>>>> feel should prevent it from moving to a vote?
>>>>> 
>>>>> In terms of how we would proceed from the point a vote passes, it seems
>>>>> like there have been enough concerns around the proposed/necessary
>>>> breaking
>>>>> changes to the 2i API, that we will start development by introducing
>>>>> components as incrementally as possible into a long-running feature
>>>> branch
>>>>> off trunk. (This work would likely start w/ *CASSANDRA-16092*
>>>>> <https://issues.apache.org/jira/browse/CASSANDRA-16092 
>>>>> <https://issues.apache.org/jira/browse/CASSANDRA-16092>>, which we
>> could
>>>>> resolve as a sub-task of the SAI epic without interfering with other
>>>> trunk
>>>>> development likely destined for a 4.x minor, etc.)
>>>>> 
>>>>> On Thu, Sep 24, 2020 at 2:47 AM Jasonstack Zhao Yang <
>>>>

Re: [DISCUSS] CEP-7 Storage Attached Index

2022-02-03 Thread Mike Adamson
I can’t why there would be any objection to adding a guardrail. I think this is 
a good idea.

MikeA

"I see this as a task for a follow-up ticket so long as the CEP’s contributors 
would not oppose the addition of such a guardrail."

> On 3 Feb 2022, at 16:06, C. Scott Andreas  wrote:
> 
> I see this as a task for a follow-up ticket so long as the CEP’s contributors 
> would not oppose the addition of such a guardrail.



Re: [DISCUSS] CEP-7 Storage Attached Index

2022-02-10 Thread Mike Adamson
> I'd be interested to hear from Mike/Jason on the OR support topic, of course.

The support for OR within SAI is fairly minimal and will not work without the 
non-SAI changes needed. Since the non-SAI OR changes are extensive it would be 
better to bring those in under their own CEP. 

I’d leave the decision of whether to put the rest of SAI behind an experimental 
flag to others. My preference would be to not do so because the non-OR 
implementation has been tested and used on production for over a year now.

MikeA

> On 9 Feb 2022, at 13:06, bened...@apache.org wrote:
> 
> > Is there some mechanism such as experimental flags, which would allow the 
> > SAI-only OR support to be merged into trunk
>  
> FWIW, I’m OK with this merging to trunk, either hidden behind a CI-only flag 
> or exposed to the user via some experimental flag (and a suitable NEWS.txt). 
> We’ve discussed the need to periodically merge feature branches with trunk 
> before they are complete. If the work is logically complete for SAI, and 
> we’re only pending work to make OR consistent between SAI and non-SAI 
> queries, I think that more than meets this criterion.
>  
>  
> From: Henrik Ingo mailto:henrik.i...@datastax.com>>
> Date: Monday, 7 February 2022 at 12:03
> To: dev@cassandra.apache.org  
> mailto:dev@cassandra.apache.org>>
> Subject: Re: [DISCUSS] CEP-7 Storage Attached Index
> 
> Thanks Benjamin for reviewing and raising this.
>  
> While I don't speak for the CEP authors, just some thoughts from me:
>  
> On Mon, Feb 7, 2022 at 11:18 AM Benjamin Lerer  > wrote:
> I would like to raise 2 points regarding the current CEP proposal:
>  
> 1. There are mention of some target versions and of the removal of SASI 
>  
> At this point, we have not agreed on any version numbers and I do not feel 
> that removing SASI should be part of the proposal for now.
> It seems to me that we should see first the adoption surrounding SAI before 
> talking about deprecating other solutions.
>  
>  
> This seems rather uncontroversial. I think the CEP template and previous CEPs 
> invite  the discussion on whether the new feature will or may replace an 
> existing feature. But at the same time that's of course out of scope for the 
> work at hand. I have no opinion one way or the other myself.
>  
>  
> 2. OR queries
>  
> It is unclear to me if the proposal is about adding OR support only for SAI 
> index or for other types of queries too.
> In the past, we had the nasty habit for CQL to provide only partialially 
> implemented features which resulted in a bad user experience.
> Some examples are:
> * LIKE restrictions which were introduced for the need of SASI and were not 
> never supported for other type of queries
> * IS NOT NULL restrictions for MATERIALIZED VIEWS that are not supported 
> elsewhere
> * != operator only supported for conditional inserts or updates
> And there are unfortunately many more.
>  
> We are currenlty slowly trying to fix those issue and make CQL a more mature 
> language. By consequence, I would like that we change our way of doing 
> things. If we introduce support for OR it should also cover all the other 
> type of queries and be fully tested.
> I also believe that it is a feature that due to its complexity fully deserves 
> its own CEP.
>  
>  
> The current code that would be submitted for review after the CEP is adopted, 
> contains OR support beyond just SAI indexes. An initial implementation first 
> targeted only such queries where all columns in a WHERE clause using OR 
> needed to be backed by an SAI index. This was since extended to also support 
> ALLOW FILTERING mode as well as OR with clustering key columns. The current 
> implementation is by no means perfect as a general purpose OR support, the 
> focus all the time was on implementing OR support in SAI. I'll leave it to 
> others to enumerate exactly the limitations of the current implementation.
>  
> Seeing that also Benedict supports your point of view, I would steer the 
> conversation more into a project management perspective:
> * How can we advance CEP-7 so that the bulk of the SAI code can still be 
> added to Cassandra, so that  users can benefit from this new index type, 
> albeit without OR?
> * This is also an important question from the point of view that this is a 
> large block of code that will inevitably diverged if it's not in trunk. Also, 
> merging it to trunk will allow future enhancements, including the OR syntax 
> btw, to happen against trunk (aka upstream first).
> * Since OR support nevertheless is a feature of SAI, it needs to be at least 
> unit tested, but ideally even would be exposed so that it is possible to test 
> on the CQL level. Is there some mechanism such as experimental flags, which 
> would allow the SAI-only OR support to be merged into trunk, while a separate 
> CEP is focused on implementing "proper" general purpose OR support? I should 
> note

Re: [DISCUSS] CEP-7 Storage Attached Index

2022-02-14 Thread Mike Adamson
> We don't need a whole "codec framework" for V1, but we're still embedding 
> some versioning information in the column index on-disk structures, right?

I’m not sure why we would want to pull the versioning code only to have to put 
it back in as soon as we need to change the on-disk format. We also need to 
consider whether the legacy format used by DSE is supported in OSS. I’m not 
sure of the policy on this although I strongly suspect that the answer is that 
it won’t be supported. Either way, it would seem to be a lot of work to pull 
the versioning code out at this point since it formed part of a major refactor 
of the SAI framework and plumbing.

MikeA

> On 11 Feb 2022, at 18:47, Caleb Rackliffe  wrote:
> 
> Just finished reading the latest version of the CEP. Here are my thoughts:
> 
> - We've already talked about OR queries, so I won't rehash that, but 
> tokenization support seems like it might be another one of those places where 
> we can cut scope if we want to get V1 out the door. It shouldn't be that hard 
> to detangle from the rest of the code.
> - We mention the JMX metric ecosystem in the CEP, but not the related virtual 
> tables. This isn't a big issue, and doesn't mean we need to change the CEP, 
> but it might be helpful for those not familiar with the existing prototype to 
> know they exist :)
> - It's probably below the line for CEP discussion, but the text and numeric 
> index formats will probably change over time. We don't need a whole "codec 
> framework" for V1, but we're still embedding some versioning information in 
> the column index on-disk structures, right?
> 
> To offset my obvious partiality around this CEP, I've already made an effort 
> to raise some of the issues that may come up to challenge us from a macro 
> perspective. It seems like the prevailing opinion here is that they are 
> either surmountable or simply basic conceptual difficulties w/ distributed 
> secondary indexing.
> 
> tl;dr I'm +1 on bringing this to a vote and starting to put together all the 
> pieces for CASSANDRA-16052 
> <https://issues.apache.org/jira/browse/CASSANDRA-16052> :)
> 
> On Thu, Feb 10, 2022 at 11:26 AM Mike Adamson  <mailto:madam...@datastax.com>> wrote:
> > I'd be interested to hear from Mike/Jason on the OR support topic, of 
> > course.
> 
> The support for OR within SAI is fairly minimal and will not work without the 
> non-SAI changes needed. Since the non-SAI OR changes are extensive it would 
> be better to bring those in under their own CEP. 
> 
> I’d leave the decision of whether to put the rest of SAI behind an 
> experimental flag to others. My preference would be to not do so because the 
> non-OR implementation has been tested and used on production for over a year 
> now.
> 
> MikeA
> 
>> On 9 Feb 2022, at 13:06, bened...@apache.org <mailto:bened...@apache.org> 
>> wrote:
>> 
>> > Is there some mechanism such as experimental flags, which would allow the 
>> > SAI-only OR support to be merged into trunk
>>  
>> FWIW, I’m OK with this merging to trunk, either hidden behind a CI-only flag 
>> or exposed to the user via some experimental flag (and a suitable NEWS.txt). 
>> We’ve discussed the need to periodically merge feature branches with trunk 
>> before they are complete. If the work is logically complete for SAI, and 
>> we’re only pending work to make OR consistent between SAI and non-SAI 
>> queries, I think that more than meets this criterion.
>>  
>>  
>> From: Henrik Ingo > <mailto:henrik.i...@datastax.com>>
>> Date: Monday, 7 February 2022 at 12:03
>> To: dev@cassandra.apache.org <mailto:dev@cassandra.apache.org> 
>> mailto:dev@cassandra.apache.org>>
>> Subject: Re: [DISCUSS] CEP-7 Storage Attached Index
>> 
>> Thanks Benjamin for reviewing and raising this.
>>  
>> While I don't speak for the CEP authors, just some thoughts from me:
>>  
>> On Mon, Feb 7, 2022 at 11:18 AM Benjamin Lerer > <mailto:ble...@apache.org>> wrote:
>> I would like to raise 2 points regarding the current CEP proposal:
>>  
>> 1. There are mention of some target versions and of the removal of SASI 
>>  
>> At this point, we have not agreed on any version numbers and I do not feel 
>> that removing SASI should be part of the proposal for now.
>> It seems to me that we should see first the adoption surrounding SAI before 
>> talking about deprecating other solutions.
>>  
>>  
>> This seems rather uncontroversial. I think the CEP template and previous 
>> CEPs invite  the

Re: [DISCUSS] CEP-7 Storage Attached Index

2022-02-16 Thread Mike Adamson
I have updated the CEP to reflect the recent discussions.

OR support has moved out of version 1 support. Index versioning and virtual 
table support are now covered in the Addenda.

MikeA

> On 14 Feb 2022, at 15:35, Caleb Rackliffe  wrote:
> 
> Agreed there’s no reason to pull it out. I was just wondering what state it 
> was in, given I didn’t see it mentioned in the CEP.
> 
>> On Feb 14, 2022, at 8:12 AM, Mike Adamson  wrote:
>> 
>> > We don't need a whole "codec framework" for V1, but we're still embedding 
>> some versioning information in the column index on-disk structures, right?
>> 
>> I’m not sure why we would want to pull the versioning code only to have to 
>> put it back in as soon as we need to change the on-disk format. We also need 
>> to consider whether the legacy format used by DSE is supported in OSS. I’m 
>> not sure of the policy on this although I strongly suspect that the answer 
>> is that it won’t be supported. Either way, it would seem to be a lot of work 
>> to pull the versioning code out at this point since it formed part of a 
>> major refactor of the SAI framework and plumbing.
>> 
>> MikeA
>> 
>>> On 11 Feb 2022, at 18:47, Caleb Rackliffe >> <mailto:calebrackli...@gmail.com>> wrote:
>>> 
>>> Just finished reading the latest version of the CEP. Here are my thoughts:
>>> 
>>> - We've already talked about OR queries, so I won't rehash that, but 
>>> tokenization support seems like it might be another one of those places 
>>> where we can cut scope if we want to get V1 out the door. It shouldn't be 
>>> that hard to detangle from the rest of the code.
>>> - We mention the JMX metric ecosystem in the CEP, but not the related 
>>> virtual tables. This isn't a big issue, and doesn't mean we need to change 
>>> the CEP, but it might be helpful for those not familiar with the existing 
>>> prototype to know they exist :)
>>> - It's probably below the line for CEP discussion, but the text and numeric 
>>> index formats will probably change over time. We don't need a whole "codec 
>>> framework" for V1, but we're still embedding some versioning information in 
>>> the column index on-disk structures, right?
>>> 
>>> To offset my obvious partiality around this CEP, I've already made an 
>>> effort to raise some of the issues that may come up to challenge us from a 
>>> macro perspective. It seems like the prevailing opinion here is that they 
>>> are either surmountable or simply basic conceptual difficulties w/ 
>>> distributed secondary indexing.
>>> 
>>> tl;dr I'm +1 on bringing this to a vote and starting to put together all 
>>> the pieces for CASSANDRA-16052 
>>> <https://issues.apache.org/jira/browse/CASSANDRA-16052> :)
>>> 
>>> On Thu, Feb 10, 2022 at 11:26 AM Mike Adamson >> <mailto:madam...@datastax.com>> wrote:
>>> > I'd be interested to hear from Mike/Jason on the OR support topic, of 
>>> > course.
>>> 
>>> The support for OR within SAI is fairly minimal and will not work without 
>>> the non-SAI changes needed. Since the non-SAI OR changes are extensive it 
>>> would be better to bring those in under their own CEP. 
>>> 
>>> I’d leave the decision of whether to put the rest of SAI behind an 
>>> experimental flag to others. My preference would be to not do so because 
>>> the non-OR implementation has been tested and used on production for over a 
>>> year now.
>>> 
>>> MikeA
>>> 
>>>> On 9 Feb 2022, at 13:06, bened...@apache.org <mailto:bened...@apache.org> 
>>>> wrote:
>>>> 
>>>> > Is there some mechanism such as experimental flags, which would allow 
>>>> > the SAI-only OR support to be merged into trunk
>>>>  
>>>> FWIW, I’m OK with this merging to trunk, either hidden behind a CI-only 
>>>> flag or exposed to the user via some experimental flag (and a suitable 
>>>> NEWS.txt). We’ve discussed the need to periodically merge feature branches 
>>>> with trunk before they are complete. If the work is logically complete for 
>>>> SAI, and we’re only pending work to make OR consistent between SAI and 
>>>> non-SAI queries, I think that more than meets this criterion.
>>>>  
>>>>  
>>>> From: Henrik Ingo >>> <mailto:henrik.i...@datastax.com>>
>>>> Da

[DISCUSSION] New dependencies for SAI CEP-7

2022-12-08 Thread Mike Adamson
Hi,

I wanted to discuss the addition of the following dependencies for CEP-7.
The dependencies are:

org.apache.lucene.lucene-core 7.5.0
org.apache.lucene.lucene-analyzers-common 7.5.0
com.carrotsearch.randomizedtesting.randomizedtesting-runner 2.1.2 - test
dependency

Lucene is an apache project so is licensed APL2. Carrotsearch is not an
apache project but is licensed APL2

We are also removing the dependency on com.github.rholder.snowball-stemmer.
This library is used by SASI stemming filters but a later version of the
same library is available in the lucene libraries.

Does anyone have any concerns about these changes?

Mike Adamson


Re: Welcome Caleb Rackliffe to the PMC

2025-03-02 Thread Mike Adamson
Congratulations Caleb!

On Fri, 28 Feb 2025, 19:11 Doug Rohrer,  wrote:

> Congrats Caleb!
>
> On Feb 26, 2025, at 10:14 PM, Jordan West  wrote:
>
> Congrats Caleb!!
>
> Jordan
> On Wed, Feb 26, 2025 at 13:01 Mick Semb Wever  wrote:
>
>>   .
>>
>>>
>>> Please join us in welcoming Caleb to his new role!
>>>
>>
>>
>>
>> Congratulations Caleb !!
>>
>>
>