Re: Role of Hadoop code in Cassandra 5.0

2023-03-09 Thread Rahul Xavier Singh
What is the hadoop code for? For interacting from Hadoop via CQL, or Thrift
if it's that old, or directly looking at SSTables? Been using C* since 2
and have never used it.

Agree to deprecate in next possible 4.1.x version and remove in 5.0

Rahul Singh

Chief Executive Officer | Business Platform Architect m: 202.905.2818 e:
rahul.si...@anant.us li: http://linkedin.com/in/xingh ca:
http://calendly.com/xingh

*We create, support, and manage real-time global data & analytics platforms
for the modern enterprise.*

*Anant | https://anant.us *

3 Washington Circle, Suite 301

Washington, D.C. 20037

*http://Cassandra.Link * : The best resources for
Apache Cassandra


On Thu, Mar 9, 2023 at 12:53 PM Brandon Williams  wrote:

> I think if we reach consensus here that decides it. I too vote to
> deprecate in 4.1.x.  This means we would remove it in 5.0.
>
> Kind Regards,
> Brandon
>
> On Thu, Mar 9, 2023 at 11:32 AM Ekaterina Dimitrova
>  wrote:
> >
> > Deprecation sounds good to me, but I am not completely sure in which
> version we can do it. If it is possible to add a deprecation warning in the
> 4.x series or at least 4.1.x - I vote for that.
> >
> > On Thu, 9 Mar 2023 at 12:14, Jacek Lewandowski <
> lewandowski.ja...@gmail.com> wrote:
> >>
> >> Is it possible to deprecate it in the 4.1.x patch release? :)
> >>
> >>
> >> - - -- --- -  -
> >> Jacek Lewandowski
> >>
> >>
> >> czw., 9 mar 2023 o 18:11 Brandon Williams 
> napisał(a):
> >>>
> >>> This is my feeling too, but I think we should accomplish this by
> >>> deprecating it first.  I don't expect anything will change after the
> >>> deprecation period.
> >>>
> >>> Kind Regards,
> >>> Brandon
> >>>
> >>> On Thu, Mar 9, 2023 at 11:09 AM Jacek Lewandowski
> >>>  wrote:
> >>> >
> >>> > I vote for removing it entirely.
> >>> >
> >>> > thanks
> >>> > - - -- --- -  -
> >>> > Jacek Lewandowski
> >>> >
> >>> >
> >>> > czw., 9 mar 2023 o 18:07 Miklosovic, Stefan <
> stefan.mikloso...@netapp.com> napisał(a):
> >>> >>
> >>> >> Derek,
> >>> >>
> >>> >> I have couple more points ... I do not think that extracting it to
> a separate repository is "win". That code is on Hadoop 1.0.3. We would be
> spending a lot of work on extracting it just to extract 10 years old code
> with occasional updates (in my humble opinion just to make it compilable
> again if the code around changes). What good is in that? We would have one
> more place to take care of ... Now we at least have it all in one place.
> >>> >>
> >>> >> I believe we have four options:
> >>> >>
> >>> >> 1) leave it there so it will be like this is for next years with
> questionable and diminishing usage
> >>> >> 2) update it to Hadoop 3.3 (I wonder who is going to do that)
> >>> >> 3) 2) and extract it to a separate repository but if we do 2) we
> can just leave it there
> >>> >> 4) remove it
> >>> >>
> >>> >> 
> >>> >> From: Derek Chen-Becker 
> >>> >> Sent: Thursday, March 9, 2023 15:55
> >>> >> To: dev@cassandra.apache.org
> >>> >> Subject: Re: Role of Hadoop code in Cassandra 5.0
> >>> >>
> >>> >> NetApp Security WARNING: This is an external email. Do not click
> links or open attachments unless you recognize the sender and know the
> content is safe.
> >>> >>
> >>> >>
> >>> >>
> >>> >> I think the question isn't "Who ... is still using that?" but more
> "are we actually going to support it?" If we're on a version that old it
> would appear that we've basically abandoned it, although there do appear to
> have been refactoring (for other things) commits in the last couple of
> years. I would be in favor of removal from 5.0, but at the very least,
> could it be moved into a separate repo/package so that it's not pulling a
> relatively large dependency subtree from Hadoop into our main codebase?
> >>> >>
> >>> >> Cheers,
> >>> >>
> >>> >> Derek
> >>> >>
> >>> >> On Thu, Mar 9, 2023 at 6:44 AM Miklosovic, Stefan <
> stefan.mikloso...@netapp.com> wrote:
> >>> >> Hi list,
> >>> >>
> >>> >> I stumbled upon Hadoop package again. I think there was some
> discussion about the relevancy of Hadoop code some time ago but I would
> like to ask this again.
> >>> >>
> >>> >> Do you think Hadoop code (1) is still relevant in 5.0? Who in the
> industry is still using that?
> >>> >>
> >>> >> We might drop a lot of code and some Hadoop dependencies too (3)
> (even their scope is "provided"). The version of Hadoop we build upon is
> 1.0.3 which was released 10 years ago. This code does not have any tests
> nor documentation on the website.
> >>> >>
> >>> >> There seems to be issues like this (2) and it seems like the
> solution is to, basically, use Spark Cassandra connector instead which I
> would say is quite reasonable.
> >>> >>
> >>> >> Regards
> >>> >>
> >>> >> (1)
> https://github.com/apache/cassandra/tree/trunk/src/java/org/apache/cassandra/hadoop
> >>> >> (2)

Re: [DISCUSS] Introduce DATABASE as an alternative to KEYSPACE

2023-04-04 Thread Rahul Xavier Singh
My 2 cents:

Keeping it keyspace works for me, namespace could be cool also since we
decide where that namespace exists in relation to Datacenters, etc.  In our
case, a Keyspace is more similar to a namespace than it is to a database
since we expect all the UDTs,/UDFs, indexes to refer to only the tables in
that keyspace/namespace.

Alternatively interesting to observe and throw some fuel into the
discussion , looking at the Postgres (only because there are many
distributed databases that are now PG compliant) :
>From the interwebs:
*In PostgreSQL, a schema is a namespace that contains named database
objects such as tables, views, indexes, data types, functions, stored
procedures and operators. A database can contain one or multiple schemas
and each schema belongs to only one database.*
I used to gripe about this but as a platform gets more complex it is useful
to organize PG DBs into schemas. In C* world, I found myself doing similar
things by having a prefix : e.g. appprefix_system1 appprefix_system2 ...


Rahul Singh

Chief Executive Officer | Business Platform Architect m: 202.905.2818 e:
rahul.si...@anant.us li: http://linkedin.com/in/xingh ca:
http://calendly.com/xingh

*We create, support, and manage real-time global data & analytics platforms
for the modern enterprise.*

*Anant | https://anant.us *

3 Washington Circle, Suite 301

Washington, D.C. 20037

*http://Cassandra.Link * : The best resources for
Apache Cassandra


On Tue, Apr 4, 2023 at 12:52 PM Jeff Jirsa  wrote:

> KEYSPACE at least makes sense in the context that it is the unit that
> defines how those partitions keys are going to be treated/replicated
>
> DATABASE may be ambiguous, but it's ambiguity shared across the industry.
>
> Creating a new name like TABLESPACE or TABLEGROUP sounds horrible because
> it'll be unique to us in the world, and therefore unintuitive for new users.
>
>
>
> On Tue, Apr 4, 2023 at 9:36 AM Josh McKenzie  wrote:
>
>> I think there's competing dynamics here.
>>
>> 1) KEYSPACE isn't that great of a name; it's not a space in which keys
>> are necessarily unique, and you can't address things just by key w/out
>> their respective tables
>> 2) DATABASE isn't that great of a name either due to the aforementioned
>> ambiguity.
>>
>> Something like "TABLESPACE" or 'TABLEGROUP" would *theoretically* better
>> satisfy point 1 and 2 above but subjectively I kind of recoil at both
>> equally. So there's that.
>>
>> On Tue, Apr 4, 2023, at 12:30 PM, Abe Ratnofsky wrote:
>>
>> I agree with Bowen - I find Keyspace easier to communicate with. There
>> are plenty of situations where the use of "database" is ambiguous (like
>> "Could you help me connect to database x?"), but Keyspace refers to a
>> single thing. I think more software is moving towards calling these things
>> "namespaces" (like Kubernetes), and while "Keyspaces" is not a term used in
>> this way elsewhere, I still find it leads to clearer communication.
>>
>> --
>> Abe
>>
>>
>> On Apr 4, 2023, at 9:24 AM, Andrés de la Peña 
>> wrote:
>>
>> I think supporting DATABASE is a great idea.
>>
>> It's better aligned with SQL databases, and can save new users one of the
>> first troubles they find.
>>
>> Probably anyone starting to use Cassandra for the first time is going to
>> face the what is a keyspace? question in the first minutes. Saving that to
>> users with a more common name would be a victory for usability IMO.
>>
>> On Tue, 4 Apr 2023 at 16:48, Mike Adamson  wrote:
>>
>> Hi,
>>
>> I'd like to propose that we add DATABASE to the CQL grammar as an
>> alternative to KEYSPACE.
>>
>> Background: While TABLE was introduced as an alternative for COLUMNFAMILY
>> in the grammar we have kept KEYSPACE for the container name for a group of
>> tables. Nearly all traditional SQL databases use DATABASE as the container
>> name for a group of tables so it would make sense for Cassandra to adopt
>> this naming as well.
>>
>> KEYSPACE would be kept in the grammar but we would update some logging
>> and documentation to encourage use of the new name.
>>
>> Mike Adamson
>>
>> --
>> [image: DataStax Logo Square] 
>> *Mike Adamson*
>> Engineering
>> +1 650 389 6000 <16503896000> | datastax.com 
>> Find DataStax Online:
>> [image: LinkedIn Logo]
>> 
>>[image: Facebook Logo]
>> 
>>[image: Twitter Logo]    [image: RSS
>> Feed] 

Re: Adding vector search to SAI with heirarchical navigable small world graph index

2023-04-25 Thread Rahul Xavier Singh
Very exciting. Love it.

"Retrieval augmented" or "Data augmented" LLMs are the easiest way to
"fine-tune" the output of LLM without actually fine-tuning. Currently
Pinecone/Weaviate/Milvus are eating up the scene, with new players like
Chroma coming out soon.
We've been working with Langchain / LLAMA Index for data ingestion to power
this method , primarily with Pinecone, while using Cassandra as the
intermediary store .. especially when trying out different chunking lengths
for the text retrieval.
Being first to have a solid distributed option will be useful not only for
people who are currently using vector databases, but also help others start
leveraging LLM's with existing infrastructure on Cassandra/Astra etc.

This message was not written by ChatGPT
Rahul Singh

Chief Executive Officer | Business Platform Architect m: 202.905.2818 e:
rahul.si...@anant.us li: http://linkedin.com/in/xingh ca:
http://calendly.com/xingh

*We create, support, and manage real-time global data & analytics platforms
for the modern enterprise.*

*Anant | https://anant.us *

3 Washington Circle, Suite 301

Washington, D.C. 20037

*http://Cassandra.Link * : The best resources for
Apache Cassandra


On Mon, Apr 24, 2023 at 5:17 PM David Capwell  wrote:

> This work sounds interesting, I would recommend decoupling the types from
> the ANN support as the types require client changes and can go in now
> (would give a lot of breathing room to get this ready for 5.0), where as
> ANN depends on SAI which is still being worked on.
>
> On Apr 22, 2023, at 1:02 PM, Jonathan Ellis  wrote:
>
> If you're interested in playing with HNSW outside of a super alpha
> Cassandra branch, I put up a repo with some sample code here:
> https://github.com/jbellis/hnswdemo
>
> On Fri, Apr 21, 2023 at 4:19 PM Jonathan Ellis  wrote:
>
>>
>>
>> *Happy Friday, everyone!Rich text formatting ahead, I've attached a PDF
>> for those who prefer that.*
>>
>> *I propose adding approximate nearest neighbor (ANN) vector search
>> capability to Apache Cassandra via storage-attached indexes (SAI). This is
>> a medium-sized effort that will significantly enhance Cassandra’s
>> functionality, particularly for AI use cases. This addition will not only
>> provide a new and important feature for existing Cassandra users, but also
>> attract new users to the community from the AI space, further expanding
>> Cassandra’s reach and relevance.*
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> *IntroductionVector search is a powerful document search technique that
>> enables developers to quickly find relevant content within an extensive
>> collection of documents, which is useful as a standalone technique, but it
>> is particularly hot now because it significantly enhances the performance
>> of LLMs.Vector search uses ML models to match the semantics of a question
>> rather than just the words it contains, avoiding the classic false
>> positives and false negatives associated with term-based search.
>> Alessandro Benedetti gives some good examples in his excellent talk
>> :You can
>> search across any set of vectors, which are just ordered sets of numbers.
>> In the context of natural language queries and document search, we are
>> specifically concerned with a type of vector called an embedding.  An
>> embedding is a high-dimensional vector that captures the underlying
>> semantic relationships and contextual information of words or phrases.
>> Embeddings are generated by ML models trained for this purpose; OpenAI
>> provides an API to do this, but open-source and self-hostable models like
>> BERT are also popular. Creating more accurate and smaller embeddings are
>> active research areas in ML.Large language models (LLMs) can be described
>> as a mile wide and an inch deep. They are not experts on any narrow domain
>> (although they will hallucinate that they are, sometimes convincingly).
>> You can remedy this by giving the LLM additional context for your query,
>> but the context window is small (4k tokens for GPT-3.5, up to 32k for
>> GPT-4), so you want to be very selective about giving the LLM the most
>> relevant possible information.Vector search is red-hot now because it
>> allows us to easily answer the question “what are the most relevant
>> documents to provide as context” by performing a similarity search between
>> the embeddings vector of the query, and those of your document universe.
>> Doing exact search is prohibitively expensive, since you necessarily have
>> to compare with each and every document; this is intractable when you have
>> billions or trillions of docs.  However, there are well-understood
>> algorithms for turning this into a logarithmic problem if you are willing
>> to accept approximately the most similar documents.  This is the
>> “approximate nearest neighbor” problem.  (You will see these referred to as
>>

Re: [POLL] Vector type for ML

2023-05-03 Thread Rahul Xavier Singh
I like this approach. Thank you for those working on this vector search
initiative.

Here's the feedback from my "user" hat for someone who is looking at
databases / indexes for my next LLM app.

Can I take some python code and go from using an in memory vector store
like numpy or FAISS to something else? How easy is it for me to take my
python code and get it to work with this new external service which is no
longer just a library?
There's also tons of services that I can run on docker e.g. milvus,
redissearch, typesense, elasticsearch, opensearch and I may hit a hurdle
when trying to do a lot more data, so I look at Cassandra Vector Search.
Because I am familiar with SQL , Cassandra looks appealing since I can
potentially use "cql_agent" lib ( to be created for langchain and we're
looking into that now) or an existing CassandraVectorStore class?

In most of these scenarios, if people are using langchain, llamaindex, the
underlying implementation is not as important since we shield the user from
CQL data types except at schema creation and most of this libs can be
opinionated and just suggest a generic schema.

The ideal world is if I can just dump text into a field and do a natural
language query on it and have my DB do the embeddings for the document, and
then for the query for me. For now the libs can manage all that and they do
that well. We just need the interface to stay consistent and be relatively
easy to query in CQL. The most popular index in LLM retrieval augmented
patterns is pinecone. You make an index, you upsert, and then you query.
It's not assumed that you are also giving it content, though you can send
it metadata to have the document there.

If we can have a similar workflow e.g. create a table with a vector type OR
create a table with an existing type and then add an index to it, no one is
going to sleep over it as long as it works. Having the ability to take a
table that has data, and then add a vector index doesn't make it any
different than adding a new field since I've got to calculate the
embeddings anyways.

Would love to see how the CQL ends up looking like.
Rahul Singh

Chief Executive Officer | Business Platform Architect m: 202.905.2818 e:
rahul.si...@anant.us li: http://linkedin.com/in/xingh ca:
http://calendly.com/xingh

*We create, support, and manage real-time global data & analytics platforms
for the modern enterprise.*

*Anant | https://anant.us *

3 Washington Circle, Suite 301

Washington, D.C. 20037

*http://Cassandra.Link * : The best resources for
Apache Cassandra


On Tue, May 2, 2023 at 6:39 PM Patrick McFadin  wrote:

> \o/
>
> Bring it in team. Group hug.
>
> Now if you'll excuse me, I'm going to go build my preso on how Cassandra
> is the only distributed database you can do vector search in an ACID
> transaction.
>
> Patrick
>
> On Tue, May 2, 2023 at 3:27 PM Jonathan Ellis  wrote:
>
>> I had a call with David.  We agreed that we want a "vector" data type
>> with these properties
>>
>> - Fixed length
>> - No nulls
>> - Random access not supported
>>
>> Where we disagreed was on my proposal to restrict vectors to only numeric
>> data.  David's points were that
>>
>> (1) He has a use case today for a data type with the other vector
>> properties,
>> (2) It doesn't seem reasonable to create two data types with the same
>> properties, one of which is restricted to numerics, and
>> (3) The restrictions that I want for numeric vectors make more sense at
>> the index and function level, than at the type level.
>>
>> I'm ready to concede that David has the better case here and move forward
>> with a vector implementation without that restriction.
>>
>> On Tue, May 2, 2023 at 4:03 PM David Capwell  wrote:
>>
>>>  How about it, David? Did you already make this?
>>>
>>>
>>> I checked out the patch, fixed serialize/deserialize, added the
>>> constraints, then added a composeForFloat(ByteBuffer), with this the impact
>>> to the POC patch was the following
>>>
>>> 1) move away from VectorType.instance.serializer().deserialize(bb) to
>>> type.composeForFloat(bb), both return float[]
>>> 2) change the index validate logic to move away from checking VectorType
>>> and instead check for that plus the element type == FloatType.  I didn’t
>>> bother to do this as its trivial
>>>
>>> David. End this argument. SHOW THE CODE!
>>>
>>>
>>> If this argument ends and people are cool with vector supporting
>>> abstract type, more than glad to help get this in.
>>>
>>> On May 2, 2023, at 1:53 PM, Jeremy Hanna 
>>> wrote:
>>>
>>> I'm all for bringing more functionality to the masses sooner, but the
>>> original idea has a very very specific use case.  Do we have use cases for
>>> a general purpose Vector/Array data structure?  If so, awesome.  I just
>>> wondered if generalizing provides value, beyond being straightforward to
>>> implement.  I'm just trying to be sensitive to the database code
>>> maintenance and driver support for general types 

Re: [POLL] Vector type for ML

2023-05-05 Thread Rahul Xavier Singh
Love it. Thank you folks for coming to a decision on this. This is very
helpful to move forward on planning on for the current Python frameworks:

   - Langchain.CassandraVectorStore
   - Langchain.CassandraVectorRetriever
   - Langchain.CassandraVectorStoreAgent
   - LlamaIndex.CassandraVectorLoader
   - LlamaIndex.CassandraVectorIndex


Rahul Singh

Chief Executive Officer | Business Platform Architect m: 202.905.2818 e:
rahul.si...@anant.us li: http://linkedin.com/in/xingh ca:
http://calendly.com/xingh

*We create, support, and manage real-time global data & analytics platforms
for the modern enterprise.*

*Anant | https://anant.us *

3 Washington Circle, Suite 301

Washington, D.C. 20037

*http://Cassandra.Link * : The best resources for
Apache Cassandra


On Fri, May 5, 2023 at 7:13 PM David Capwell  wrote:

> [CASSANDRA-18504] Added support for type VECTOR - ASF JIRA
> 
> issues.apache.org 
> [image: fav-jsw.png]
> 
> 
>
>
> On May 5, 2023, at 12:27 PM, David Capwell  wrote:
>
> Yep, fair point…. SPARSE VECTOR better maps to NON NULL MAP
>
> On May 5, 2023, at 11:58 AM, David Capwell  wrote:
>
> If we ever add sparse vectors, we can assume that DENSE is the default and
> allow to use either DENSE, SPARSE or nothing.
>
>
> I have been feeling that sparse is just a fixed size list with nulls… so
> array… if you insert {0: 42, 3: 17} then you get a array
> of [42, null, null, 17]?  One negative doing this is any operator/function
> that needs to reify large vectors (lets say 10k elements) you have a ton
> of memory due to us making it a array… so a new type could be used to lower
> this cost…
>
> With DENSE VECTOR we have the syntax in place that we “could” add SPARSE
> later… With VECTOR we will have complications adding a sparse vector after
> the fact due to this implying DENSE…
>
> Updated ranking
>
> *Syntax*
> *Score*
> VECTOR
> 21
> DENSE VECTOR
> 12
> type[dimension]
> 10
> NON NULL [dimention]
> 8
> VECTOR type[n]
> 5
> DENSE_VECTOR
> 4
> NON-NULL FROZEN
> 3
> ARRAY
> 1
>
> *Syntax*
> *Round 1*
> *Round 2*
> VECTOR
> 4
> 4
> DENSE VECTOR
> 2
> 3
> NON NULL [dimention]
> 2
> 1
> VECTOR type[n]
> 1
>
> type[dimension]
> 1
>
> DENSE_VECTOR
> 1
>
> NON-NULL FROZEN
> 1
>
> ARRAY
> 0
>
>
> VECTOR is still in the lead…
>
> On May 5, 2023, at 11:40 AM, Andrés de la Peña 
> wrote:
>
> My vote is:
>
> 1. VECTOR
> 2. DENSE VECTOR
> 3. type[dimension]
>
> If we ever add sparse vectors, we can assume that DENSE is the default and
> allow to use either DENSE, SPARSE or nothing.
>
> Perhaps the dimension could be separated from the type, such as in
> VECTOR[dimension] or VECTOR(dimension).
>
> On Fri, 5 May 2023 at 19:05, David Capwell  wrote:
>
>> ...where, just to be clear, VECTOR means a frozen fixed
>>> size array w/ no null values?
>>>
>> Assuming this is the case
>>
>>
>> The current agreed requirements are:
>>
>> 1) non-null elements
>> 2) fixed length
>> 3) frozen
>>
>> You pointed out 3 isn’t actually required, but that would be a different
>> conversation to remove =)… maybe defer this to JIRA as long as all parties
>> agree in the ticket?
>>
>> With all votes in, this is what I see
>>
>> *Syntax*
>> *Jonathan Ellis*
>> *David Capwell*
>> *Josh McKenzie*
>> *Caleb Rackliffe*
>> *Patrick McFadin*
>> *Brandon Williams*
>> *Mike Adamson*
>> *Benedict*
>> *Mick Semb Wever*
>> *Derek Chen-Becker*
>> VECTOR
>> 1
>> 2
>> 2
>>
>> 2
>> 1
>> 1
>> 3
>> 2
>>
>> DENSE VECTOR
>> 2
>> 1
>>
>>
>> 1
>>
>> 2
>>
>>
>>
>> type[dimension]
>> 3
>> 3
>> 3
>> 1
>>
>> 3
>>
>> 2
>>
>>
>> DENSE_VECTOR
>>
>>
>> 1
>>
>>
>>
>>
>>
>>
>> 3
>> NON NULL [dimention]
>>
>> 1
>>
>>
>>
>>
>>
>> 1
>>
>> 2
>> VECTOR type[n]
>>
>>
>>
>>
>>
>> 2
>>
>>
>> 1
>>
>> ARRAY
>>
>>
>>
>>
>> 3
>>
>>
>>
>>
>>
>> NON-NULL FROZEN
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> 1
>>
>> *Rank*
>> *Weight*
>> *1*
>> 3
>> *2*
>> 2
>> *3*
>> 1
>> *?*
>> 3
>>
>> *Syntax*
>> *Score*
>> VECTOR
>> 18
>> DENSE VECTOR
>> 10
>> type[dimension]
>> 9
>> NON NULL [dimention]
>> 8
>> VECTOR type[n]
>> 5
>> DENSE_VECTOR
>> 4
>> NON-NULL FROZEN
>> 3
>> ARRAY
>> 1
>>
>>
>> *Syntax*
>> *Round 1*
>> *Round 2*
>> VECTOR
>> 3
>> 4
>> DENSE VECTOR
>> 2
>> 2
>> NON NULL [dimention]
>> 2
>> 1
>> VECTOR type[n]
>> 1
>>
>> type[dimension]
>> 1
>>
>> DENSE_VECTOR
>> 1
>>
>> NON-NULL FROZEN
>> 1
>>
>> ARRAY
>> 0
>>
>>
>> Under 2 different voting systems vector is in the lead
>> and by a good amount… I have updated the patch locally to reflect this
>> change as well.
>>
>> On May 5, 2023, at 10:41 AM, Mike Adamson  wrote:
>>
>> ...where, just to be clear, VECTOR means a frozen fixed
>>> size array w/ no null values?
>>>
>> Assuming this is the case, my vote is:
>>
>> 1. VECTOR
>> 2. DENSE VECTOR
>>
>> I don't really have a 3rd vote because I 

Re: Voice of Apache (Feathercast) at summit?

2023-12-08 Thread Rahul Xavier Singh
I’m sure you have other people interested but would love to speak about the
community aspect, how we’ve seen customers use it and how it continues to
grow into what we need as technologists to help people build scalable
platforms.



On Tue, Dec 5, 2023 at 8:35 AM Rich Bowen  wrote:

> Hey, folks. I'll be at Cassandra Summit next week, and was wondering if
> any of you who might be there would be interested in doing a podcast
> interview with me for Voice Of Apache (the podcast formerly known as
> Feathercast - see https://feathercast.apache.org for context). Topics
> might include something about 5.0, retrospectives on the last 13 years, or
> whatever you think might be of interest.
>
> Let me know soon of anyone's interested/available, so I know to pack my
> gear.
>
> Thanks!
>
> --Rich
>


Re: Welcome Maxim Muzafarov as Cassandra Committer

2024-01-08 Thread Rahul Xavier Singh
Congratulations!

On Mon, Jan 8, 2024 at 1:21 PM Josh McKenzie  wrote:

> The Apache Cassandra PMC is pleased to announce that Maxim Muzafarov has
> accepted
> the invitation to become a committer.
>
> Thanks for all the hard work and collaboration on the project thus far,
> and we're all looking forward to working more with you in the future.
> Congratulations and welcome!
>
> The Apache Cassandra PMC members
>
>
>


Re: Welcome Brad Schoening as Cassandra Committer

2024-02-21 Thread Rahul Xavier Singh
Congrats Brad!

On Wed, Feb 21, 2024 at 3:47 PM Ekaterina Dimitrova 
wrote:

> Congrats and thank you for everything, Brad! 🥳
>
> On Wed, 21 Feb 2024 at 15:46, Josh McKenzie  wrote:
>
>> The Apache Cassandra PMC is pleased to announce that Brad Schoening has
>> accepted
>> the invitation to become a committer.
>>
>> Your work on the integrated python driver, launch script environment, and
>> tests
>> has been a big help to many. Congratulations and welcome!
>>
>> The Apache Cassandra PMC members
>>
>


Re: Cassandra PMC Chair Rotation, 2024 Edition

2024-06-20 Thread Rahul Xavier Singh
Congrats Dinesh!

On Thu, Jun 20, 2024 at 12:27 PM Francisco Guerrero 
wrote:

> Thanks Josh for your contributions to the project as PMC Chair.
> Congratulations Dinesh!
>
> On 2024/06/20 16:25:26 David Capwell wrote:
> > Congrats!
> >
> > > On Jun 20, 2024, at 9:10 AM, Melissa Logan 
> wrote:
> > >
> > > Josh, thank you for your time as chair + congrats Dinesh!
> > >
> > > On Thu, Jun 20, 2024 at 9:08 AM Abe Ratnofsky  a...@aber.io>> wrote:
> > >> Congrats Dinesh! Thank you Josh!
> > >>
> > >>> On Jun 20, 2024, at 11:53 AM, Jeremiah Jordan <
> jeremiah.jor...@gmail.com > wrote:
> > >>>
> > >>> Welcome to the Chair role Dinesh!  Congrats!
> > >>>
> > >>> On Jun 20, 2024 at 10:50:37 AM, Josh McKenzie  > wrote:
> >  Another PMC Chair baton pass incoming! On behalf of the Apache
> Cassandra Project Management Committee (PMC) I would like to welcome and
> congratulate our next PMC Chair Dinesh Joshi (djoshi).
> > 
> >  Dinesh has been a member of the PMC for a few years now and many of
> you likely know him from his thoughtful, measured presence on many of our
> collective discussions as we've grown and evolved over the past few years.
> > 
> >  I appreciate the project trusting me as liaison with the board over
> the past year and look forward to supporting Dinesh in the role in the
> future.
> > 
> >  Repeating Mick (repeating Paulo's) words from last year: The chair
> is an administrative position that interfaces with the Apache Software
> Foundation Board, by submitting regular reports about project status and
> health. Read more about the PMC chair role on Apache projects:
> >  - https://www.apache.org/foundation/how-it-works.html#pmc
> >  - https://www.apache.org/foundation/how-it-works.html#pmc-chair
> >  -
> https://www.apache.org/foundation/faq.html#why-are-PMC-chairs-officers
> > 
> >  The PMC as a whole is the entity that oversees and leads the
> project and any PMC member can be approached as a representative of the
> committee. A list of Apache Cassandra PMC members can be found on:
> https://cassandra.apache.org/_/community.html
> > >>
> >
> >
>


Re: Apache Cassandra Virtual meetings

2019-08-11 Thread Rahul Xavier Singh
I think these meetings would be great.. if there is a specific structure.
We use a simple format that could help e.g.

1. Review long term vision/ roadmap.
2. Review next release / features that are in progress.
3. Discuss issues in general and make a game plan for the next quarter.

Nothing too complicated, but at least some structure so that we are
timeboxed on what we are doing.



On Wed, Aug 7, 2019 at 7:42 AM Joshua McKenzie  wrote:

> The one thing we need to keep in mind is the "If it didn't happen on a
> mailing list, it didn't happen "
> philosophy of apache projects. Shouldn't constrain us too much as the
> nuance is:
>
> *"Discussions and plan proposals often happen at events, in chats (Slack,
> IRC, IM, etc.) or other synchronous places. But all final decisions about
> executing on the plan, checking in the new code, or launching the website
> must be made by the community asyncrhonously on the mailing list."*
>
> So long as we keep that in mind (and maybe push it back to 8am PST since
> 9am can get pretty ugly for some of the more eastern european / asian
> countries), makes sense to me.
>
> On Tue, Aug 6, 2019 at 6:07 PM Dinesh Joshi  wrote:
>
> > Thanks for initiating this conversation Sankalp. On the ASF front, I
> think
> > we need to ensure that non-Pacific time participants can also participate
> > in the discussions. So posting the notes and opening up discussions after
> > the meet up to dev@ would be a great way of making sure everyone can
> > participate and gets visibility. Additionally, we should consider
> > scheduling this meetup in different timezones as far as logistics allow
> it.
> >
> > Dinesh
> >
> > > On Aug 6, 2019, at 2:58 PM, sankalp kohli 
> > wrote:
> > >
> > > Hi All,
> > > There are projects (like k8s[1]) which do regular meetings
> using
> > > video conferencing tools. We want to propose such a meeting for Apache
> > > Cassandra once a quarter. Here are some of the initial details.
> > >
> > > 1. A two hour meeting once a quarter starting at 9am Pacific. We can
> > later
> > > move this to other times to make it easier for other timezones.
> > > 2. Agenda of the meeting will be due 2 days prior to the meeting. A
> > sample
> > > agenda for next one could cover updates on 4.0 testing, any major bugs
> > > found and/or fixed, next steps for 4.0, etc.
> > > 3. Each agenda item will have a time duration and list of people to
> drive
> > > that item.
> > > 4. We will have a moderator for each meeting which will rotate around
> the
> > > community members.
> > > 5. We need to figure out which video conferencing tool to use for this.
> > > Suggestions and donation of tools are welcome.
> > > 6. We will have meeting notes for each item discussed in the meeting.
> > >
> > > Motivation for such a meeting
> > > 1. We currently have Slack, JIRA and emails however an agenda driven
> > video
> > > meeting can help facilitate alignment within the community.
> > > 2. This will give an opportunity to the community to summarize past
> > > progress and talk about future tasks.
> > > 3. Agenda notes can serve as newsletters for the community.
> > >
> > > Notes:
> > > 1. Does this violate any Apache rules? I could not find any rules but
> > > someone can double check
> > > 2. Are there any other Apache projects which do something similar?
> > >
> > > This is a proposal at this time and your feedback is greatly
> appreciated.
> > > If anyone thinks this will not help then please provide a reason.
> > >
> > > Thanks,
> > > Sankalp
> > > [1] https://github.com/kubernetes/community/tree/master/sig-storage
> >
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
> > For additional commands, e-mail: dev-h...@cassandra.apache.org
> >
> >
>


Re: [DISCUSS] Cassandra ecosystem site

2022-07-27 Thread Rahul Xavier Singh
Henrik

What a great find! Love the filterability. Our team and I have been
curating on https://cassandra.tools/ for a few years on a Markdown based
Static Generated Site. Right now it uses Gatsby. We had gotten the first UI
from a jamstack Headless CMS listing a while ago.

All the content is in Markdown files. We deliberately chose this so that in
the worst case it can be hosted for free on any static hosting site in the
world including Github Pages.

There's another iteration of where we can find inspiration.  This is a Hugo
generated site.

https://github.com/stackbit/jamstackthemes
This powers https://jamstackthemes.dev/

All of the content is in markdown here as well
https://github.com/stackbit/jamstackthemes/blob/master/content/services/formstack/_index.md

This is the story of precaution.. the moment you have to depend on a
database for any site, it becomes a service that has to be hosted, managed,
cared for... I strongly recommend a static generator even if you continue
to use a database.

There's also a source of content from the DB that powers our
https://cassandra.link which is also generated from Gatsby but is backed by
a MySQL based app. We've backlogged moving it to be hosted on a Cassandra
variant like Astra or otherwise, but you know as well as most people, this
guy named Free Time is hard to get in touch with.



Rahul Singh

Chief Executive Officer | Business Platform Architect m: 202.905.2818 e:
rahul.si...@anant.us li: http://linkedin.com/in/xingh ca:
http://calendly.com/xingh

*We create, support, and manage real-time global data & analytics platforms
for the modern enterprise.*

*Anant | https://anant.us *

3 Washington Circle, Suite 301

Washington, D.C. 20037

*http://Cassandra.Link * : The best resources for
Apache Cassandra


Re: [VOTE] Release Apache Cassandra 4.0.6

2022-08-26 Thread Rahul Xavier Singh
+1 nb
Rahul Singh

Chief Executive Officer | Business Platform Architect m: 202.905.2818 e:
rahul.si...@anant.us li: http://linkedin.com/in/xingh ca:
http://calendly.com/xingh

*We create, support, and manage real-time global data & analytics platforms
for the modern enterprise.*

*Anant | https://anant.us *

3 Washington Circle, Suite 301

Washington, D.C. 20037

*http://Cassandra.Link * : The best resources for
Apache Cassandra


On Tue, Aug 23, 2022 at 8:56 AM Berenguer Blasi 
wrote:

> +1
> On 23/8/22 14:50, Ekaterina Dimitrova wrote:
>
>
> +1(nb)
> On Tue, 23 Aug 2022 at 8:49, Josh McKenzie  wrote:
>
>> +1
>>
>> On Tue, Aug 23, 2022, at 6:47 AM, Benjamin Lerer wrote:
>>
>> +1
>>
>> Le mar. 23 août 2022 à 11:30, Andrés de la Peña 
>> a écrit :
>>
>> +1 (nb)
>>
>> On Tue, 23 Aug 2022 at 06:14, Tommy Stendahl via dev <
>> dev@cassandra.apache.org> wrote:
>>
>> +1 nb
>>
>> -Original Message-
>> *From*: Brandon Williams > >
>> *Reply-To*: dev@cassandra.apache.org
>> *To*: dev > >
>> *Subject*: Re: [VOTE] Release Apache Cassandra 4.0.6
>> *Date*: Mon, 22 Aug 2022 17:47:59 -0500
>>
>> +1
>>
>> On Sun, Aug 21, 2022 at 7:44 AM Mick Semb Wever <
>>
>> m...@apache.org
>>
>> > wrote:
>>
>> Proposing the test build of Cassandra 4.0.6 for release.
>>
>> sha1: eb2375718483f4c360810127ae457f2a26ccce67
>>
>> Git:
>>
>> https://gitbox.apache.org/repos/asf?p=cassandra.git;a=shortlog;h=refs/tags/4.0.6-tentative
>>
>> Maven Artifacts:
>>
>> https://repository.apache.org/content/repositories/orgapachecassandra-/org/apache/cassandra/cassandra-all/4.0.6/
>>
>> The Source and Build Artifacts, and the Debian and RPM packages and 
>> repositories, are available here:
>>
>> https://dist.apache.org/repos/dist/dev/cassandra/4.0.6/
>>
>> The vote will be open for 72 hours (longer if needed). Everyone who has 
>> tested the build is invited to vote. Votes by PMC members are considered 
>> binding. A vote passes if there are at least three binding +1s and no -1's.
>>
>> [1]: CHANGES.txt:
>>
>> https://gitbox.apache.org/repos/asf?p=cassandra.git;a=blob_plain;f=CHANGES.txt;hb=refs/tags/4.0.6-tentative
>>
>> [2]: NEWS.txt:
>>
>> https://gitbox.apache.org/repos/asf?p=cassandra.git;a=blob_plain;f=NEWS.txt;hb=refs/tags/4.0.6-tentative
>>
>>


Re: [Marketing] For Review: Learn How CommitLog Works in Apache Cassandra

2022-08-26 Thread Rahul Xavier Singh
Added a comment about "ACID". I would recommend not saying ACID until it's
there. C* has strong consistency when needed. It doesn't for example
guarantee that two competing mutations will be executed (or be able to be
rolled back to the previous state) in the same exact order they were
intended if they come in at the same time, especially if these are coming
from two different data centers for example.

Maybe it can be explained later that the commitlog mechanism provides
ACID-like features ... ?

>From my understanding the Accord white paper has not been implemented into
any working Cassandra code. I may be wrong.


Rahul Singh

Chief Executive Officer | Business Platform Architect m: 202.905.2818 e:
rahul.si...@anant.us li: http://linkedin.com/in/xingh ca:
http://calendly.com/xingh

*We create, support, and manage real-time global data & analytics platforms
for the modern enterprise.*

*Anant | https://anant.us *

3 Washington Circle, Suite 301

Washington, D.C. 20037

*http://Cassandra.Link * : The best resources for
Apache Cassandra


On Tue, Aug 23, 2022 at 5:43 AM Sharan Foga  wrote:

> Hi Chris
>
> I've added a few comments and suggestions. Please feel free to use /ignore
> whichever ones you think :-)
>
> Thanks
> Sharan
>
> On 2022/08/23 00:08:52 Chris Thornett wrote:
> > Opening up Alex Sorokoumov's guide 'Learn How CommitLog Works in Apache
> > Cassandra' for a 72-hr community review by lazy consensus.
> >
> > Please add any amends and suggestions in the comments:
> >
> https://docs.google.com/document/d/1cyOi-IeU_I9GBkpQbJS6IIrmemAesEqvzLb-eeFs_rM/edit#
> >
> > Thanks!
> >
> > --
> >
> > Chris Thornett
> > Senior Content Strategist, Constantia.io
> >
>