Re: [DISCUSS] [PATCH] Enable Direct I/O For CommitLog Files

2023-04-21 Thread Jon Haddad
This sounds awesome.  Could you share the fio configuration you used to 
benchmark and what hardware you used?  


On 2023/04/18 18:10:24 "Pawar, Amit" wrote:
> [Public]
> 
> Hi,
> 
> I shared my investigation about Commitlog I/O issue on large core count 
> system in my previous email dated July-22 and link to the thread is given 
> below.
> https://lists.apache.org/thread/xc5ocog2qz2v2gnj4xlw5hbthfqytx2n
> 
> Basically, two solutions looked possible to improve the CommitLog I/O.
> 
>   1.  Multi-threaded syncing
>   2.  Using Direct-IO through JNA
> 
> I worked on 2nd option considering the following benefit compared to the 
> first one
> 
>   1.  Direct I/O read/write throughput is very high compared to non-Direct 
> I/O. Learnt through FIO benchmarking.
>   2.  Reduces kernel file cache uses which in-turn reduces kernel I/O 
> activity for Commitlog files only.
>   3.  Overall CPU usage reduced for flush activity. JVisualvm shows CPU usage 
> < 30% for Commitlog syncer thread with Direct I/O feature
>   4.  Direct I/O implementation is easier compared to multi-threaded
> 
> As per the community suggestion, less in code complex is good to have. Direct 
> I/O enablement looked promising but there was one issue.
> Java version 8 does not have native support to enable Direct I/O. So, JNA 
> library usage is must. The same implementation should also work across other 
> versions of Java (like 11 and beyond).
> 
> I have completed Direct I/O implementation and summary of the attached patch 
> changes are given below.
> 
>   1.  This implementation is not using Java file channels and file is opened 
> through JNA to use Direct I/O feature.
>   2.  New Segment are defined named "DirectIOSegment"  for Direct I/O and 
> "NonDirectIOSegment" for non-direct I/O (NonDirectIOSegment is test purpose 
> only).
>   3.  JNA write call is used to flush the changes.
>   4.  New helper functions are defined in NativeLibrary.java and platform 
> specific file. Currently tested on Linux only.
>   5.  Patch allows user to configure optimum block size  and alignment if 
> default values are not OK for CommitLog disk.
>   6.  Following configuration options are provided in Cassandra.yaml file
>  *   use_jna_for_commitlog_io : to use jna feature
>  *   use_direct_io_for_commitlog : to use Direct I/O feature.
>  *   direct_io_minimum_block_alignment: 512 (default)
>  *   nvme_disk_block_size: 32MiB (default and can be changed as per the 
> required size)
> 
> Test matrix is complex so CommitLog related testcases and TPCx-IOT benchmark 
> was tested. It works with both Java 8 and 11 versions. Compressed and 
> Encrypted based segments are not supported yet and it can be enabled later 
> based on the Community feedback.
> 
> Following improvement are seen with Direct I/O enablement.
> 
>   1.  32 cores >= ~15%
>   2.  64 cores >= ~80%
> 
> Also, another observation would like to share here. Reading Commitlog files 
> with Direct I/O might help in reducing node bring-up time after the node 
> crash.
> 
> Tested with commit ID: 91f6a9aca8d3c22a03e68aa901a0b154d960ab07
> 
> The attached patch enables Direct I/O feature for Commitlog files. Please 
> check and share your feedback.
> 
> Thanks,
> Amit
> 


Re: Adding vector search to SAI with heirarchical navigable small world graph index

2023-04-21 Thread Dinesh Joshi
Interesting proposal Jonathan. Will grok it over the weekend and play around 
with the branch.

Do you intend to make this part of CEP-7 or as an incremental update to SAI 
once it is committed?

> On Apr 21, 2023, at 2:19 PM, Jonathan Ellis  wrote:
> 
> Happy Friday, everyone!
> 
> Rich text formatting ahead, I've attached a PDF for those who prefer that.
> 
> I propose adding approximate nearest neighbor (ANN) vector search capability 
> to Apache Cassandra via storage-attached indexes (SAI). This is a 
> medium-sized effort that will significantly enhance Cassandra’s 
> functionality, particularly for AI use cases. This addition will not only 
> provide a new and important feature for existing Cassandra users, but also 
> attract new users to the community from the AI space, further expanding 
> Cassandra’s reach and relevance.
> Introduction
> Vector search is a powerful document search technique that enables developers 
> to quickly find relevant content within an extensive collection of documents, 
> which is useful as a standalone technique, but it is particularly hot now 
> because it significantly enhances the performance of LLMs.
> 
> Vector search uses ML models to match the semantics of a question rather than 
> just the words it contains, avoiding the classic false positives and false 
> negatives associated with term-based search.  Alessandro Benedetti gives some 
> good examples in his excellent talk 
> :
> 
> 
> 
> 
> 
> You can search across any set of vectors, which are just ordered sets of 
> numbers.  In the context of natural language queries and document search, we 
> are specifically concerned with a type of vector called an embedding.  
> 
> An embedding is a high-dimensional vector that captures the underlying 
> semantic relationships and contextual information of words or phrases. 
> Embeddings are generated by ML models trained for this purpose; OpenAI 
> provides an API to do this, but open-source and self-hostable models like 
> BERT are also popular. Creating more accurate and smaller embeddings are 
> active research areas in ML.
> 
> Large language models (LLMs) can be described as a mile wide and an inch 
> deep. They are not experts on any narrow domain (although they will 
> hallucinate that they are, sometimes convincingly).  You can remedy this by 
> giving the LLM additional context for your query, but the context window is 
> small (4k tokens for GPT-3.5, up to 32k for GPT-4), so you want to be very 
> selective about giving the LLM the most relevant possible information.
> 
> Vector search is red-hot now because it allows us to easily answer the 
> question “what are the most relevant documents to provide as context” by 
> performing a similarity search between the embeddings vector of the query, 
> and those of your document universe.  Doing exact search is prohibitively 
> expensive, since you necessarily have to compare with each and every 
> document; this is intractable when you have billions or trillions of docs.  
> However, there are well-understood algorithms for turning this into a 
> logarithmic problem if you are willing to accept approximately the most 
> similar documents.  This is the “approximate nearest neighbor” problem.  (You 
> will see these referred to as kNN – k nearest neighbors – or ANN.)
> 
> Pinecone DB has a good example of what this looks like in Python code 
> .
> 
> Vector search is the foundation underlying effectively all of the AI 
> applications that are launching now.  This is particularly relevant to Apache 
> Cassandra users, who tend to manage the types of large datasets that benefit 
> the most from fast similarity search. Adding vector search to Cassandra’s 
> unique strengths of scale, reliability, and low latency, will further enhance 
> its appeal and effectiveness for these users while also making it more 
> attractive to newcomers looking to harness AI’s potential.  The faster we 
> deliver vector search, the more valuable it will be for this expanding user 
> base.
> Requirements
> Perform vector search as outlined in the Pinecone example above
> Support Float32 embeddings in the form of a new DENSE FLOAT32 cql type
> This is also useful for “classic” ML applications that derive and serve their 
> own feature vectors
> Add ANN (approximate nearest neighbor) search.
> Work with normal Cassandra data flow
> Inserting one row at a time is fine; cannot require batch ingest
> Updating, deleting rows is also fine
> Must compose with other SAI predicates as well as partition keys
> Not requirements
> Other datatypes besides Float32
> Pinecone supports only Float32 and it’s hugely ahead in mindshare so let’s 
> make things easy on ourselves and follow their precedent.
> I don’t want to scope creep beyond ANN. In particular, I don’t want to wait 
> for ORDER BY to get exact search in as well.
> How we can do this
> There is exactly one prod