Re: [DISCUSS] CEP-7 Storage Attached Index

Jeremiah D Jordan Wed, 23 Sep 2020 09:00:15 -0700

> Short question: looking forward, how are we going to maintain three 2i
> implementations: SASI, SAI, and 2i?


I think one of the goals stated in the CEP is for SAI to have parity with 2i 
such that it could eventually replace it.


> On Sep 23, 2020, at 10:34 AM, Oleksandr Petrov <[email protected]> 
> wrote:
> 
> Short question: looking forward, how are we going to maintain three 2i
> implementations: SASI, SAI, and 2i?
> 
> Another thing I think this CEP is missing is rationale and motivation
> about why trie-based indexes were chosen over, say, B-Tree. We did have a
> short discussion about this on Slack, but both arguments that I've heard
> (space-saving and keeping a small subset of nodes in memory) work only for
> the most primitive implementation of a B-Tree. Fully-occupied prefix B-Tree
> can have similar properties. There's been a lot of research on B-Trees and
> optimisations in those. Unfortunately, I do not have an
> implementation sitting around for a direct comparison, but I can imagine
> situations when B-Trees may perform better because of simpler construction.
> Maybe we should even consider prototyping a prefix B-Tree to have a more
> fair comparison.
> 
> Thank you,
> -- Alex
> 
> 
> 
> On Thu, Sep 10, 2020 at 9:12 AM Jasonstack Zhao Yang <
> [email protected]> wrote:
> 
>> Thank you Patrick for hosting Cassandra Contributor Meeting for CEP-7 SAI.
>> 
>> The recorded video is available here:
>> 
>> https://cwiki.apache.org/confluence/display/CASSANDRA/2020-09-01+Apache+Cassandra+Contributor+Meeting
>> 
>> On Tue, 1 Sep 2020 at 14:34, Jasonstack Zhao Yang <
>> [email protected]>
>> wrote:
>> 
>>> Thank you, Charles and Patrick
>>> 
>>> On Tue, 1 Sep 2020 at 04:56, Charles Cao <[email protected]> wrote:
>>> 
>>>> Thank you, Patrick!
>>>> 
>>>> On Mon, Aug 31, 2020 at 12:59 PM Patrick McFadin <[email protected]>
>>>> wrote:
>>>>> 
>>>>> I just moved it to 8AM for this meeting to better accommodate APAC.
>>>> Please
>>>>> see the update here:
>>>>> 
>>>> 
>> https://cwiki.apache.org/confluence/display/CASSANDRA/2020-08-01+Apache+Cassandra+Contributor+Meeting
>>>>> 
>>>>> Patrick
>>>>> 
>>>>> On Mon, Aug 31, 2020 at 10:04 AM Charles Cao <[email protected]>
>>>> wrote:
>>>>> 
>>>>>> Patrick,
>>>>>> 
>>>>>> 11AM PST is a bad time for the people in the APAC timezone. Can we
>>>>>> move it to 7 or 8AM PST in the morning to accommodate their needs ?
>>>>>> 
>>>>>> ~Charles
>>>>>> 
>>>>>> On Fri, Aug 28, 2020 at 4:37 PM Patrick McFadin <[email protected]
>>> 
>>>>>> wrote:
>>>>>>> 
>>>>>>> Meeting scheduled.
>>>>>>> 
>>>>>> 
>>>> 
>> https://cwiki.apache.org/confluence/display/CASSANDRA/2020-08-01+Apache+Cassandra+Contributor+Meeting
>>>>>>> 
>>>>>>> Tuesday September 1st, 11AM PST. I added a basic bullet for the
>>>> agenda
>>>>>> but
>>>>>>> if there is more, edit away.
>>>>>>> 
>>>>>>> Patrick
>>>>>>> 
>>>>>>> On Thu, Aug 27, 2020 at 11:31 AM Jasonstack Zhao Yang <
>>>>>>> [email protected]> wrote:
>>>>>>> 
>>>>>>>> +1
>>>>>>>> 
>>>>>>>> On Thu, 27 Aug 2020 at 04:52, Ekaterina Dimitrova <
>>>>>> [email protected]>
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> +1
>>>>>>>>> 
>>>>>>>>> On Wed, 26 Aug 2020 at 16:48, Caleb Rackliffe <
>>>>>> [email protected]>
>>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> +1
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On Wed, Aug 26, 2020, 3:45 PM Patrick McFadin <
>>>> [email protected]>
>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> This is related to the discussion Jordan and I had about
>> the
>>>>>>>>> contributor
>>>>>>>>>> 
>>>>>>>>>>> Zoom call. Instead of open mic for any issue, call it
>> based
>>>> on a
>>>>>>>>>> discussion
>>>>>>>>>> 
>>>>>>>>>>> thread or threads for higher bandwidth discussion.
>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> I would be happy to schedule on for next week to
>>>> specifically
>>>>>> discuss
>>>>>>>>>> 
>>>>>>>>>>> CEP-7. I can attach the recorded call to the CEP after.
>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> +1 or -1?
>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> Patrick
>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> On Tue, Aug 25, 2020 at 7:03 AM Joshua McKenzie <
>>>>>>>> [email protected]>
>>>>>>>>>> 
>>>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>>>> Does community plan to open another discussion or CEP
>> on
>>>>>>>>>> 
>>>>>>>>>>> modularization?
>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>>> We probably should have a discussion on the ML or
>> monthly
>>>>>> contrib
>>>>>>>>> call
>>>>>>>>>> 
>>>>>>>>>>>> about it first to see how aligned the interested
>>>> contributors
>>>>>> are.
>>>>>>>>>> Could
>>>>>>>>>> 
>>>>>>>>>>> do
>>>>>>>>>> 
>>>>>>>>>>>> that through CEP as well but CEP's (at least thus far
>>>> sans k8s
>>>>>>>>>> operator)
>>>>>>>>>> 
>>>>>>>>>>>> tend to start with a strong, deeply thought out point of
>>>> view
>>>>>> being
>>>>>>>>>> 
>>>>>>>>>>>> expressed.
>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>>> On Tue, Aug 25, 2020 at 3:26 AM Jasonstack Zhao Yang <
>>>>>>>>>> 
>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>>>>>>> SASI's performance, specifically the search in the
>>>> B+
>>>>>> tree
>>>>>>>>>> 
>>>>>>>>>>> component,
>>>>>>>>>> 
>>>>>>>>>>>>>>>> depends a lot on the component file's header being
>>>>>> available
>>>>>>>> in
>>>>>>>>>> the
>>>>>>>>>> 
>>>>>>>>>>>>>>>> pagecache. SASI benefits from (needs) nodes with
>>>> lots of
>>>>>> RAM.
>>>>>>>>> Is
>>>>>>>>>> 
>>>>>>>>>>> SAI
>>>>>>>>>> 
>>>>>>>>>>>>> bound
>>>>>>>>>> 
>>>>>>>>>>>>>>>> to this same or similar limitation?
>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>>>> SAI also benefits from larger memory because SAI puts
>>>> block
>>>>>> info
>>>>>>>> on
>>>>>>>>>> 
>>>>>>>>>>> heap
>>>>>>>>>> 
>>>>>>>>>>>>> for searching on-disk components and having
>> cross-index
>>>>>> files on
>>>>>>>>> page
>>>>>>>>>> 
>>>>>>>>>>>> cache
>>>>>>>>>> 
>>>>>>>>>>>>> improves read performance of different indexes on the
>>>> same
>>>>>> table.
>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>>>>>>> Flushing of SASI can be CPU+IO intensive, to the
>>>> point of
>>>>>>>>>> 
>>>>>>>>>>> saturation,
>>>>>>>>>> 
>>>>>>>>>>>>>>>> pauses, and crashes on the node. SSDs are a must,
>>>> along
>>>>>> with
>>>>>>>> a
>>>>>>>>>> bit
>>>>>>>>>> 
>>>>>>>>>>> of
>>>>>>>>>> 
>>>>>>>>>>>>>>>> tuning, just to avoid bringing down your cluster.
>>>> Beyond
>>>>>>>>> reducing
>>>>>>>>>> 
>>>>>>>>>>>> space
>>>>>>>>>> 
>>>>>>>>>>>>>>>> requirements, does SAI improve on these things?
>> Like
>>>>>> SASI how
>>>>>>>>>> does
>>>>>>>>>> 
>>>>>>>>>>>> SAI,
>>>>>>>>>> 
>>>>>>>>>>>>> in
>>>>>>>>>> 
>>>>>>>>>>>>>>>> its own way, change/narrow the recommendations on
>>>> node
>>>>>>>> hardware
>>>>>>>>>> 
>>>>>>>>>>>> specs?
>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>>>> SAI won't crash the node during compaction and
>> requires
>>>> less
>>>>>>>>> CPU/IO.
>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>>>> * SAI defines global memory limit for compaction
>>>> instead of
>>>>>>>>> per-index
>>>>>>>>>> 
>>>>>>>>>>>>> memory limit used by SASI.
>>>>>>>>>> 
>>>>>>>>>>>>>  For example, compactions are running on 10 tables
>> and
>>>> each
>>>>>> has
>>>>>>>> 10
>>>>>>>>>> 
>>>>>>>>>>>>> indexes. SAI will cap the
>>>>>>>>>> 
>>>>>>>>>>>>>  memory usage with global limit while SASI may use up
>>>> to
>>>>>> 100 *
>>>>>>>>>> 
>>>>>>>>>>> per-index
>>>>>>>>>> 
>>>>>>>>>>>>> limit.
>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>>>> * After flushing in-memory segments to disk, SAI won't
>>>> merge
>>>>>>>>> on-disk
>>>>>>>>>> 
>>>>>>>>>>>>> segments while SASI
>>>>>>>>>> 
>>>>>>>>>>>>>  attempts to merge them at the end.
>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>>>>  There are pros and cons of not merging segments:
>>>>>>>>>> 
>>>>>>>>>>>>>    ** Pros: compaction runs faster and requires fewer
>>>>>> resources.
>>>>>>>>>> 
>>>>>>>>>>>>>    ** Cons: small segments reduce compression ratio.
>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>>>> * SAI on-disk format with row ids compresses better.
>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>>>>>>> I understand the desire in keeping out of scope
>> the
>>>>>> longer
>>>>>>>> term
>>>>>>>>>> 
>>>>>>>>>>>>> deprecation
>>>>>>>>>> 
>>>>>>>>>>>>>>>> and migration plan, but… if SASI provides
>>>> functionality
>>>>>> that
>>>>>>>>> SAI
>>>>>>>>>> 
>>>>>>>>>>>>> doesn't,
>>>>>>>>>> 
>>>>>>>>>>>>>>>> like tokenisation and DelimiterAnalyzer, yet
>>>> introduces a
>>>>>>>> body
>>>>>>>>> of
>>>>>>>>>> 
>>>>>>>>>>>> code
>>>>>>>>>> 
>>>>>>>>>>>>>>>> ~somewhat similar, shouldn't we be roughly
>>>> sketching out
>>>>>> how
>>>>>>>> to
>>>>>>>>>> 
>>>>>>>>>>>> reduce
>>>>>>>>>> 
>>>>>>>>>>>>> the
>>>>>>>>>> 
>>>>>>>>>>>>>>>> maintenance surface area?
>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>>>> Agreed that we should reduce maintenance area if
>>>> possible,
>>>>>> but
>>>>>>>> only
>>>>>>>>>> 
>>>>>>>>>>> very
>>>>>>>>>> 
>>>>>>>>>>>>> limited
>>>>>>>>>> 
>>>>>>>>>>>>> code base (eg. RangeIterator, QueryPlan) can be
>> shared.
>>>> The
>>>>>> rest
>>>>>>>> of
>>>>>>>>>> the
>>>>>>>>>> 
>>>>>>>>>>>>> code base
>>>>>>>>>> 
>>>>>>>>>>>>> is quite different because of on-disk format and
>>>> cross-index
>>>>>>>> files.
>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>>>> The goal of this CEP is to get community buy-in on
>> SAI's
>>>>>> design.
>>>>>>>>>> 
>>>>>>>>>>>>> Tokenization,
>>>>>>>>>> 
>>>>>>>>>>>>> DelimiterAnalyzer should be straightforward to
>>>> implement on
>>>>>> top
>>>>>>>> of
>>>>>>>>>> SAI.
>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>>>>>>> Can we list what configurations of SASI will
>> become
>>>>>>>> deprecated
>>>>>>>>>> once
>>>>>>>>>> 
>>>>>>>>>>>> SAI
>>>>>>>>>> 
>>>>>>>>>>>>>>>> becomes non-experimental?
>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>>>> Except for "Like", "Tokenisation",
>> "DelimiterAnalyzer",
>>>> the
>>>>>> rest
>>>>>>>> of
>>>>>>>>>> 
>>>>>>>>>>> SASI
>>>>>>>>>> 
>>>>>>>>>>>>> can
>>>>>>>>>> 
>>>>>>>>>>>>> be replaced by SAI.
>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>>>>>>> Given a few bugs are open against 2i and SASI, can
>>>> we
>>>>>> provide
>>>>>>>>>> some
>>>>>>>>>> 
>>>>>>>>>>>>>>>> overview, or rough indication, of how many of them
>>>> we
>>>>>> could
>>>>>>>>>> "triage
>>>>>>>>>> 
>>>>>>>>>>>>> away"?
>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>>>> I believe most of the known bugs in 2i/SASI either
>> have
>>>> been
>>>>>>>>>> addressed
>>>>>>>>>> 
>>>>>>>>>>> in
>>>>>>>>>> 
>>>>>>>>>>>>> SAI or
>>>>>>>>>> 
>>>>>>>>>>>>> don't apply to SAI.
>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>>>>>>> And, is it time for the project to start
>>>> introducing new
>>>>>> SPI
>>>>>>>>>> 
>>>>>>>>>>>>>>>> implementations as separate sub-modules and jar
>>>> files
>>>>>> that
>>>>>>>> are
>>>>>>>>>> only
>>>>>>>>>> 
>>>>>>>>>>>>> loaded
>>>>>>>>>> 
>>>>>>>>>>>>>>>> at runtime based on configuration settings? (sorry
>>>> for
>>>>>> the
>>>>>>>>>> 
>>>>>>>>>>> conflation
>>>>>>>>>> 
>>>>>>>>>>>>> on
>>>>>>>>>> 
>>>>>>>>>>>>>>>> this one, but maybe it's the right time to raise
>> it
>>>>>> :shrug:)
>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>>>> Agreed that modularization is the way to go and will
>>>> speed up
>>>>>>>>> module
>>>>>>>>>> 
>>>>>>>>>>>>> development speed.
>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>>>> Does community plan to open another discussion or CEP
>> on
>>>>>>>>>> 
>>>>>>>>>>> modularization?
>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>>>> On Mon, 24 Aug 2020 at 16:43, Mick Semb Wever <
>>>>>> [email protected]>
>>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>>>>> Adding to Duy's questions…
>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>>>>> * Hardware specs
>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>>>>> SASI's performance, specifically the search in the
>> B+
>>>> tree
>>>>>>>>>> component,
>>>>>>>>>> 
>>>>>>>>>>>>>> depends a lot on the component file's header being
>>>>>> available in
>>>>>>>>> the
>>>>>>>>>> 
>>>>>>>>>>>>>> pagecache. SASI benefits from (needs) nodes with
>> lots
>>>> of
>>>>>> RAM.
>>>>>>>> Is
>>>>>>>>>> SAI
>>>>>>>>>> 
>>>>>>>>>>>>> bound
>>>>>>>>>> 
>>>>>>>>>>>>>> to this same or similar limitation?
>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>>>>> Flushing of SASI can be CPU+IO intensive, to the
>>>> point of
>>>>>>>>>> saturation,
>>>>>>>>>> 
>>>>>>>>>>>>>> pauses, and crashes on the node. SSDs are a must,
>>>> along
>>>>>> with a
>>>>>>>>> bit
>>>>>>>>>> of
>>>>>>>>>> 
>>>>>>>>>>>>>> tuning, just to avoid bringing down your cluster.
>>>> Beyond
>>>>>>>> reducing
>>>>>>>>>> 
>>>>>>>>>>> space
>>>>>>>>>> 
>>>>>>>>>>>>>> requirements, does SAI improve on these things? Like
>>>> SASI
>>>>>> how
>>>>>>>>> does
>>>>>>>>>> 
>>>>>>>>>>> SAI,
>>>>>>>>>> 
>>>>>>>>>>>>> in
>>>>>>>>>> 
>>>>>>>>>>>>>> its own way, change/narrow the recommendations on
>> node
>>>>>> hardware
>>>>>>>>>> 
>>>>>>>>>>> specs?
>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>>>>> * Code Maintenance
>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>>>>> I understand the desire in keeping out of scope the
>>>> longer
>>>>>> term
>>>>>>>>>> 
>>>>>>>>>>>>> deprecation
>>>>>>>>>> 
>>>>>>>>>>>>>> and migration plan, but… if SASI provides
>>>> functionality
>>>>>> that
>>>>>>>> SAI
>>>>>>>>>> 
>>>>>>>>>>>> doesn't,
>>>>>>>>>> 
>>>>>>>>>>>>>> like tokenisation and DelimiterAnalyzer, yet
>>>> introduces a
>>>>>> body
>>>>>>>> of
>>>>>>>>>> 
>>>>>>>>>>> code
>>>>>>>>>> 
>>>>>>>>>>>>>> ~somewhat similar, shouldn't we be roughly sketching
>>>> out
>>>>>> how to
>>>>>>>>>> 
>>>>>>>>>>> reduce
>>>>>>>>>> 
>>>>>>>>>>>>> the
>>>>>>>>>> 
>>>>>>>>>>>>>> maintenance surface area?
>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>>>>> Can we list what configurations of SASI will become
>>>>>> deprecated
>>>>>>>>> once
>>>>>>>>>> 
>>>>>>>>>>> SAI
>>>>>>>>>> 
>>>>>>>>>>>>>> becomes non-experimental?
>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>>>>> Given a few bugs are open against 2i and SASI, can
>> we
>>>>>> provide
>>>>>>>>> some
>>>>>>>>>> 
>>>>>>>>>>>>>> overview, or rough indication, of how many of them
>> we
>>>> could
>>>>>>>>> "triage
>>>>>>>>>> 
>>>>>>>>>>>>> away"?
>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>>>>> And, is it time for the project to start introducing
>>>> new
>>>>>> SPI
>>>>>>>>>> 
>>>>>>>>>>>>>> implementations as separate sub-modules and jar
>> files
>>>> that
>>>>>> are
>>>>>>>>> only
>>>>>>>>>> 
>>>>>>>>>>>>> loaded
>>>>>>>>>> 
>>>>>>>>>>>>>> at runtime based on configuration settings? (sorry
>>>> for the
>>>>>>>>>> conflation
>>>>>>>>>> 
>>>>>>>>>>>> on
>>>>>>>>>> 
>>>>>>>>>>>>>> this one, but maybe it's the right time to raise it
>>>>>> :shrug:)
>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>>>>> regards,
>>>>>>>>>> 
>>>>>>>>>>>>>> Mick
>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>>>>> On Tue, 18 Aug 2020 at 13:05, DuyHai Doan <
>>>>>>>> [email protected]>
>>>>>>>>>> 
>>>>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>>>>>> Thank you Zhao Yang for starting this topic
>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>>>>>> After reading the short design doc, I have a few
>>>>>> questions
>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>>>>>> 1) SASI was pretty inefficient indexing wide
>>>> partitions
>>>>>>>> because
>>>>>>>>>> the
>>>>>>>>>> 
>>>>>>>>>>>>> index
>>>>>>>>>> 
>>>>>>>>>>>>>>> structure only retains the partition token, not
>> the
>>>>>>>> clustering
>>>>>>>>>> 
>>>>>>>>>>>> colums.
>>>>>>>>>> 
>>>>>>>>>>>>> As
>>>>>>>>>> 
>>>>>>>>>>>>>>> per design doc SAI has row id mapping to partition
>>>>>> offset,
>>>>>>>> can
>>>>>>>>> we
>>>>>>>>>> 
>>>>>>>>>>>> hope
>>>>>>>>>> 
>>>>>>>>>>>>>> that
>>>>>>>>>> 
>>>>>>>>>>>>>>> indexing wide partition will be more efficient
>> with
>>>> SAI
>>>>>> ? One
>>>>>>>>>> 
>>>>>>>>>>> detail
>>>>>>>>>> 
>>>>>>>>>>>>> that
>>>>>>>>>> 
>>>>>>>>>>>>>>> worries me is that in the beggining of the design
>>>> doc,
>>>>>> it is
>>>>>>>>> said
>>>>>>>>>> 
>>>>>>>>>>>> that
>>>>>>>>>> 
>>>>>>>>>>>>>> the
>>>>>>>>>> 
>>>>>>>>>>>>>>> matching rows are post filtered while scanning the
>>>>>> partition.
>>>>>>>>> Can
>>>>>>>>>> 
>>>>>>>>>>> you
>>>>>>>>>> 
>>>>>>>>>>>>>>> confirm or infirm that SAI is efficient with wide
>>>>>> partitions
>>>>>>>>> and
>>>>>>>>>> 
>>>>>>>>>>>>> provides
>>>>>>>>>> 
>>>>>>>>>>>>>>> the partition offsets to the matching rows ?
>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>>>>>> 2) About space efficiency, one of the biggest
>>>> drawback of
>>>>>>>> SASI
>>>>>>>>>> was
>>>>>>>>>> 
>>>>>>>>>>>> the
>>>>>>>>>> 
>>>>>>>>>>>>>> huge
>>>>>>>>>> 
>>>>>>>>>>>>>>> space required for index structure when using
>>>> CONTAINS
>>>>>> logic
>>>>>>>>>> 
>>>>>>>>>>> because
>>>>>>>>>> 
>>>>>>>>>>>> of
>>>>>>>>>> 
>>>>>>>>>>>>>> the
>>>>>>>>>> 
>>>>>>>>>>>>>>> decomposition of text columns into n-grams. Will
>> SAI
>>>>>> suffer
>>>>>>>>> from
>>>>>>>>>> 
>>>>>>>>>>> the
>>>>>>>>>> 
>>>>>>>>>>>>> same
>>>>>>>>>> 
>>>>>>>>>>>>>>> issue in future iterations ? I'm anticipating a
>> bit
>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>>>>>> 3) If I'm querying using SAI and providing
>> complete
>>>>>> partition
>>>>>>>>>> key,
>>>>>>>>>> 
>>>>>>>>>>>> will
>>>>>>>>>> 
>>>>>>>>>>>>>> it
>>>>>>>>>> 
>>>>>>>>>>>>>>> be more efficient than querying without partition
>>>> key. In
>>>>>>>> other
>>>>>>>>>> 
>>>>>>>>>>>> words,
>>>>>>>>>> 
>>>>>>>>>>>>>> does
>>>>>>>>>> 
>>>>>>>>>>>>>>> SAI provide any optimisation when partition key is
>>>>>> specified
>>>>>>>> ?
>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>>>>>> Regards
>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>>>>>> Duy Hai DOAN
>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>>>>>> Le mar. 18 août 2020 à 11:39, Mick Semb Wever <
>>>>>>>> [email protected]>
>>>>>>>>> a
>>>>>>>>>> 
>>>>>>>>>>>>> écrit :
>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>>>>>>>> We are looking forward to the community's
>>>> feedback
>>>>>> and
>>>>>>>>>> 
>>>>>>>>>>>> suggestions.
>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>>>>>>> What comes immediately to mind is testing
>>>>>> requirements. It
>>>>>>>>> has
>>>>>>>>>> 
>>>>>>>>>>> been
>>>>>>>>>> 
>>>>>>>>>>>>>>>> mentioned already that the project's testability
>>>> and QA
>>>>>>>>>> 
>>>>>>>>>>> guidelines
>>>>>>>>>> 
>>>>>>>>>>>>> are
>>>>>>>>>> 
>>>>>>>>>>>>>>>> inadequate to successfully introduce new
>> features
>>>> and
>>>>>>>>>> 
>>>>>>>>>>> refactorings
>>>>>>>>>> 
>>>>>>>>>>>> to
>>>>>>>>>> 
>>>>>>>>>>>>>> the
>>>>>>>>>> 
>>>>>>>>>>>>>>>> codebase. During the 4.0 beta phase this was
>>>> intended
>>>>>> to be
>>>>>>>>>> 
>>>>>>>>>>>>> addressed,
>>>>>>>>>> 
>>>>>>>>>>>>>>> i.e.
>>>>>>>>>> 
>>>>>>>>>>>>>>>> defining more specific QA guidelines for 4.0-rc.
>>>> This
>>>>>> would
>>>>>>>>> be
>>>>>>>>>> an
>>>>>>>>>> 
>>>>>>>>>>>>>>> important
>>>>>>>>>> 
>>>>>>>>>>>>>>>> step towards QA guidelines for all changes and
>>>> CEPs
>>>>>>>> post-4.0.
>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>>>>>>> Questions from me
>>>>>>>>>> 
>>>>>>>>>>>>>>>> - How will this be tested, how will its QA
>>>> status and
>>>>>>>>>> lifecycle
>>>>>>>>>> 
>>>>>>>>>>> be
>>>>>>>>>> 
>>>>>>>>>>>>>>>> defined? (per above)
>>>>>>>>>> 
>>>>>>>>>>>>>>>> - With existing C* code needing to be changed,
>>>> what
>>>>>> is the
>>>>>>>>>> 
>>>>>>>>>>>> proposed
>>>>>>>>>> 
>>>>>>>>>>>>>> plan
>>>>>>>>>> 
>>>>>>>>>>>>>>>> for making those changes ensuring maintained QA,
>>>> e.g.
>>>>>> is
>>>>>>>>> there
>>>>>>>>>> 
>>>>>>>>>>>>> separate
>>>>>>>>>> 
>>>>>>>>>>>>>>> QA
>>>>>>>>>> 
>>>>>>>>>>>>>>>> cycles planned for altering the SPI before
>> adding
>>>> a
>>>>>> new SPI
>>>>>>>>>> 
>>>>>>>>>>>>>>> implementation?
>>>>>>>>>> 
>>>>>>>>>>>>>>>> - Despite being out of scope, it would be nice
>>>> to have
>>>>>>>> some
>>>>>>>>>> idea
>>>>>>>>>> 
>>>>>>>>>>>>> from
>>>>>>>>>> 
>>>>>>>>>>>>>>> the
>>>>>>>>>> 
>>>>>>>>>>>>>>>> CEP author of when users might still choose
>>>> afresh 2i
>>>>>> or
>>>>>>>> SASI
>>>>>>>>>> 
>>>>>>>>>>> over
>>>>>>>>>> 
>>>>>>>>>>>>> SAI,
>>>>>>>>>> 
>>>>>>>>>>>>>>>> - Who fills the roles involved? Who are the
>>>>>> contributors
>>>>>>>> in
>>>>>>>>>> this
>>>>>>>>>> 
>>>>>>>>>>>>>>> DataStax
>>>>>>>>>> 
>>>>>>>>>>>>>>>> team? Who is the shepherd? Are there other
>>>> stakeholders
>>>>>>>>> willing
>>>>>>>>>> 
>>>>>>>>>>> to
>>>>>>>>>> 
>>>>>>>>>>>> be
>>>>>>>>>> 
>>>>>>>>>>>>>>>> involved?
>>>>>>>>>> 
>>>>>>>>>>>>>>>> - Is there a preference to use gdoc instead of
>>>> the
>>>>>>>> project's
>>>>>>>>>> 
>>>>>>>>>>> wiki,
>>>>>>>>>> 
>>>>>>>>>>>>> and
>>>>>>>>>> 
>>>>>>>>>>>>>>>> why? (the CEP process suggest a wiki page, and
>>>>>> feedback on
>>>>>>>>> why
>>>>>>>>>> 
>>>>>>>>>>>>> another
>>>>>>>>>> 
>>>>>>>>>>>>>>>> approach is considered better helps evolve the
>> CEP
>>>>>> process
>>>>>>>>>> 
>>>>>>>>>>> itself)
>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>>>>>>> cheers,
>>>>>>>>>> 
>>>>>>>>>>>>>>>> Mick
>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>>> 
>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: [email protected]
>>>>>> For additional commands, e-mail: [email protected]
>>>>>> 
>>>>>> 
>>>> 
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: [email protected]
>>>> For additional commands, e-mail: [email protected]
>>>> 
>>>> 
>> 
> 
> 
> -- 
> alex p


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [DISCUSS] CEP-7 Storage Attached Index

Reply via email to