Re: [DISCUSS] CEP-7 Storage Attached Index

2020-08-21 Thread Jason Rutherglen
> About space efficiency, one of the biggest drawback of SASI was the huge
space required for index structure when using CONTAINS logic because of the
decomposition of text columns into n-grams. Will SAI suffer from the same
issue in future iterations ?

SAI does not have specific ngram support atm, though that may be added
with tokenizers.  Ngrams do indeed grow the index, that's a user
decision for faster queries or more disk space.

On Tue, Aug 18, 2020 at 6:05 AM DuyHai Doan  wrote:
>
> Thank you Zhao Yang for starting this topic
>
> After reading the short design doc, I have a few questions
>
> 1) SASI was pretty inefficient indexing wide partitions because the index
> structure only retains the partition token, not the clustering colums. As
> per design doc SAI has row id mapping to partition offset, can we hope that
> indexing wide partition will be more efficient with SAI ? One detail that
> worries me is that in the beggining of the design doc, it is said that the
> matching rows are post filtered while scanning the partition. Can you
> confirm or infirm that SAI is efficient with wide partitions and provides
> the partition offsets to the matching rows ?
>
> 2) About space efficiency, one of the biggest drawback of SASI was the huge
> space required for index structure when using CONTAINS logic because of the
> decomposition of text columns into n-grams. Will SAI suffer from the same
> issue in future iterations ? I'm anticipating a bit
>
> 3) If I'm querying using SAI and providing complete partition key, will it
> be more efficient than querying without partition key. In other words, does
> SAI provide any optimisation when partition key is specified ?
>
> Regards
>
> Duy Hai DOAN
>
> Le mar. 18 août 2020 à 11:39, Mick Semb Wever  a écrit :
>
> > >
> > > We are looking forward to the community's feedback and suggestions.
> > >
> >
> >
> > What comes immediately to mind is testing requirements. It has been
> > mentioned already that the project's testability and QA guidelines are
> > inadequate to successfully introduce new features and refactorings to the
> > codebase. During the 4.0 beta phase this was intended to be addressed, i.e.
> > defining more specific QA guidelines for 4.0-rc. This would be an important
> > step towards QA guidelines for all changes and CEPs post-4.0.
> >
> > Questions from me
> >  - How will this be tested, how will its QA status and lifecycle be
> > defined? (per above)
> >  - With existing C* code needing to be changed, what is the proposed plan
> > for making those changes ensuring maintained QA, e.g. is there separate QA
> > cycles planned for altering the SPI before adding a new SPI implementation?
> >  - Despite being out of scope, it would be nice to have some idea from the
> > CEP author of when users might still choose afresh 2i or SASI over SAI,
> >  - Who fills the roles involved? Who are the contributors in this DataStax
> > team? Who is the shepherd? Are there other stakeholders willing to be
> > involved?
> >  - Is there a preference to use gdoc instead of the project's wiki, and
> > why? (the CEP process suggest a wiki page, and feedback on why another
> > approach is considered better helps evolve the CEP process itself)
> >
> > cheers,
> > Mick
> >

-
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
For additional commands, e-mail: dev-h...@cassandra.apache.org



Re: [DISCUSS] CEP-7 Storage Attached Index

2020-08-28 Thread Jason Rutherglen
+1

On Thu, Aug 27, 2020 at 1:31 PM Jasonstack Zhao Yang
 wrote:
>
> +1
>
> On Thu, 27 Aug 2020 at 04:52, Ekaterina Dimitrova 
> wrote:
>
> > +1
> >
> > On Wed, 26 Aug 2020 at 16:48, Caleb Rackliffe 
> > wrote:
> >
> > > +1
> > >
> > >
> > >
> > > On Wed, Aug 26, 2020, 3:45 PM Patrick McFadin 
> > wrote:
> > >
> > >
> > >
> > > > This is related to the discussion Jordan and I had about the
> > contributor
> > >
> > > > Zoom call. Instead of open mic for any issue, call it based on a
> > > discussion
> > >
> > > > thread or threads for higher bandwidth discussion.
> > >
> > > >
> > >
> > > > I would be happy to schedule on for next week to specifically discuss
> > >
> > > > CEP-7. I can attach the recorded call to the CEP after.
> > >
> > > >
> > >
> > > > +1 or -1?
> > >
> > > >
> > >
> > > > Patrick
> > >
> > > >
> > >
> > > > On Tue, Aug 25, 2020 at 7:03 AM Joshua McKenzie 
> > >
> > > > wrote:
> > >
> > > >
> > >
> > > > > >
> > >
> > > > > > Does community plan to open another discussion or CEP on
> > >
> > > > modularization?
> > >
> > > > >
> > >
> > > > > We probably should have a discussion on the ML or monthly contrib
> > call
> > >
> > > > > about it first to see how aligned the interested contributors are.
> > > Could
> > >
> > > > do
> > >
> > > > > that through CEP as well but CEP's (at least thus far sans k8s
> > > operator)
> > >
> > > > > tend to start with a strong, deeply thought out point of view being
> > >
> > > > > expressed.
> > >
> > > > >
> > >
> > > > > On Tue, Aug 25, 2020 at 3:26 AM Jasonstack Zhao Yang <
> > >
> > > > > jasonstack.z...@gmail.com> wrote:
> > >
> > > > >
> > >
> > > > > > >>> SASI's performance, specifically the search in the B+ tree
> > >
> > > > component,
> > >
> > > > > > >>> depends a lot on the component file's header being available in
> > > the
> > >
> > > > > > >>> pagecache. SASI benefits from (needs) nodes with lots of RAM.
> > Is
> > >
> > > > SAI
> > >
> > > > > > bound
> > >
> > > > > > >>> to this same or similar limitation?
> > >
> > > > > >
> > >
> > > > > > SAI also benefits from larger memory because SAI puts block info on
> > >
> > > > heap
> > >
> > > > > > for searching on-disk components and having cross-index files on
> > page
> > >
> > > > > cache
> > >
> > > > > > improves read performance of different indexes on the same table.
> > >
> > > > > >
> > >
> > > > > >
> > >
> > > > > > >>> Flushing of SASI can be CPU+IO intensive, to the point of
> > >
> > > > saturation,
> > >
> > > > > > >>> pauses, and crashes on the node. SSDs are a must, along with a
> > > bit
> > >
> > > > of
> > >
> > > > > > >>> tuning, just to avoid bringing down your cluster. Beyond
> > reducing
> > >
> > > > > space
> > >
> > > > > > >>> requirements, does SAI improve on these things? Like SASI how
> > > does
> > >
> > > > > SAI,
> > >
> > > > > > in
> > >
> > > > > > >>> its own way, change/narrow the recommendations on node hardware
> > >
> > > > > specs?
> > >
> > > > > >
> > >
> > > > > > SAI won't crash the node during compaction and requires less
> > CPU/IO.
> > >
> > > > > >
> > >
> > > > > > * SAI defines global memory limit for compaction instead of
> > per-index
> > >
> > > > > > memory limit used by SASI.
> > >
> > > > > >   For example, compactions are running on 10 tables and each has 10
> > >
> > > > > > indexes. SAI will cap the
> > >
> > > > > >   memory usage with global limit while SASI may use up to 100 *
> > >
> > > > per-index
> > >
> > > > > > limit.
> > >
> > > > > >
> > >
> > > > > > * After flushing in-memory segments to disk, SAI won't merge
> > on-disk
> > >
> > > > > > segments while SASI
> > >
> > > > > >   attempts to merge them at the end.
> > >
> > > > > >
> > >
> > > > > >   There are pros and cons of not merging segments:
> > >
> > > > > > ** Pros: compaction runs faster and requires fewer resources.
> > >
> > > > > > ** Cons: small segments reduce compression ratio.
> > >
> > > > > >
> > >
> > > > > > * SAI on-disk format with row ids compresses better.
> > >
> > > > > >
> > >
> > > > > >
> > >
> > > > > > >>> I understand the desire in keeping out of scope the longer term
> > >
> > > > > > deprecation
> > >
> > > > > > >>> and migration plan, but… if SASI provides functionality that
> > SAI
> > >
> > > > > > doesn't,
> > >
> > > > > > >>> like tokenisation and DelimiterAnalyzer, yet introduces a body
> > of
> > >
> > > > > code
> > >
> > > > > > >>> ~somewhat similar, shouldn't we be roughly sketching out how to
> > >
> > > > > reduce
> > >
> > > > > > the
> > >
> > > > > > >>> maintenance surface area?
> > >
> > > > > >
> > >
> > > > > > Agreed that we should reduce maintenance area if possible, but only
> > >
> > > > very
> > >
> > > > > > limited
> > >
> > > > > > code base (eg. RangeIterator, QueryPlan) can be shared. The rest of
> > > the
> > >
> > > > > > code base
> > >
> > > > > > is quite different because of on-disk format and cross-index files.
> > >
> > > > > >
> > >
> > > > > > The goal of this CEP is to get

Re: Bitmap indexes - reviving CASSANDRA-1472

2013-04-11 Thread Jason Rutherglen
What's the advantage over Lucene?


On Wed, Apr 10, 2013 at 10:43 PM, Matt Stump  wrote:

> Druid was our inspiration to layer bitmap indexes on top of Cassandra.
> Druid doesn't work for us because or data set is too large. We would need
> many hundreds of nodes just for the pre-processed data. What I envisioned
> was the ability to perform druid style queries (no aggregation) without the
> limitations imposed by having the entire dataset in memory. I primarily
> need to query whether a user performed some event, but I also intend to add
> trigram indexes for LIKE, ILIKE or possibly regex style matching.
>
> I wasn't aware of CONCISE, thanks for the pointer. We are currently
> evaluating fastbit, which is a very similar project:
> https://sdm.lbl.gov/fastbit/
>
>
> On Wed, Apr 10, 2013 at 5:49 PM, Brian O'Neill  >wrote:
>
> >
> > How does this compare with Druid?
> > https://github.com/metamx/druid
> >
> > We're currently evaluating Acunu, Vertica and Druid...
> >
> >
> http://brianoneill.blogspot.com/2013/04/bianalytics-on-big-datacassandra.html
> >
> > With its bitmapped indexes, Druid appears to have the most potential.
> > They boast some pretty impressive stats, especially WRT handling
> > "real-time" updates and adding new dimensions.
> >
> > They also use a compression algorithm, CONCISE, to cut down on the space
> > requirements.
> > http://ricerca.mat.uniroma3.it/users/colanton/concise.html
> >
> > I haven't looked too deep into the Druid code, but I've been meaning to
> > see if it could be backed by C*.
> >
> > We'd be game to join the hunt if you pursue such a beast. (with your
> code,
> > or with portions of Druid)
> >
> > -brian
> >
> >
> > On Apr 10, 2013, at 5:40 PM, mrevilgnome wrote:
> >
> > > What do you think about set manipulation via indexes in Cassandra? I'm
> > > interested in answering queries such as give me all users that
> performed
> > > event 1, 2, and 3, but not 4. If the answer is yes than I can make a
> case
> > > for spending my time on C*. The only downside for us would be our
> current
> > > prototype is in C++ so we would loose some performance and the ability
> to
> > > dedicate an entire machine to caching/performing queries.
> > >
> > >
> > > On Wed, Apr 10, 2013 at 11:57 AM, Jonathan Ellis 
> > wrote:
> > >
> > >> If you mean, "Can someone help me figure out how to get started
> updating
> > >> these old patches to trunk and cleaning out the Avro?" then yes, I've
> > been
> > >> knee-deep in indexing code recently.
> > >>
> > >>
> > >> On Wed, Apr 10, 2013 at 11:34 AM, mrevilgnome 
> > >> wrote:
> > >>
> > >>> I'm currently building a distributed cluster on top of cassandra to
> > >> perform
> > >>> fast set manipulation via bitmap indexes. This gives me the ability
> to
> > >>> perform unions, intersections, and set subtraction across
> sub-queries.
> > >>> Currently I'm storing index information for thousands of dimensions
> as
> > >>> cassandra rows, and my cluster keeps this information cached,
> > distributed
> > >>> and replicated in order to answer queries.
> > >>>
> > >>> Every couple of days I think to myself this should really exist in
> C*.
> > >>> Given all the benifits would there be any interest in
> > >>> reviving CASSANDRA-1472?
> > >>>
> > >>> Some downsides are that this is very memory intensive, even for
> sparse
> > >>> bitmaps.
> > >>>
> > >>
> > >>
> > >>
> > >> --
> > >> Jonathan Ellis
> > >> Project Chair, Apache Cassandra
> > >> co-founder, http://www.datastax.com
> > >> @spyced
> > >>
> >
> > --
> > Brian ONeill
> > Lead Architect, Health Market Science (http://healthmarketscience.com)
> > mobile:215.588.6024
> > blog: http://weblogs.java.net/blog/boneill42/
> > blog: http://brianoneill.blogspot.com/
> >
> >
>


Re: Bitmap indexes - reviving CASSANDRA-1472

2013-04-12 Thread Jason Rutherglen
Brian,

The Solr StatsComponent performs aggregations.

http://wiki.apache.org/solr/StatsComponent

I recommend using Datastax DSE Search...


On Fri, Apr 12, 2013 at 10:09 AM, Brian O'Neill wrote:

> @Jason,
>
> I have a lot of experience with SOLR + ES, but mainly for search.  (i.e.
> Finding the most relevant records given a query)
> That's been working well, but now we have requirements to support
> dashboards.  Those dashboards have aggregations in them (sum, average,
> count(s), etc).  I have limited experience using filter functions and
> facets to achieve similar things w/ Lucene, but they never seemed to
> perform well when the sets were large.
>
> If Lucene/SOLR/ES can support this kind of functionality, we'd gladly use
> it instead. (Let me know!)
>
> When we looked around, Druid seemed to fit the bill exactly: (and it was
> open source)
> http://metamarkets.com/2011/druid-part-i-real-time-analytics-at-a-billion-r
> ows-per-second/
>
> BTW, here is more information on the compression that Druid uses:
> http://metamarkets.com/2012/druid-bitmap-compression/
>
>
> To echo Matt's sentiment, we'd love to leverage a C* native capability for
> this.
> (Acunu provides most of the capability, but it isn't open source)
>
> I think once we have the "conditional write" semantics that are coming, we
> could layer this on top of C*. (extending the secondary indexes
> functionality)
>
> -brian
>
>
>
> ---
> Brian O'Neill
> Lead Architect, Software Development
> Health Market Science
> The Science of Better Results
> 2700 Horizon Drive € King of Prussia, PA € 19406
> M: 215.588.6024 € @boneill42 <http://www.twitter.com/boneill42>  €
> healthmarketscience.com
>
> This information transmitted in this email message is for the intended
> recipient only and may contain confidential and/or privileged material. If
> you received this email in error and are not the intended recipient, or
> the person responsible to deliver it to the intended recipient, please
> contact the sender at the email above and delete this email and any
> attachments and destroy any copies thereof. Any review, retransmission,
> dissemination, copying or other use of, or taking any action in reliance
> upon, this information by persons or entities other than the intended
> recipient is strictly prohibited.
>
>
>
>
>
>
>
> On 4/12/13 12:46 AM, "Matt Stump"  wrote:
>
> >You could embed Lucene, but then you pretty much have DSE search, and
> >there
> >are people on this list in a better position than I to describe
> >the difficulty in making that scale. By rolling your own you get
> >simplicity
> >and control. If you use a uniform index size you can just assign chunks of
> >it to the cassandra ring making it easy to distribute queries. I think
> >that
> >using Lucene in this way would cause most of the benefit of the library to
> >be lost, and add unnecessary complexity. If Lucene were easy, then I think
> >given the team's experience with both Lucene and C* it would have been
> >done
> >already.
> >
> >Sorry if it's a fuzzy answer, but I haven't run down every technical angle
> >on the integration with C* yet. The idea was still very much in the
> >wouldn't it be very cool if this thing lived in Cassandra. It would be the
> >nail in the coffin for impala, redshift, et al.
> >
> >
> >On Thu, Apr 11, 2013 at 3:15 PM, Jason Rutherglen <
> >jason.rutherg...@gmail.com> wrote:
> >
> >> What's the advantage over Lucene?
> >>
> >>
> >> On Wed, Apr 10, 2013 at 10:43 PM, Matt Stump 
> >> wrote:
> >>
> >> > Druid was our inspiration to layer bitmap indexes on top of Cassandra.
> >> > Druid doesn't work for us because or data set is too large. We would
> >>need
> >> > many hundreds of nodes just for the pre-processed data. What I
> >>envisioned
> >> > was the ability to perform druid style queries (no aggregation)
> >>without
> >> the
> >> > limitations imposed by having the entire dataset in memory. I
> >>primarily
> >> > need to query whether a user performed some event, but I also intend
> >>to
> >> add
> >> > trigram indexes for LIKE, ILIKE or possibly regex style matching.
> >> >
> >> > I wasn't aware of CONCISE, thanks for the pointer. We are currently
> >> > evaluating fastbit, which is a very similar project:
> >> > https://sdm.lbl.gov/fastbit/
> >> >
> >> >
>

Avoiding the system IO cache for compaction

2011-08-02 Thread Jason Rutherglen
Cassandra 'compacts' the way Lucene 'merges' segments.  One
interesting new feature built into Lucene is [1] which avoids loading
the source files into the system IO cache on compaction / merge.

Perhaps Cassandra already has this feature?

1. 
https://builds.apache.org/job/Lucene-trunk/javadoc/all/org/apache/lucene/store/DirectIOLinuxDirectory.html


Re: Avoiding the system IO cache for compaction

2011-08-02 Thread Jason Rutherglen
Looks like there were issues with the native parts.  Mike McCandless
notes this in [1].  If we reuse the Lucene code, I think it'll work
for Cassandra.

1. https://issues.apache.org/jira/browse/LUCENE-2500

On Tue, Aug 2, 2011 at 9:42 PM, Jonathan Ellis  wrote:
> Tried and failed. https://issues.apache.org/jira/browse/CASSANDRA-1902
>
> On Tue, Aug 2, 2011 at 11:24 PM, Jason Rutherglen
>  wrote:
>> Cassandra 'compacts' the way Lucene 'merges' segments.  One
>> interesting new feature built into Lucene is [1] which avoids loading
>> the source files into the system IO cache on compaction / merge.
>>
>> Perhaps Cassandra already has this feature?
>>
>> 1. 
>> https://builds.apache.org/job/Lucene-trunk/javadoc/all/org/apache/lucene/store/DirectIOLinuxDirectory.html
>>
>
>
>
> --
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder of DataStax, the source for professional Cassandra support
> http://www.datastax.com
>


The Eclipse target doesn't seem to show up in 'ant -p'

2011-09-07 Thread Jason Rutherglen
I think it's 'ant generate-eclipse-files'?  Maybe we should make it show up?


Re: The Eclipse target doesn't seem to show up in 'ant -p'

2011-09-07 Thread Jason Rutherglen
Now something is broken, I stopped my laptop while Maven was
downloading something!

---

Buildfile: /home/j/src/CASSANDRA-3147/build.xml

maven-ant-tasks-localrepo:

maven-ant-tasks-download:

maven-ant-tasks-init:
Unable to obtain resource from
/home/j/src/CASSANDRA-3147/build/maven-ant-tasks-2.1.3.jar:
java.util.zip.ZipException: error in opening zip file
  [typedef] Unable to obtain resource from
/home/j/src/CASSANDRA-3147/build/maven-ant-tasks-2.1.3.jar:
  [typedef] java.util.zip.ZipException: error in opening zip file
  [typedef] at java.util.zip.ZipFile.open(Native Method)
  [typedef] at java.util.zip.ZipFile.(ZipFile.java:127)
  [typedef] at java.util.jar.JarFile.(JarFile.java:135)
  [typedef] at java.util.jar.JarFile.(JarFile.java:99)
  [typedef] at
org.apache.tools.ant.AntClassLoader.getResourceURL(AntClassLoader.java:1002)
  [typedef] at
org.apache.tools.ant.AntClassLoader$ResourceEnumeration.findNextResource(AntClassLoader.java:145)
  [typedef] at
org.apache.tools.ant.AntClassLoader$ResourceEnumeration.(AntClassLoader.java:109)
  [typedef] at
org.apache.tools.ant.AntClassLoader.findResources(AntClassLoader.java:949)
  [typedef] at
org.apache.tools.ant.AntClassLoader.getNamedResources(AntClassLoader.java:918)
  [typedef] at
org.apache.tools.ant.loader.AntClassLoader5.getResources(AntClassLoader5.java:54)
  [typedef] at
org.apache.tools.ant.taskdefs.Definer.resourceToURLs(Definer.java:375)
  [typedef] at 
org.apache.tools.ant.taskdefs.Definer.execute(Definer.java:267)
  [typedef] at
org.apache.tools.ant.UnknownElement.execute(UnknownElement.java:291)
  [typedef] at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
  [typedef] at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
  [typedef] at java.lang.reflect.Method.invoke(Method.java:597)
  [typedef] at
org.apache.tools.ant.dispatch.DispatchUtils.execute(DispatchUtils.java:106)
  [typedef] at org.apache.tools.ant.Task.perform(Task.java:348)
  [typedef] at org.apache.tools.ant.Target.execute(Target.java:390)
  [typedef] at org.apache.tools.ant.Target.performTasks(Target.java:411)
  [typedef] at
org.apache.tools.ant.Project.executeSortedTargets(Project.java:1397)
  [typedef] at org.apache.tools.ant.Project.executeTarget(Project.java:1366)
  [typedef] at
org.apache.tools.ant.helper.DefaultExecutor.executeTargets(DefaultExecutor.java:41)
  [typedef] at 
org.apache.tools.ant.Project.executeTargets(Project.java:1249)
  [typedef] at org.apache.tools.ant.Main.runBuild(Main.java:801)
  [typedef] at org.apache.tools.ant.Main.startAnt(Main.java:218)
  [typedef] at org.apache.tools.ant.launch.Launcher.run(Launcher.java:280)
  [typedef] at org.apache.tools.ant.launch.Launcher.main(Launcher.java:109)
  [typedef] Could not load definitions from resource
org/apache/maven/artifact/ant/antlib.xml. It could not be found.

BUILD FAILED
/home/j/src/CASSANDRA-3147/build.xml:290: Problem: failed to create
task or type antlib:org.apache.maven.artifact.ant:remoteRepository
Cause: The name is undefined.
Action: Check the spelling.
Action: Check that any custom tasks/types have been declared.
Action: Check that any / declarations have taken place.
No types or tasks have been defined in this namespace yet

This appears to be an antlib declaration.
Action: Check that the implementing library exists in one of:
-/usr/share/ant/lib
-/home/j/.ant/lib
-a directory added on the command line with the -lib argument


Total time: 1 second


Faster byte[] comparisons

2011-10-31 Thread Jason Rutherglen
"...benchmarks show it as being 2x more CPU-efficient than the
equivalent pure-Java implementation..."

https://issues.apache.org/jira/browse/HADOOP-7761


Re: Cassandra in memory key index

2012-06-06 Thread Jason Rutherglen
Seems like a good place to try out Lucene's FST [1] data structure
which would enable more keys to be loaded into RAM (for more granular
seeks), along with their positions.  Lucene uses this for the terms
dictionary and it's use made for nice gains in the efficiency of the
terms dictionary.  The efficiency gains are potentially much better if
the FST were used in Cassandra?

This is an example of how it is used [2].  Not only is the memory
usage very efficient because it's a single byte[], the keys are
compressed as well as the position value.

If you are interested I can help, I used the FST on a Hadoop project
to implement a fast map side range join.

1. http://blog.mikemccandless.com/2010/12/using-finite-state-transducers-in.html
2. 
https://github.com/apache/lucene-solr/blob/trunk/lucene/core/src/java/org/apache/lucene/codecs/VariableGapTermsIndexReader.java

On Wed, Jun 6, 2012 at 12:05 PM, Jonathan Ellis  wrote:
> Implementation is in IndexSummary.java; the core is
>
>    private final ArrayList positions;
>    private final ArrayList keys;
>
> So no, nothing fancy like prefix compression.
>
> On Wed, Jun 6, 2012 at 11:00 AM, Jason Rutherglen
>  wrote:
>> I am wondering how this is currently implemented?  Is there prefix 
>> compression?
>
>
>
> --
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder of DataStax, the source for professional Cassandra support
> http://www.datastax.com


Re: Cassandra in memory key index

2012-06-08 Thread Jason Rutherglen
The Cassandra integration is probably beyond the time I have available.  If
the locations in the code that need to be rewritten to use the FST are
known, and a patch simply 'plugs-in' the FST, that would be much easier.
Eg, I don't know how Cassandra stores the current key index for example...

Basically I can write FST serializing, deserializing, and key lookup code
fairly easy by basing it on Lucene's terms dict.

On Fri, Jun 8, 2012 at 6:00 AM, Radim Kolar  wrote:

>
>  If you are interested I can help, I used the FST on a Hadoop project
>> to implement a fast map side range join.
>>
> create JIRA item with patch attached, i will test it.
>


Re: Cassandra in memory key index

2012-06-08 Thread Jason Rutherglen
Ok looks like the IndexSummary encapsulates everything, I can start with
hacking that.

On Fri, Jun 8, 2012 at 11:50 AM, Jason Rutherglen <
jason.rutherg...@gmail.com> wrote:

> The Cassandra integration is probably beyond the time I have available.
> If the locations in the code that need to be rewritten to use the FST are
> known, and a patch simply 'plugs-in' the FST, that would be much easier.
> Eg, I don't know how Cassandra stores the current key index for example...
>
> Basically I can write FST serializing, deserializing, and key lookup code
> fairly easy by basing it on Lucene's terms dict.
>
>
> On Fri, Jun 8, 2012 at 6:00 AM, Radim Kolar  wrote:
>
>>
>>  If you are interested I can help, I used the FST on a Hadoop project
>>> to implement a fast map side range join.
>>>
>> create JIRA item with patch attached, i will test it.
>>
>
>


Re: Cassandra in memory key index

2012-06-08 Thread Jason Rutherglen
Yeah that's fine, however if there isn't a Java implementation that's a lot
of extra work.  The FST approach should be a clear quick and easy win.  The
current system of in heap keys and a binary search is what the FST replaced
in Lucene.  There are plenty of references to the improvement.

On Fri, Jun 8, 2012 at 3:50 PM, Pavel Yaskevich  wrote:

> I would vote,  if possible, to compare it with y-fast trie [1] (it doesn't
> seem to be available java implementation unfortunately) by means of memory
> efficiency and lookup performance. As we use big integer tokens the main
> benefit from that trie could be O(log log M) predecessor lookup and compact
> in-memory size.
>
> [1] https://en.wikipedia.org/wiki/Y-fast_trie
>
> Best Regards
> --
> Pavel Yaskevich
>
>
> On Friday 8 June 2012 at 22:19, Jason Rutherglen wrote:
>
> > Ok looks like the IndexSummary encapsulates everything, I can start with
> > hacking that.
> >
> > On Fri, Jun 8, 2012 at 11:50 AM, Jason Rutherglen <
> > jason.rutherg...@gmail.com (mailto:jason.rutherg...@gmail.com)> wrote:
> >
> > > The Cassandra integration is probably beyond the time I have available.
> > > If the locations in the code that need to be rewritten to use the FST
> are
> > > known, and a patch simply 'plugs-in' the FST, that would be much
> easier.
> > > Eg, I don't know how Cassandra stores the current key index for
> example...
> > >
> > > Basically I can write FST serializing, deserializing, and key lookup
> code
> > > fairly easy by basing it on Lucene's terms dict.
> > >
> > >
> > > On Fri, Jun 8, 2012 at 6:00 AM, Radim Kolar  h...@filez.com)> wrote:
> > >
> > > >
> > > > If you are interested I can help, I used the FST on a Hadoop project
> > > > > to implement a fast map side range join.
> > > >
> > > > create JIRA item with patch attached, i will test it.
> > > >
> > >
> > >
> >
> >
> >
>
>
>


Re: Cassandra in memory key index

2012-06-09 Thread Jason Rutherglen
> does FST provide a predecessor lookup function

Yes [1].  seekCeil, seekFloor, seekExact are the methods provided for
seeking.

1.
https://github.com/apache/lucene-solr/blob/trunk/lucene/core/src/java/org/apache/lucene/util/fst/BytesRefFSTEnum.java

On Fri, Jun 8, 2012 at 4:03 PM, Pavel Yaskevich  wrote:

> Yeah, that is why I wrote "if possible" :) Also, does FST provide a
> predecessor lookup function, wasn't clear from the blog post?
>
> On Friday 8 June 2012 at 22:53, Jason Rutherglen wrote:
>
> > Yeah that's fine, however if there isn't a Java implementation that's a
> lot
> > of extra work. The FST approach should be a clear quick and easy win. The
> > current system of in heap keys and a binary search is what the FST
> replaced
> > in Lucene. There are plenty of references to the improvement.
> >
> > On Fri, Jun 8, 2012 at 3:50 PM, Pavel Yaskevich  pove...@gmail.com)> wrote:
> >
> > > I would vote, if possible, to compare it with y-fast trie [1] (it
> doesn't
> > > seem to be available java implementation unfortunately) by means of
> memory
> > > efficiency and lookup performance. As we use big integer tokens the
> main
> > > benefit from that trie could be O(log log M) predecessor lookup and
> compact
> > > in-memory size.
> > >
> > > [1] https://en.wikipedia.org/wiki/Y-fast_trie
> > >
> > > Best Regards
> > > --
> > > Pavel Yaskevich
> > >
> > >
> > > On Friday 8 June 2012 at 22:19, Jason Rutherglen wrote:
> > >
> > > > Ok looks like the IndexSummary encapsulates everything, I can start
> with
> > > > hacking that.
> > > >
> > > > On Fri, Jun 8, 2012 at 11:50 AM, Jason Rutherglen <
> > > > jason.rutherg...@gmail.com (mailto:jason.rutherg...@gmail.com)>
> wrote:
> > > >
> > > > > The Cassandra integration is probably beyond the time I have
> available.
> > > > > If the locations in the code that need to be rewritten to use the
> FST
> > > > >
> > > >
> > > >
> > >
> > > are
> > > > > known, and a patch simply 'plugs-in' the FST, that would be much
> > > >
> > >
> > > easier.
> > > > > Eg, I don't know how Cassandra stores the current key index for
> > > >
> > >
> > > example...
> > > > >
> > > > > Basically I can write FST serializing, deserializing, and key
> lookup
> > > code
> > > > > fairly easy by basing it on Lucene's terms dict.
> > > > >
> > > > >
> > > > > On Fri, Jun 8, 2012 at 6:00 AM, Radim Kolar  h...@filez.com) (mailto:
> > > h...@filez.com (mailto:h...@filez.com))> wrote:
> > > > >
> > > > > >
> > > > > > If you are interested I can help, I used the FST on a Hadoop
> project
> > > > > > > to implement a fast map side range join.
> > > > > >
> > > > > >
> > > > > > create JIRA item with patch attached, i will test it.
>
>