Re: Cassandra in memory key index

2012-06-08 Thread Radim Kolar



If you are interested I can help, I used the FST on a Hadoop project
to implement a fast map side range join.

create JIRA item with patch attached, i will test it.


Re: Cassandra in memory key index

2012-06-08 Thread Jason Rutherglen
The Cassandra integration is probably beyond the time I have available.  If
the locations in the code that need to be rewritten to use the FST are
known, and a patch simply 'plugs-in' the FST, that would be much easier.
Eg, I don't know how Cassandra stores the current key index for example...

Basically I can write FST serializing, deserializing, and key lookup code
fairly easy by basing it on Lucene's terms dict.

On Fri, Jun 8, 2012 at 6:00 AM, Radim Kolar  wrote:

>
>  If you are interested I can help, I used the FST on a Hadoop project
>> to implement a fast map side range join.
>>
> create JIRA item with patch attached, i will test it.
>


Re: Cassandra in memory key index

2012-06-08 Thread Jason Rutherglen
Ok looks like the IndexSummary encapsulates everything, I can start with
hacking that.

On Fri, Jun 8, 2012 at 11:50 AM, Jason Rutherglen <
jason.rutherg...@gmail.com> wrote:

> The Cassandra integration is probably beyond the time I have available.
> If the locations in the code that need to be rewritten to use the FST are
> known, and a patch simply 'plugs-in' the FST, that would be much easier.
> Eg, I don't know how Cassandra stores the current key index for example...
>
> Basically I can write FST serializing, deserializing, and key lookup code
> fairly easy by basing it on Lucene's terms dict.
>
>
> On Fri, Jun 8, 2012 at 6:00 AM, Radim Kolar  wrote:
>
>>
>>  If you are interested I can help, I used the FST on a Hadoop project
>>> to implement a fast map side range join.
>>>
>> create JIRA item with patch attached, i will test it.
>>
>
>


Re: Cassandra in memory key index

2012-06-08 Thread Pavel Yaskevich
I would vote,  if possible, to compare it with y-fast trie [1] (it doesn't seem 
to be available java implementation unfortunately) by means of memory 
efficiency and lookup performance. As we use big integer tokens the main 
benefit from that trie could be O(log log M) predecessor lookup and compact 
in-memory size. 

[1] https://en.wikipedia.org/wiki/Y-fast_trie

Best Regards
-- 
Pavel Yaskevich


On Friday 8 June 2012 at 22:19, Jason Rutherglen wrote:

> Ok looks like the IndexSummary encapsulates everything, I can start with
> hacking that.
> 
> On Fri, Jun 8, 2012 at 11:50 AM, Jason Rutherglen <
> jason.rutherg...@gmail.com (mailto:jason.rutherg...@gmail.com)> wrote:
> 
> > The Cassandra integration is probably beyond the time I have available.
> > If the locations in the code that need to be rewritten to use the FST are
> > known, and a patch simply 'plugs-in' the FST, that would be much easier.
> > Eg, I don't know how Cassandra stores the current key index for example...
> > 
> > Basically I can write FST serializing, deserializing, and key lookup code
> > fairly easy by basing it on Lucene's terms dict.
> > 
> > 
> > On Fri, Jun 8, 2012 at 6:00 AM, Radim Kolar  > (mailto:h...@filez.com)> wrote:
> > 
> > > 
> > > If you are interested I can help, I used the FST on a Hadoop project
> > > > to implement a fast map side range join.
> > > 
> > > create JIRA item with patch attached, i will test it.
> > > 
> > 
> > 
> 
> 
> 




Re: Cassandra in memory key index

2012-06-08 Thread Jason Rutherglen
Yeah that's fine, however if there isn't a Java implementation that's a lot
of extra work.  The FST approach should be a clear quick and easy win.  The
current system of in heap keys and a binary search is what the FST replaced
in Lucene.  There are plenty of references to the improvement.

On Fri, Jun 8, 2012 at 3:50 PM, Pavel Yaskevich  wrote:

> I would vote,  if possible, to compare it with y-fast trie [1] (it doesn't
> seem to be available java implementation unfortunately) by means of memory
> efficiency and lookup performance. As we use big integer tokens the main
> benefit from that trie could be O(log log M) predecessor lookup and compact
> in-memory size.
>
> [1] https://en.wikipedia.org/wiki/Y-fast_trie
>
> Best Regards
> --
> Pavel Yaskevich
>
>
> On Friday 8 June 2012 at 22:19, Jason Rutherglen wrote:
>
> > Ok looks like the IndexSummary encapsulates everything, I can start with
> > hacking that.
> >
> > On Fri, Jun 8, 2012 at 11:50 AM, Jason Rutherglen <
> > jason.rutherg...@gmail.com (mailto:jason.rutherg...@gmail.com)> wrote:
> >
> > > The Cassandra integration is probably beyond the time I have available.
> > > If the locations in the code that need to be rewritten to use the FST
> are
> > > known, and a patch simply 'plugs-in' the FST, that would be much
> easier.
> > > Eg, I don't know how Cassandra stores the current key index for
> example...
> > >
> > > Basically I can write FST serializing, deserializing, and key lookup
> code
> > > fairly easy by basing it on Lucene's terms dict.
> > >
> > >
> > > On Fri, Jun 8, 2012 at 6:00 AM, Radim Kolar  h...@filez.com)> wrote:
> > >
> > > >
> > > > If you are interested I can help, I used the FST on a Hadoop project
> > > > > to implement a fast map side range join.
> > > >
> > > > create JIRA item with patch attached, i will test it.
> > > >
> > >
> > >
> >
> >
> >
>
>
>


Re: Cassandra in memory key index

2012-06-08 Thread Pavel Yaskevich
Yeah, that is why I wrote "if possible" :) Also, does FST provide a predecessor 
lookup function, wasn't clear from the blog post?

On Friday 8 June 2012 at 22:53, Jason Rutherglen wrote:

> Yeah that's fine, however if there isn't a Java implementation that's a lot
> of extra work. The FST approach should be a clear quick and easy win. The
> current system of in heap keys and a binary search is what the FST replaced
> in Lucene. There are plenty of references to the improvement.
> 
> On Fri, Jun 8, 2012 at 3:50 PM, Pavel Yaskevich  (mailto:pove...@gmail.com)> wrote:
> 
> > I would vote, if possible, to compare it with y-fast trie [1] (it doesn't
> > seem to be available java implementation unfortunately) by means of memory
> > efficiency and lookup performance. As we use big integer tokens the main
> > benefit from that trie could be O(log log M) predecessor lookup and compact
> > in-memory size.
> > 
> > [1] https://en.wikipedia.org/wiki/Y-fast_trie
> > 
> > Best Regards
> > --
> > Pavel Yaskevich
> > 
> > 
> > On Friday 8 June 2012 at 22:19, Jason Rutherglen wrote:
> > 
> > > Ok looks like the IndexSummary encapsulates everything, I can start with
> > > hacking that.
> > > 
> > > On Fri, Jun 8, 2012 at 11:50 AM, Jason Rutherglen <
> > > jason.rutherg...@gmail.com (mailto:jason.rutherg...@gmail.com)> wrote:
> > > 
> > > > The Cassandra integration is probably beyond the time I have available.
> > > > If the locations in the code that need to be rewritten to use the FST
> > > > 
> > > 
> > > 
> > 
> > are
> > > > known, and a patch simply 'plugs-in' the FST, that would be much
> > > 
> > 
> > easier.
> > > > Eg, I don't know how Cassandra stores the current key index for
> > > 
> > 
> > example...
> > > > 
> > > > Basically I can write FST serializing, deserializing, and key lookup
> > code
> > > > fairly easy by basing it on Lucene's terms dict.
> > > > 
> > > > 
> > > > On Fri, Jun 8, 2012 at 6:00 AM, Radim Kolar  > > > (mailto:h...@filez.com) (mailto:
> > h...@filez.com (mailto:h...@filez.com))> wrote:
> > > > 
> > > > > 
> > > > > If you are interested I can help, I used the FST on a Hadoop project
> > > > > > to implement a fast map side range join.
> > > > > 
> > > > > 
> > > > > create JIRA item with patch attached, i will test it. 



Re: Cassandra in memory key index

2012-06-08 Thread Radim Kolar

Dne 8.6.2012 21:19, Jason Rutherglen napsal(a):

Ok looks like the IndexSummary encapsulates everything, I can start with
hacking that.

do memory part first. i want to test it on existing serialized index data.