Order preserving partitioning strategy

2010-08-22 Thread Hien. To Trong
Hi,
I am developing a system with some features like cassandra.
I want to add order preserving partitioning strategy, but I don't know how to 
implement it.

In cassandra paper - Cassandra - A Decentralized Structured Storage System
"Cassandra partitions data across the cluster using consistent hashing but uses 
an order pre-
serving hash function (OPHF) to do so"

I wonder:

1. Cassandra still use a hash function (the other strategy is random 
partitioner) for OPP? 
If so, what is the algorithm of OPHF? is it a type of minimal perfect hash 
function (MPHF)?

I already read some papers about algorithms for MPHF which preserve the order 
of hash value. However, 
the size of key space equals and hash value space are equal and much more 
smaller than the size of key space 
(may be userid or usertaskid) in our application. How can I deal with that or I 
went on the wrong track?

2. My system is simple. I have some servers and I use Berkeley DB to store 
Key/Value (our data model is simple). Is OPP strategy useful 
when I don't have data model like cassandra? (column family for example).

Thanks so much.

Re: Order preserving partitioning strategy

2010-08-23 Thread Hien. To Trong
Hi,
OrderPreservingPartitioner is efficient range queries but can cause unevently 
distributed data.
Does anyone has an idea of a HybridPartitioner which takes advantages of both 
RandomPartitioner and OPP, 
or at least a partitioner trade off between them.

Load balancing - Order preserving partition - New Keys prob.

2010-09-16 Thread Hien. To Trong
Hi all,
Thanks you for your comments.
I already read some papers about load balancing in P2P which some of them allow 
range query.
During this time, I find another problem: "NEW KEYS"

Karger - Simple efficient load balancing algorithms for p2p systems. (support 
OPP)
John Byers - Simple load balancing for distributed hash table.
Third: Ananth Rao - Load balancing in structured P2P systems.
I think the first paper is the best to get the background.

Prefix Hash Tree: An Indexing Data Structure over Distributed Hash Tables
and Range queries over DHTs
These paper use PHT to allow range query.

James Aspnes - Skip Graphs (most recent as I known)
and Distributed Balanced Tables, Not making a hash of it All.

All of these papers deal with load balancing and range query probs by creating 
schemes or strategies 
based on only "load - number of key on each nodes", not care about new keys and 
highly-accessed keys. 
HOWEVER, in fact, "the most recent keys are likely accessed more frequently".

I suppose, we have 400.000 "NEW KEYS" in 2 recent days (are likely accessed 
more frequently). --> A scheme: these new keys are uniformly partitioned and 
divided into some successive nodes. For example, there are 10 node (N1...N10), 
keys in day 1 will be put into node_1_2, keys in day 2 will be put into 
node_3_4 and so on... But, the prob is that number of keys and nodes 
increase/decrease day by day (not fixed) --> the number of node used to store 
keys in each day may increase/decrease (1 or 3 node for example).

Does any one have any ideas or know papers to deal with this prob (new keys and 
highly-accessed keys)?

Thanks a lot.