Re: accumulo backend for gora

Enis Söztutar Thu, 01 Dec 2011 14:17:15 -0800

Wow, this is great news. If you upload the patch, I am sure there will be
interest for review and we can add it to the code base.

Coming to the array storage, one of the strengths of Gora is that it
delegates the mapping to the data store, since every one has it's own data
model. In HBas, and I believe in Accumulo as well, you can store arrays at
least in three ways
 (1) serialize the array and store it in one cell
  - Adding deleting items will read and reserialize the whole array. This
is perfect for small, mostly read only arrays.
 (2) serialize each item in one cell sharing the same column family and
having consecutive column numbers. Like family:0 -> arr[0],
family:1->arr[1], ...
 (3) serialize each item in columns sharing the same column family, but
with empty calls. Like family:arr[0] -> 'dummy', family:arr[1], ... .
 - The array elements will be stored in sorted order.

So, the question is what to choose? It turns out that depending on how you
want to access data and the characteristics of the data (like read-only,
size, etc), you should be able to choose either of them for your fields.
And depending on how you do the data layout in your storage, the semantics
and/or the performance for the use case you mentioned can change. In HBase,
we have only option (2), but ideally Gora-hbase and gora-accumulo should be
able to work with all 3. And if you think about the deleting item from
array semantics, it gets a little bit more involved. For example in
gora-hbase, your use case will probably print d4,d5,d3 (since d1 and d2
will be overriden, but d3 won't be deleted). However, I think the correct
semantics should be only to print d4 and d5. However, if you go with (3), I
think the correct semantics is to print d1,d2,d3,d4,d5.

So, as I said, the "correct" semantics depends on the data model, and gora
should be flexible enough so that we can utilize different models suitable
for the job.

Thanks,
Enis

On Thu, Dec 1, 2011 at 1:07 PM, Keith Turner <[email protected]> wrote:

> I am have been writing an Accumulo [1]  backend for gora.  I am pretty
> far along, but not finished.  When I am finished, I plan to post a
> patch on a jira ticket.  If anyone would like to review it let me
> know.
>
> I have a question about storing arrays.  I am wondering what the
> expected behavior is given the following?
>
>
>  {
>  "type": "record",
>  "name": "Foo",
>  "namespace": "test",
>  "fields" : [
>    {"name": "data","type": "array", "items": "string"}
>  ]
> }
>
>
> Foo foo1 = new test.Foo();
> foo1.addToData("d1");
> foo1.addToData("d2");
> foo1.addToData("d3");
> datastore.put(42l, foo1);
>
> datastore.flush();
>
> Foo foo2 = new test.Foo();
> foo2.addToData("d4");
> foo2.addToData("d5");
> datastore.put(42l, foo2);
>
> datastore.flush();
>
> Foo foo3 = datastore.get(42l);
> System.out.println(foo3);  //what would you expect this to print for
> the data array?  d4,d5?
>
>
> [1]: http://incubator.apache.org/accumulo
>

Re: accumulo backend for gora

Reply via email to