Wow, this is great news. If you upload the patch, I am sure there will be interest for review and we can add it to the code base.
Coming to the array storage, one of the strengths of Gora is that it delegates the mapping to the data store, since every one has it's own data model. In HBas, and I believe in Accumulo as well, you can store arrays at least in three ways (1) serialize the array and store it in one cell - Adding deleting items will read and reserialize the whole array. This is perfect for small, mostly read only arrays. (2) serialize each item in one cell sharing the same column family and having consecutive column numbers. Like family:0 -> arr[0], family:1->arr[1], ... (3) serialize each item in columns sharing the same column family, but with empty calls. Like family:arr[0] -> 'dummy', family:arr[1], ... . - The array elements will be stored in sorted order. So, the question is what to choose? It turns out that depending on how you want to access data and the characteristics of the data (like read-only, size, etc), you should be able to choose either of them for your fields. And depending on how you do the data layout in your storage, the semantics and/or the performance for the use case you mentioned can change. In HBase, we have only option (2), but ideally Gora-hbase and gora-accumulo should be able to work with all 3. And if you think about the deleting item from array semantics, it gets a little bit more involved. For example in gora-hbase, your use case will probably print d4,d5,d3 (since d1 and d2 will be overriden, but d3 won't be deleted). However, I think the correct semantics should be only to print d4 and d5. However, if you go with (3), I think the correct semantics is to print d1,d2,d3,d4,d5. So, as I said, the "correct" semantics depends on the data model, and gora should be flexible enough so that we can utilize different models suitable for the job. Thanks, Enis On Thu, Dec 1, 2011 at 1:07 PM, Keith Turner <[email protected]> wrote: > I am have been writing an Accumulo [1] backend for gora. I am pretty > far along, but not finished. When I am finished, I plan to post a > patch on a jira ticket. If anyone would like to review it let me > know. > > I have a question about storing arrays. I am wondering what the > expected behavior is given the following? > > > { > "type": "record", > "name": "Foo", > "namespace": "test", > "fields" : [ > {"name": "data","type": "array", "items": "string"} > ] > } > > > Foo foo1 = new test.Foo(); > foo1.addToData("d1"); > foo1.addToData("d2"); > foo1.addToData("d3"); > datastore.put(42l, foo1); > > datastore.flush(); > > Foo foo2 = new test.Foo(); > foo2.addToData("d4"); > foo2.addToData("d5"); > datastore.put(42l, foo2); > > datastore.flush(); > > Foo foo3 = datastore.get(42l); > System.out.println(foo3); //what would you expect this to print for > the data array? d4,d5? > > > [1]: http://incubator.apache.org/accumulo >
