Create a jira to spec out the design for the RDF layer: https://issues.apache.org/jira/browse/HBASE-2433. I'll post an initial design and some other ideas on it soon. Go ahead and put in whatever you have in mind.
-ak Amandeep Khurana Computer Science Graduate Student University of California, Santa Cruz On Mon, Apr 5, 2010 at 10:44 PM, Amandeep Khurana <[email protected]> wrote: > Scaling up is not going to be an issue unless you demand performance (in > terms of low latency of queries) too. Here's a paper that has some ideas on > how we can get good performance on queries in a large scale triple store: > *http://people.csail.mit.edu/tdanford/6830papers/weiss-hexastore.pdf* > > We can use indexing ideas from this paper, combined with coprocessors > (which I'm still not sure how to leverage yet) for fast query performance. > > For storing large number of RDF triples, we might not need to add much to > HBase's data model. I'm still thinking of this idea: We could have a few > column families (<10) and hash the predicate value to a column family. > > So, predicate1 can go to fam1, making fam1:predicate1, so on and so forth. > We could use ideas from the CRUSH paper [1] for this. > > Similarly, if a table is getting too big, we can have multiple tables as > well and hash the subject value to decide the table it should be placed in. > > Thoughts? > > This gives us scale as well as the ability to do fast querying.. Ofcourse, > as Andy mentioned, we'll have to find a subset of queries that we will > support. > > [1] http://ceph.newdream.net/papers/weil-crush-sc06.pdf > > > > Amandeep Khurana > Computer Science Graduate Student > University of California, Santa Cruz > > > On Mon, Apr 5, 2010 at 7:55 PM, <[email protected]> wrote: > >> The priorities 1), 2) and 3) are pretty well stated. - Victor >> >> >> On 4/5/10 3:58 PM, "ext Andrew Purtell" <[email protected]> wrote: >> >> Just some ideas, possibly half-baked: >> >> > From: Amandeep Khurana >> > Subject: Re: Using SPARQL against HBase >> > To: [email protected] >> > 1. We want to have a SPARQL query engine over it that can return >> > results to queries in real time, comparable to other systems out >> > there. And since we will have HBase as the storage layer, we want >> > to scale well. >> >> Generally, I wonder if HBase may be able to trade disk space for query >> processing time for expected common queries. >> >> So part of the story here could be using coprocessors (HBASE-2000) as a >> mapping layer between the clients and the plain/simple BigTable store. For >> example, an RDF and graph relation aware coprocessor could produce and cache >> projections on the fly and use structure aware data placement strategies for >> sharding -- so the table or tables exposed to the client for enabling >> queries may be only a logical construct backed by one or more real tables >> with far different structure, and there would be intelligence for managing >> the construct running within the regionservers. Projections could be built >> lazily (via interprocess BSP?), triggered by a new query or an admin action. >> (And possibly the results could be cached with TTLs for automatic garbage >> collection for managing the total size of the store.) >> >> This opens up a range of implementation options that the basic BigTable >> architecture would not support. This is like installing a purpose-built RDF >> store within an existing HBase+Hadoop deployment. >> >> > 2. We want to enable large scale processing as well, >> > leveraging Hadoop (maybe? read about this on Cloudera's blog), >> > and maybe something like Pregel. >> >> Edward, didn't you do some work implementing graph operations using BSP >> message passing within the Hadoop framework? What were your findings? >> >> I think a coprocessor could implement a Pregel-like distributed graph >> processing model internally to the region servers, using ZooKeeper >> primitives for rendezvous. >> >> > These things are fluid and the first step would be to spec >> > out features that we want to build in >> >> In my opinion as a potential user of such a service, the design priorities >> should be something like: >> >> 1) Scale. >> >> 2) Real time queries. >> >> 3) Support a reasonable subset of possible queries over the data. >> >> Obviously both #1 and #2 are in tension with #3, so some expressiveness >> could be sacrificed. >> >> #1 and #2 are in tension as well. It would not be desirable to provide for >> all possible queries to be returned in real time given the cost of that is >> an unsupportable space explosion. >> >> My rationale for the above is a BigTable hosted RDF store could have less >> expressiveness than alternatives but that would be acceptable if the reason >> for considering the solution is the 'Big' in BigTable. But this is not the >> only consideration. Also if it can be fast for the common cases even with >> moderately sized data, it is a good alternative and may be already installed >> as part of a larger strategy employing the Hadoop stack. >> >> We should consider a motivating use case, or a few of them. >> >> For me, I'd like a canonical source of provenance. We have a patchwork of >> tracking systems. I'd like to be able to link the provenance for all of our >> workflows and data, inputs and outputs at each stage. Should support fast >> queries for weighting inputs to predictive models. Should support bulk >> queries also, so as we assess or reassess the reliability and >> trustworthiness of a source or service we would be able to trace all data >> and all conclusions contributed by the entity and all that build upon it -- >> the whole cascade of it -- by following the linkage. We would be able to >> invalidate any conclusions based on data or process we deem (at some >> arbitrary time) flawed or untrustworthy. This "provenance store" would be a >> new metaindex over several workflows and data islands. >> >> >> http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.99.7575&rep=rep1&type=pdf >> >> >> http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.91.3562&rep=rep1&type=pdf >> >> Deletions would be rare, if ever. >> >> - Andy >> >> >> >> >> >> >> >
