Re: Using SPARQL against HBase

Amandeep Khurana Sat, 10 Apr 2010 16:38:17 -0700

Create a jira to spec out the design for the RDF layer:
https://issues.apache.org/jira/browse/HBASE-2433. I'll post an initial
design and some other ideas on it soon. Go ahead and put in whatever you
have in mind.


-ak


Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz


On Mon, Apr 5, 2010 at 10:44 PM, Amandeep Khurana <[email protected]> wrote:

> Scaling up is not going to be an issue unless you demand performance (in
> terms of low latency of queries) too. Here's a paper that has some ideas on
> how we can get good performance on queries in a large scale triple store:
> *http://people.csail.mit.edu/tdanford/6830papers/weiss-hexastore.pdf*
>
> We can use indexing ideas from this paper, combined with coprocessors
> (which I'm still not sure how to leverage yet) for fast query performance.
>
> For storing large number of RDF triples, we might not need to add much to
> HBase's data model. I'm still thinking of this idea: We could have a few
> column families (<10) and hash the predicate value to a column family.
>
> So, predicate1 can go to fam1, making fam1:predicate1, so on and so forth.
> We could use ideas from the CRUSH paper [1] for this.
>
> Similarly, if a table is getting too big, we can have multiple tables as
> well and hash the subject value to decide the table it should be placed in.
>
> Thoughts?
>
> This gives us scale as well as the ability to do fast querying.. Ofcourse,
> as Andy mentioned, we'll have to find a subset of queries that we will
> support.
>
> [1] http://ceph.newdream.net/papers/weil-crush-sc06.pdf
>
>
>
> Amandeep Khurana
> Computer Science Graduate Student
> University of California, Santa Cruz
>
>
> On Mon, Apr 5, 2010 at 7:55 PM, <[email protected]> wrote:
>
>> The priorities 1), 2) and 3) are pretty well stated. - Victor
>>
>>
>> On 4/5/10 3:58 PM, "ext Andrew Purtell" <[email protected]> wrote:
>>
>> Just some ideas, possibly half-baked:
>>
>> > From: Amandeep Khurana
>> > Subject: Re: Using SPARQL against HBase
>> > To: [email protected]
>> > 1. We want to have a SPARQL query engine over it that can return
>> > results to queries in real time, comparable to other systems out
>> > there. And since we will have HBase as the storage layer, we want
>> > to scale well.
>>
>> Generally, I wonder if HBase may be able to trade disk space for query
>> processing time for expected common queries.
>>
>> So part of the story here could be using coprocessors (HBASE-2000) as a
>> mapping layer between the clients and the plain/simple BigTable store. For
>> example, an RDF and graph relation aware coprocessor could produce and cache
>> projections on the fly and use structure aware data placement strategies for
>> sharding -- so the table or tables exposed to the client for enabling
>> queries may be only a logical construct backed by one or more real tables
>> with far different structure, and there would be intelligence for managing
>> the construct running within the regionservers. Projections could be built
>> lazily (via interprocess BSP?), triggered by a new query or an admin action.
>> (And possibly the results could be cached with TTLs for automatic garbage
>> collection for managing the total size of the store.)
>>
>> This opens up a range of implementation options that the basic BigTable
>> architecture would not support. This is like installing a purpose-built RDF
>> store within an existing HBase+Hadoop deployment.
>>
>> > 2. We want to enable large scale processing as well,
>> > leveraging Hadoop (maybe? read about this on Cloudera's blog),
>> > and maybe something like Pregel.
>>
>> Edward, didn't you do some work implementing graph operations using BSP
>> message passing within the Hadoop framework? What were your findings?
>>
>> I think a coprocessor could implement a Pregel-like distributed graph
>> processing model internally to the region servers, using ZooKeeper
>> primitives for rendezvous.
>>
>> > These things are fluid and the first step would be to spec
>> > out features that we want to build in
>>
>> In my opinion as a potential user of such a service, the design priorities
>> should be something like:
>>
>> 1) Scale.
>>
>> 2) Real time queries.
>>
>> 3) Support a reasonable subset of possible queries over the data.
>>
>> Obviously both #1 and #2 are in tension with #3, so some expressiveness
>> could be sacrificed.
>>
>> #1 and #2 are in tension as well. It would not be desirable to provide for
>> all possible queries to be returned in real time given the cost of that is
>> an unsupportable space explosion.
>>
>> My rationale for the above is a BigTable hosted RDF store could have less
>> expressiveness than alternatives but that would be acceptable if the reason
>> for considering the solution is the 'Big' in BigTable. But this is not the
>> only consideration. Also if it can be fast for the common cases even with
>> moderately sized data, it is a good alternative and may be already installed
>> as part of a larger strategy employing the Hadoop stack.
>>
>> We should consider a motivating use case, or a few of them.
>>
>> For me, I'd like a canonical source of provenance. We have a patchwork of
>> tracking systems. I'd like to be able to link the provenance for all of our
>> workflows and data, inputs and outputs at each stage. Should support fast
>> queries for weighting inputs to predictive models. Should support bulk
>> queries also, so as we assess or reassess the reliability and
>> trustworthiness of a source or service we would be able to trace all data
>> and all conclusions contributed by the entity and all that build upon it --
>> the whole cascade of it -- by following the linkage. We would be able to
>> invalidate any conclusions based on data or process we deem (at some
>> arbitrary time) flawed or untrustworthy. This "provenance store" would be a
>> new metaindex over several workflows and data islands.
>>
>>
>> http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.99.7575&rep=rep1&type=pdf
>>
>>
>> http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.91.3562&rep=rep1&type=pdf
>>
>> Deletions would be rare, if ever.
>>
>>   - Andy
>>
>>
>>
>>
>>
>>
>>
>

Re: Using SPARQL against HBase

Reply via email to