oakstream wrote
> Thanks guys!
> David,
> 
> In general and in your opinion would Lucene Spatial be the way to go to
> index hundreds of terabytes of spatial data that continually grows. 
> Mostly point data, mostly structured, however, could be polygons.  The
> searches would be within or contains in a polygon.  

That's a lot of data!  I don't know what the upper bound is on how much data
a sharded Lucene based system can handle for spatial searches given
"response times in seconds".  As with most things, people should try
themselves because there are so many variables.

I suggest separating indexed point data from indexed polygon data, so you
can optimize for both.  For example, a point cannot satisfy the "contains"
predicate, so skip that data set.  And for the "within" predicate, indexed
points is equivalent to using the "intersects" predicate which is quite
fast, and will always be the fastest predicate I believe.  Polygon "within"
polygon (or any non-point shape "within" any other non-point shape) is
something I'm currently working on -- give me a couple weeks: LUCENE-4644.  

A shortcoming of indexing non-point shapes to be aware of is that the shape
is completely represented by the gridded index.  If you want an indexed
polygon to be represented fairly precisely, then it may take an unreasonable
number of indexed grid cells to represent that shape.  Eventually, I want to
store the shape exactly (in Lucene's DocValues structure) in addition to a
course grid index so that it can be consulted for the particular cases where
the index alone isn't sure if a document's shape satisfies the predicate.


oakstream wrote
> Do you have any thoughts on using a NOSQL database (like Mongodb) or
> something else comparable.  I need response times in the seconds.  My
> thoughts are that I need some type of distributed system.  I was thinking
> about SOLRCLOUD to solve this.  I'm fairly new to Lucene/Solr.    Most of
> the data is currently in HDFS/HBASE.  
> 
> I've investigated sharding Oracle and Postgres databases but this just
> doesn't seem like the ideal solution and since all the data already exists
> in HDFS, I'd like to build a solution that works on top of it but
> "real-time" or as "near" as I can get.  
> 
> Anyways, I've read some of your work in the past and appreciate your
> input.  
> I don't mind putting in some development work, just not sure the right
> approach. 
> 
> Thanks for your time. I appreciate it!

Hundreds of terabytes of data precludes relational databases.  I'm not sure
how MongoDB would fair.  And presumably HBase doesn't have spatial support
or you wouldn't be looking elsewhere (here).  Based on other conversations
I'm having about how spatial searches could work in Accumulo, I don't think
it will be able to match Lucene's performance.  In Lucene I'm able to edge
n-gram geohashes (e.g Boston geohash: DRT2Y, DRT2, DRT, DR, D) and the
intersection algorithm is able to do wonders with this.  But with Accumulo
(or HBase or Cassandra) I believe I'd have to stick with the full-length
geohash as the sorted key, which means query shapes covering lots of indexed
data will take longer as they have to fully iterate over underlying data. 
If anyone reading what I'm saying here is at least somewhat familiar with
why Lucene's Trie based number indexed fields are so fast over how things
worked before, it's for the same reason that Lucene 4 PrefixTree (a synonym
of Trie) based fields are fast.  I don't think the underlying approach is
even possible in the big-table NoSQL based systems without some kind of
inverted index.  If anyone knows otherwise then please enlighten me!

~ David Smiley



-----
 Author: http://www.packtpub.com/apache-solr-3-enterprise-search-server/book
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Using-Solr-Spatial-in-conjunction-with-HBASE-Hadoop-tp4034307p4034806.html
Sent from the Solr - User mailing list archive at Nabble.com.

Reply via email to