oakstream wrote > Thanks guys! > David, > > In general and in your opinion would Lucene Spatial be the way to go to > index hundreds of terabytes of spatial data that continually grows. > Mostly point data, mostly structured, however, could be polygons. The > searches would be within or contains in a polygon.
That's a lot of data! I don't know what the upper bound is on how much data a sharded Lucene based system can handle for spatial searches given "response times in seconds". As with most things, people should try themselves because there are so many variables. I suggest separating indexed point data from indexed polygon data, so you can optimize for both. For example, a point cannot satisfy the "contains" predicate, so skip that data set. And for the "within" predicate, indexed points is equivalent to using the "intersects" predicate which is quite fast, and will always be the fastest predicate I believe. Polygon "within" polygon (or any non-point shape "within" any other non-point shape) is something I'm currently working on -- give me a couple weeks: LUCENE-4644. A shortcoming of indexing non-point shapes to be aware of is that the shape is completely represented by the gridded index. If you want an indexed polygon to be represented fairly precisely, then it may take an unreasonable number of indexed grid cells to represent that shape. Eventually, I want to store the shape exactly (in Lucene's DocValues structure) in addition to a course grid index so that it can be consulted for the particular cases where the index alone isn't sure if a document's shape satisfies the predicate. oakstream wrote > Do you have any thoughts on using a NOSQL database (like Mongodb) or > something else comparable. I need response times in the seconds. My > thoughts are that I need some type of distributed system. I was thinking > about SOLRCLOUD to solve this. I'm fairly new to Lucene/Solr. Most of > the data is currently in HDFS/HBASE. > > I've investigated sharding Oracle and Postgres databases but this just > doesn't seem like the ideal solution and since all the data already exists > in HDFS, I'd like to build a solution that works on top of it but > "real-time" or as "near" as I can get. > > Anyways, I've read some of your work in the past and appreciate your > input. > I don't mind putting in some development work, just not sure the right > approach. > > Thanks for your time. I appreciate it! Hundreds of terabytes of data precludes relational databases. I'm not sure how MongoDB would fair. And presumably HBase doesn't have spatial support or you wouldn't be looking elsewhere (here). Based on other conversations I'm having about how spatial searches could work in Accumulo, I don't think it will be able to match Lucene's performance. In Lucene I'm able to edge n-gram geohashes (e.g Boston geohash: DRT2Y, DRT2, DRT, DR, D) and the intersection algorithm is able to do wonders with this. But with Accumulo (or HBase or Cassandra) I believe I'd have to stick with the full-length geohash as the sorted key, which means query shapes covering lots of indexed data will take longer as they have to fully iterate over underlying data. If anyone reading what I'm saying here is at least somewhat familiar with why Lucene's Trie based number indexed fields are so fast over how things worked before, it's for the same reason that Lucene 4 PrefixTree (a synonym of Trie) based fields are fast. I don't think the underlying approach is even possible in the big-table NoSQL based systems without some kind of inverted index. If anyone knows otherwise then please enlighten me! ~ David Smiley ----- Author: http://www.packtpub.com/apache-solr-3-enterprise-search-server/book -- View this message in context: http://lucene.472066.n3.nabble.com/Using-Solr-Spatial-in-conjunction-with-HBASE-Hadoop-tp4034307p4034806.html Sent from the Solr - User mailing list archive at Nabble.com.