Hi all,

I've just got my geographic clustering component working (somewhat). I've attached a sample resultset to this mail. It seems to work pretty well and it's pretty fast. I have one issue I need help with concerning the API though. At the moment my Hilbert field is a Sortable Integer, and I do the following call to get the count for a specific cluster:

Query rangeQ = new TermRangeQuery("geo_hilbert", lowI, highI, true, true);
searcher.numDocs(rangeQ, docs);

But I'd like to further reduce the DocSet by the given longitude and latitude bounds given in the geocluster arguments (swlat, swlng, nelat and nelng) but only for the purposes of clustering, I don't just want to have to add fq arguments for to the query as I want my non-geocluster results (like facet counts and numFound) to not be affected by the selected range. So how would I achieve the effect of filterqueries (including the awesome caching) by manipulating either the rangeQ or docs. And since the snippet above is called multiple times with different rangeQ but the same (filtered) DocSet I guess manipulating docs would be faster (I think).

Regards,

gwk

gwk wrote:
Hi Joe,

Thanks for the link, I'll check it out, I'm not sure it'll help in my situation though since the clustering should happen at runtime due to faceted browsing (unless I'm mistaken at what the preprocessing does).

More on my progress though, I thought some more about using Hilbert curve mapping and it seems really suited for what I want. I've just added a Hilbert field to my schema (Trie Integer field) with latitude and longitude at 15bits precision (didn't use 16 bits to avoid the sign bit) so I have a 30 bit number in said field. Getting facet counts for 0 to (2^30 - 1) should get me the entire map while getting counts for 0 to (2^28 - 1), 2^28 to (2^29 - 1), 2^29 to (2^29 + 2^28 - 1) and (2^29 + 2^28) to (2^30 - 1) should give me counts for four equal quadrants, all the way down to 0 to 3, 4 to 7, 8 to 11 .... (2^30 - 4 to 2^30 - 1) and of course faceting on every separate term. Of course since if you're zoomed in far enough to need such fine grained clustering you'll be looking at a small portion of the map and only a part of the whole range should be counted, but that should be doable by calculating the Hilbert number for the lower and upper bounds.

The only problem is the location of the clusters, if I use this method I'll only have the Hilbert number and the number of items in that part of the, what is essentially a quadtree. But I suppose I can calculate the facet counts for one precision finer than the requested precision and use a weighted average of the four parts of the cluster, I'll have to see if that is accurate enough.

Hopefully I'll have the time to complete this today or tomorrow. I'll report back if it has worked.

Regards,

gwk

Joe Calderon wrote:
there are clustering libraries like
http://bonsai.ims.u-tokyo.ac.jp/~mdehoon/software/cluster/, that have
bindings to perl/python, you can preprocess your results and create
clusters for each zoom level

On Tue, Sep 8, 2009 at 8:08 AM, gwk<g...@eyefi.nl> wrote:
Hi,

I just completed a simple proof-of-concept clusterer component which
naively clusters with a specified bounding box around each position,
similar to what the javascript MarkerClusterer does. It's currently very
slow as I loop over the entire docset and request the longitude and
latitude of each document (Not to mention that my unfamiliarity with
Lucene/Solr isn't helping the implementations performance any, most code
is copied from grep-ing the solr source). Clustering a set of about
80.000 documents takes about 5-6 seconds. I'm currently looking into
storing the hilber curve mapping in Solr and clustering using facet
counts on numerical ranges of that mapping but I'm not sure it will pan out.

Regards,

gwk

Grant Ingersoll wrote:
Not directly related to geo clustering, but
http://issues.apache.org/jira/browse/SOLR-769 is all about a pluggable
interface to clustering implementations.  It currently has Carrot2
implemented, but the APIs are marked as experimental. I would definitely be interested in hearing your experience with implementing your clustering
algorithm in it.

-Grant

On Sep 8, 2009, at 4:00 AM, gwk wrote:

Hi,

I'm working on a search-on-map interface for our website. I've created a
little proof of concept which uses the MarkerClusterer
(http://code.google.com/p/gmaps-utility-library-dev/) which clusters the markers nicely. But because sending tens of thousands of markers over Ajax
is not quite as fast as I would like it to be, I'd prefer to do the
clustering on the server side. I've considered a few options like storing the morton-order and throwing away precision to cluster, assigning all locations to a grid position. Or simply cluster based on country/region/city depending on zoom level by adding latitude on longitude fields for each zoom level (so that for smaller countries you have to be zoomed in further to get
the next level of clustering).

I was wondering if anybody else has worked on something similar and if so
what their solutions are.

Regards,

gwk
--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
Solr/Lucene:
http://www.lucidimagination.com/search




<?xml version="1.0" encoding="UTF-8"?>
<response>

<lst name="responseHeader">
 <int name="status">0</int>
 <int name="QTime">3</int>
 <lst name="params">
  <str name="indent">on</str>
  <str name="geocluster.swlat">36.53</str>
  <str name="rows">0</str>
  <str name="version">2.2</str>
  <str name="omitHeader">false</str>
  <str name="geocluster.nelat">38.07</str>
  <str name="echoParams">EXPLICIT</str>
  <str name="start">0</str>
  <str name="q">*:*</str>
  <str name="geocluster.precision">9</str>
  <str name="geocluster.swlng">-9.8</str>
  <str name="geocluster.nelng">-6.72</str>
  <str name="geocluster">true</str>
  <str name="fq">geo_hilbert:[* TO *]</str>
 </lst>
</lst>
<result name="response" numFound="93307" start="0"/>
<lst name="facet_counts">
 <lst name="facet_queries"/>
 <lst name="facet_fields"/>
 <lst name="facet_dates"/>
 <lst name="facet_numbers"/>
</lst>
<lst name="geoclusters">
 <lst name="cluster">
  <double name="latitude">37.769512009033484</double>
  <double name="longitude">-8.61354411450543</double>
  <int name="count">25</int>
 </lst>
 <lst name="cluster">
  <double name="latitude">37.06696017358364</double>
  <double name="longitude">-8.07416425385468</double>
  <int name="count">3583</int>
 </lst>
 <lst name="cluster">
  <double name="latitude">37.176887001393936</double>
  <double name="longitude">-7.549981735214701</double>
  <int name="count">1153</int>
 </lst>
 <lst name="cluster">
  <double name="latitude">37.17886898403883</double>
  <double name="longitude">-6.855677968688013</double>
  <int name="count">54</int>
 </lst>
 <lst name="cluster">
  <double name="latitude">37.35465559862057</double>
  <double name="longitude">-6.855677968688013</double>
  <int name="count">8</int>
 </lst>
 <lst name="cluster">
  <double name="latitude">37.8820154423658</double>
  <double name="longitude">-6.855677968688013</double>
  <int name="count">1</int>
 </lst>
 <lst name="cluster">
  <double name="latitude">37.823419904171885</double>
  <double name="longitude">-7.558824427014996</double>
  <int name="count">3</int>
 </lst>
 <lst name="cluster">
  <double name="latitude">37.46247138889737</double>
  <double name="longitude">-7.427570421460621</double>
  <int name="count">75</int>
 </lst>
 <lst name="cluster">
  <double name="latitude">37.44254890591144</double>
  <double name="longitude">-7.910397656178475</double>
  <int name="count">2</int>
 </lst>
 <lst name="cluster">
  <double name="latitude">37.71721549119542</double>
  <double name="longitude">-7.976317636646627</double>
  <int name="count">16</int>
 </lst>
 <lst name="cluster">
  <double name="latitude">38.05780205694754</double>
  <double name="longitude">-7.93969542527543</double>
  <int name="count">12</int>
 </lst>
 <lst name="cluster">
  <double name="latitude">38.05780205694754</double>
  <double name="longitude">-8.61354411450543</double>
  <int name="count">114</int>
 </lst>
</lst>
</response>

Reply via email to