Re: Geographic clustering

gwk Fri, 11 Sep 2009 07:33:22 -0700

Hi all,

I've just got my geographic clustering component working (somewhat).I've attached a sample resultset to this mail. It seems to work prettywell and it's pretty fast. I have one issue I need help with concerningthe API though. At the moment my Hilbert field is a Sortable Integer,and I do the following call to get the count for a specific cluster:


Query rangeQ = new TermRangeQuery("geo_hilbert", lowI, highI, true, true);
searcher.numDocs(rangeQ, docs);

But I'd like to further reduce the DocSet by the given longitude andlatitude bounds given in the geocluster arguments (swlat, swlng, nelatand nelng) but only for the purposes of clustering, I don't just want tohave to add fq arguments for to the query as I want my non-geoclusterresults (like facet counts and numFound) to not be affected by theselected range. So how would I achieve the effect of filterqueries(including the awesome caching) by manipulating either the rangeQ ordocs. And since the snippet above is called multiple times withdifferent rangeQ but the same (filtered) DocSet I guess manipulatingdocs would be faster (I think).


Regards,

gwk

gwk wrote:

Hi Joe,
Thanks for the link, I'll check it out, I'm not sure it'll help in mysituation though since the clustering should happen at runtime due tofaceted browsing (unless I'm mistaken at what the preprocessing does).
More on my progress though, I thought some more about using Hilbertcurve mapping and it seems really suited for what I want. I've justadded a Hilbert field to my schema (Trie Integer field) with latitudeand longitude at 15bits precision (didn't use 16 bits to avoid thesign bit) so I have a 30 bit number in said field. Getting facetcounts for 0 to (2^30 - 1) should get me the entire map while gettingcounts for 0 to (2^28 - 1), 2^28 to (2^29 - 1), 2^29 to (2^29 + 2^28 -1) and (2^29 + 2^28) to (2^30 - 1) should give me counts for fourequal quadrants, all the way down to 0 to 3, 4 to 7, 8 to 11 ....(2^30 - 4 to 2^30 - 1) and of course faceting on every separate term.Of course since if you're zoomed in far enough to need such finegrained clustering you'll be looking at a small portion of the map andonly a part of the whole range should be counted, but that should bedoable by calculating the Hilbert number for the lower and upper bounds.
The only problem is the location of the clusters, if I use this methodI'll only have the Hilbert number and the number of items in that partof the, what is essentially a quadtree. But I suppose I can calculatethe facet counts for one precision finer than the requested precisionand use a weighted average of the four parts of the cluster, I'll haveto see if that is accurate enough.
Hopefully I'll have the time to complete this today or tomorrow. I'llreport back if it has worked.
Regards,

gwk

Joe Calderon wrote:
there are clustering libraries like
http://bonsai.ims.u-tokyo.ac.jp/~mdehoon/software/cluster/, that have
bindings to perl/python, you can preprocess your results and create
clusters for each zoom level

On Tue, Sep 8, 2009 at 8:08 AM, gwk<g...@eyefi.nl> wrote:
Hi,

I just completed a simple proof-of-concept clusterer component which
naively clusters with a specified bounding box around each position,
similar to what the javascript MarkerClusterer does. It's currentlyvery
slow as I loop over the entire docset and request the longitude and
latitude of each document (Not to mention that my unfamiliarity with
Lucene/Solr isn't helping the implementations performance any, mostcode
is copied from grep-ing the solr source). Clustering a set of about
80.000 documents takes about 5-6 seconds. I'm currently looking into
storing the hilber curve mapping in Solr and clustering using facet
counts on numerical ranges of that mapping but I'm not sure it willpan out.
Regards,

gwk

Grant Ingersoll wrote:
Not directly related to geo clustering, but
http://issues.apache.org/jira/browse/SOLR-769 is all about a pluggable
interface to clustering implementations.  It currently has Carrot2
implemented, but the APIs are marked as experimental. I woulddefinitely beinterested in hearing your experience with implementing yourclustering
algorithm in it.

-Grant

On Sep 8, 2009, at 4:00 AM, gwk wrote:
Hi,
I'm working on a search-on-map interface for our website. I'vecreated a
little proof of concept which uses the MarkerClusterer
(http://code.google.com/p/gmaps-utility-library-dev/) whichclusters themarkers nicely. But because sending tens of thousands of markersover Ajax
is not quite as fast as I would like it to be, I'd prefer to do the
clustering on the server side. I've considered a few options likestoringthe morton-order and throwing away precision to cluster, assigningalllocations to a grid position. Or simply cluster based oncountry/region/citydepending on zoom level by adding latitude on longitude fields foreach zoomlevel (so that for smaller countries you have to be zoomed infurther to get
the next level of clustering).
I was wondering if anybody else has worked on something similarand if so
what their solutions are.

Regards,

gwk
--------------------------
Grant Ingersoll
http://www.lucidimagination.com/
Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)using
Solr/Lucene:
http://www.lucidimagination.com/search

<?xml version="1.0" encoding="UTF-8"?>
<response>

<lst name="responseHeader">
 <int name="status">0</int>
 <int name="QTime">3</int>
 <lst name="params">
  <str name="indent">on</str>
  <str name="geocluster.swlat">36.53</str>
  <str name="rows">0</str>
  <str name="version">2.2</str>
  <str name="omitHeader">false</str>
  <str name="geocluster.nelat">38.07</str>
  <str name="echoParams">EXPLICIT</str>
  <str name="start">0</str>
  <str name="q">*:*</str>
  <str name="geocluster.precision">9</str>
  <str name="geocluster.swlng">-9.8</str>
  <str name="geocluster.nelng">-6.72</str>
  <str name="geocluster">true</str>
  <str name="fq">geo_hilbert:[* TO *]</str>
 </lst>
</lst>
<result name="response" numFound="93307" start="0"/>
<lst name="facet_counts">
 <lst name="facet_queries"/>
 <lst name="facet_fields"/>
 <lst name="facet_dates"/>
 <lst name="facet_numbers"/>
</lst>
<lst name="geoclusters">
 <lst name="cluster">
  <double name="latitude">37.769512009033484</double>
  <double name="longitude">-8.61354411450543</double>
  <int name="count">25</int>
 </lst>
 <lst name="cluster">
  <double name="latitude">37.06696017358364</double>
  <double name="longitude">-8.07416425385468</double>
  <int name="count">3583</int>
 </lst>
 <lst name="cluster">
  <double name="latitude">37.176887001393936</double>
  <double name="longitude">-7.549981735214701</double>
  <int name="count">1153</int>
 </lst>
 <lst name="cluster">
  <double name="latitude">37.17886898403883</double>
  <double name="longitude">-6.855677968688013</double>
  <int name="count">54</int>
 </lst>
 <lst name="cluster">
  <double name="latitude">37.35465559862057</double>
  <double name="longitude">-6.855677968688013</double>
  <int name="count">8</int>
 </lst>
 <lst name="cluster">
  <double name="latitude">37.8820154423658</double>
  <double name="longitude">-6.855677968688013</double>
  <int name="count">1</int>
 </lst>
 <lst name="cluster">
  <double name="latitude">37.823419904171885</double>
  <double name="longitude">-7.558824427014996</double>
  <int name="count">3</int>
 </lst>
 <lst name="cluster">
  <double name="latitude">37.46247138889737</double>
  <double name="longitude">-7.427570421460621</double>
  <int name="count">75</int>
 </lst>
 <lst name="cluster">
  <double name="latitude">37.44254890591144</double>
  <double name="longitude">-7.910397656178475</double>
  <int name="count">2</int>
 </lst>
 <lst name="cluster">
  <double name="latitude">37.71721549119542</double>
  <double name="longitude">-7.976317636646627</double>
  <int name="count">16</int>
 </lst>
 <lst name="cluster">
  <double name="latitude">38.05780205694754</double>
  <double name="longitude">-7.93969542527543</double>
  <int name="count">12</int>
 </lst>
 <lst name="cluster">
  <double name="latitude">38.05780205694754</double>
  <double name="longitude">-8.61354411450543</double>
  <int name="count">114</int>
 </lst>
</lst>
</response>

Re: Geographic clustering

Reply via email to