Hi all,
I've just got my geographic clustering component working (somewhat).
I've attached a sample resultset to this mail. It seems to work pretty
well and it's pretty fast. I have one issue I need help with concerning
the API though. At the moment my Hilbert field is a Sortable Integer,
and I do the following call to get the count for a specific cluster:
Query rangeQ = new TermRangeQuery("geo_hilbert", lowI, highI, true, true);
searcher.numDocs(rangeQ, docs);
But I'd like to further reduce the DocSet by the given longitude and
latitude bounds given in the geocluster arguments (swlat, swlng, nelat
and nelng) but only for the purposes of clustering, I don't just want to
have to add fq arguments for to the query as I want my non-geocluster
results (like facet counts and numFound) to not be affected by the
selected range. So how would I achieve the effect of filterqueries
(including the awesome caching) by manipulating either the rangeQ or
docs. And since the snippet above is called multiple times with
different rangeQ but the same (filtered) DocSet I guess manipulating
docs would be faster (I think).
Regards,
gwk
gwk wrote:
Hi Joe,
Thanks for the link, I'll check it out, I'm not sure it'll help in my
situation though since the clustering should happen at runtime due to
faceted browsing (unless I'm mistaken at what the preprocessing does).
More on my progress though, I thought some more about using Hilbert
curve mapping and it seems really suited for what I want. I've just
added a Hilbert field to my schema (Trie Integer field) with latitude
and longitude at 15bits precision (didn't use 16 bits to avoid the
sign bit) so I have a 30 bit number in said field. Getting facet
counts for 0 to (2^30 - 1) should get me the entire map while getting
counts for 0 to (2^28 - 1), 2^28 to (2^29 - 1), 2^29 to (2^29 + 2^28 -
1) and (2^29 + 2^28) to (2^30 - 1) should give me counts for four
equal quadrants, all the way down to 0 to 3, 4 to 7, 8 to 11 ....
(2^30 - 4 to 2^30 - 1) and of course faceting on every separate term.
Of course since if you're zoomed in far enough to need such fine
grained clustering you'll be looking at a small portion of the map and
only a part of the whole range should be counted, but that should be
doable by calculating the Hilbert number for the lower and upper bounds.
The only problem is the location of the clusters, if I use this method
I'll only have the Hilbert number and the number of items in that part
of the, what is essentially a quadtree. But I suppose I can calculate
the facet counts for one precision finer than the requested precision
and use a weighted average of the four parts of the cluster, I'll have
to see if that is accurate enough.
Hopefully I'll have the time to complete this today or tomorrow. I'll
report back if it has worked.
Regards,
gwk
Joe Calderon wrote:
there are clustering libraries like
http://bonsai.ims.u-tokyo.ac.jp/~mdehoon/software/cluster/, that have
bindings to perl/python, you can preprocess your results and create
clusters for each zoom level
On Tue, Sep 8, 2009 at 8:08 AM, gwk<g...@eyefi.nl> wrote:
Hi,
I just completed a simple proof-of-concept clusterer component which
naively clusters with a specified bounding box around each position,
similar to what the javascript MarkerClusterer does. It's currently
very
slow as I loop over the entire docset and request the longitude and
latitude of each document (Not to mention that my unfamiliarity with
Lucene/Solr isn't helping the implementations performance any, most
code
is copied from grep-ing the solr source). Clustering a set of about
80.000 documents takes about 5-6 seconds. I'm currently looking into
storing the hilber curve mapping in Solr and clustering using facet
counts on numerical ranges of that mapping but I'm not sure it will
pan out.
Regards,
gwk
Grant Ingersoll wrote:
Not directly related to geo clustering, but
http://issues.apache.org/jira/browse/SOLR-769 is all about a pluggable
interface to clustering implementations. It currently has Carrot2
implemented, but the APIs are marked as experimental. I would
definitely be
interested in hearing your experience with implementing your
clustering
algorithm in it.
-Grant
On Sep 8, 2009, at 4:00 AM, gwk wrote:
Hi,
I'm working on a search-on-map interface for our website. I've
created a
little proof of concept which uses the MarkerClusterer
(http://code.google.com/p/gmaps-utility-library-dev/) which
clusters the
markers nicely. But because sending tens of thousands of markers
over Ajax
is not quite as fast as I would like it to be, I'd prefer to do the
clustering on the server side. I've considered a few options like
storing
the morton-order and throwing away precision to cluster, assigning
all
locations to a grid position. Or simply cluster based on
country/region/city
depending on zoom level by adding latitude on longitude fields for
each zoom
level (so that for smaller countries you have to be zoomed in
further to get
the next level of clustering).
I was wondering if anybody else has worked on something similar
and if so
what their solutions are.
Regards,
gwk
--------------------------
Grant Ingersoll
http://www.lucidimagination.com/
Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
using
Solr/Lucene:
http://www.lucidimagination.com/search
<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">3</int>
<lst name="params">
<str name="indent">on</str>
<str name="geocluster.swlat">36.53</str>
<str name="rows">0</str>
<str name="version">2.2</str>
<str name="omitHeader">false</str>
<str name="geocluster.nelat">38.07</str>
<str name="echoParams">EXPLICIT</str>
<str name="start">0</str>
<str name="q">*:*</str>
<str name="geocluster.precision">9</str>
<str name="geocluster.swlng">-9.8</str>
<str name="geocluster.nelng">-6.72</str>
<str name="geocluster">true</str>
<str name="fq">geo_hilbert:[* TO *]</str>
</lst>
</lst>
<result name="response" numFound="93307" start="0"/>
<lst name="facet_counts">
<lst name="facet_queries"/>
<lst name="facet_fields"/>
<lst name="facet_dates"/>
<lst name="facet_numbers"/>
</lst>
<lst name="geoclusters">
<lst name="cluster">
<double name="latitude">37.769512009033484</double>
<double name="longitude">-8.61354411450543</double>
<int name="count">25</int>
</lst>
<lst name="cluster">
<double name="latitude">37.06696017358364</double>
<double name="longitude">-8.07416425385468</double>
<int name="count">3583</int>
</lst>
<lst name="cluster">
<double name="latitude">37.176887001393936</double>
<double name="longitude">-7.549981735214701</double>
<int name="count">1153</int>
</lst>
<lst name="cluster">
<double name="latitude">37.17886898403883</double>
<double name="longitude">-6.855677968688013</double>
<int name="count">54</int>
</lst>
<lst name="cluster">
<double name="latitude">37.35465559862057</double>
<double name="longitude">-6.855677968688013</double>
<int name="count">8</int>
</lst>
<lst name="cluster">
<double name="latitude">37.8820154423658</double>
<double name="longitude">-6.855677968688013</double>
<int name="count">1</int>
</lst>
<lst name="cluster">
<double name="latitude">37.823419904171885</double>
<double name="longitude">-7.558824427014996</double>
<int name="count">3</int>
</lst>
<lst name="cluster">
<double name="latitude">37.46247138889737</double>
<double name="longitude">-7.427570421460621</double>
<int name="count">75</int>
</lst>
<lst name="cluster">
<double name="latitude">37.44254890591144</double>
<double name="longitude">-7.910397656178475</double>
<int name="count">2</int>
</lst>
<lst name="cluster">
<double name="latitude">37.71721549119542</double>
<double name="longitude">-7.976317636646627</double>
<int name="count">16</int>
</lst>
<lst name="cluster">
<double name="latitude">38.05780205694754</double>
<double name="longitude">-7.93969542527543</double>
<int name="count">12</int>
</lst>
<lst name="cluster">
<double name="latitude">38.05780205694754</double>
<double name="longitude">-8.61354411450543</double>
<int name="count">114</int>
</lst>
</lst>
</response>