On May 12, 2006, at 9:06 AM, Fabio Confalonieri wrote:
I see our needs have already surfaced in the mailing list, it's the refine search problem You have sometime called faceted browsing and which is the base of CNet browsing architecture: we have ads with different categories
which have different attributes ("fields" in lucene language), say
motors-car category has make,model,price,color and real-estates- houses has
bathrooms ranges, bedrooms ranges, etc...

I understand You have developed Solr also to have filter cache storing
bitset of search results to have a fast way to intersect those bitsets to count resulting sub-queries and present the count for refinement searches (I have read the announcement of CNet and the Nines related thread and also
some other related thread).

As Yonik has pointed out, Solr provides some nice facilities to build upon, but the actual implementation is still custom for this sort of thing. For example, here's the (pseudo)code for my intersecting BitSet (and soon to become DocSet) processing works:

private Query createConstraintMask(final Map facetCache, String[] constraints, BitSet constraintMask, IndexReader reader) throws ParseException, IOException { Query query = new BooleanQuery(); // BooleanQuery used for all full-text expression constraints, but not for facets constraintMask.set(0, constraintMask.size()); // light up all documents initially

    if (constraints != null) {

// Loop over all constraints, ANDing all cached bit sets with the constraint mask
      for (String constraint : constraints) {
        if (constraint == null || constraint.length() == 0) continue;

        // constraint looks like this: [-]field:value
        int colonPosition = constraint.indexOf(':');

        if (colonPosition <= 0) continue;

        String field = constraint.substring(0,colonPosition);
        boolean invert = false;
        if (field.startsWith("-")) {
          invert = true;
          field = field.substring(1);
        }

        String value = constraint.substring(colonPosition + 1);

        BitSet valueMask;
        if (! field.equals("?")) {
Map fieldMap = (Map) facetCache.get(field); // facetCache is from a custom Solr cache currently if (fieldMap == null) continue; // field name doesn't correspond to predefined facets

          valueMask = (BitSet) fieldMap.get(value);
          if (valueMask == null) {
            valueMask = new BitSet(constraintMask.size());
System.out.println("invalid value requested for field " + field + ": " + value);
          }
        } else {
          Query clause = // some query from parsing "value";
QueryFilter filter = new QueryFilter(clause); // this should change to get the DocSet from Solr's facilities :)
          valueMask = filter.bits(reader);
        }

        if (!invert) {
          constraintMask.and(valueMask);
        } else {
constraintMask.andNot(valueMask); // This is what would be nice for DocSet's to be capable of
        }
      }
    }

    if (((BooleanQuery)query).getClauses().length == 0) {
      query = new MatchAllDocsQuery();
    }

    return query;
  }


And then basically it gets called like this in my custom handler:

    BitSet constraintMask = new BitSet(reader.numDocs());
Query query = query = createConstraintMask(facetCache, req.getParams("constraint"), constraintMask, reader); DocList results = req.getSearcher().getDocList(query, new BitDocSet(constraintMask), sort, req.getStart(), req.getLimit());

[critique of this code more than welcome!]

My client (Ruby on Rails) is POSTing in a parameter that looks like this:

        constraint=#{invert}#{field}:#{constraint[:value]}

parameters. Works really well even before my refactoring to use Solr's DocSet and caching capabilities, and I'm sure it'll do even better leveraging its provided capabilities. Really nice stuff!

A more general question: Is all the CNet logic of intersecting bitsets
available through the servlet or have I to write some java code to be
plugged in Solr?

Currently you have to piece it together. The goal is to build these facilities more into the core, but we should do so based on folks implementing it themselves and contributing it, so that we can compare the needs that others have and come up with some great groundwork in the faceted browsing area just as Solr itself has built above raw Lucene.

So, let's all flesh this stuff out and compare/contrast real-world working implementations and factoring it on top.

As an example of another facility I've just added on top, the ability to return all terms that match a client-provided prefix - this is to enable Google Suggest-like convenience so that when someone types "Yo" and pauses, an Ajaxifried UI will hit my Rails app, which in turn will ping Solr with the prefix and a custom request handler will respond back with the terms that match ("Yonik" for example) for a specified field. Not only that, but my implementation returns the number of documents that match that term constrained by the same types of constraints above including full-text queries. This allows our users to pick people by typing a name rather than us having to populate a drop-down (we'll still have some kind of browser interface too, I'm sure) but only names of folks involved in the document set they are currently constraining their view to.

I've been thinking about this in a general sense - if Solr was driven by a slick servlet filter rather than servlets then these types of handlers could be plugged in a lot easier including automatic URL handling rather than having to twiddle web.xml. I realize that the handler configuration allows this with the qt parameter, and I'm leveraging that myself, but I think with some HiveMind mojo to allow true "plugins" to drop right into the classpath and be immediately available (perhaps even hotly with some containers, but I personally would rebuild a WAR, stop/deploy/restart).

In this case which is the correct level to make this, perhaps a new
RequestHandler understanding some new query syntax to exploit filters.

Back to your specific case currently, yes, a new request handler is needed to go above and beyond what the built-in standard one provides. I expect a flood of cool handlers on top of Solr :) and that is why I am thinking more along the lines of a true plugin architecture.

We only need a sort on a single and precalculated rank field stored as a range field, so we don't need relevance and consequently don't nedd scores
(which is a prerequisite for using BitSets, if I understand well).

You're pretty much right on!

PS:I think Solr and Lucene are a really great work!
I'll be happy when we have finished to add our project (a major press group
here in Italy) to public websites in Solr Wiki.

I'm looking forward to your work on top of Solr! I'm personally quite thrilled with it and really believe it'll go far. If only I had more time to play with it myself rather than just contemplating it :)

        Erik

Reply via email to