On May 12, 2006, at 9:06 AM, Fabio Confalonieri wrote:
I see our needs have already surfaced in the mailing list, it's the
refine
search problem You have sometime called faceted browsing and which
is the
base of CNet browsing architecture: we have ads with different
categories
which have different attributes ("fields" in lucene language), say
motors-car category has make,model,price,color and real-estates-
houses has
bathrooms ranges, bedrooms ranges, etc...
I understand You have developed Solr also to have filter cache storing
bitset of search results to have a fast way to intersect those
bitsets to
count resulting sub-queries and present the count for refinement
searches (I
have read the announcement of CNet and the Nines related thread and
also
some other related thread).
As Yonik has pointed out, Solr provides some nice facilities to build
upon, but the actual implementation is still custom for this sort of
thing. For example, here's the (pseudo)code for my intersecting
BitSet (and soon to become DocSet) processing works:
private Query createConstraintMask(final Map facetCache, String[]
constraints, BitSet constraintMask, IndexReader reader) throws
ParseException, IOException {
Query query = new BooleanQuery(); // BooleanQuery used for all
full-text expression constraints, but not for facets
constraintMask.set(0, constraintMask.size()); // light up all
documents initially
if (constraints != null) {
// Loop over all constraints, ANDing all cached bit sets with
the constraint mask
for (String constraint : constraints) {
if (constraint == null || constraint.length() == 0) continue;
// constraint looks like this: [-]field:value
int colonPosition = constraint.indexOf(':');
if (colonPosition <= 0) continue;
String field = constraint.substring(0,colonPosition);
boolean invert = false;
if (field.startsWith("-")) {
invert = true;
field = field.substring(1);
}
String value = constraint.substring(colonPosition + 1);
BitSet valueMask;
if (! field.equals("?")) {
Map fieldMap = (Map) facetCache.get(field); // facetCache
is from a custom Solr cache currently
if (fieldMap == null) continue; // field name doesn't
correspond to predefined facets
valueMask = (BitSet) fieldMap.get(value);
if (valueMask == null) {
valueMask = new BitSet(constraintMask.size());
System.out.println("invalid value requested for field "
+ field + ": " + value);
}
} else {
Query clause = // some query from parsing "value";
QueryFilter filter = new QueryFilter(clause); // this
should change to get the DocSet from Solr's facilities :)
valueMask = filter.bits(reader);
}
if (!invert) {
constraintMask.and(valueMask);
} else {
constraintMask.andNot(valueMask); // This is what would
be nice for DocSet's to be capable of
}
}
}
if (((BooleanQuery)query).getClauses().length == 0) {
query = new MatchAllDocsQuery();
}
return query;
}
And then basically it gets called like this in my custom handler:
BitSet constraintMask = new BitSet(reader.numDocs());
Query query = query = createConstraintMask(facetCache,
req.getParams("constraint"), constraintMask, reader);
DocList results = req.getSearcher().getDocList(query, new
BitDocSet(constraintMask), sort, req.getStart(), req.getLimit());
[critique of this code more than welcome!]
My client (Ruby on Rails) is POSTing in a parameter that looks like
this:
constraint=#{invert}#{field}:#{constraint[:value]}
parameters. Works really well even before my refactoring to use
Solr's DocSet and caching capabilities, and I'm sure it'll do even
better leveraging its provided capabilities. Really nice stuff!
A more general question: Is all the CNet logic of intersecting bitsets
available through the servlet or have I to write some java code to be
plugged in Solr?
Currently you have to piece it together. The goal is to build these
facilities more into the core, but we should do so based on folks
implementing it themselves and contributing it, so that we can
compare the needs that others have and come up with some great
groundwork in the faceted browsing area just as Solr itself has built
above raw Lucene.
So, let's all flesh this stuff out and compare/contrast real-world
working implementations and factoring it on top.
As an example of another facility I've just added on top, the ability
to return all terms that match a client-provided prefix - this is to
enable Google Suggest-like convenience so that when someone types
"Yo" and pauses, an Ajaxifried UI will hit my Rails app, which in
turn will ping Solr with the prefix and a custom request handler will
respond back with the terms that match ("Yonik" for example) for a
specified field. Not only that, but my implementation returns the
number of documents that match that term constrained by the same
types of constraints above including full-text queries. This allows
our users to pick people by typing a name rather than us having to
populate a drop-down (we'll still have some kind of browser interface
too, I'm sure) but only names of folks involved in the document set
they are currently constraining their view to.
I've been thinking about this in a general sense - if Solr was driven
by a slick servlet filter rather than servlets then these types of
handlers could be plugged in a lot easier including automatic URL
handling rather than having to twiddle web.xml. I realize that the
handler configuration allows this with the qt parameter, and I'm
leveraging that myself, but I think with some HiveMind mojo to allow
true "plugins" to drop right into the classpath and be immediately
available (perhaps even hotly with some containers, but I personally
would rebuild a WAR, stop/deploy/restart).
In this case which is the correct level to make this, perhaps a new
RequestHandler understanding some new query syntax to exploit filters.
Back to your specific case currently, yes, a new request handler is
needed to go above and beyond what the built-in standard one
provides. I expect a flood of cool handlers on top of Solr :) and
that is why I am thinking more along the lines of a true plugin
architecture.
We only need a sort on a single and precalculated rank field stored
as a
range field, so we don't need relevance and consequently don't nedd
scores
(which is a prerequisite for using BitSets, if I understand well).
You're pretty much right on!
PS:I think Solr and Lucene are a really great work!
I'll be happy when we have finished to add our project (a major
press group
here in Italy) to public websites in Solr Wiki.
I'm looking forward to your work on top of Solr! I'm personally
quite thrilled with it and really believe it'll go far. If only I
had more time to play with it myself rather than just contemplating
it :)
Erik