date:20060512

Leveraging filter chache in queries

2006-05-12 Thread Fabio Confalonieri


Hello,

I've just fond Lucene and Solr and I'm thinking of using them in our current
project, essentially an ads portal (something very similar to
www.oodle.com).

I see our needs have already surfaced in the mailing list, it's the refine
search problem You have sometime called faceted browsing and which is the
base of CNet browsing architecture: we have ads with different categories
which have different attributes ("fields" in lucene language), say
motors-car category has make,model,price,color and real-estates-houses has
bathrooms ranges, bedrooms ranges, etc...

I understand You have developed Solr also to have filter cache storing
bitset of search results to have a fast way to intersect those bitsets to
count resulting sub-queries and present the count for refinement searches (I
have read the announcement of CNet and the Nines related thread and also
some other related thread).

Actually we thought of storing for every category on a MySQL database (which
we use for every other non search related tasks) the possible sub-query
attributes with possible values/ranges, in a similar way as You with CNet do
storing the possible subqueries of a query in a lucene document.

Now what I havent understood is if the Solr StandardRequestHandler
automatically creates and caches filters from normal queries submitted to
Solr select servlet, possibly with some syntax clue.
I tried a query like "+field:value^0" which returns a great number of Hits
(on a total test of 100.000 documents), but I see only the query cache
growing and the filter cache always empty. Is this normal ? I've tried to
check all the cache configuration but I don't understand if filters are
auto-generated from normal queries.

A more general question: Is all the CNet logic of intersecting bitsets
available through the servlet or have I to write some java code to be
plugged in Solr?
In this case which is the correct level to make this, perhaps a new
RequestHandler understanding some new query syntax to exploit filters.

We only need a sort on a single and precalculated rank field stored as a
range field, so we don't need relevance and consequently don't nedd scores
(which is a prerequisite for using BitSets, if I understand well).

Thank You, I hope to have explained well my doubts.

Fabio

PS:I think Solr and Lucene are a really great work!
I'll be happy when we have finished to add our project (a major press group
here in Italy) to public websites in Solr Wiki.

--
View this message in context: 
http://www.nabble.com/Leveraging-filter-chache-in-queries-t1607377.html#a4357730
Sent from the Solr - User forum at Nabble.com.

Re: Leveraging filter chache in queries

2006-05-12 Thread Yonik Seeley


On 5/12/06, Fabio Confalonieri <[EMAIL PROTECTED]> wrote:

I tried a query like "+field:value^0" which returns a great number of Hits
(on a total test of 100.000 documents), but I see only the query cache
growing and the filter cache always empty. Is this normal ? I've tried to
check all the cache configuration but I don't understand if filters are
auto-generated from normal queries.


There is currently no syntax in the standard request handler that
understands filters.

Converting certain "heavy" term queries to filters when they have a
zero boost was something Doug pointed me at and I borrowed directly
from Nutch very early on, before Solr had it's own caching.

The optimization code is still sort-of in Solr, but
- it's not called by default anymore... people needing faceted
browsing currently need their own plugin anyway, and they can then
specify filters directly.
- it's caching is not integrated into Solr's caching

Filters *can* be generated and used to satisfy whole queries when the
following optimization is turned on in solrconfig.xml:
  
   true


A more general question: Is all the CNet logic of intersecting bitsets
available through the servlet or have I to write some java code to be
plugged in Solr?


The nitty-gritty if getting intersection counts is in Solr, but you
still need to ask solr for each facet count individually, and you
still need to know which counts to ask for.  Thats the part you
currently still need a custom request handler for.


In this case which is the correct level to make this, perhaps a new
RequestHandler understanding some new query syntax to exploit filters.


Yes, a new RequestHandler.. from there the easiest way is to pass
extra parameters  (not changing the query syntax passed as "q").


We only need a sort on a single and precalculated rank field stored as a
range field, so we don't need relevance and consequently don't nedd scores
(which is a prerequisite for using BitSets, if I understand well).


You can do relevancy scoring *and* do facets at the same time... there
is no incompatibility there.

-Yonik

One big XML file vs. many HTTP requests

2006-05-12 Thread Michael Levy


Greetings,

I'm evaluating using Solr under Tomcat to replace a number of text 
searching projects that currently use UMASS's INQUERY, an older search 
engine.


One nice feature of INQUERY is that you can create one large SGML file, 
containing lots of records, each bracketed with  and  tags.  
Submitting that big SGML document for indexing goes very fast. 

I believe that Solr indexes one document at a time; each document 
requires a separate HTTP POST.


How efficient is making a separate HTTP request per-document, when there 
are millions of documents?  Do people ever use Solr's or Lucene's API 
directly for indexing large numbers of documents, and if so, what are 
the considerations pro and con?


Thanks to Yonik and Chris everyone for all your work; Solr looks really 
great.

Re: One big XML file vs. many HTTP requests

2006-05-12 Thread Yonik Seeley


On 5/12/06, Michael Levy <[EMAIL PROTECTED]> wrote:

How efficient is making a separate HTTP request per-document, when there
are millions of documents?


If you use persistent connections and add make multiple requests in
parallel, there won't be much difference than multiple docs per
request.

-Yonik

Re: Leveraging filter chache in queries

2006-05-12 Thread Erik Hatcher



On May 12, 2006, at 9:06 AM, Fabio Confalonieri wrote:
I see our needs have already surfaced in the mailing list, it's the  
refine
search problem You have sometime called faceted browsing and which  
is the
base of CNet browsing architecture: we have ads with different  
categories

which have different attributes ("fields" in lucene language), say
motors-car category has make,model,price,color and real-estates- 
houses has

bathrooms ranges, bedrooms ranges, etc...

I understand You have developed Solr also to have filter cache storing
bitset of search results to have a fast way to intersect those  
bitsets to
count resulting sub-queries and present the count for refinement  
searches (I
have read the announcement of CNet and the Nines related thread and  
also

some other related thread).


As Yonik has pointed out, Solr provides some nice facilities to build  
upon, but the actual implementation is still custom for this sort of  
thing.  For example, here's the (pseudo)code for my intersecting  
BitSet (and soon to become DocSet) processing works:


  private Query createConstraintMask(final Map facetCache, String[]  
constraints, BitSet constraintMask, IndexReader reader) throws  
ParseException, IOException {
Query query = new BooleanQuery();  // BooleanQuery used for all  
full-text expression constraints, but not for facets
constraintMask.set(0, constraintMask.size()); // light up all  
documents initially


if (constraints != null) {

  // Loop over all constraints, ANDing all cached bit sets with  
the constraint mask

  for (String constraint : constraints) {
if (constraint == null || constraint.length() == 0) continue;

// constraint looks like this: [-]field:value
int colonPosition = constraint.indexOf(':');

if (colonPosition <= 0) continue;

String field = constraint.substring(0,colonPosition);
boolean invert = false;
if (field.startsWith("-")) {
  invert = true;
  field = field.substring(1);
}

String value = constraint.substring(colonPosition + 1);

BitSet valueMask;
if (! field.equals("?")) {
  Map fieldMap = (Map) facetCache.get(field);  // facetCache  
is from a custom Solr cache currently
  if (fieldMap == null) continue;  // field name doesn't  
correspond to predefined facets


  valueMask = (BitSet) fieldMap.get(value);
  if (valueMask == null) {
valueMask = new BitSet(constraintMask.size());
System.out.println("invalid value requested for field "  
+ field + ": " + value);

  }
} else {
  Query clause = // some query from parsing "value";
  QueryFilter filter = new QueryFilter(clause); // this  
should change to get the DocSet from Solr's facilities :)

  valueMask = filter.bits(reader);
}

if (!invert) {
  constraintMask.and(valueMask);
} else {
  constraintMask.andNot(valueMask);   // This is what would  
be nice for DocSet's to be capable of

}
  }
}

if (((BooleanQuery)query).getClauses().length == 0) {
  query = new MatchAllDocsQuery();
}

return query;
  }


And then basically it gets called like this in my custom handler:

BitSet constraintMask = new BitSet(reader.numDocs());
Query query = query = createConstraintMask(facetCache,  
req.getParams("constraint"), constraintMask, reader);
DocList results = req.getSearcher().getDocList(query, new  
BitDocSet(constraintMask), sort, req.getStart(), req.getLimit());


[critique of this code more than welcome!]

My client (Ruby on  Rails) is POSTing in a parameter that looks like  
this:


constraint=#{invert}#{field}:#{constraint[:value]}

parameters.  Works really well even before my refactoring to use  
Solr's DocSet and caching capabilities, and I'm sure it'll do even  
better leveraging its provided capabilities.  Really nice stuff!



A more general question: Is all the CNet logic of intersecting bitsets
available through the servlet or have I to write some java code to be
plugged in Solr?


Currently you have to piece it together.  The goal is to build these  
facilities more into the core, but we should do so based on folks  
implementing it themselves and contributing it, so that we can  
compare the needs that others have and come up with some great  
groundwork in the faceted browsing area just as Solr itself has built  
above raw Lucene.


So, let's all flesh this stuff out and compare/contrast real-world  
working implementations and factoring it on top.


As an example of another facility I've just added on top, the ability  
to return all terms that match a client-provided prefix - this is to  
enable Google Suggest-like convenience so that when someone types  
"Yo" and pauses, an Ajaxifried UI will hit my Rails app, which in  
turn will ping Solr with the prefix and a custom request handler will  
respond back

Re: One big XML file vs. many HTTP requests

2006-05-12 Thread Erik Hatcher



On May 12, 2006, at 1:02 PM, Michael Levy wrote:
One nice feature of INQUERY is that you can create one large SGML  
file, containing lots of records, each bracketed with  and DOC> tags.  Submitting that big SGML document for indexing goes  
very fast.
I believe that Solr indexes one document at a time; each document  
requires a separate HTTP POST.


Actually adding multiple documents per POST is possible

How efficient is making a separate HTTP request per-document, when  
there are millions of documents?  Do people ever use Solr's or  
Lucene's API directly for indexing large numbers of documents, and  
if so, what are the considerations pro and con?


Maybe Solr could evolve a facility for doing these types of bulk  
operations without HTTP, but still using Solr's engine somehow via  
API directly.  I guess this gets tricky when you have a live Solr  
system up and juggling write locks though.


But currently going through HTTP is the only way, and likely to not  
be that much of a bottleneck especially given you can post multiple  
documents at a time (the wiki has an example, but I can't get to the  
web at the moment to post the link).


Erik

Leveraging filter chache in queries

Re: Leveraging filter chache in queries

One big XML file vs. many HTTP requests

Re: One big XML file vs. many HTTP requests

Re: Leveraging filter chache in queries

Re: One big XML file vs. many HTTP requests

6 matches

Site Navigation

Mail list logo

Footer information