Leveraging filter chache in queries
Hello, I've just fond Lucene and Solr and I'm thinking of using them in our current project, essentially an ads portal (something very similar to www.oodle.com). I see our needs have already surfaced in the mailing list, it's the refine search problem You have sometime called faceted browsing and which is the base of CNet browsing architecture: we have ads with different categories which have different attributes ("fields" in lucene language), say motors-car category has make,model,price,color and real-estates-houses has bathrooms ranges, bedrooms ranges, etc... I understand You have developed Solr also to have filter cache storing bitset of search results to have a fast way to intersect those bitsets to count resulting sub-queries and present the count for refinement searches (I have read the announcement of CNet and the Nines related thread and also some other related thread). Actually we thought of storing for every category on a MySQL database (which we use for every other non search related tasks) the possible sub-query attributes with possible values/ranges, in a similar way as You with CNet do storing the possible subqueries of a query in a lucene document. Now what I havent understood is if the Solr StandardRequestHandler automatically creates and caches filters from normal queries submitted to Solr select servlet, possibly with some syntax clue. I tried a query like "+field:value^0" which returns a great number of Hits (on a total test of 100.000 documents), but I see only the query cache growing and the filter cache always empty. Is this normal ? I've tried to check all the cache configuration but I don't understand if filters are auto-generated from normal queries. A more general question: Is all the CNet logic of intersecting bitsets available through the servlet or have I to write some java code to be plugged in Solr? In this case which is the correct level to make this, perhaps a new RequestHandler understanding some new query syntax to exploit filters. We only need a sort on a single and precalculated rank field stored as a range field, so we don't need relevance and consequently don't nedd scores (which is a prerequisite for using BitSets, if I understand well). Thank You, I hope to have explained well my doubts. Fabio PS:I think Solr and Lucene are a really great work! I'll be happy when we have finished to add our project (a major press group here in Italy) to public websites in Solr Wiki. -- View this message in context: http://www.nabble.com/Leveraging-filter-chache-in-queries-t1607377.html#a4357730 Sent from the Solr - User forum at Nabble.com.
Re: Leveraging filter chache in queries
On 5/12/06, Fabio Confalonieri <[EMAIL PROTECTED]> wrote: I tried a query like "+field:value^0" which returns a great number of Hits (on a total test of 100.000 documents), but I see only the query cache growing and the filter cache always empty. Is this normal ? I've tried to check all the cache configuration but I don't understand if filters are auto-generated from normal queries. There is currently no syntax in the standard request handler that understands filters. Converting certain "heavy" term queries to filters when they have a zero boost was something Doug pointed me at and I borrowed directly from Nutch very early on, before Solr had it's own caching. The optimization code is still sort-of in Solr, but - it's not called by default anymore... people needing faceted browsing currently need their own plugin anyway, and they can then specify filters directly. - it's caching is not integrated into Solr's caching Filters *can* be generated and used to satisfy whole queries when the following optimization is turned on in solrconfig.xml: true A more general question: Is all the CNet logic of intersecting bitsets available through the servlet or have I to write some java code to be plugged in Solr? The nitty-gritty if getting intersection counts is in Solr, but you still need to ask solr for each facet count individually, and you still need to know which counts to ask for. Thats the part you currently still need a custom request handler for. In this case which is the correct level to make this, perhaps a new RequestHandler understanding some new query syntax to exploit filters. Yes, a new RequestHandler.. from there the easiest way is to pass extra parameters (not changing the query syntax passed as "q"). We only need a sort on a single and precalculated rank field stored as a range field, so we don't need relevance and consequently don't nedd scores (which is a prerequisite for using BitSets, if I understand well). You can do relevancy scoring *and* do facets at the same time... there is no incompatibility there. -Yonik
One big XML file vs. many HTTP requests
Greetings, I'm evaluating using Solr under Tomcat to replace a number of text searching projects that currently use UMASS's INQUERY, an older search engine. One nice feature of INQUERY is that you can create one large SGML file, containing lots of records, each bracketed with and tags. Submitting that big SGML document for indexing goes very fast. I believe that Solr indexes one document at a time; each document requires a separate HTTP POST. How efficient is making a separate HTTP request per-document, when there are millions of documents? Do people ever use Solr's or Lucene's API directly for indexing large numbers of documents, and if so, what are the considerations pro and con? Thanks to Yonik and Chris everyone for all your work; Solr looks really great.
Re: One big XML file vs. many HTTP requests
On 5/12/06, Michael Levy <[EMAIL PROTECTED]> wrote: How efficient is making a separate HTTP request per-document, when there are millions of documents? If you use persistent connections and add make multiple requests in parallel, there won't be much difference than multiple docs per request. -Yonik
Re: Leveraging filter chache in queries
On May 12, 2006, at 9:06 AM, Fabio Confalonieri wrote: I see our needs have already surfaced in the mailing list, it's the refine search problem You have sometime called faceted browsing and which is the base of CNet browsing architecture: we have ads with different categories which have different attributes ("fields" in lucene language), say motors-car category has make,model,price,color and real-estates- houses has bathrooms ranges, bedrooms ranges, etc... I understand You have developed Solr also to have filter cache storing bitset of search results to have a fast way to intersect those bitsets to count resulting sub-queries and present the count for refinement searches (I have read the announcement of CNet and the Nines related thread and also some other related thread). As Yonik has pointed out, Solr provides some nice facilities to build upon, but the actual implementation is still custom for this sort of thing. For example, here's the (pseudo)code for my intersecting BitSet (and soon to become DocSet) processing works: private Query createConstraintMask(final Map facetCache, String[] constraints, BitSet constraintMask, IndexReader reader) throws ParseException, IOException { Query query = new BooleanQuery(); // BooleanQuery used for all full-text expression constraints, but not for facets constraintMask.set(0, constraintMask.size()); // light up all documents initially if (constraints != null) { // Loop over all constraints, ANDing all cached bit sets with the constraint mask for (String constraint : constraints) { if (constraint == null || constraint.length() == 0) continue; // constraint looks like this: [-]field:value int colonPosition = constraint.indexOf(':'); if (colonPosition <= 0) continue; String field = constraint.substring(0,colonPosition); boolean invert = false; if (field.startsWith("-")) { invert = true; field = field.substring(1); } String value = constraint.substring(colonPosition + 1); BitSet valueMask; if (! field.equals("?")) { Map fieldMap = (Map) facetCache.get(field); // facetCache is from a custom Solr cache currently if (fieldMap == null) continue; // field name doesn't correspond to predefined facets valueMask = (BitSet) fieldMap.get(value); if (valueMask == null) { valueMask = new BitSet(constraintMask.size()); System.out.println("invalid value requested for field " + field + ": " + value); } } else { Query clause = // some query from parsing "value"; QueryFilter filter = new QueryFilter(clause); // this should change to get the DocSet from Solr's facilities :) valueMask = filter.bits(reader); } if (!invert) { constraintMask.and(valueMask); } else { constraintMask.andNot(valueMask); // This is what would be nice for DocSet's to be capable of } } } if (((BooleanQuery)query).getClauses().length == 0) { query = new MatchAllDocsQuery(); } return query; } And then basically it gets called like this in my custom handler: BitSet constraintMask = new BitSet(reader.numDocs()); Query query = query = createConstraintMask(facetCache, req.getParams("constraint"), constraintMask, reader); DocList results = req.getSearcher().getDocList(query, new BitDocSet(constraintMask), sort, req.getStart(), req.getLimit()); [critique of this code more than welcome!] My client (Ruby on Rails) is POSTing in a parameter that looks like this: constraint=#{invert}#{field}:#{constraint[:value]} parameters. Works really well even before my refactoring to use Solr's DocSet and caching capabilities, and I'm sure it'll do even better leveraging its provided capabilities. Really nice stuff! A more general question: Is all the CNet logic of intersecting bitsets available through the servlet or have I to write some java code to be plugged in Solr? Currently you have to piece it together. The goal is to build these facilities more into the core, but we should do so based on folks implementing it themselves and contributing it, so that we can compare the needs that others have and come up with some great groundwork in the faceted browsing area just as Solr itself has built above raw Lucene. So, let's all flesh this stuff out and compare/contrast real-world working implementations and factoring it on top. As an example of another facility I've just added on top, the ability to return all terms that match a client-provided prefix - this is to enable Google Suggest-like convenience so that when someone types "Yo" and pauses, an Ajaxifried UI will hit my Rails app, which in turn will ping Solr with the prefix and a custom request handler will respond back
Re: One big XML file vs. many HTTP requests
On May 12, 2006, at 1:02 PM, Michael Levy wrote: One nice feature of INQUERY is that you can create one large SGML file, containing lots of records, each bracketed with and DOC> tags. Submitting that big SGML document for indexing goes very fast. I believe that Solr indexes one document at a time; each document requires a separate HTTP POST. Actually adding multiple documents per POST is possible How efficient is making a separate HTTP request per-document, when there are millions of documents? Do people ever use Solr's or Lucene's API directly for indexing large numbers of documents, and if so, what are the considerations pro and con? Maybe Solr could evolve a facility for doing these types of bulk operations without HTTP, but still using Solr's engine somehow via API directly. I guess this gets tricky when you have a live Solr system up and juggling write locks though. But currently going through HTTP is the only way, and likely to not be that much of a bottleneck especially given you can post multiple documents at a time (the wiki has an example, but I can't get to the web at the moment to post the link). Erik