Leveraging filter chache in queries
Hello, I've just fond Lucene and Solr and I'm thinking of using them in our current project, essentially an ads portal (something very similar to www.oodle.com). I see our needs have already surfaced in the mailing list, it's the refine search problem You have sometime called faceted browsing and which is the base of CNet browsing architecture: we have ads with different categories which have different attributes ("fields" in lucene language), say motors-car category has make,model,price,color and real-estates-houses has bathrooms ranges, bedrooms ranges, etc... I understand You have developed Solr also to have filter cache storing bitset of search results to have a fast way to intersect those bitsets to count resulting sub-queries and present the count for refinement searches (I have read the announcement of CNet and the Nines related thread and also some other related thread). Actually we thought of storing for every category on a MySQL database (which we use for every other non search related tasks) the possible sub-query attributes with possible values/ranges, in a similar way as You with CNet do storing the possible subqueries of a query in a lucene document. Now what I havent understood is if the Solr StandardRequestHandler automatically creates and caches filters from normal queries submitted to Solr select servlet, possibly with some syntax clue. I tried a query like "+field:value^0" which returns a great number of Hits (on a total test of 100.000 documents), but I see only the query cache growing and the filter cache always empty. Is this normal ? I've tried to check all the cache configuration but I don't understand if filters are auto-generated from normal queries. A more general question: Is all the CNet logic of intersecting bitsets available through the servlet or have I to write some java code to be plugged in Solr? In this case which is the correct level to make this, perhaps a new RequestHandler understanding some new query syntax to exploit filters. We only need a sort on a single and precalculated rank field stored as a range field, so we don't need relevance and consequently don't nedd scores (which is a prerequisite for using BitSets, if I understand well). Thank You, I hope to have explained well my doubts. Fabio PS:I think Solr and Lucene are a really great work! I'll be happy when we have finished to add our project (a major press group here in Italy) to public websites in Solr Wiki. -- View this message in context: http://www.nabble.com/Leveraging-filter-chache-in-queries-t1607377.html#a4357730 Sent from the Solr - User forum at Nabble.com.
Finding documents with undefined field
Hello, I would like to search for all documents with a field not defined. Say You have documents with title_s defined and some documents without title_s: I would like to obtain all documents without title_s. I've not find a clue, perhaps it's not possible; I would like to know if there is an alternative to setting a default value "undefined" for title_s on all documents. Thank You Fabio Confalonieri -- View this message in context: http://www.nabble.com/Finding-documents-with-undefined-field-t1742872.html#a4736565 Sent from the Solr - User forum at Nabble.com.
Re: Finding documents with undefined field
Thank You Hoss (You all are always very responsive...), actually I've developed my own FacetRequestHandler extending the query format and adding a showfacet parameter (it's a little custom on our needs, but I'd like to publish it when we have finished). What I do is the merge of some ideas from the forum; my query is now in three parts q=query;sort;filters where filters is a list of query-clauses separated by commas that I parse to get filterField and filterValue, then for every filter: filterList.add(QueryParsing.parseQuery(createQueryString("filterField:filterValue", defaultField, req.getSchema())); then I use filterList in the main query in DocListAndSet results = req.getSearcher().getDocListAndSet(query,filterList,sort,... Then, if requested with showfacets parameter, I get facets extracting and parsing a facetXML descriptor from a facet-type document in the index, querying for the facet descriptor of the current category i get from the filter list (similar to CNET, i think). To calculate counts for every facet composed of a field and a value, based on the main query, I use facetCount = searcher.numDocs(QueryParsing.parseQuery("facetField:facetValue", "", req.getSchema()), results.docSet); Now, how could I get a fiter for the missing field ? Can I use the unbounded range trick simply adding a facet (and filter) like this: facetCount = searcher.numDocs(QueryParsing.parseQuery("-fieldName:[* TO *]", "", req.getSchema()), results.docSet); ...since i use results.docSet of the base query (the same for filters I think) ? Or there is a better way ? Thank You again Fabio -- View this message in context: http://www.nabble.com/Finding-documents-with-undefined-field-t1742872.html#a4751981 Sent from the Solr - User forum at Nabble.com.
Re: Finding documents with undefined field
Chris Hostetter wrote: > > > There are a couple of things you can do here... > > 1) Use the same approach i described before if you have a uniqueKey, > search for all things with a key and then exclude things that have a value > in your field. Since you are writing a request handler, you could also > progromaticaly build up a BooleanQuery containing a MatchAllDocsQuery > object and your prohibited clause even if you don't have a uniqueKey > > 2) you can fetch the DocSet of all documents that *do* have a value for > that field, and then get the inverse, and use that for your facet counts. > this is something that was discussed before in a thread Erik started... > .. > > Ok at last I tried the easy way so, when I find a particular predefined "undefined-value" in a filter or facet, I convert the query to parse to: "type:ad AND -" +field+":[* TO *]" "type:ad" matches all my documents, the other type I have is "facets" (many thanks for the unbound range trick). I cannot see any particular slowliness (but I'm testing with 50.000 docs now) perhaps thanks to Solr ConstantScoreRangeQueries conversion, should I worry with bigger numbers, say 300.000 docs ? My two cents on Solr development: surely "DocSet.andNot(DocSet other)" capability would be precious to optimize the undefined-field and other inverse-query problems. Thanks again Fabio -- View this message in context: http://www.nabble.com/Finding-documents-with-undefined-field-t1742872.html#a4773462 Sent from the Solr - User forum at Nabble.com.
International Charsets in embedded XML
Here I am again with a subtle problem: I need to store XML in a document field. I declare it as string and surround it in CData when I post the add xml. -- View this message in context: http://www.nabble.com/International-Charsets-in-embedded-XML-t1779810.html#a4845328 Sent from the Solr - User forum at Nabble.com.
International Charsets in embedded XML
(sorry the last one got wrongly posted) Here I am again with charset encoding problems: I need to store XML in a document field. I declare it as string and surround it in CData when I post the add xml. Now the problem is I have some Iternational char in the XML: say ì or à and also € (i don't know if You can read these). When i get back from Solr the XML field strange things happens: - first one: € get converted to ? (I see it in the index looking with luke) - if there is an ì (accented ì) I get malformed XML back using with firefox and IE: 00 /relazioni/... But if i get the same document in my request handler (as a Document structure) I don't have any problem parsing the XML and get the correct char. I have traced the XML.escape and the problem is not there so it's somewere between XMLWriter and Jetty (I've tried the last one 5.1.11). - if i put some international char in a normal string field I see Solr stores the UTF-8 (i Think) encoded char in a string as in a text field type. The question is: apart from the malformed XML issue, what is the better way to deal with internationa charsets ? Thank You Fabio -- View this message in context: http://www.nabble.com/International-Charsets-in-embedded-XML-t1780147.html#a4846383 Sent from the Solr - User forum at Nabble.com. - Autocaravan ìMansardato ^ HERE begins the problem: from now on no more shielding of "<" Semintegrale HERE continues the output, as it should have been shielded after the problem above:
- Semintegrale
Re: International Charsets in embedded XML
Klaas-2 wrote: > > Are you sending Content-Type headers with appropriate charset > indicated? Is your xml fully-escpaed in your update message? > ...no, actually I simply make a URLConnection conn = url.openConnection(); conn.setRequestProperty("ContentType", "text/xml"); conn.setDoOutput(true); wr = new OutputStreamWriter(conn.getOutputStream()); wr.write(data); wr.flush(); to post del add xml and my XML is embedded in a CData without further escaping... have I to to something else. I'm getting data from a MySQL db and I found some problems where in retrieving data from there. I've made some step forword connecting to the db with "characterEncodingutf8" in the jdbc URL, and then converting with: new String(mysqlXMLField.getBytes("latin1")); But I'm really not into charsets and encodings... -- View this message in context: http://www.nabble.com/International-Charsets-in-embedded-XML-t1780147.html#a4849551 Sent from the Solr - User forum at Nabble.com.
Re: International Charsets in embedded XML
Ok, thanks to Your posts, I've read some basic on encoding and made some changes to my code: now it's all much more clear... but I still have some problems. This is what I do (don't know if this can help someone having same problems I had): - I get data from a DB telling JDBC connector to use UTF-8. - then i convert in Java string internal encoding (UTF-16 I have learned) in this way: new String(rs.getBytes(rsField), "UTF-8") this way I get the UTF8 byte array from my resultset (from MySQL) then telling String constructor that the array is to be interpreted in UTF8. When I have to write the update XML document to solr: URLConnection conn = url.openConnection(); conn.setRequestProperty("Content-Type", "text/xml; charset=utf-8"); conn.setDoOutput(true); wr = new OutputStreamWriter(conn.getOutputStream(), "UTF-8"); wr.write(data); wr.flush(); So I'm sure everything is converted back to UTF8 when writing to the update solr url. This way everything is fine getting normal field from documents (we can get back all our diacritical chars and Euro sign)... but: - I cannot search using diacritical. If i have a doc with a field containing "città", I cannot get it back with q=field:città (in the url the à get converted to utf8 E0 like this "citt%E0"). The strange thing is that using an old solr with Jetty 6.0.beta the search with diacritical was ok, but responses got back from solr doubly utf8 encoded (we had to decode two times). Using last version of Solr with jetty 5.1.X responses are single utf8 encode (as You would expect) but diacritical search is not running. Is there a particular way to do this ? - I still have problems getting back fields stored in XML that contain diacritical (I've followed your advices and have escaped myself the < sign but the result is the same as usig CData (i dont use DOM here), by the way, why did You said not to use CData?): I get the same problem I showed You in my first post of a malformed XML. Thank You again Fabio -- View this message in context: http://www.nabble.com/International-Charsets-in-embedded-XML-t1780147.html#a4884245 Sent from the Solr - User forum at Nabble.com.
Re: International Charsets in embedded XML
Ok, I fould de clue: the problem is Jetty, using Tomcat everything works fine. I can search diacritics (I found Jetty required an extra UTF8 encoding on query values in the url) AND no more problems in responses with field containing XML with diacritics and Euro sign (and everything else I suppose). It's a Pity because Jetty is much more slimmer to deploy and install and perhaps faster, but anyway I think these problems should be documented in some manner. Thanks to all Fabio -- View this message in context: http://www.nabble.com/International-Charsets-in-embedded-XML-t1780147.html#a4897795 Sent from the Solr - User forum at Nabble.com.
Re: who uses Solr?
We (Zero Computing S.r.l. of Italy www.zero.it) are now using Solr as index of a classified ads portal for our customer Gruppo Espresso: "one of the leading media groups in Italy with interests in publishing, radio, advertising, internet businesses and television" (from their site http://www.gruppoespresso.it/gruppoesp/eng/index.jsp). In particular I've developed a custom request handler to achive faceted browsing capability (like You can see in www.oodle.com). We think to deploy the portal end of July, as I already said, as soon as we will be online, I'll update the wiki. Happy presentation ! Fabio -- Ing. Fabio Confalonieri Zero Computing S.r.l. (www.zero.it) -- View this message in context: http://www.nabble.com/who-uses-Solr--t1799697.html#a4937164 Sent from the Solr - User forum at Nabble.com.
Editing wiki-page "Powerd by Solr"
I have a problem posting an update to the Powered By Solr wiki page. I would like to add the line: * [http://annunci.repubblica.it La Repubblica Newspaper Classifieds] (in Italian) uses Solr for faceted browsing/filtering through classifieds of one of the main Italian Newspapers But I receive this error: Sorry, can not save page because "annunci.repubblica.it" is not allowed in this wiki. I understand "annunci.repubblica.it" is somehow blacklisted, but I cannot argue why. Sorry for posting here, I could not find a reference on wiki posting/editing. Thank You Fabio Confalonieri -- View this message in context: http://www.nabble.com/Editing-wiki-page-%22Powerd-by-Solr%22-tf3454859.html#a9638264 Sent from the Solr - User mailing list archive at Nabble.com.
Re: Editing wiki-page "Powerd by Solr"
Chris Hostetter wrote: > > > I've added a link to the main newspaper site instead, and clarified that > hte classifieds use Solr. > ... > It seems that is the 3rd level domain "annunci" (or its presence in the url) which is banned, if the domain repubblica.it is ok: curious... Anyway thank You Hoss! Fabio Confalonieri -- View this message in context: http://www.nabble.com/Editing-wiki-page-%22Powerd-by-Solr%22-tf3454859.html#a9674732 Sent from the Solr - User mailing list archive at Nabble.com.