Leveraging filter chache in queries

2006-05-12 Thread Fabio Confalonieri

Hello,

I've just fond Lucene and Solr and I'm thinking of using them in our current
project, essentially an ads portal (something very similar to
www.oodle.com).

I see our needs have already surfaced in the mailing list, it's the refine
search problem You have sometime called faceted browsing and which is the
base of CNet browsing architecture: we have ads with different categories
which have different attributes ("fields" in lucene language), say
motors-car category has make,model,price,color and real-estates-houses has
bathrooms ranges, bedrooms ranges, etc...

I understand You have developed Solr also to have filter cache storing
bitset of search results to have a fast way to intersect those bitsets to
count resulting sub-queries and present the count for refinement searches (I
have read the announcement of CNet and the Nines related thread and also
some other related thread).

Actually we thought of storing for every category on a MySQL database (which
we use for every other non search related tasks) the possible sub-query
attributes with possible values/ranges, in a similar way as You with CNet do
storing the possible subqueries of a query in a lucene document.

Now what I havent understood is if the Solr StandardRequestHandler
automatically creates and caches filters from normal queries submitted to
Solr select servlet, possibly with some syntax clue.
I tried a query like "+field:value^0" which returns a great number of Hits
(on a total test of 100.000 documents), but I see only the query cache
growing and the filter cache always empty. Is this normal ? I've tried to
check all the cache configuration but I don't understand if filters are
auto-generated from normal queries.

A more general question: Is all the CNet logic of intersecting bitsets
available through the servlet or have I to write some java code to be
plugged in Solr?
In this case which is the correct level to make this, perhaps a new
RequestHandler understanding some new query syntax to exploit filters.

We only need a sort on a single and precalculated rank field stored as a
range field, so we don't need relevance and consequently don't nedd scores
(which is a prerequisite for using BitSets, if I understand well).

Thank You, I hope to have explained well my doubts.

Fabio

PS:I think Solr and Lucene are a really great work!
I'll be happy when we have finished to add our project (a major press group
here in Italy) to public websites in Solr Wiki.

--
View this message in context: 
http://www.nabble.com/Leveraging-filter-chache-in-queries-t1607377.html#a4357730
Sent from the Solr - User forum at Nabble.com.



Finding documents with undefined field

2006-06-06 Thread Fabio Confalonieri

Hello,
I would like to search for all documents with a field not defined.
Say You have documents with title_s defined and some documents without
title_s: I would like to obtain all documents without title_s.

I've not find a clue, perhaps it's not possible; I would like to know if
there is an alternative to setting a default value "undefined" for title_s
on all documents.

Thank You

Fabio Confalonieri

--
View this message in context: 
http://www.nabble.com/Finding-documents-with-undefined-field-t1742872.html#a4736565
Sent from the Solr - User forum at Nabble.com.



Re: Finding documents with undefined field

2006-06-07 Thread Fabio Confalonieri

Thank You Hoss (You all are always very responsive...),

actually I've developed my own FacetRequestHandler extending the query
format and adding a showfacet parameter (it's a little custom on our needs,
but I'd like to publish it when we have finished).
What I do is the merge of some ideas from the forum; my query is now in
three parts
  q=query;sort;filters 
where filters is a list of query-clauses separated by commas that I parse to
get filterField and filterValue, then for every filter:

   
filterList.add(QueryParsing.parseQuery(createQueryString("filterField:filterValue",
defaultField, req.getSchema()));

then I use filterList in the main query in

DocListAndSet results =
req.getSearcher().getDocListAndSet(query,filterList,sort,...

Then, if requested with showfacets parameter, I get facets extracting and
parsing a facetXML descriptor from a facet-type document in the index,
querying for the facet descriptor of the current category i get from the
filter list (similar to CNET, i think).

To calculate counts for every facet composed of a field and a value, based
on the main query, I use

facetCount =
searcher.numDocs(QueryParsing.parseQuery("facetField:facetValue", "",
req.getSchema()), results.docSet);

Now, how could I get a fiter for the missing field ?
Can I use the unbounded range trick simply adding a facet (and filter) like
this:

facetCount = searcher.numDocs(QueryParsing.parseQuery("-fieldName:[* TO
*]", "", req.getSchema()), results.docSet);

...since i use results.docSet of the base query (the same for filters I
think) ?
Or there is a better way ?

Thank You again

   Fabio
--
View this message in context: 
http://www.nabble.com/Finding-documents-with-undefined-field-t1742872.html#a4751981
Sent from the Solr - User forum at Nabble.com.



Re: Finding documents with undefined field

2006-06-08 Thread Fabio Confalonieri


Chris Hostetter wrote:
> 
> 
> There are a couple of things you can do here...
> 
> 1) Use the same approach i described before if you have a uniqueKey,
> search for all things with a key and then exclude things that have a value
> in your field.  Since you are writing a request handler, you could also
> progromaticaly build up a BooleanQuery containing a MatchAllDocsQuery
> object and your prohibited clause even if you don't have a uniqueKey
> 
> 2) you can fetch the DocSet of all documents that *do* have a value for
> that field, and then get the inverse, and use that for your facet counts.
> this is something that was discussed before in a thread Erik started...
> ..
> 
> 

Ok at last I tried the easy way so, when I find a particular predefined 
"undefined-value" in a filter or facet, I convert the query to parse to:

   "type:ad AND -" +field+":[* TO *]"

"type:ad" matches all my documents, the other type I have is "facets"
 (many thanks for the unbound range trick).

I cannot see any particular slowliness (but I'm testing with 50.000 docs
now) perhaps thanks to Solr ConstantScoreRangeQueries conversion, 
should I worry with bigger numbers, say 300.000 docs ?

My two cents on Solr development: surely "DocSet.andNot(DocSet other)"
capability would be precious to optimize the undefined-field and other 
inverse-query problems.

Thanks again

Fabio
--
View this message in context: 
http://www.nabble.com/Finding-documents-with-undefined-field-t1742872.html#a4773462
Sent from the Solr - User forum at Nabble.com.



International Charsets in embedded XML

2006-06-13 Thread Fabio Confalonieri

Here I am again with a subtle problem:

I need to store XML in a document field. I declare it as string and surround
it in CData when I post the add xml.
--
View this message in context: 
http://www.nabble.com/International-Charsets-in-embedded-XML-t1779810.html#a4845328
Sent from the Solr - User forum at Nabble.com.



International Charsets in embedded XML

2006-06-13 Thread Fabio Confalonieri

(sorry the last one got wrongly posted)

Here I am again with charset encoding problems:

I need to store XML in a document field. I declare it as string and surround
it in CData when I post the add xml.
Now the problem is I have some Iternational char in the XML: say  ì or à and
also € (i don't know if You can read these).

When i get back from Solr the XML field strange things happens:

- first one: € get converted to ? (I see it in the index looking with luke)

- if there is an ì (accented ì) I get malformed XML back using with firefox
and IE:



  00
  

  /relazioni/
  

Autocaravan ìMansardato

   ^ HERE begins the problem: from now on no
more shielding of "<"

Semintegrale



HERE continues the output, as it should have been shielded after the
problem above:

Semintegrale

  
  ...

  


But if i get the same document in my request handler (as a Document
structure) I don't have any problem parsing the XML and get the correct
char.
I have traced the XML.escape and the problem is not there so it's somewere
between XMLWriter and Jetty (I've tried the last one 5.1.11).

- if i put some international char in a normal string field I see Solr
stores the UTF-8 (i Think) encoded char in a string as in a text field type.

The question is: apart from the malformed XML issue, what is the better way
to deal with internationa charsets ?

Thank You

Fabio
--
View this message in context: 
http://www.nabble.com/International-Charsets-in-embedded-XML-t1780147.html#a4846383
Sent from the Solr - User forum at Nabble.com.



Re: International Charsets in embedded XML

2006-06-13 Thread Fabio Confalonieri


Klaas-2 wrote:
> 
> Are you sending Content-Type headers with appropriate charset
> indicated?  Is your xml fully-escpaed in your update message?
> 

...no, actually I simply make a

URLConnection conn = url.openConnection();
conn.setRequestProperty("ContentType", "text/xml");
conn.setDoOutput(true);
wr = new OutputStreamWriter(conn.getOutputStream());
wr.write(data);
wr.flush();

to post del add xml and my XML is embedded in a CData without further
escaping... have I to to something else.

I'm getting data from a MySQL db and I found some problems where in
retrieving data from there.

I've made some step forword connecting to the db with
"characterEncodingutf8" in the jdbc URL, and then converting with:

new String(mysqlXMLField.getBytes("latin1"));

But I'm really not into charsets and encodings...


--
View this message in context: 
http://www.nabble.com/International-Charsets-in-embedded-XML-t1780147.html#a4849551
Sent from the Solr - User forum at Nabble.com.



Re: International Charsets in embedded XML

2006-06-15 Thread Fabio Confalonieri

Ok, thanks to Your posts, I've read some basic on encoding and made some
changes to my code: now it's all much more clear... but I still have some
problems.

This is what I do (don't know if this can help someone having same problems
I had):

- I get data from a DB telling JDBC connector to use UTF-8.

- then i convert in Java string internal encoding (UTF-16 I have learned) in
this way:

new String(rs.getBytes(rsField), "UTF-8")

this way I get the UTF8 byte array from my resultset (from MySQL) then
telling String constructor that the array is to be interpreted in UTF8.

When I have to write the update XML document to solr:

URLConnection conn = url.openConnection();
conn.setRequestProperty("Content-Type", "text/xml; 
charset=utf-8");
conn.setDoOutput(true);
wr = new OutputStreamWriter(conn.getOutputStream(), "UTF-8");
wr.write(data);
wr.flush();

So I'm sure everything is converted back to UTF8 when writing to the update
solr url.

This way everything is fine getting normal field from documents (we can get
back all our diacritical chars and Euro sign)... but:

-  I cannot search using diacritical.
If i have a doc with a field containing "città", I cannot get it back with
q=field:città (in the url the à get converted to utf8 E0 like this
"citt%E0").
The strange thing is that using an old solr with Jetty 6.0.beta the search
with diacritical was ok, but responses got back from solr doubly utf8
encoded (we had to decode two times). Using last version of Solr with jetty
5.1.X responses are single utf8 encode (as You would expect) but diacritical
search is not running. Is there a particular way to do this ?

- I still have problems getting back fields stored in XML that contain
diacritical (I've followed your advices and have escaped myself the < sign
but the result is the same as usig CData (i dont use DOM here), by the way,
why did You said not to use CData?):
I get the same problem I showed You in my first post of a malformed XML.

Thank You again

   Fabio

--
View this message in context: 
http://www.nabble.com/International-Charsets-in-embedded-XML-t1780147.html#a4884245
Sent from the Solr - User forum at Nabble.com.



Re: International Charsets in embedded XML

2006-06-16 Thread Fabio Confalonieri

Ok, I fould de clue:
the problem is Jetty, using Tomcat everything works fine.

I can search diacritics (I found Jetty required an extra UTF8 encoding on
query values in the url)
AND
no more problems in responses with field containing XML with diacritics and
Euro sign (and everything else I suppose).

It's a Pity because Jetty is much more slimmer to deploy and install and
perhaps faster, but anyway I think these problems should be documented in
some manner.

Thanks to all

Fabio

--
View this message in context: 
http://www.nabble.com/International-Charsets-in-embedded-XML-t1780147.html#a4897795
Sent from the Solr - User forum at Nabble.com.



Re: who uses Solr?

2006-06-19 Thread Fabio Confalonieri

We (Zero Computing S.r.l. of Italy www.zero.it) are now using Solr as index
of a classified ads portal for our customer Gruppo Espresso:
"one of the leading media groups in Italy with interests in publishing,
radio, advertising, internet businesses and television" (from their site
http://www.gruppoespresso.it/gruppoesp/eng/index.jsp).

In particular I've developed a custom request handler to achive faceted
browsing capability (like You can see in www.oodle.com).

We think to deploy the portal end of July, as I already said, as soon as we
will be online, I'll update the wiki.

Happy presentation !

Fabio

-- 
Ing. Fabio Confalonieri
Zero Computing S.r.l. (www.zero.it)
--
View this message in context: 
http://www.nabble.com/who-uses-Solr--t1799697.html#a4937164
Sent from the Solr - User forum at Nabble.com.



Editing wiki-page "Powerd by Solr"

2007-03-23 Thread Fabio Confalonieri

I have a problem posting an update to the Powered By Solr wiki page.

I would like to add the line:
 * [http://annunci.repubblica.it La Repubblica Newspaper Classifieds] (in
Italian) uses Solr for faceted browsing/filtering through classifieds of one
of the main Italian Newspapers

But I receive this error:
Sorry, can not save page because "annunci.repubblica.it" is not allowed in
this wiki.

I understand "annunci.repubblica.it" is somehow blacklisted, but I cannot
argue why.

Sorry for posting here, I could not find a reference on wiki
posting/editing.

Thank You

Fabio Confalonieri




-- 
View this message in context: 
http://www.nabble.com/Editing-wiki-page-%22Powerd-by-Solr%22-tf3454859.html#a9638264
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Editing wiki-page "Powerd by Solr"

2007-03-26 Thread Fabio Confalonieri


Chris Hostetter wrote:
> 
> 
> I've added a link to the main newspaper site instead, and clarified that
> hte classifieds use Solr.
> ...
> 

It seems that is the 3rd level domain "annunci" (or its presence in the url)
which is banned, if the domain repubblica.it is ok: curious...

Anyway thank You Hoss!

  Fabio Confalonieri
-- 
View this message in context: 
http://www.nabble.com/Editing-wiki-page-%22Powerd-by-Solr%22-tf3454859.html#a9674732
Sent from the Solr - User mailing list archive at Nabble.com.