wildcards and German umlauts
Hi all, Index-searching works, if i type complete word (such as "übersicht"). But there are no hits, if i use wildcards (such as "über*") Searching with wildcards and without umlauts works as well. Can someone help me? Thanx in advance! Here is my field definition: generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" /> protected="protwords.txt" language="German2" /> synonyms="synonyms.txt" ignoreCase="true" expand="true" /> generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" /> protected="protwords.txt" language="German2" />
FunctionQuery in a custom request handler
I'm trying to pull off a "time bias", "article freshness" thing - boosting recent documents based on a "published_date" field. The reasonable way to do this seems using a FunctionQuery. But all the examples I find are for expressing this through the query parser; I'd need to do this inside my custom, plugged request handler. How do I access the ValueSource for my DateField? I'd like to use a ReciprocalFloatFunction from inside the code, adding it aside others in the main BooleanQuery. Thanks for the replies. David -- View this message in context: http://www.nabble.com/FunctionQuery-in-a-custom-request-handler-tp14838957p14838957.html Sent from the Solr - User mailing list archive at Nabble.com.
highlighting marks wrong words
Hi all, I have a query like this: q=(auto) AND id:(100 OR 1 OR 2 OR 3 OR 5 OR 6)&fl=score&hl.fl=content&hl=true&hl.fragsize=200&hl.snippets=2&hl.simple.pre=%3Cb%3E&hl.simple.post=%3C%2Fb%3E&start=0&rows=10 Default field is content. So, I expect, that only occurrencies of "auto" will be marked. BUT: the occurrencies of id (100, 1, 2, ..), which occasionally also present in content field, are marked as well... The result looks like: North American International Auto Show 2007 - Celebrating 100 years Any ideas? Thanx in advance!
RE: highlighting marks wrong words
I believe changing the "AND id: etc etc " part of the query to it's on filter query will take care of your highlighting problem. In other words, try a query like this: q=(auto)&fq=id:(100 OR 1 OR 2 OR 3 OR 5 OR 6)&fl=score&hl.fl=content&hl=true&hl.fragsize=200&hl.snippets=2&hl.simpl e.pre=%3Cb%3E&hl.simple.post=%3C%2Fb%3E&start=0&rows=10 This could also get you a performance boost if you're querying against this set of ids often. -Original Message- From: Alexey Shakov [mailto:[EMAIL PROTECTED] Sent: Tuesday, January 15, 2008 6:54 AM To: solr-user@lucene.apache.org Subject: highlighting marks wrong words Hi all, I have a query like this: q=(auto) AND id:(100 OR 1 OR 2 OR 3 OR 5 OR 6)&fl=score&hl.fl=content&hl=true&hl.fragsize=200&hl.snippets=2&hl.simpl e.pre=%3Cb%3E&hl.simple.post=%3C%2Fb%3E&start=0&rows=10 Default field is content. So, I expect, that only occurrencies of "auto" will be marked. BUT: the occurrencies of id (100, 1, 2, ..), which occasionally also present in content field, are marked as well... The result looks like: North American International Auto Show 2007 - Celebrating 100 years Any ideas? Thanx in advance!
Re: field:(-null) returns records where field was not specified
Thanks Chris, this is useful, we can you the query format you suggest, Karen On Tuesday 15 January 2008 01:13:14 Chris Hostetter wrote: > Several things in this thread should be clarified (note: order of > quotations munged for clarity)... > > : I had read this page. But I'm not using the "NOT" operator, I'm using > : the "-" operator. I'm assuming there is a subtle difference between them > : in that NOT qualifies something else, hence needs 2 terms. Isn't the "-" > : operator supposed to be a complement to the "+" operator, ie. excludes > : something rather than requiring it ? > > "The NOT operator" and "the - operator" are in fact the same thing ... the > duplicate syntax comes from Lucene trying to appease people that > want boolean style operator synta (AND/OR/NOT) even though the query > parser is not a boolean syntax. > > : > Have you seen this page? > : > http://lucene.apache.org/java/docs/queryparsersyntax.html > : > > : > From that page: > : > Note: The NOT operator cannot be used with just one term. For example, > : > the following search will return no results: > : > NOT "jakarta apache" > > In Solr, the query parser can in fact support purely negative queries, by > internally transforming the query, this is noted on the Solr query syntax > wiki... > > http://wiki.apache.org/solr/SolrQuerySyntax > > : > > field_name:(-null) > > "null" is not a special keyword, if you look at the debugging output when > doing that query you'll see that it is the same as: -field_name:null > ... which is a search for all docs containing the string "null" in the > field "field_name". > > : The *:* (star colon star) means "all records". The trick is to use (*:* > : AND -field:[* TO *]). It's silly, but there it is. > > as i mentioned, you can do pure wildcard queries now, so a simple search > for -field_name:[* TO *] will find all docs that have no indexed values > for that field at all. > > : A performance note: we switched from empty fields to fields with a > : standard 'empty' value. This way we don't have to do a range check to > : find records with empty fields. > > Your milage may vary depending on how many docs you have with "no value" > ... this also issn't practical when dealing with numeric, boolean, or date > based fields. (and depending on how much churn there is in your index, > the filterCache can probably make the difference negliable on average > anyway). > > > > > -Hoss
XSLT to preprocess XML documents into 'update xml documents' ?
Hi all, I noticed some recent discussion with regard to using XSLT to preprocess XML documents into 'update xml documents' : http://www.mail-archive.com/[EMAIL PROTECTED]/msg05927.html I was wondering if there has been any update to this ? It is something we would be interested in using. Thanks Karen
Re: XSLT to preprocess XML documents into 'update xml documents' ?
I have not tried it, but check: https://issues.apache.org/jira/browse/SOLR-285 Karen Loughran wrote: Hi all, I noticed some recent discussion with regard to using XSLT to preprocess XML documents into 'update xml documents' : http://www.mail-archive.com/[EMAIL PROTECTED]/msg05927.html I was wondering if there has been any update to this ? It is something we would be interested in using. Thanks Karen
Re: LNS - or - "now i know we've succeeded"
I'm sure N stealth startups are doing this as we speakand reading this, rubbing hands :) Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Lance Norskog <[EMAIL PROTECTED]> To: solr-user@lucene.apache.org Sent: Monday, January 14, 2008 6:09:38 PM Subject: RE: LNS - or - "now i know we've succeeded" Now that Microsoft is buying FAST (!!) the open source world needs a matching technology :) -Original Message- From: Walter Underwood [mailto:[EMAIL PROTECTED] Sent: Monday, January 14, 2008 7:42 AM To: solr-user@lucene.apache.org Subject: Re: LNS - or - "now i know we've succeeded" Yes, they are reputable. They've been doing consulting with Verity, Ultraseek, and other platforms for many years. --wunder On 1/12/08 1:22 AM, "Chris Hostetter" <[EMAIL PROTECTED]> wrote: > It is pretty cool to see a reputable > Search company (is ideaeng.com a reputable search consulting company? No virus found in this incoming message. Checked by AVG Free Edition. Version: 7.5.516 / Virus Database: 269.19.0/1218 - Release Date: 1/10/2008 1:32 PM
Re: highlighting marks wrong words
Thank you! It works correct with filter query Charlie Jackson schrieb: I believe changing the "AND id: etc etc " part of the query to it's on filter query will take care of your highlighting problem. In other words, try a query like this: q=(auto)&fq=id:(100 OR 1 OR 2 OR 3 OR 5 OR 6)&fl=score&hl.fl=content&hl=true&hl.fragsize=200&hl.snippets=2&hl.simpl e.pre=%3Cb%3E&hl.simple.post=%3C%2Fb%3E&start=0&rows=10 This could also get you a performance boost if you're querying against this set of ids often. -Original Message- From: Alexey Shakov [mailto:[EMAIL PROTECTED] Sent: Tuesday, January 15, 2008 6:54 AM To: solr-user@lucene.apache.org Subject: highlighting marks wrong words Hi all, I have a query like this: q=(auto) AND id:(100 OR 1 OR 2 OR 3 OR 5 OR 6)&fl=score&hl.fl=content&hl=true&hl.fragsize=200&hl.snippets=2&hl.simpl e.pre=%3Cb%3E&hl.simple.post=%3C%2Fb%3E&start=0&rows=10 Default field is content. So, I expect, that only occurrencies of "auto" will be marked. BUT: the occurrencies of id (100, 1, 2, ..), which occasionally also present in content field, are marked as well... The result looks like: North American International Auto Show 2007 - Celebrating 100 years Any ideas? Thanx in advance!
Missing Content Stream
Hi Everyone, I am new to solr. I am trying to index xml using http post as follows try{ String xmlText = ""; xmlText+=""; xmlText+="SOLR1000"; xmlText+="Solr, the Enterprise Search Server"; xmlText+="Apache Software Foundation"; xmlText+="software"; xmlText+="search"; xmlText+="Advanced Full-Text Search Capabilities using Lucene"; xmlText+="Optimizied for High Volume Web Traffic"; xmlText+="Standards Based Open Interfaces - XML and HTTP"; xmlText+="Comprehensive HTML Administration Interfaces"; xmlText+="Scalability - Efficient Replication to other Solr Search Servers"; xmlText+="Flexible and Adaptable with XML configuration and Schema"; xmlText+="Good unicode support: héllo (hello with an accent over the e)"; xmlText+="0"; xmlText+="10"; xmlText+="true"; xmlText+="2006-01-17T00:00:00.000Z "; xmlText+=""; xmlText+=""; URL url=new URL(http://localhost:8080/solr/update); HttpURLConnection c = (HttpURLConnection) url.openConnection(); c.setRequestMethod("POST"); c.setRequestProperty("Content-Type", "text/xml; charset=\"utf-8\""); c.setDoOutput(true); OutputStreamWriter out = new OutputStreamWriter(c.getOutputStream(), "UTF8"); out.write(xmlText); out.close(); } but I am keep getting error in tomcat logs complaining "Missing content stream". can anybody tell whats going on here here is tomcat log INFO: /update SOLR1000... 0 0 Jan 15, 2008 2:11:11 AM org.apache.solr.common.SolrException log SEVERE: org.apache.solr.common.SolrException: missing content stream at org.apache.solr.handler.XmlUpdateRequestHandler.handleRequestBody( XmlUpdateRequestHandler.java:114) at org.apache.solr.handler.RequestHandlerBase.handleRequest( RequestHandlerBase.java:117) at org.apache.solr.core.SolrCore.execute(SolrCore.java:902)
Re: Missing Content Stream
Ismail, use Solrj instead, you'll be much happier. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Ismail Siddiqui <[EMAIL PROTECTED]> To: solr-user@lucene.apache.org Sent: Tuesday, January 15, 2008 1:50:25 PM Subject: Missing Content Stream Hi Everyone, I am new to solr. I am trying to index xml using http post as follows try{ String xmlText = ""; xmlText+=""; xmlText+="SOLR1000"; xmlText+="Solr, the Enterprise Search Server"; xmlText+="Apache Software Foundation"; xmlText+="software"; xmlText+="search"; xmlText+="Advanced Full-Text Search Capabilities using Lucene"; xmlText+="Optimizied for High Volume Web Traffic"; xmlText+="Standards Based Open Interfaces - XML and HTTP"; xmlText+="Comprehensive HTML Administration Interfaces"; xmlText+="Scalability - Efficient Replication to other Solr Search Servers"; xmlText+="Flexible and Adaptable with XML configuration and Schema"; xmlText+="Good unicode support: héllo (hello with an accent over the e)"; xmlText+="0"; xmlText+="10"; xmlText+="true"; xmlText+="2006-01-17T00:00:00.000Z "; xmlText+=""; xmlText+=""; URL url=new URL(http://localhost:8080/solr/update); HttpURLConnection c = (HttpURLConnection) url.openConnection(); c.setRequestMethod("POST"); c.setRequestProperty("Content-Type", "text/xml; charset=\"utf-8\""); c.setDoOutput(true); OutputStreamWriter out = new OutputStreamWriter(c.getOutputStream(), "UTF8"); out.write(xmlText); out.close(); } but I am keep getting error in tomcat logs complaining "Missing content stream". can anybody tell whats going on here here is tomcat log INFO: /update SOLR1000... 0 0 Jan 15, 2008 2:11:11 AM org.apache.solr.common.SolrException log SEVERE: org.apache.solr.common.SolrException: missing content stream at org.apache.solr.handler.XmlUpdateRequestHandler.handleRequestBody( XmlUpdateRequestHandler.java:114) at org.apache.solr.handler.RequestHandlerBase.handleRequest( RequestHandlerBase.java:117) at org.apache.solr.core.SolrCore.execute(SolrCore.java:902)
Re: Missing Content Stream
On Jan 15, 2008, at 1:50 PM, Ismail Siddiqui wrote: Hi Everyone, I am new to solr. I am trying to index xml using http post as follows Ismail, you seem to have a few spelling mistakes in your xml string. "fiehld, nadme" etc. (a) try fixing them, (b) try solrj instead, I agree w/ otis.
best way to get number of documents in a Solr index
Hello, I am looking for the best way to get the number of documents in a Solr index. I'd like to do it from a java code using solrj. Any suggestions are welcome. Thank you in advance, Maria mosolova
Re: best way to get number of documents in a Solr index
On Jan 15, 2008, at 3:47 PM, Maria Mosolova wrote: Hello, I am looking for the best way to get the number of documents in a Solr index. I'd like to do it from a java code using solrj. public int resultCount() { try { SolrQuery q = new SolrQuery("*:*"); QueryResponse rq = solr.query(q); return rq.getResults().getNumFound(); } catch (org.apache.solr.client.solrj.SolrServerException e) { System.err.println("Query problem"); } catch (java.io.IOException e) { System.err.println("Other error"); } return -1; }
Re: best way to get number of documents in a Solr index
try a query with q=*:* the 'numFound' will be every document -- use &rows=0 to avoid returing docs (if you like) ryan Maria Mosolova wrote: Hello, I am looking for the best way to get the number of documents in a Solr index. I'd like to do it from a java code using solrj. Any suggestions are welcome. Thank you in advance, Maria mosolova
Re: best way to get number of documents in a Solr index
Thanks a lot Brian! Maria Brian Whitman wrote: On Jan 15, 2008, at 3:47 PM, Maria Mosolova wrote: Hello, I am looking for the best way to get the number of documents in a Solr index. I'd like to do it from a java code using solrj. public int resultCount() { try { SolrQuery q = new SolrQuery("*:*"); QueryResponse rq = solr.query(q); return rq.getResults().getNumFound(); } catch (org.apache.solr.client.solrj.SolrServerException e) { System.err.println("Query problem"); } catch (java.io.IOException e) { System.err.println("Other error"); } return -1; }
Re: Missing Content Stream
thanks brian and otis, i will definitely try solrj.. but actaually now the problem is resolved by setting content length in header i was missing it c.setRequestProperty("Content-Length", xmlText.length()+""); but now its not throwing any error but not indexing the document either.. do I have to set autoCommit on in solrconfig.xml ??? thanks On 1/15/08, Brian Whitman <[EMAIL PROTECTED]> wrote: > > > On Jan 15, 2008, at 1:50 PM, Ismail Siddiqui wrote: > > > Hi Everyone, > > I am new to solr. I am trying to index xml using http post as follows > > > Ismail, you seem to have a few spelling mistakes in your xml string. > "fiehld, nadme" etc. (a) try fixing them, (b) try solrj instead, I > agree w/ otis. > > > >
Re: wildcards and German umlauts
On Dienstag, 15. Januar 2008, Alexey Shakov wrote: > Index-searching works, if i type complete word (such as "übersicht"). > But there are no hits, if i use wildcards (such as "über*") > Searching with wildcards and without umlauts works as well. Maybe this describes your problem on the Lucene level? http://wiki.apache.org/lucene-java/LuceneFAQ#head-133cf44dd3dff3680c96c1316a663e881eeac35a If that doesn't help, try Luke to see how your queries are parsed. Regards Daniel -- http://www.danielnaber.de
Solr in a distributed multi-machine high-performance environment
Hi All, There is a requirement in our group of indexing and searching several millions of documents (TREC) in real-time and millisecond responses. For the moment we are preferring scale-out (throw more commodity machines) approaches rather than scale-up (faster disks, more RAM). This is in-turn inspired by the "Scale-out vs. Scale-up" paper (mail me if you want a copy) in which it was proven that this kind of distribution scales better and is more resilient. So, are there any resources available (Wiki, Tutorials, Slides, README etc.) that throw light and guide newbies on how to run Solr in a multi-machine scenario? I have gone through the mailing lists and site but could not really find any answers or hands-on stuff to do so. An adhoc guideline to get things working with 2 machines might just be enough but for the sake of thinking out loud and solicit responses from the list, here are my questions: 1) Solr that has to handle a fairly large index which has to be split up on multiple disks (using Multicore?) - Space is not a problem since we can use NFS but that is not recommended as we would only exploit 1 processor 2) Solr that has to handle a large collective index which has to be split up on multi-machines - The index is ever increasing (TB scale) and dynamic and all of it has to be searched at any point 3) Solr that has to exploit multi-machines because we have plenty of them in a tightly coupled P2P scenario - Machines are not a problem but will they be if they are of varied configurations (PIII to Core2; Linux to Vista; 32-bit to 64-bit; J2SE 1.1 to 1.6) 4) Solr that has to distribute load on several machines - The index(s) could be common though like say using a distributed filesystem (Hadoop?) In each the above cases (we might use all of these strategies at various use cases) the application should use Solr as a strict backend and named service (IP or host:port) so that we can expose this application (and the service) to the web or intranet. Machine failures should be tolerated too. Also, does Solr manage load balancing out of the box if it was indeed configured to work with multi-machines? Maybe it is superfluous but is Solr and/or Nutch the only way to use Lucene in a multi-machine environment? Or is there some hidden document/project somewhere that makes it possible by exposing a regular Lucene process over the network using RMI or something? It is my understanding (could be wrong) that Nutch and to some extent, Solr do not perform well when there is a lot of indexing activity in parallel to search. Batch processing is also there and perhaps we can use Nutch/Solr there. Even so, we need multi-machine directions. I am sure that multi-machines make possible for a lot of other ways which might solve the goal better and that others have practical experience on. So, any advise and tips are also very welcome. We intend to document things and do some benchmarking along the way in the open spirit. Really sorry for the length but I hope some answers are forthcoming. Cheers, Srikant