wildcards and German umlauts

2008-01-15 Thread Alexey Shakov

Hi all,

Index-searching works, if i type complete word (such as "übersicht").
But there are no hits, if i use wildcards (such as "über*")
Searching with wildcards and without umlauts works as well.

Can someone help me? Thanx in advance!

Here is my field definition:


   
   
   
   
  generateWordParts="1" generateNumberParts="1" 
catenateWords="1"

   catenateNumbers="1" catenateAll="0" />
   
   protected="protwords.txt" language="German2" />   
   

   
   
   
   
  synonyms="synonyms.txt" ignoreCase="true" 
expand="true" />

   
  generateWordParts="1" generateNumberParts="1" 
catenateWords="0"

   catenateNumbers="0" catenateAll="0" />
   
   protected="protwords.txt" language="German2" />   
   

   
   
   



FunctionQuery in a custom request handler

2008-01-15 Thread evol__

I'm trying to pull off a "time bias", "article freshness" thing - boosting
recent documents based on a "published_date" field. The reasonable way to do
this seems using a FunctionQuery.
But all the examples I find are for expressing this through the query
parser; I'd need to do this inside my custom, plugged request handler.

How do I access the ValueSource for my DateField? I'd like to use a
ReciprocalFloatFunction from inside the code, adding it aside others in the
main BooleanQuery.


Thanks for the replies.
David

-- 
View this message in context: 
http://www.nabble.com/FunctionQuery-in-a-custom-request-handler-tp14838957p14838957.html
Sent from the Solr - User mailing list archive at Nabble.com.



highlighting marks wrong words

2008-01-15 Thread Alexey Shakov

Hi all,

I have a query like this:

q=(auto) AND id:(100 OR 1 OR 2 OR 3 OR 5 OR 
6)&fl=score&hl.fl=content&hl=true&hl.fragsize=200&hl.snippets=2&hl.simple.pre=%3Cb%3E&hl.simple.post=%3C%2Fb%3E&start=0&rows=10


Default field is content.

So, I expect, that only occurrencies of "auto" will be marked.

BUT: the occurrencies of id (100, 1, 2, ..), which occasionally also 
present in content field, are marked as well...


The result looks like:

North American International Auto Show 2007 - Celebrating 
100 years



Any ideas?

Thanx in advance!




RE: highlighting marks wrong words

2008-01-15 Thread Charlie Jackson
I believe changing the "AND id: etc etc " part of the query to it's on
filter query will take care of your highlighting problem. 

In other words, try a query like this:

q=(auto)&fq=id:(100 OR 1 OR 2 OR 3 OR 5 OR
6)&fl=score&hl.fl=content&hl=true&hl.fragsize=200&hl.snippets=2&hl.simpl
e.pre=%3Cb%3E&hl.simple.post=%3C%2Fb%3E&start=0&rows=10

This could also get you a performance boost if you're querying against
this set of ids often.

-Original Message-
From: Alexey Shakov [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, January 15, 2008 6:54 AM
To: solr-user@lucene.apache.org
Subject: highlighting marks wrong words

Hi all,

I have a query like this:

q=(auto) AND id:(100 OR 1 OR 2 OR 3 OR 5 OR 
6)&fl=score&hl.fl=content&hl=true&hl.fragsize=200&hl.snippets=2&hl.simpl
e.pre=%3Cb%3E&hl.simple.post=%3C%2Fb%3E&start=0&rows=10

Default field is content.

So, I expect, that only occurrencies of "auto" will be marked.

BUT: the occurrencies of id (100, 1, 2, ..), which occasionally also 
present in content field, are marked as well...

The result looks like:

North American International Auto Show 2007 - Celebrating 
100 years


Any ideas?

Thanx in advance!




Re: field:(-null) returns records where field was not specified

2008-01-15 Thread Karen Loughran

Thanks Chris, this is useful, we can you the query format you suggest,

Karen

On Tuesday 15 January 2008 01:13:14 Chris Hostetter wrote:
> Several things in this thread should be clarified (note: order of
> quotations munged for clarity)...
>
> : I had read this page.  But I'm not using the "NOT" operator,  I'm using
> : the "-" operator.  I'm assuming there is a subtle difference between them
> : in that NOT qualifies something else, hence needs 2 terms.  Isn't the "-"
> : operator supposed to be a complement to the "+" operator, ie. excludes
> : something rather than requiring it ?
>
> "The NOT operator" and "the - operator" are in fact the same thing ... the
> duplicate syntax comes from Lucene trying to appease people that
> want boolean style operator synta (AND/OR/NOT) even though the query
> parser is not a boolean syntax.
>
> : > Have you seen this page?
> : > http://lucene.apache.org/java/docs/queryparsersyntax.html
> : >
> : > From that page:
> : > Note: The NOT operator cannot be used with just one term. For example,
> : > the following search will return no results:
> : > NOT "jakarta apache"
>
> In Solr, the query parser can in fact support purely negative queries, by
> internally transforming the query, this is noted on the Solr query syntax
> wiki...
>
> http://wiki.apache.org/solr/SolrQuerySyntax
>
> : > > field_name:(-null)
>
> "null" is not a special keyword, if you look at the debugging output when
> doing that query you'll see that it is the same as:   -field_name:null
> ... which is a search for all docs containing the string "null" in the
> field "field_name".
>
> : The *:* (star colon star) means "all records". The trick is to use (*:*
> : AND -field:[* TO *]). It's silly, but there it is.
>
> as i mentioned, you can do pure wildcard queries now, so a simple search
> for -field_name:[* TO *] will find all docs that have no indexed values
> for that field at all.
>
> : A performance note: we switched from empty fields to fields with a
> : standard 'empty' value. This way we don't have to do a range check to
> : find records with empty fields.
>
> Your milage may vary depending on how many docs you have with "no value"
> ... this also issn't practical when dealing with numeric, boolean, or date
> based fields.  (and depending on how much churn there is in your index,
> the filterCache can probably make the difference negliable on average
> anyway).
>
>
>
>
> -Hoss




XSLT to preprocess XML documents into 'update xml documents' ?

2008-01-15 Thread Karen Loughran

Hi all,

I noticed some recent discussion with regard to using XSLT to preprocess XML 
documents into 'update xml documents' :

http://www.mail-archive.com/[EMAIL PROTECTED]/msg05927.html

I was wondering if there has been any update to this ?  It is something we 
would be interested in using.

Thanks
Karen


Re: XSLT to preprocess XML documents into 'update xml documents' ?

2008-01-15 Thread Ryan McKinley

I have not tried it, but check:
https://issues.apache.org/jira/browse/SOLR-285


Karen Loughran wrote:

Hi all,

I noticed some recent discussion with regard to using XSLT to preprocess XML 
documents into 'update xml documents' :


http://www.mail-archive.com/[EMAIL PROTECTED]/msg05927.html

I was wondering if there has been any update to this ?  It is something we 
would be interested in using.


Thanks
Karen





Re: LNS - or - "now i know we've succeeded"

2008-01-15 Thread Otis Gospodnetic
I'm sure N stealth startups are doing this as we speakand reading this, 
rubbing hands :)

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

- Original Message 
From: Lance Norskog <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Monday, January 14, 2008 6:09:38 PM
Subject: RE: LNS - or - "now i know we've succeeded"

Now that Microsoft is buying FAST (!!) the open source world needs a
matching technology :) 

-Original Message-
From: Walter Underwood [mailto:[EMAIL PROTECTED] 
Sent: Monday, January 14, 2008 7:42 AM
To: solr-user@lucene.apache.org
Subject: Re: LNS - or - "now i know we've succeeded"

Yes, they are reputable. They've been doing consulting with Verity,
Ultraseek, and other platforms for many years.  --wunder

On 1/12/08 1:22 AM, "Chris Hostetter" <[EMAIL PROTECTED]> wrote:

> It is pretty cool to see a reputable
> Search company (is ideaeng.com a reputable search consulting company?


No virus found in this incoming message.
Checked by AVG Free Edition. 
Version: 7.5.516 / Virus Database: 269.19.0/1218 - Release Date:
 1/10/2008
1:32 PM
 






Re: highlighting marks wrong words

2008-01-15 Thread Alexey Shakov

Thank you! It works correct with filter query

Charlie Jackson schrieb:

 I believe changing the "AND id: etc etc " part of the query to it's on
 filter query will take care of your highlighting problem.

 In other words, try a query like this:

 q=(auto)&fq=id:(100 OR 1 OR 2 OR 3 OR 5 OR
 6)&fl=score&hl.fl=content&hl=true&hl.fragsize=200&hl.snippets=2&hl.simpl
 e.pre=%3Cb%3E&hl.simple.post=%3C%2Fb%3E&start=0&rows=10

 This could also get you a performance boost if you're querying against
 this set of ids often.

 -Original Message-
 From: Alexey Shakov [mailto:[EMAIL PROTECTED]
 Sent: Tuesday, January 15, 2008 6:54 AM
 To: solr-user@lucene.apache.org
 Subject: highlighting marks wrong words

 Hi all,

 I have a query like this:

 q=(auto) AND id:(100 OR 1 OR 2 OR 3 OR 5 OR
 6)&fl=score&hl.fl=content&hl=true&hl.fragsize=200&hl.snippets=2&hl.simpl
 e.pre=%3Cb%3E&hl.simple.post=%3C%2Fb%3E&start=0&rows=10

 Default field is content.

 So, I expect, that only occurrencies of "auto" will be marked.

 BUT: the occurrencies of id (100, 1, 2, ..), which occasionally also
 present in content field, are marked as well...

 The result looks like:

 North American International Auto Show 2007 - Celebrating
 100 years


 Any ideas?

 Thanx in advance!








Missing Content Stream

2008-01-15 Thread Ismail Siddiqui
Hi Everyone,
I am new to solr. I am trying to index xml using http post as follows

try{

 String xmlText = "";
 xmlText+="";
 xmlText+="SOLR1000";
 xmlText+="Solr, the Enterprise Search
Server";
 xmlText+="Apache Software Foundation";
 xmlText+="software";
 xmlText+="search";
 xmlText+="Advanced Full-Text Search
Capabilities using Lucene";
 xmlText+="Optimizied for High Volume Web
Traffic";
 xmlText+="Standards Based Open Interfaces -
XML and HTTP";
 xmlText+="Comprehensive HTML Administration
Interfaces";
 xmlText+="Scalability - Efficient
Replication to other Solr Search Servers";
 xmlText+="Flexible and Adaptable with XML
configuration and Schema";
 xmlText+="Good unicode support: héllo
(hello with an accent over the e)";
 xmlText+="0";
 xmlText+="10";
 xmlText+="true";
 xmlText+="2006-01-17T00:00:00.000Z
";
 xmlText+="";
 xmlText+="";
 URL url=new URL(http://localhost:8080/solr/update);
 HttpURLConnection c = (HttpURLConnection) url.openConnection();
c.setRequestMethod("POST");
c.setRequestProperty("Content-Type", "text/xml; charset=\"utf-8\"");
c.setDoOutput(true);
OutputStreamWriter out = new OutputStreamWriter(c.getOutputStream(),
"UTF8");
out.write(xmlText);
out.close();
}


but I am keep getting error in tomcat logs complaining "Missing content
stream".
can anybody tell whats going on here
here is tomcat log

INFO: /update SOLR1000...
0 0
Jan 15, 2008 2:11:11 AM org.apache.solr.common.SolrException log
SEVERE: org.apache.solr.common.SolrException: missing content stream
 at org.apache.solr.handler.XmlUpdateRequestHandler.handleRequestBody(
XmlUpdateRequestHandler.java:114)
 at org.apache.solr.handler.RequestHandlerBase.handleRequest(
RequestHandlerBase.java:117)
 at org.apache.solr.core.SolrCore.execute(SolrCore.java:902)


Re: Missing Content Stream

2008-01-15 Thread Otis Gospodnetic
Ismail, use Solrj instead, you'll be much happier.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

- Original Message 
From: Ismail Siddiqui <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Tuesday, January 15, 2008 1:50:25 PM
Subject: Missing Content Stream

Hi Everyone,
I am new to solr. I am trying to index xml using http post as follows

try{

 String xmlText = "";
 xmlText+="";
 xmlText+="SOLR1000";
 xmlText+="Solr, the Enterprise Search
Server";
 xmlText+="Apache Software
 Foundation";
 xmlText+="software";
 xmlText+="search";
 xmlText+="Advanced Full-Text Search
Capabilities using Lucene";
 xmlText+="Optimizied for High Volume Web
Traffic";
 xmlText+="Standards Based Open Interfaces
 -
XML and HTTP";
 xmlText+="Comprehensive HTML
 Administration
Interfaces";
 xmlText+="Scalability - Efficient
Replication to other Solr Search Servers";
 xmlText+="Flexible and Adaptable with XML
configuration and Schema";
 xmlText+="Good unicode support:
 héllo
(hello with an accent over the e)";
 xmlText+="0";
 xmlText+="10";
 xmlText+="true";
 xmlText+="2006-01-17T00:00:00.000Z
";
 xmlText+="";
 xmlText+="";
 URL url=new URL(http://localhost:8080/solr/update);
 HttpURLConnection c = (HttpURLConnection) url.openConnection();
c.setRequestMethod("POST");
c.setRequestProperty("Content-Type", "text/xml;
 charset=\"utf-8\"");
c.setDoOutput(true);
OutputStreamWriter out = new
 OutputStreamWriter(c.getOutputStream(),
"UTF8");
out.write(xmlText);
out.close();
}


but I am keep getting error in tomcat logs complaining "Missing content
stream".
can anybody tell whats going on here
here is tomcat log

INFO: /update SOLR1000...
0 0
Jan 15, 2008 2:11:11 AM org.apache.solr.common.SolrException log
SEVERE: org.apache.solr.common.SolrException: missing content stream
 at org.apache.solr.handler.XmlUpdateRequestHandler.handleRequestBody(
XmlUpdateRequestHandler.java:114)
 at org.apache.solr.handler.RequestHandlerBase.handleRequest(
RequestHandlerBase.java:117)
 at org.apache.solr.core.SolrCore.execute(SolrCore.java:902)





Re: Missing Content Stream

2008-01-15 Thread Brian Whitman


On Jan 15, 2008, at 1:50 PM, Ismail Siddiqui wrote:


Hi Everyone,
I am new to solr. I am trying to index xml using http post as follows



Ismail, you seem to have a few spelling mistakes in your xml string.  
"fiehld, nadme" etc. (a) try fixing them, (b) try solrj instead, I  
agree w/ otis.






best way to get number of documents in a Solr index

2008-01-15 Thread Maria Mosolova

Hello,

I am looking for the best way to get the number of documents in a Solr 
index. I'd like to do it from a java code using solrj. Any suggestions 
are welcome.


Thank you in advance,
Maria mosolova


Re: best way to get number of documents in a Solr index

2008-01-15 Thread Brian Whitman


On Jan 15, 2008, at 3:47 PM, Maria Mosolova wrote:


Hello,

I am looking for the best way to get the number of documents in a  
Solr index. I'd like to do it from a java code using solrj.



  public int resultCount() {
try {
  SolrQuery q = new SolrQuery("*:*");
  QueryResponse rq = solr.query(q);
  return rq.getResults().getNumFound();
} catch (org.apache.solr.client.solrj.SolrServerException e) {
  System.err.println("Query problem");
} catch (java.io.IOException e)  {
  System.err.println("Other error");
}
return -1;
  }




Re: best way to get number of documents in a Solr index

2008-01-15 Thread Ryan McKinley

try a query with q=*:*

the 'numFound' will be every document -- use &rows=0 to avoid returing 
docs (if you like)


ryan


Maria Mosolova wrote:

Hello,

I am looking for the best way to get the number of documents in a Solr 
index. I'd like to do it from a java code using solrj. Any suggestions 
are welcome.


Thank you in advance,
Maria mosolova





Re: best way to get number of documents in a Solr index

2008-01-15 Thread Maria Mosolova

Thanks a lot Brian!
Maria

Brian Whitman wrote:


On Jan 15, 2008, at 3:47 PM, Maria Mosolova wrote:


Hello,

I am looking for the best way to get the number of documents in a 
Solr index. I'd like to do it from a java code using solrj.



  public int resultCount() {
try {
  SolrQuery q = new SolrQuery("*:*");
  QueryResponse rq = solr.query(q);
  return rq.getResults().getNumFound();
} catch (org.apache.solr.client.solrj.SolrServerException e) {
  System.err.println("Query problem");
} catch (java.io.IOException e)  {
  System.err.println("Other error");
}
return -1;
  }







Re: Missing Content Stream

2008-01-15 Thread Ismail Siddiqui
thanks brian and otis,
i will definitely try solrj.. but actaually now the problem is resolved by
setting content length in header i was missing it
c.setRequestProperty("Content-Length", xmlText.length()+"");
but now its not throwing any error but not indexing the document either.. do
I have to set autoCommit on in solrconfig.xml ???


thanks


On 1/15/08, Brian Whitman <[EMAIL PROTECTED]> wrote:
>
>
> On Jan 15, 2008, at 1:50 PM, Ismail Siddiqui wrote:
>
> > Hi Everyone,
> > I am new to solr. I am trying to index xml using http post as follows
>
>
> Ismail, you seem to have a few spelling mistakes in your xml string.
> "fiehld, nadme" etc. (a) try fixing them, (b) try solrj instead, I
> agree w/ otis.
>
>
>
>


Re: wildcards and German umlauts

2008-01-15 Thread Daniel Naber
On Dienstag, 15. Januar 2008, Alexey Shakov wrote:

> Index-searching works, if i type complete word (such as "übersicht").
> But there are no hits, if i use wildcards (such as "über*")
> Searching with wildcards and without umlauts works as well.

Maybe this describes your problem on the Lucene level?
http://wiki.apache.org/lucene-java/LuceneFAQ#head-133cf44dd3dff3680c96c1316a663e881eeac35a

If that doesn't help, try Luke to see how your queries are parsed.

Regards
 Daniel

-- 
http://www.danielnaber.de


Solr in a distributed multi-machine high-performance environment

2008-01-15 Thread Srikant Jakilinki
Hi All,

There is a requirement in our group of indexing and searching several
millions of documents (TREC) in real-time and millisecond responses.
For the moment we are preferring scale-out (throw more commodity
machines) approaches rather than scale-up (faster disks, more
RAM). This is in-turn inspired by the "Scale-out vs. Scale-up" paper
(mail me if you want a copy) in which it was proven that this kind of
distribution scales better and is more resilient.

So, are there any resources available (Wiki, Tutorials, Slides, README
etc.) that throw light and guide newbies on how to run Solr in a
multi-machine scenario? I have gone through the mailing lists and site
but could not really find any answers or hands-on stuff to do so. An
adhoc guideline to get things working with 2 machines might just be
enough but for the sake of thinking out loud and solicit responses
from the list, here are my questions:

1) Solr that has to handle a fairly large index which has to be split
up on multiple disks (using Multicore?)
- Space is not a problem since we can use NFS but that is not
recommended as we would only exploit 1 processor
2) Solr that has to handle a large collective index which has to be
split up on multi-machines
- The index is ever increasing (TB scale) and dynamic and all of it
has to be searched at any point
3) Solr that has to exploit multi-machines because we have plenty of
them in a tightly coupled P2P scenario
- Machines are not a problem but will they be if they are of varied
configurations (PIII to Core2; Linux to Vista; 32-bit to 64-bit; J2SE
1.1 to 1.6)
4) Solr that has to distribute load on several machines
- The index(s) could be common though like say using a distributed
filesystem (Hadoop?)

In each the above cases (we might use all of these strategies at
various use cases) the application should use Solr as a strict backend
and named service (IP or host:port) so that we can expose this
application (and the service) to the web or intranet. Machine failures
should be tolerated too. Also, does Solr manage load balancing out of
the box if it was indeed configured to work with multi-machines?

Maybe it is superfluous but is Solr and/or Nutch the only way to use
Lucene in a multi-machine environment? Or is there some hidden
document/project somewhere that makes it possible by exposing a
regular Lucene process over the network using RMI or something? It is
my understanding (could be wrong) that Nutch and to some extent, Solr
do not perform well when there is a lot of indexing activity in
parallel to search. Batch processing is also there and perhaps we can
use Nutch/Solr there. Even so, we need multi-machine directions.

I am sure that multi-machines make possible for a lot of other ways
which might solve the goal better and that others have practical
experience on. So, any advise and tips are also very welcome. We
intend to document things and do some benchmarking along the way in
the open spirit.

Really sorry for the length but I hope some answers are forthcoming.

Cheers,
Srikant