SolrJ and Unique Doc ID

2008-02-11 Thread Grant Ingersoll
What's the best way to retrieve the unique key field from SolrJ?  From  
what I can tell, it seems like I would need to retrieve the schema and  
then parse it and get it from there, or am I missing something?


Thanks,
Grant


Re: Search result not coming for normal special characters...

2008-02-11 Thread nithyavembu

Thanks Erick.

I have tried with WhitespaceAnalyzer as you said.

-> In my schema.xml i have removed the filter class
"solr.WordDelimiterFilterFactory" for both indexing and querying.

-> If i remove this, the special character search works fine. But i am
unable to search for this scenario...

 example : indexData : sriHari, sweetHeart,mike Oliver

 SearchData: 

 If i search for sri, sweet,mike or oliver it returns the search
result correctly. But if i search for
"Hari","Heart" its not returning the result. In the middle of the term if i
give any word i am unable to search.
-> I found that in "solr.WordDelimiterFilterFactory" will split the word and
provides the middle of word search. But special character ignored here. 

-> I need the both the scenarios to work. Is it possible? Any idea or
solution? 
 
Thanks,
Nithya.




When in doubt, use WhitespaceAnalyzer and build up from there. It's the
simplest. Look at the Lucene docs for what the various analyzers do
under the covers.

Note: WhitespaceAnalyzer does NOT transform to lowercase, you have
to do that yourself or compose your own analyzer.

Erick


-- 
View this message in context: 
http://www.nabble.com/Search-with-the-characters-%28%21%2C%40%2C-%2C%24%2C-%2C%5E%2C-%2C*%2C%28%2C%29%2C%7B%2C%7D%2C-%2C-%29...-tp15339827p15415375.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: SolrJ and Unique Doc ID

2008-02-11 Thread Ryan McKinley

right now you need to know the unique key name to get it...
I don't think we have any easy way to get that besides parsing the 
schema


With debugQuery=true, the uniqueKey is added to the 'explain' info:

  
   ...

this gets parsed into the QueryResults _explainMap and _docIdMap but i'm 
not sure that is useful in the general sense...


ryan


Grant Ingersoll wrote:
What's the best way to retrieve the unique key field from SolrJ?  From 
what I can tell, it seems like I would need to retrieve the schema and 
then parse it and get it from there, or am I missing something?


Thanks,
Grant





Re: SolrJ and Unique Doc ID

2008-02-11 Thread Yonik Seeley
Hmmm, I should have just mandated that the id field be called "id"
from the start :-)

On Feb 11, 2008 5:51 PM, Grant Ingersoll <[EMAIL PROTECTED]> wrote:
> What's the best way to retrieve the unique key field from SolrJ?  From
> what I can tell, it seems like I would need to retrieve the schema and
> then parse it and get it from there, or am I missing something?
>
> Thanks,
> Grant
>


Re: solrj and multiple slaves

2008-02-11 Thread Walter Underwood
On 2/11/08 8:42 PM, "Chris Hostetter" <[EMAIL PROTECTED]> wrote:

> if you want to worry about smart load balancing, try to load balance based
> on the nature of the URL query string ... make you load balancer pick
> a slave by hashing on the "q" param for example.

This is very effective. We used this at Infoseek ten years ago.

An easy way to do this is to have the client code do the hash and
add it as an extra parameter. Then have the load balancer switch
based on that param. Something like this:

   &preferred_server=2

wunder



Re: solrj and multiple slaves

2008-02-11 Thread Chris Hostetter

: I have a quick question about using solrj to connect to multiple slaves.
: My application is deployed on multiple boxes that have to talk to
: multiple solr slaves.  In order to take advantage of the queryResult
: cache, each request from one of my app boxes should be redirected to the
: same solr slave.

i've never once worried about "session affinity" when dealing with Solr 
... if a query is common/important enough that it's going to be a cache 
hit, it will probably be a cache hit on all the servers.  besides which: 
just because two queries come from the same client doesn't mean they have 
anything to do with eachother - i'm typically just as likely to get the 
same query from two differnet clients as i am twice fro mthe same client.  

if you want to worry about smart load balancing, try to load balance based 
on the nature of the URL query string ... make you loard balancer pick 
a slave by hashing on the "q" param for example.

the one situation where i worry about sending certain traffic to some Solr 
boxes and other traffic to other Solr boxes is when i know that the client 
apps have very differnet query usage patterns ... then i have two seperate 
tiers of Slaves -- identical indexes, but different solrconfigs.  the 
clients that hit my custon faceting plugin use one tier with a big custom 
cache and filterCache.  the clients that do more traditional searching 
using dismax hit a second tier which has no custom cache, a smaller 
filterCache and a bigger queryResultCache ... but even then i don't worry 
about session IDs ... i just configure the two client applications with 
different DNS aliases.




-Hoss



Re: Index will get change or refresh after restart the solr server?

2008-02-11 Thread Chris Hostetter

:When i again start what will happen in the "data" folder? Any data
: refreshing,adding,deleting etc..
:Every restart of the solr server what will happens to indexing data or it
: remain unchanged without any action?

if you start up Solr, and there is allready a "data" directory containing 
an "index" directory then Solr will use that index.  If there is no index 
directory, then Solr will create it (it will not create the "data" 
directory -- if that's missing you get an error)

:morning  data :
:
:   primaryAdmin
:   secondaryAdmin
:   
:   Evening data :
: 
:   primaryAdmin
:   
:   These are the data i indexed. But when i searching for "primaryAdmin" it
: returning only the data which indexed at that time..

i do not understand your question.  there could be lots of things going on 
here, but it's not at all clear that anything is actually going wrong.  
did the document you indexed in the evening have hte same value for the 
uniqueKey field as the document you indexed in the morning?

Your best bet for getting meaningful help with your problem is to be very 
explicit about exactly what it is you are doing and what results you are 
getting ... show us your schema.xml, show us the full XML of every doc 
you index, list every action you take (including when you stop/start your 
tomcat port), etc...



-Hoss



Re: range vs. filter queries

2008-02-11 Thread Chris Hostetter

: essentially, this is:
:  +north:[* TO nnn] +south:[sss TO *] +east:[* TO eee] +west:[www TO *]

: Would this be better as four individual filters?

it depends on the granularity you exepct clients to query with ... if 
cleents can get really granular, then the odds of reuse are lower, so the 
advantages of individual filters are gone.

if however you know that the granularity of your input will allways be 
something course -- like a multiple of 15 degrees, or even 5 degrees -- 
then it's probably practical to break it out.

Something else to consider: use cached range filters for the course 
aspects, but use uncached range filters for the precisision stuff.  i know 
you're searching for areas, but for simplicity assume the docs exist at 
point and the query is a box... if the input box is "N+42.5 S+13.7 W-78.2 
E-62.4" you can cache filters for lat:[15 TO 45] and lon:[-75 TO -60] and 
then intersect and union those results with uncached queries for lat:[13.7 
TO 15], lat:[42.5 TO 45], lon:[-78.2 TO -75], lon:-62.4 TO 60]

...I've never tried this so i'm not sure if the cost/benefit trade off 
actaully makes sense ... but the principle seems sound.




-Hoss



RE: range vs. filter queries

2008-02-11 Thread Lance Norskog
Is it not possible to make a grid of your boxes? It seems like this would be
a more efficient query:

grid:N100_S50_E250_W412 

This is how GIS systems work, right?

Lance

-Original Message-
From: Ryan McKinley [mailto:[EMAIL PROTECTED] 
Sent: Monday, February 11, 2008 6:13 PM
To: solr-user@lucene.apache.org
Subject: Re: range vs. filter queries

>>
>> Would this be better as four individual filters?
> 
> Only if there were likely to occur again in combination with different 
> constraints.
> My guess would be no.

this is because the filter could not be cached?

Since i know it should not cached, is there any way to make sure it does not
purge useful stuff from the cache?

> 
> Perhaps you want 2 fields (lat and long) instead of 4?
> 

2 is fine if I was dealing with points, but this is a region, so i need to
deal with a whole region (N,S,E,and W).


> One issue here is range queries that include many terms are currently
slow.
> That's something we need to address sometime (there has been some work
> on this in Lucene, but nothing yet committed AFAIK).
>

do range queries operate on the whole index, or can they be limited 
first?  That is, if i can throw out half the docs with a simple 
TermQuery, does the range still have to go through everything?

thanks
ryan




Re: range vs. filter queries

2008-02-11 Thread Yonik Seeley
On Feb 11, 2008 9:13 PM, Ryan McKinley <[EMAIL PROTECTED]> wrote:
> >>
> >> Would this be better as four individual filters?
> >
> > Only if there were likely to occur again in combination with different
> > constraints.
> > My guess would be no.
>
> this is because the filter could not be cached?

right.  It's probably minor though... the bigger cost will be
generation of those range queries.

> Since i know it should not cached, is there any way to make sure it does
> not purge useful stuff from the cache?
>
> >
> > Perhaps you want 2 fields (lat and long) instead of 4?
> >
>
> 2 is fine if I was dealing with points, but this is a region, so i need
> to deal with a whole region (N,S,E,and W).

If it's a bounding box, it can be defined by 2 range queries, right?

> > One issue here is range queries that include many terms are currently slow.
> > That's something we need to address sometime (there has been some work
> > on this in Lucene, but nothing yet committed AFAIK).
> >
>
> do range queries operate on the whole index, or can they be limited
> first?  That is, if i can throw out half the docs with a simple
> TermQuery, does the range still have to go through everything?

Needs to go through everything.  No easy way to avoid that right now.

-Yonik


Re: range vs. filter queries

2008-02-11 Thread Ryan McKinley


Would this be better as four individual filters?


Only if there were likely to occur again in combination with different
constraints.
My guess would be no.


this is because the filter could not be cached?

Since i know it should not cached, is there any way to make sure it does 
not purge useful stuff from the cache?




Perhaps you want 2 fields (lat and long) instead of 4?



2 is fine if I was dealing with points, but this is a region, so i need 
to deal with a whole region (N,S,E,and W).




One issue here is range queries that include many terms are currently slow.
That's something we need to address sometime (there has been some work
on this in Lucene, but nothing yet committed AFAIK).



do range queries operate on the whole index, or can they be limited 
first?  That is, if i can throw out half the docs with a simple 
TermQuery, does the range still have to go through everything?


thanks
ryan



Re: range vs. filter queries

2008-02-11 Thread Yonik Seeley
On Feb 11, 2008 8:51 PM, Ryan McKinley <[EMAIL PROTECTED]> wrote:
> Hello-
>
> I'm working on a SearchComponent that should limit results to entries
> within a geographic range.  I would love some feedback to make sure I'm
> not building silly queries and/or can change them to be better.  I have
> four fields:
>
>
>
>
>
>
> The component looks for a "bounds" argument and parses out the NSEW
> corners.  Currently, I'm building a boolean query and adding that to the
> filter list:
>
>FieldType ft = req.getSchema().getFieldTypes().get( "sfloat" );
>
>BooleanQuery range = new BooleanQuery( true );
>range.add( new ConstantScoreRangeQuery( "north", null,
> ft.toInternal(n), true, true ), BooleanClause.Occur.MUST );
>range.add( new ConstantScoreRangeQuery( "south",
> ft.toInternal(s), null, true, true ), BooleanClause.Occur.MUST );
>range.add( new ConstantScoreRangeQuery( "east", null,
> ft.toInternal(e), true, true ), BooleanClause.Occur.MUST );
>range.add( new ConstantScoreRangeQuery( "west", ft.toInternal(w),
> null, true, true ), BooleanClause.Occur.MUST );
>
> essentially, this is:
>   +north:[* TO nnn] +south:[sss TO *] +east:[* TO eee] +west:[www TO *]
>
>
> Would this be better as four individual filters?

Only if there were likely to occur again in combination with different
constraints.
My guess would be no.

Perhaps you want 2 fields (lat and long) instead of 4?

One issue here is range queries that include many terms are currently slow.
That's something we need to address sometime (there has been some work
on this in Lucene, but nothing yet committed AFAIK).

-Yonik


range vs. filter queries

2008-02-11 Thread Ryan McKinley

Hello-

I'm working on a SearchComponent that should limit results to entries 
within a geographic range.  I would love some feedback to make sure I'm 
not building silly queries and/or can change them to be better.  I have 
four fields:


  
  
  
  

The component looks for a "bounds" argument and parses out the NSEW 
corners.  Currently, I'm building a boolean query and adding that to the 
filter list:


  FieldType ft = req.getSchema().getFieldTypes().get( "sfloat" );

  BooleanQuery range = new BooleanQuery( true );
  range.add( new ConstantScoreRangeQuery( "north", null, 
ft.toInternal(n), true, true ), BooleanClause.Occur.MUST );
  range.add( new ConstantScoreRangeQuery( "south", 
ft.toInternal(s), null, true, true ), BooleanClause.Occur.MUST );
  range.add( new ConstantScoreRangeQuery( "east", null, 
ft.toInternal(e), true, true ), BooleanClause.Occur.MUST );
  range.add( new ConstantScoreRangeQuery( "west", ft.toInternal(w), 
null, true, true ), BooleanClause.Occur.MUST );


essentially, this is:
 +north:[* TO nnn] +south:[sss TO *] +east:[* TO eee] +west:[www TO *]


Would this be better as four individual filters?

Additionally, I could chunk the world into a grid and see index if a 
point exists within a square.  This could potentially cut out many 
results with a simple term query, but I don't know if it is worthwhile 
since I will need to run the points through a range query at the end anyway.


Any thoughts of feedback would be great.

thanks
ryan





Re: Highlight on non-text fields and/or field-match list

2008-02-11 Thread Chris Hostetter

: to. For example, if I have a field in a document such as "username" which is
: a string that I'll do wild-card searches on, Solr will return document
: matches but no highlight data for that field. The end-goal is to know which

FYI: this is a known bug that results from a "safety" net in the 
SolrQueryParser...

https://issues.apache.org/jira/browse/SOLR-195

...wildcards work in the trunk, and there is a workarround for 
prefix queries mentioned in the issue (you trick the queryparser into 
doing a wildcard query).

In general "fields" don't match queries, "documents" match queries ... 
highlighting can show you places "terms" and "phrases" appear in 
documents, but that doesn't garuntee that the "terms" highlighted are the 
reason the document matched the query.  the explain info is the only thing 
that can do that.




-Hoss



Re: Commit strategies

2008-02-11 Thread Chris Hostetter

if you just want commits ot happen on a regular frequenty take a look at 
the autoCommit options.

sa for the specific errors you are getting, i don't know enouugh python to 
unerstand them, but it may just be that your commits are taking too long 
and your client is timing out on waiting for the commit to finish.

have you tried increasing the timeout?

: How do people generally approach the deferred commit issue? Do I need to queue
: index and search requests myself or does Solr handle it? My app indexes about
: 100 times more than it searches, but searching is more time critical. Does
: that change anything?

searches can go on happily while commits/adds are happening, and multiple 
adds can happen in parallel, ... but all adds block while a commit is 
taking place.  i just give all of clients that update the index a really 
large timeout value (ie: 60 seconds or so) and don't worry about queing up 
indexing requests.  the only intelegence you typically need to worry about 
is that there's very little reason to ever do a commit if you know you've 
got more adds ready to go.




-Hoss



Performance help for heavy indexing workload

2008-02-11 Thread James Brady

Hello,
I'm looking for some configuration guidance to help improve  
performance of my application, which tends to do a lot more indexing  
than searching.


At present, it needs to index around two documents / sec - a document  
being the stripped content of a webpage. However, performance was so  
poor that I've had to disable indexing of the webpage content as an  
emergency measure. In addition, some search queries take an  
inordinate length of time - regularly over 60 seconds.


This is running on a medium sized EC2 instance (2 x 2GHz Opterons and  
8GB RAM), and there's not too much else going on on the box. In  
total, there are about 1.5m documents in the index.


I'm using a fairly standard configuration - the things I've tried  
changing so far have been parameters like maxMergeDocs, mergeFactor  
and the autoCommit options. I'm only using the  
StandardRequestHandler, no faceting. I have a scheduled task causing  
a database commit every 15 seconds.


Obviously, every workload varies, but could anyone comment on whether  
this sort of hardware should, with proper configuration, be able to  
manage this sort of workload?


I can't see signs of Solr being IO-bound, CPU-bound or memory-bound,  
although my scheduled commit operation, or perhaps GC, does spike up  
the CPU utilisation at intervals.


Any help appreciated!
James

Re: SolrJ and Unique Doc ID

2008-02-11 Thread Chris Hostetter
: Another option is to add it to the responseHeader  Or it could be a quick
: add to the LukeRH.  The former has the advantage that we wouldn't have to make

adding the info to LukeRequestHandler makes sense.

Honestly: i can't think of a single use case where client code would care 
about what the uniqueKey field is, unless it already *knew* what the 
uniqueKey field is.

: Of course, it probably would be useful to be able to request the schema from
: the server and build an IndexSchema object on the client side.  This could be
: added to the LukeRH as well.

somebody was working on that at some point ... but i may be thinking of 
the Ruby client ... no i'm pretty sure i remember it coming up in the 
context of Java because i remember dicsussion that a full "IndexSchema" 
was too much because it required the client to have the class files for 
all of the analysis chain and filedtype classes.



-Hoss



Re: range vs. filter queries

2008-02-11 Thread Ryan McKinley

Lance Norskog wrote:

Is it not possible to make a grid of your boxes? It seems like this would be
a more efficient query:

	grid:N100_S50_E250_W412 


This is how GIS systems work, right?



something like that...  I was just checking if I could get away with 
range queries for now...  I'll also check if local lucene is possible:

http://www.nsshutdown.com/projects/lucene/whitepaper/locallucene.htm

ryan



Re: SolrJ and Unique Doc ID

2008-02-11 Thread Grant Ingersoll
Another option is to add it to the responseHeader  Or it could be  
a quick add to the LukeRH.  The former has the advantage that we  
wouldn't have to make extra calls at the cost of sending an extra  
string w/ every message.  The latter would work by asking for it up  
front and then saving it aside.  Any preference?  Or, we could add it  
to both, making the responseHeader one optional.


Of course, it probably would be useful to be able to request the  
schema from the server and build an IndexSchema object on the client  
side.  This could be added to the LukeRH as well.


Hindsight is 20/20...

On Feb 11, 2008, at 6:51 PM, Ryan McKinley wrote:

thoughts on requiring that for solrj?  perhaps in 2.0?  Not  
suggesting it is a good idea (yet)... but we may want to consider it.



Yonik Seeley wrote:

Hmmm, I should have just mandated that the id field be called "id"
from the start :-)
On Feb 11, 2008 5:51 PM, Grant Ingersoll <[EMAIL PROTECTED]> wrote:
What's the best way to retrieve the unique key field from SolrJ?   
From
what I can tell, it seems like I would need to retrieve the schema  
and

then parse it and get it from there, or am I missing something?

Thanks,
Grant








Re: SolrJ and Unique Doc ID

2008-02-11 Thread Ryan McKinley
thoughts on requiring that for solrj?  perhaps in 2.0?  Not suggesting 
it is a good idea (yet)... but we may want to consider it.



Yonik Seeley wrote:

Hmmm, I should have just mandated that the id field be called "id"
from the start :-)

On Feb 11, 2008 5:51 PM, Grant Ingersoll <[EMAIL PROTECTED]> wrote:

What's the best way to retrieve the unique key field from SolrJ?  From
what I can tell, it seems like I would need to retrieve the schema and
then parse it and get it from there, or am I missing something?

Thanks,
Grant







RE: Multiple Search in Solr

2008-02-11 Thread patrik
It's based on SOLR 1.2, however it's customized for our application to do
this. I'm only mentioning that it's possible by changing the
DirectUpdateHandler2 to have multiple indexes.

pb

-Original Message-
From: Niveen Nagy [mailto:[EMAIL PROTECTED] 
Sent: Sunday, February 10, 2008 1:47 AM
To: solr-user@lucene.apache.org
Subject: RE: Multiple Search in Solr


Could you please clarify what version.


Best Regards,

Niveen Nagy

Software Engineer
-Original Message-
From: patrik [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, February 06, 2008 10:10 PM
To: solr-user@lucene.apache.org
Subject: RE: Multiple Search in Solr

We're using a version of SOLR that's we've customized to allow multiple
indexes with the same schema to be searched. So, it is possible. The
tricky
part we're noticing is managing updates to the same document. If you
don't
need that you can get by pretty easily.

patrik

-Original Message-
From: Peter Thygesen [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, February 05, 2008 2:08 AM
To: solr-user@lucene.apache.org
Subject: RE: Multiple Search in Solr

I'm also looking for a solution with multiple indices.

Soo.. great, are you saying the patch doesn't work or what? And could
you elaborate a little more on the "I have written the Lucene
application.." What did you do?


-Peter Thygesen

-Original Message-
From: Jae Joo [mailto:[EMAIL PROTECTED] 
Sent: 4. februar 2008 14:59
To: solr-user@lucene.apache.org
Subject: RE: Multiple Search in Solr

I have downloaded version 1.3 and built multiple indices.

I could not find any way for multiple indices search at Solr level, I
have written the Lucene application. It is working well.

Jae Joo

-Original Message-
From: Niveen Nagy [mailto:[EMAIL PROTECTED] 
Sent: Monday, February 04, 2008 8:55 AM
To: solr-user@lucene.apache.org
Subject: Multiple Search in Solr

Hello ,

 

I have a question concerning solr multiple indices. We have 4 solr
indices in our system and we want to use distributed search (Multiple
search) that searches in the four indices in parallel. We downloaded the
latest code from svn and we applied the patch distributed.patch but we
need more detailed description on how to use this patch and what changes
should be applied to solr schema, and how these indices should be
located. Another question here is could the steps be applied to our
indices that was built using a version before applying the distributed
patch.

 

 Thanks in advance.

   

Best Regards,

 

Niveen Nagy