SolrJ and Unique Doc ID
What's the best way to retrieve the unique key field from SolrJ? From what I can tell, it seems like I would need to retrieve the schema and then parse it and get it from there, or am I missing something? Thanks, Grant
Re: Search result not coming for normal special characters...
Thanks Erick. I have tried with WhitespaceAnalyzer as you said. -> In my schema.xml i have removed the filter class "solr.WordDelimiterFilterFactory" for both indexing and querying. -> If i remove this, the special character search works fine. But i am unable to search for this scenario... example : indexData : sriHari, sweetHeart,mike Oliver SearchData: If i search for sri, sweet,mike or oliver it returns the search result correctly. But if i search for "Hari","Heart" its not returning the result. In the middle of the term if i give any word i am unable to search. -> I found that in "solr.WordDelimiterFilterFactory" will split the word and provides the middle of word search. But special character ignored here. -> I need the both the scenarios to work. Is it possible? Any idea or solution? Thanks, Nithya. When in doubt, use WhitespaceAnalyzer and build up from there. It's the simplest. Look at the Lucene docs for what the various analyzers do under the covers. Note: WhitespaceAnalyzer does NOT transform to lowercase, you have to do that yourself or compose your own analyzer. Erick -- View this message in context: http://www.nabble.com/Search-with-the-characters-%28%21%2C%40%2C-%2C%24%2C-%2C%5E%2C-%2C*%2C%28%2C%29%2C%7B%2C%7D%2C-%2C-%29...-tp15339827p15415375.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: SolrJ and Unique Doc ID
right now you need to know the unique key name to get it... I don't think we have any easy way to get that besides parsing the schema With debugQuery=true, the uniqueKey is added to the 'explain' info: ... this gets parsed into the QueryResults _explainMap and _docIdMap but i'm not sure that is useful in the general sense... ryan Grant Ingersoll wrote: What's the best way to retrieve the unique key field from SolrJ? From what I can tell, it seems like I would need to retrieve the schema and then parse it and get it from there, or am I missing something? Thanks, Grant
Re: SolrJ and Unique Doc ID
Hmmm, I should have just mandated that the id field be called "id" from the start :-) On Feb 11, 2008 5:51 PM, Grant Ingersoll <[EMAIL PROTECTED]> wrote: > What's the best way to retrieve the unique key field from SolrJ? From > what I can tell, it seems like I would need to retrieve the schema and > then parse it and get it from there, or am I missing something? > > Thanks, > Grant >
Re: solrj and multiple slaves
On 2/11/08 8:42 PM, "Chris Hostetter" <[EMAIL PROTECTED]> wrote: > if you want to worry about smart load balancing, try to load balance based > on the nature of the URL query string ... make you load balancer pick > a slave by hashing on the "q" param for example. This is very effective. We used this at Infoseek ten years ago. An easy way to do this is to have the client code do the hash and add it as an extra parameter. Then have the load balancer switch based on that param. Something like this: &preferred_server=2 wunder
Re: solrj and multiple slaves
: I have a quick question about using solrj to connect to multiple slaves. : My application is deployed on multiple boxes that have to talk to : multiple solr slaves. In order to take advantage of the queryResult : cache, each request from one of my app boxes should be redirected to the : same solr slave. i've never once worried about "session affinity" when dealing with Solr ... if a query is common/important enough that it's going to be a cache hit, it will probably be a cache hit on all the servers. besides which: just because two queries come from the same client doesn't mean they have anything to do with eachother - i'm typically just as likely to get the same query from two differnet clients as i am twice fro mthe same client. if you want to worry about smart load balancing, try to load balance based on the nature of the URL query string ... make you loard balancer pick a slave by hashing on the "q" param for example. the one situation where i worry about sending certain traffic to some Solr boxes and other traffic to other Solr boxes is when i know that the client apps have very differnet query usage patterns ... then i have two seperate tiers of Slaves -- identical indexes, but different solrconfigs. the clients that hit my custon faceting plugin use one tier with a big custom cache and filterCache. the clients that do more traditional searching using dismax hit a second tier which has no custom cache, a smaller filterCache and a bigger queryResultCache ... but even then i don't worry about session IDs ... i just configure the two client applications with different DNS aliases. -Hoss
Re: Index will get change or refresh after restart the solr server?
:When i again start what will happen in the "data" folder? Any data : refreshing,adding,deleting etc.. :Every restart of the solr server what will happens to indexing data or it : remain unchanged without any action? if you start up Solr, and there is allready a "data" directory containing an "index" directory then Solr will use that index. If there is no index directory, then Solr will create it (it will not create the "data" directory -- if that's missing you get an error) :morning data : : : primaryAdmin : secondaryAdmin : : Evening data : : : primaryAdmin : : These are the data i indexed. But when i searching for "primaryAdmin" it : returning only the data which indexed at that time.. i do not understand your question. there could be lots of things going on here, but it's not at all clear that anything is actually going wrong. did the document you indexed in the evening have hte same value for the uniqueKey field as the document you indexed in the morning? Your best bet for getting meaningful help with your problem is to be very explicit about exactly what it is you are doing and what results you are getting ... show us your schema.xml, show us the full XML of every doc you index, list every action you take (including when you stop/start your tomcat port), etc... -Hoss
Re: range vs. filter queries
: essentially, this is: : +north:[* TO nnn] +south:[sss TO *] +east:[* TO eee] +west:[www TO *] : Would this be better as four individual filters? it depends on the granularity you exepct clients to query with ... if cleents can get really granular, then the odds of reuse are lower, so the advantages of individual filters are gone. if however you know that the granularity of your input will allways be something course -- like a multiple of 15 degrees, or even 5 degrees -- then it's probably practical to break it out. Something else to consider: use cached range filters for the course aspects, but use uncached range filters for the precisision stuff. i know you're searching for areas, but for simplicity assume the docs exist at point and the query is a box... if the input box is "N+42.5 S+13.7 W-78.2 E-62.4" you can cache filters for lat:[15 TO 45] and lon:[-75 TO -60] and then intersect and union those results with uncached queries for lat:[13.7 TO 15], lat:[42.5 TO 45], lon:[-78.2 TO -75], lon:-62.4 TO 60] ...I've never tried this so i'm not sure if the cost/benefit trade off actaully makes sense ... but the principle seems sound. -Hoss
RE: range vs. filter queries
Is it not possible to make a grid of your boxes? It seems like this would be a more efficient query: grid:N100_S50_E250_W412 This is how GIS systems work, right? Lance -Original Message- From: Ryan McKinley [mailto:[EMAIL PROTECTED] Sent: Monday, February 11, 2008 6:13 PM To: solr-user@lucene.apache.org Subject: Re: range vs. filter queries >> >> Would this be better as four individual filters? > > Only if there were likely to occur again in combination with different > constraints. > My guess would be no. this is because the filter could not be cached? Since i know it should not cached, is there any way to make sure it does not purge useful stuff from the cache? > > Perhaps you want 2 fields (lat and long) instead of 4? > 2 is fine if I was dealing with points, but this is a region, so i need to deal with a whole region (N,S,E,and W). > One issue here is range queries that include many terms are currently slow. > That's something we need to address sometime (there has been some work > on this in Lucene, but nothing yet committed AFAIK). > do range queries operate on the whole index, or can they be limited first? That is, if i can throw out half the docs with a simple TermQuery, does the range still have to go through everything? thanks ryan
Re: range vs. filter queries
On Feb 11, 2008 9:13 PM, Ryan McKinley <[EMAIL PROTECTED]> wrote: > >> > >> Would this be better as four individual filters? > > > > Only if there were likely to occur again in combination with different > > constraints. > > My guess would be no. > > this is because the filter could not be cached? right. It's probably minor though... the bigger cost will be generation of those range queries. > Since i know it should not cached, is there any way to make sure it does > not purge useful stuff from the cache? > > > > > Perhaps you want 2 fields (lat and long) instead of 4? > > > > 2 is fine if I was dealing with points, but this is a region, so i need > to deal with a whole region (N,S,E,and W). If it's a bounding box, it can be defined by 2 range queries, right? > > One issue here is range queries that include many terms are currently slow. > > That's something we need to address sometime (there has been some work > > on this in Lucene, but nothing yet committed AFAIK). > > > > do range queries operate on the whole index, or can they be limited > first? That is, if i can throw out half the docs with a simple > TermQuery, does the range still have to go through everything? Needs to go through everything. No easy way to avoid that right now. -Yonik
Re: range vs. filter queries
Would this be better as four individual filters? Only if there were likely to occur again in combination with different constraints. My guess would be no. this is because the filter could not be cached? Since i know it should not cached, is there any way to make sure it does not purge useful stuff from the cache? Perhaps you want 2 fields (lat and long) instead of 4? 2 is fine if I was dealing with points, but this is a region, so i need to deal with a whole region (N,S,E,and W). One issue here is range queries that include many terms are currently slow. That's something we need to address sometime (there has been some work on this in Lucene, but nothing yet committed AFAIK). do range queries operate on the whole index, or can they be limited first? That is, if i can throw out half the docs with a simple TermQuery, does the range still have to go through everything? thanks ryan
Re: range vs. filter queries
On Feb 11, 2008 8:51 PM, Ryan McKinley <[EMAIL PROTECTED]> wrote: > Hello- > > I'm working on a SearchComponent that should limit results to entries > within a geographic range. I would love some feedback to make sure I'm > not building silly queries and/or can change them to be better. I have > four fields: > > > > > > > The component looks for a "bounds" argument and parses out the NSEW > corners. Currently, I'm building a boolean query and adding that to the > filter list: > >FieldType ft = req.getSchema().getFieldTypes().get( "sfloat" ); > >BooleanQuery range = new BooleanQuery( true ); >range.add( new ConstantScoreRangeQuery( "north", null, > ft.toInternal(n), true, true ), BooleanClause.Occur.MUST ); >range.add( new ConstantScoreRangeQuery( "south", > ft.toInternal(s), null, true, true ), BooleanClause.Occur.MUST ); >range.add( new ConstantScoreRangeQuery( "east", null, > ft.toInternal(e), true, true ), BooleanClause.Occur.MUST ); >range.add( new ConstantScoreRangeQuery( "west", ft.toInternal(w), > null, true, true ), BooleanClause.Occur.MUST ); > > essentially, this is: > +north:[* TO nnn] +south:[sss TO *] +east:[* TO eee] +west:[www TO *] > > > Would this be better as four individual filters? Only if there were likely to occur again in combination with different constraints. My guess would be no. Perhaps you want 2 fields (lat and long) instead of 4? One issue here is range queries that include many terms are currently slow. That's something we need to address sometime (there has been some work on this in Lucene, but nothing yet committed AFAIK). -Yonik
range vs. filter queries
Hello- I'm working on a SearchComponent that should limit results to entries within a geographic range. I would love some feedback to make sure I'm not building silly queries and/or can change them to be better. I have four fields: The component looks for a "bounds" argument and parses out the NSEW corners. Currently, I'm building a boolean query and adding that to the filter list: FieldType ft = req.getSchema().getFieldTypes().get( "sfloat" ); BooleanQuery range = new BooleanQuery( true ); range.add( new ConstantScoreRangeQuery( "north", null, ft.toInternal(n), true, true ), BooleanClause.Occur.MUST ); range.add( new ConstantScoreRangeQuery( "south", ft.toInternal(s), null, true, true ), BooleanClause.Occur.MUST ); range.add( new ConstantScoreRangeQuery( "east", null, ft.toInternal(e), true, true ), BooleanClause.Occur.MUST ); range.add( new ConstantScoreRangeQuery( "west", ft.toInternal(w), null, true, true ), BooleanClause.Occur.MUST ); essentially, this is: +north:[* TO nnn] +south:[sss TO *] +east:[* TO eee] +west:[www TO *] Would this be better as four individual filters? Additionally, I could chunk the world into a grid and see index if a point exists within a square. This could potentially cut out many results with a simple term query, but I don't know if it is worthwhile since I will need to run the points through a range query at the end anyway. Any thoughts of feedback would be great. thanks ryan
Re: Highlight on non-text fields and/or field-match list
: to. For example, if I have a field in a document such as "username" which is : a string that I'll do wild-card searches on, Solr will return document : matches but no highlight data for that field. The end-goal is to know which FYI: this is a known bug that results from a "safety" net in the SolrQueryParser... https://issues.apache.org/jira/browse/SOLR-195 ...wildcards work in the trunk, and there is a workarround for prefix queries mentioned in the issue (you trick the queryparser into doing a wildcard query). In general "fields" don't match queries, "documents" match queries ... highlighting can show you places "terms" and "phrases" appear in documents, but that doesn't garuntee that the "terms" highlighted are the reason the document matched the query. the explain info is the only thing that can do that. -Hoss
Re: Commit strategies
if you just want commits ot happen on a regular frequenty take a look at the autoCommit options. sa for the specific errors you are getting, i don't know enouugh python to unerstand them, but it may just be that your commits are taking too long and your client is timing out on waiting for the commit to finish. have you tried increasing the timeout? : How do people generally approach the deferred commit issue? Do I need to queue : index and search requests myself or does Solr handle it? My app indexes about : 100 times more than it searches, but searching is more time critical. Does : that change anything? searches can go on happily while commits/adds are happening, and multiple adds can happen in parallel, ... but all adds block while a commit is taking place. i just give all of clients that update the index a really large timeout value (ie: 60 seconds or so) and don't worry about queing up indexing requests. the only intelegence you typically need to worry about is that there's very little reason to ever do a commit if you know you've got more adds ready to go. -Hoss
Performance help for heavy indexing workload
Hello, I'm looking for some configuration guidance to help improve performance of my application, which tends to do a lot more indexing than searching. At present, it needs to index around two documents / sec - a document being the stripped content of a webpage. However, performance was so poor that I've had to disable indexing of the webpage content as an emergency measure. In addition, some search queries take an inordinate length of time - regularly over 60 seconds. This is running on a medium sized EC2 instance (2 x 2GHz Opterons and 8GB RAM), and there's not too much else going on on the box. In total, there are about 1.5m documents in the index. I'm using a fairly standard configuration - the things I've tried changing so far have been parameters like maxMergeDocs, mergeFactor and the autoCommit options. I'm only using the StandardRequestHandler, no faceting. I have a scheduled task causing a database commit every 15 seconds. Obviously, every workload varies, but could anyone comment on whether this sort of hardware should, with proper configuration, be able to manage this sort of workload? I can't see signs of Solr being IO-bound, CPU-bound or memory-bound, although my scheduled commit operation, or perhaps GC, does spike up the CPU utilisation at intervals. Any help appreciated! James
Re: SolrJ and Unique Doc ID
: Another option is to add it to the responseHeader Or it could be a quick : add to the LukeRH. The former has the advantage that we wouldn't have to make adding the info to LukeRequestHandler makes sense. Honestly: i can't think of a single use case where client code would care about what the uniqueKey field is, unless it already *knew* what the uniqueKey field is. : Of course, it probably would be useful to be able to request the schema from : the server and build an IndexSchema object on the client side. This could be : added to the LukeRH as well. somebody was working on that at some point ... but i may be thinking of the Ruby client ... no i'm pretty sure i remember it coming up in the context of Java because i remember dicsussion that a full "IndexSchema" was too much because it required the client to have the class files for all of the analysis chain and filedtype classes. -Hoss
Re: range vs. filter queries
Lance Norskog wrote: Is it not possible to make a grid of your boxes? It seems like this would be a more efficient query: grid:N100_S50_E250_W412 This is how GIS systems work, right? something like that... I was just checking if I could get away with range queries for now... I'll also check if local lucene is possible: http://www.nsshutdown.com/projects/lucene/whitepaper/locallucene.htm ryan
Re: SolrJ and Unique Doc ID
Another option is to add it to the responseHeader Or it could be a quick add to the LukeRH. The former has the advantage that we wouldn't have to make extra calls at the cost of sending an extra string w/ every message. The latter would work by asking for it up front and then saving it aside. Any preference? Or, we could add it to both, making the responseHeader one optional. Of course, it probably would be useful to be able to request the schema from the server and build an IndexSchema object on the client side. This could be added to the LukeRH as well. Hindsight is 20/20... On Feb 11, 2008, at 6:51 PM, Ryan McKinley wrote: thoughts on requiring that for solrj? perhaps in 2.0? Not suggesting it is a good idea (yet)... but we may want to consider it. Yonik Seeley wrote: Hmmm, I should have just mandated that the id field be called "id" from the start :-) On Feb 11, 2008 5:51 PM, Grant Ingersoll <[EMAIL PROTECTED]> wrote: What's the best way to retrieve the unique key field from SolrJ? From what I can tell, it seems like I would need to retrieve the schema and then parse it and get it from there, or am I missing something? Thanks, Grant
Re: SolrJ and Unique Doc ID
thoughts on requiring that for solrj? perhaps in 2.0? Not suggesting it is a good idea (yet)... but we may want to consider it. Yonik Seeley wrote: Hmmm, I should have just mandated that the id field be called "id" from the start :-) On Feb 11, 2008 5:51 PM, Grant Ingersoll <[EMAIL PROTECTED]> wrote: What's the best way to retrieve the unique key field from SolrJ? From what I can tell, it seems like I would need to retrieve the schema and then parse it and get it from there, or am I missing something? Thanks, Grant
RE: Multiple Search in Solr
It's based on SOLR 1.2, however it's customized for our application to do this. I'm only mentioning that it's possible by changing the DirectUpdateHandler2 to have multiple indexes. pb -Original Message- From: Niveen Nagy [mailto:[EMAIL PROTECTED] Sent: Sunday, February 10, 2008 1:47 AM To: solr-user@lucene.apache.org Subject: RE: Multiple Search in Solr Could you please clarify what version. Best Regards, Niveen Nagy Software Engineer -Original Message- From: patrik [mailto:[EMAIL PROTECTED] Sent: Wednesday, February 06, 2008 10:10 PM To: solr-user@lucene.apache.org Subject: RE: Multiple Search in Solr We're using a version of SOLR that's we've customized to allow multiple indexes with the same schema to be searched. So, it is possible. The tricky part we're noticing is managing updates to the same document. If you don't need that you can get by pretty easily. patrik -Original Message- From: Peter Thygesen [mailto:[EMAIL PROTECTED] Sent: Tuesday, February 05, 2008 2:08 AM To: solr-user@lucene.apache.org Subject: RE: Multiple Search in Solr I'm also looking for a solution with multiple indices. Soo.. great, are you saying the patch doesn't work or what? And could you elaborate a little more on the "I have written the Lucene application.." What did you do? -Peter Thygesen -Original Message- From: Jae Joo [mailto:[EMAIL PROTECTED] Sent: 4. februar 2008 14:59 To: solr-user@lucene.apache.org Subject: RE: Multiple Search in Solr I have downloaded version 1.3 and built multiple indices. I could not find any way for multiple indices search at Solr level, I have written the Lucene application. It is working well. Jae Joo -Original Message- From: Niveen Nagy [mailto:[EMAIL PROTECTED] Sent: Monday, February 04, 2008 8:55 AM To: solr-user@lucene.apache.org Subject: Multiple Search in Solr Hello , I have a question concerning solr multiple indices. We have 4 solr indices in our system and we want to use distributed search (Multiple search) that searches in the four indices in parallel. We downloaded the latest code from svn and we applied the patch distributed.patch but we need more detailed description on how to use this patch and what changes should be applied to solr schema, and how these indices should be located. Another question here is could the steps be applied to our indices that was built using a version before applying the distributed patch. Thanks in advance. Best Regards, Niveen Nagy