Re: Spatial Solr (JTeam)
I have also move the jar into the global core's lib directory. and I still have this issue. I am running macosx snowleopard java version "1.6.0_17" Java(TM) SE Runtime Environment (build 1.6.0_17-b04-248-10M3025) Java HotSpot(TM) 64-Bit Server VM (build 14.3-b01-101, mixed mode) I really don't know where the issue come from. On Mon, Dec 28, 2009 at 2:54 PM, Mauricio Scheffer < mauricioschef...@gmail.com> wrote: > Seems to work for me... (I mean, I don't get a NoClassDefFoundError but I > have other issues). > I just put spatial-solr-1.0-RC3.jar in the core's lib directory and it > worked. > > On Wed, Dec 23, 2009 at 8:25 PM, Thomas Rabaix >wrote: > > > Hello, > > > > I would like to set up the spatial solr plugin from > > http://www.jteam.nl/news/spatialsolr on solr 1.4. However I am getting a > > error message when solr start. > > > > SEVERE: java.lang.NoClassDefFoundError: > > org/apache/solr/search/QParserPlugin > > > > I guess nl.jteam.search.solrext.spatial.SpatialTierQueryParserPlugin > > extends QParserPlugin. I have checked into the solr.war file (the one > > provided by solr download webpage) and the class is present. > > > > Do you know if the current version "SSP version 1.0-RC3" is compatible > with > > solr 1.4 ? > > > > Thanks > > > > -- > > Thomas Rabaix > > > -- Thomas Rabaix http://rabaix.net
Re: Remove the deleted docs from the Solr Index
On Wed, Dec 30, 2009 at 12:10 AM, Mohamed Parvez wrote: > Ditto. There should have been an DIH command to re-sync the Index with the > DB. > But there is such a command; it is called full-import. -- Regards, Shalin Shekhar Mangar.
Re: Search algorithm used in Solr
On Mon, Jan 4, 2010 at 11:39 AM, wrote: > Hello everyone, > > Is there an article which explains (on a high level) the algorithm of > search in Solr? > > How does Solr search approach compare to the "inverted index" technique? > > Solr uses Lucene. It is the same inverted index technique at work. -- Regards, Shalin Shekhar Mangar.
Re: Optimize not having any effect on my index
Hey, I managed to run it correctly after a few restarts. Don't really know what happened. Can't really see what this would have had to do with compound file format tho? But no, I'm not using compund file format. Cheers and thanks for your replies, Aleks On Mon, Dec 21, 2009 at 8:27 AM, gurudev wrote: > > Hi, > > Are you using the compound file format? If yes, then, have u set it > properly > in solrconfig.xml, if not, then, change to: > > true (this is by default 'false') under > the tags: > > ... > and, ... > > > > > Aleksander Stensby wrote: > > > > Hey guys, > > I'm getting some strange behavior here, and I'm wondering if I'm doing > > anything wrong.. > > > > I've got an unoptimized index, and I'm trying to run the following > > command: > > > http://server:8983/solr/update?optimize=true&maxSegments=10&waitFlush=false > > Tried it first directly in the browser, it obviously took quite a bit of > > time, but once it was finished I see no difference in my index. Same > > number > > of files, same size etc. > > So i tried with curl: > > curl http://server:8983/solr/update --data-binary '' -H > > 'Content-type:text/xml; charset=utf-8' > > > > No difference here either... Am I doing anything wrong? Do i need to > issue > > a > > commit after the optimize? > > > > Any pointers would be greatly appreciated. > > > > Cheers, > > Aleks > > > > > > -- > View this message in context: > http://old.nabble.com/Optimize-not-having-any-effect-on-my-index-tp26843094p26870653.html > Sent from the Solr - User mailing list archive at Nabble.com. > >
Facets and distributed search
Hi everyone! I've posted a similar question earlier, but in a thread related to facets in general, so I thought I'd repost it here as a separate thread. I have a faceted search that is very fast when I executed the query on a single solr server, but is significantly slower when executed in a distributed environment. The set-back seem to be in the sharding of our data.. And that puzzles me a little bit... I can't really see why SOLR is so slow at doing this. The scenario: Let's say we have two servers (s1 and s2). If i query the following: q=threadid:33&facet=true&facet.field=author&limit=-1&facet.mincount=0&rows=0 directly on either server, the response is lightning fast. (<10ms) So, in theory I could query them directly, concat the result myself and get that done pretty fast. But if I introduce the shards parameter, the response time booms to between 15000ms and 2ms! shards=s1:8983/solr,s2:8983/solr My initial thoughts is that I MUST be doing something wrong here? So I try the following: Run the query on server s1, with the shards param shards=s1:8983/solr response time goes from sub 10ms to between 5000ms and 1ms! Same results if i run the query on s2, and same if i use shards=s2:8983/solr Is there really that much overhead in running a distributed facet field query with Solr? Anyone else experienced this? On the other hand, running regular queries without facet distributed is lightning fast... (so can't really see that this is a network problem or anything either). - I tried running a facet query on s1 with s1 as the shards param, and that is still as slow as if the shards param was pointed to a different server... Any insight into this would be greatly appreciated! (Would like to avoid having to hack together our own solution concatenating results...) Cheers, Aleks
Re: Implementing Autocomplete/Query Suggest using Solr
On Wed, Dec 30, 2009 at 3:07 AM, Prasanna R wrote: > I looked into the Solr/Lucene classes and found the required information. > Am summarizing the same for the benefit of those that might refer to this > thread in the future. > > The change I had to make was very simple - make a call to getPrefixQuery > instead of getWildcardQuery in my custom-modified Solr dismax query parser > class. However, this will make a fairly significant difference in terms of > efficiency. The key difference between the lucene WildcardQuery and > PrefixQuery lies in their respective term enumerators, specifically in the > term comparators. The termCompare method for PrefixQuery is more > light-weight than that of WildcardQuery and is essentially an optimization > given that a prefix query is nothing but a specialized case of Wildcard > query. Also, this is why the lucene query parser automatically creates a > PrefixQuery for query terms of the form 'foo*' instead of a WildcardQuery. > > I don't understand this. There is nothing that one should need to do in Solr's code to make this work. Prefix queries are supported out of the box in Solr. > And one final request for Comment to Shalin on this topic - I am guessing > you ensured there were no duplicate terms in the field(s) used for > autocompletion. For our first version, I am thinking of eliminating the > duplicates outside of the results handler that gives suggestions since > duplicate suggestions originate only from different document IDs in our > system and we do want the list of document IDs matched. Is there a > better/different way of doing the same? > > No, I guess not. -- Regards, Shalin Shekhar Mangar.
Re: Solr Cell - PDFs plus literal metadata - GET or POST ?
On Wed, Dec 30, 2009 at 7:49 AM, Ross wrote: > Hi all > > I'm experimenting with Solr. I've successfully indexed some PDFs and > all looks good but now I want to index some PDFs with metadata pulled > from another source. I see this example in the docs. > > curl " > http://localhost:8983/solr/update/extract?literal.id=doc4&captureAttr=true&defaultField=text&capture=div&fmap.div=foo_t&boost.foo_t=3&literal.blah_s=Bah > " > -F "tutori...@tutorial.pdf" > > I can write code to generate a script with those commands substituting > my own literal.whatever. My metadata could be up to a couple of KB in > size. Is there a way of making the literal a POST variable rather than > a GET? With Curl? Yes, see the man page. > Will Solr Cell accept it as a POST? Yes, it will. -- Regards, Shalin Shekhar Mangar.
Re: performance question
On Jan 4, 2010, at 12:04 AM, A. Steven Anderson wrote: dynamic fields don't make it worse ... the number of actaul field names you sort on makes it worse. If you sort on 100 fields, the cost is the same regardless of wether all 100 of those fields exist because of a single declaration, or 100 distinct declarations. Ahh...thanks for the clarification. So, in general, there is no *significant* performance difference with using dynamic fields. Correct? Correct. There's not even really an "insignificant" performance difference. A dynamic field is the same as a regular field in practically every way on the search side of things. Erik
Re: Search both diacritics and non-diacritics
On Sun, Jan 3, 2010 at 6:01 AM, Lance Norskog wrote: > The ASCIIFoldingFilter is a superset of the ISOLatin1Filter - > ISOLatin1 is deprecated. Here's the Javadoc from ASCIIFoldingFIlter. > You did not mention which language you want to search. > > Unforch, the ASCIIFoldingFilter is not mentioned on the Solr wiki. > > Thanks Lance. I've added it to the wiki at http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters -- Regards, Shalin Shekhar Mangar.
Re: Configuring Solr to use RAMDirectory
On Thu, Dec 31, 2009 at 3:36 PM, dipti khullar wrote: > Hi > > Can somebody let me know if its possible to configure RAMDirectory from > solrconfig.xml. Although its clearly mentioned in > https://issues.apache.org/jira/browse/SOLR-465 by Mark that he has worked > upon it, but still I couldn't find any such property in config file in Solr > 1.4 latest download. > May be I am overlooking some simple property. Any help would be > appreciated. > > Note that there are things like replication which will not work if you are using a RAMDirectory. -- Regards, Shalin Shekhar Mangar.
Re: Rules engine and Solr
On Mon, Jan 4, 2010 at 10:24 AM, Avlesh Singh wrote: > I have a Solr (version 1.3) powered search server running in production. > Search is keyword driven is supported using custom fields and tokenizers. > > I am planning to build a rules engine on top search. The rules are database > driven and can't be stored inside solr indexes. These rules would > ultimately > two do things - > > 1. Change the order of Lucene hits. > A Lucene FieldComparator is what you'd need. The QueryElevationComponent uses this technique. > 2. Add/remove some results to/from the Lucene hits. > > This is a bit more tricky. If you will always have a very limited number of docs to add or remove, it may be best to change the query itself to include or exclude them (i.e. add fq). Otherwise you'd need to write a custom Collector (see DocSetCollector) and change SolrIndexSearcher to use it. We are planning to modify SolrIndexSearcher to allow custom collectors soon for field collapsing but for now you will have to modify it. > What should be my starting point? Custom search handler? > > A custom SearchComponent which extends/overrides QueryComponent will do the job. -- Regards, Shalin Shekhar Mangar.
Re: Invalid CRLF - StreamingUpdateSolrServer ?
Thank you Yonik for your answer. The platform encoding is "fr_FR.UTF-8", so it's still UTF-8, it should be I guess "en_US.UTF-8" ? I've also tested LBHttpSolrServer (We wanted to have it as a "backup" for HAproxy) and it appears not to be thread safe ( what is also curious about it, is that there's no way to manage the connections' pool ). If you're interresting in the logs, I can send those to you. *Will there be a Solr 1.4.1 that'll fix those problems ?* Cause using a SNAPSHOT doesn't seem a good idea to me. I have another question but I don't know if I have to make a new post : Can I use "-Dmaster=disabled" in JAVA_OPTS for a server that is slave and repeater ? Patrick. Yonik Seeley a écrit : It could be this bug, fixed in trunk: * SOLR-1595: StreamingUpdateSolrServer used the platform default character set when streaming updates, rather than using UTF-8 as the HTTP headers indicated, leading to an encoding mismatch. (hossman, yonik) Could you try a recent nightly build (or build your own from trunk) and see if it fixes it? -Yonik http://www.lucidimagination.com On Thu, Dec 31, 2009 at 5:07 AM, Patrick Sauts wrote: I'm using solr 1.4 on tomcat 5.0.28, with client StreamingUpdateSolrServer with 10threads and xml communication via Post method. Is there a way to avoid this error (data lost)? And is StreamingUpdateSolrServer reliable ? GRAVE: org.apache.solr.common.SolrException: Invalid CRLF at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:72) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:215) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:188) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:213) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:174) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:117) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:108) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:174) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:874) at org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConnection(Http11BaseProtocol.java:665) at org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(PoolTcpEndpoint.java:528) at org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(LeaderFollowerWorkerThread.java:81) at org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:689) at java.lang.Thread.run(Thread.java:619) Caused by: com.ctc.wstx.exc.WstxIOException: Invalid CRLF
Re: Invalid CRLF - StreamingUpdateSolrServer ?
On Mon, Jan 4, 2010 at 6:11 PM, Patrick Sauts wrote: > > I've also tested LBHttpSolrServer (We wanted to have it as a "backup" for > HAproxy) and it appears not to be thread safe ( what is also curious about > it, is that there's no way to manage the connections' pool ). If you're > interresting in the logs, I can send those to you. > What is the issue that you are facing? What is it exactly that you want to change? -- Regards, Shalin Shekhar Mangar.
Improvising solr queries
Hi We have tried out various configurations settings to improvise the performance of the site which is majorly using Solr but still the response time remains about 4-5 reqs/sec. We also did some performance tests on Solr 1.4 but still there is a very minute improvement in performance. Currently we are using Solr 1.3. So our last resort remains, improvising the queries. We are using SolrJ - CommonsHttpSolrServer We guys are trying to tune up Solr Queries being used in our project. Following sample query takes about 6 secs to execute under normal traffic. At peak hours this often increases to 10-15 secs. sitename:XYZ OR sitename:"All Sites") AND (localeid:1237400589415) AND ((assettype:Gallery)) AND (rbcategory:"ABC XYZ" ) AND (startdate:[* TO 2009-12-07T23:59:00Z] AND enddate:[2009-12-07T00:00:00Z TO *])&rows=9&start=63&sort=date desc&facet=true&facet.field=assettype&facet.mincount=1 Similar to this query we have several much complex queries supporting all major landing pages of our application. Just want to confirm that whether anyone can identify any major flaws or issues in the sample query? Thanks Dipti
Re: Invalid CRLF - StreamingUpdateSolrServer ?
The issue was sometimes null result during facet navigation or simple search, results were back after a refresh, we tried to changed the cache to . But same behaviour. *My implementation was :* (maybe wrong ?) LBHttpSolrServer solrServer = new LBHttpSolrServer(new HttpClient(), new XMLResponseParser(), solrServerUrl.split(",")); solrServer.setConnectionManagerTimeout(CONNECTION_TIMEOUT); solrServer.setConnectionTimeout(CONNECTION_TIMEOUT); solrServer.setSoTimeout(READ_TIMEOUT); solrServer.setAliveCheckInterval(CHECK_HEALTH_INTERVAL_MS); *What I was suggesting :* As a LBHttpSolrServer is a wrapper to CommonsHttpSolrServer CommonsHttpSolrServer search1 = new CommonsHttpSolrServer("http://mysearch1";); search1.setConnectionTimeout(CONNECTION_TIMEOUT); search1.setSoTimeout(READ_TIMEOUT); search1.setConnectionManagerTimeout(solr.CONNECTION_MANAGER_TIMEOUT); search1.setDefaultMaxConnectionsPerHost(MAX_CONNECTIONS_PER_HOST1); search1.setMaxTotalConnections(MAX_TOTAL_CONNECTIONS1); search1.setParser(new XMLResponseParser()); CommonsHttpSolrServer search2 = new CommonsHttpSolrServer("http://mysearch1";); search2.setConnectionTimeout(CONNECTION_TIMEOUT); search2.setSoTimeout(READ_TIMEOUT); search2.setConnectionManagerTimeout(solr.CONNECTION_MANAGER_TIMEOUT); search2.setDefaultMaxConnectionsPerHost(MAX_CONNECTIONS_PER_HOST1); search2.setMaxTotalConnections(MAX_TOTAL_CONNECTIONS1); search2.setParser(new XMLResponseParser()); *LBHttpSolrServer solrServers = new LBHttpSolrServer(search1, search2);* So we can manage the parameters per server. Thank you for your time. Patrick. Shalin Shekhar Mangar a écrit : On Mon, Jan 4, 2010 at 6:11 PM, Patrick Sauts wrote: I've also tested LBHttpSolrServer (We wanted to have it as a "backup" for HAproxy) and it appears not to be thread safe ( what is also curious about it, is that there's no way to manage the connections' pool ). If you're interresting in the logs, I can send those to you. What is the issue that you are facing? What is it exactly that you want to change?
Re: Improvising solr queries
On Mon, Jan 4, 2010 at 6:39 PM, dipti khullar wrote: > We have tried out various configurations settings to improvise the > performance of the site which is majorly using Solr but still the response > time remains about 4-5 reqs/sec. We also did some performance tests on Solr > 1.4 but still there is a very minute improvement in performance. Currently > we are using Solr 1.3. > That is too slow. We need more information on your setup before we can help. What kind of hardware you are using? Which OS/JVM? How much memory have you allocated to the JVM? What does your solrconfig look like? How many documents are there in your index? What is the size of index on disk? What are the field types of the fields you are searching on? Do you do highlighting on large fields? Can you paste the cache section on the statistics page of your Solr dashboard (preferably, just after a peak load)? How frequently is your index changed (i.e. how frequently do you commit)? I'd recommend an upgrade to Solr 1.4 anyway since it has major performance improvements. > > So our last resort remains, improvising the queries. We are using SolrJ - > CommonsHttpSolrServer > > Actually that is one of the first things that you should look at. > We guys are trying to tune up Solr Queries being used in our project. > Following sample query takes about 6 secs to execute under normal traffic. > At peak hours this often increases to 10-15 secs. > > sitename:XYZ OR sitename:"All Sites") AND (localeid:1237400589415) AND > ((assettype:Gallery)) AND (rbcategory:"ABC XYZ" ) AND (startdate:[* TO > 2009-12-07T23:59:00Z] AND enddate:[2009-12-07T00:00:00Z TO > *])&rows=9&start=63&sort=date > desc&facet=true&facet.field=assettype&facet.mincount=1 > > Similar to this query we have several much complex queries supporting all > major landing pages of our application. > > Just want to confirm that whether anyone can identify any major flaws or > issues in the sample query? > > Most of those AND conditions can be separate filter queries. Filter queries can be cached separately and can therefore be re-used. See http://wiki.apache.org/solr/FilterQueryGuidance -- Regards, Shalin Shekhar Mangar.
RE: Reverse sort facet query [SOLR-1672]
> Date: Sun, 3 Jan 2010 22:18:33 -0800 > From: hossman_luc...@fucit.org > To: solr-user@lucene.apache.org > Subject: RE: Reverse sort facet query [SOLR-1672] > > > : Yes, I thought about adding some 'new syntax', but I opted for a separate > 'facet.sortorder' parameter, > : > : mainly because I'm not familiar enough with the codebase to know what > effect this might have on > : > : backward compatibility. It would be easy enough to modify the patch I > created to do it this way. > > it shouldn't really affect anything -- it wouldn't really be new syntax, > just extending hte existing "sort" param syntax to apply to the > "facet.sort" param. The only back compat concern is making sure we > continue to support true/false as aliases, and having the default order > match the current bahvior if asc/desc aren't specified. > > > -Hoss > Yes, agreed. The current patch doesn't touch the b/w true/false aliasing, and any move to adding a new attr can keep all that intact. I've been using the current patch extensively in our testing, and that's working well. The only caveat to this is that the reverse sort results don't include 0-count facets (see notes in SOLR-1672), so reverse sort results start with the first count=1. This could be confusing as there could well be many facets whose count is 0, and it might be expected that these be returned in the first instance. >From my admittedly cursory look into the codebase regading this, I believe >patching to include 0 counts could open a can of worms in terms of b/w compat and performance, as 0 counts look to be skipped (by default). I could be wrong, and you may know better how changes to SimpleFacets/UnInvertedField would affect performance and compatibility. If there is indeed a performance optimization in facet counting iteration, it would, imo, be preferable to have the optimization, rather than the 0-counts. Would you like me to go ahead and amend the patch (w/o 0-counts) to define a new 'sort' parameter? For naming, I would propose an extension of FacetParams.FACET_SORT_COUNT ala: public static final String FACET_SORT_COUNT_REVERSE = "count.reverse"; I can then easily modify the patch to detect/use this value to invoke the new behaviour. Comments? Suggestions? Thanks, Peter _ Have more than one Hotmail account? Link them together to easily access both http://clk.atdmt.com/UKM/go/186394591/direct/01/
Re: Improvising solr queries
On Mon, Jan 4, 2010 at 7:25 PM, dipti khullar wrote: > Thanks Shalin. > > Following are the relevant details: > > There are 2 search servers in a virtualized VMware environment. Each has 2 > instances of Solr running on separates ports in tomcat. > Server 1: hosts 1 master(application 1), 1 slave (application 1) > Server 2: hosta 1 master (application 2), 1 slave (application 1) > > Have you tried a non-virtualized environment? Virtual instances are not that great for high I/O throughput environments. > Both servers have 4 CPUs and 4 GB RAM. > > Master > - 4GB RAM > - 1GB JVM Heap memory is allocated to Solr > Slave1/Slave2: > - 4GB RAM > - 2GB JVM Heap memory is allocated to Solr > > Solr Details: > apache-solr Version: 1.3.0 > Lucene - 2.4-dev > > - autocommit: 50 docs and 5 minutes > - optimize runs on master in every 7 minutes > - using postOptimize , we execute snapshooter on master > - snappuller/snapinstaller on 2 slaves runs after every 10 minutes > > You are committing every 5 minutes and optimizing every 7 minutes. Can you try committing less often? > Master and Slave1 (solr1)are on single box and Slave2(solr2) on different > box. We use HAProxy to load balance query requests between > 2 slaves. Master is only used for indexing. > > Solrj client which is used to query slave solr,gets timedout and there is > high CPU usage/load avg.T he problem is reported on slaves for application > 1. The SolrJ client which queries Solr over HTTP times out (10 sec is the > timeout value) though in the Solr tomcat access log we find all requests > have 200 response. > During the tme, requests timeout the load avg. of the server goes extremely > high (10-20). > The issue gets resolved as soon as we optimize the slave index. In the solr > admin, it shows only 4 requests/sec is handled with 400 ms response time. > > I am attaching solrconfig.xml for both master and slaves. > > There is no autowarming on slaves which is probably OK if you are committing so often. But do you really need to index new documents so often? -- Regards, Shalin Shekhar Mangar.
High Availability
I'm kind of stuck and looking for suggestions for high availability options. I've figured out without much trouble how to get the master-slave replication working. This eliminates any single points of failure in the application in terms of the application's searching capability. I would setup a master which would create the index and several slaves to act as the search servers, and put them behind a load balancer to distribute the requests. This would ensure that if a slave node goes down, requests would continue to get serviced by the other nodes that are still up. The problem I have is that my particular application also has the capability to trigger index updates from the user interface. This means that the master now becomes a single point of failure for the user interface. The basic idea of the app is that there are multiple oracle instances contributing to a single document. The volume and organization of the data (database links, normalization, etc...) prevents any sort of fast querying via SQL to do querying of the documents. The solution is to build a lucene index (via solr), and use that for searching. When updates are made in the UI, we will also send the updates directly to the solr server as well (we don't want to wait some arbitrary interval for a delta query to run). So you can see the problem here is that if the master is down, the sending of the updates to the master solr server will fail, thus causing an application exception. I have tried configuring multiple solr servers which are both setup as masters and slaves to each other, but they keep clobber each other's index updates and rolling back each other's delta updates. It seems that the replication doesn't take the generation # into account and check that the generation it's fetching is > the generation it already has before it applies it. I thought of maybe introducing a JMS queue to send my updates to and having the JMS message listener set to manually acknowledge the messages only after a succesfull application of the solrj api calls, but that seems kind of contrived, and is only a band-aid. Does anyone have any suggestions? mattin...@yahoo.com "Once you start down the dark path, forever will it dominate your destiny. Consume you it will " - Yoda
Re: High Availability
Have you looked into a basic floating IP setup? Have the master also replicate to another hot-spare master. Any downtime during an outage of the 'live' master would be minimal as the hot-spare takes up the floating IP. On Mon 04/01/10 16:13 , Matthew Inger wrote: > I'm kind of stuck and looking for suggestions for high availability > options. I've figured out without much trouble how to get the > master-slave replication working. This eliminates any single points > of failure in the application in terms of the application's searching > capability. > I would setup a master which would create the index and several > slaves to act as the search servers, and put them behind a load > balancer to distribute the requests. This would ensure that if a > slave node goes down, requests would continue to get serviced by the > other nodes that are still up. > The problem I have is that my particular application also has the > capability to trigger index updates from the user interface. This > means that the master now becomes a single point of failure for the > user interface. > The basic idea of the app is that there are multiple oracle > instances contributing to a single document. The volume and > organization of the data (database links, normalization, etc...) > prevents any sort of fast querying via SQL to do querying of the > documents. The solution is to build a lucene index (via solr), and > use that for searching. When updates are made in the UI, we will > also send the updates directly to the solr server as well (we don't > want to wait some arbitrary interval for a delta query to run). > So you can see the problem here is that if the master is down, the > sending of the updates to the master solr server will fail, thus > causing an application exception. > I have tried configuring multiple solr servers which are both setup > as masters and slaves to each other, but they keep clobber each > other's index updates and rolling back each other's delta updates. > It seems that the replication doesn't take the generation # into > account and check that the generation it's fetching is > the > generation it already has before it applies it. > I thought of maybe introducing a JMS queue to send my updates to and > having the JMS message listener set to manually acknowledge the > messages only after a succesfull application of the solrj api calls, > but that seems kind of contrived, and is only a band-aid. > Does anyone have any suggestions? > > "Once you start down the dark path, forever will it > dominate your destiny. Consume you it will " - Yoda > > Message sent via Atmail Open - http://atmail.org/
Re: Facets and distributed search
Something looks wrong... that type of slowdown is certainly not expected. You should be able to see both the main query and a sub-query in the logs... could you post an actual example? -Yonik http://www.lucidimagination.com On Mon, Jan 4, 2010 at 4:15 AM, Aleksander Stensby wrote: > Hi everyone! I've posted a similar question earlier, but in a thread related > to facets in general, so I thought I'd repost it here as a separate thread. > > I have a faceted search that is very fast when I executed the query on a > single solr server, but is significantly slower when executed in a > distributed environment. > The set-back seem to be in the sharding of our data.. And that puzzles me a > little bit... I can't really see why SOLR is so slow at doing this. > > The scenario: > Let's say we have two servers (s1 and s2). > If i query > the following: > q=threadid:33&facet=true&facet.field=author&limit=-1&facet.mincount=0&rows=0 > directly on either server, the response is lightning fast. (<10ms) > > So, in theory I could query them directly, concat the result myself and get > that done pretty fast. > > But if I introduce the shards parameter, the response time booms to between > 15000ms and 2ms! > shards=s1:8983/solr,s2:8983/solr > > My initial thoughts is that I MUST be doing something wrong here? > > So I try the following: > Run the query on server s1, with the shards param shards=s1:8983/solr > response time goes from sub 10ms to between 5000ms and 1ms! > Same results if i run the query on s2, and same if i use shards=s2:8983/solr > > Is there really that much overhead in running a distributed facet field > query with Solr? Anyone else experienced this? > > On the other hand, running regular queries without facet distributed is > lightning fast... (so can't really see that this is a network problem or > anything either). - I tried running a facet query on s1 with s1 as the > shards param, and that is still as slow as if the shards param was pointed > to a different server... > > Any insight into this would be greatly appreciated! (Would like to avoid > having to hack together our own solution concatenating results...) > > Cheers, > Aleks >
Re: High Availability
So, when the masters switch back, does that mean, we have to force a full delta update, correct? mattin...@yahoo.com "Once you start down the dark path, forever will it dominate your destiny. Consume you it will " - Yoda - Original Message From: "r...@intelcompute.com" To: solr-user@lucene.apache.org Sent: Mon, January 4, 2010 11:17:40 AM Subject: Re: High Availability Have you looked into a basic floating IP setup? Have the master also replicate to another hot-spare master. Any downtime during an outage of the 'live' master would be minimal as the hot-spare takes up the floating IP. On Mon 04/01/10 16:13 , Matthew Inger wrote: > I'm kind of stuck and looking for suggestions for high availability > options. I've figured out without much trouble how to get the > master-slave replication working. This eliminates any single points > of failure in the application in terms of the application's searching > capability. > I would setup a master which would create the index and several > slaves to act as the search servers, and put them behind a load > balancer to distribute the requests. This would ensure that if a > slave node goes down, requests would continue to get serviced by the > other nodes that are still up. > The problem I have is that my particular application also has the > capability to trigger index updates from the user interface. This > means that the master now becomes a single point of failure for the > user interface. > The basic idea of the app is that there are multiple oracle > instances contributing to a single document. The volume and > organization of the data (database links, normalization, etc...) > prevents any sort of fast querying via SQL to do querying of the > documents. The solution is to build a lucene index (via solr), and > use that for searching. When updates are made in the UI, we will > also send the updates directly to the solr server as well (we don't > want to wait some arbitrary interval for a delta query to run). > So you can see the problem here is that if the master is down, the > sending of the updates to the master solr server will fail, thus > causing an application exception. > I have tried configuring multiple solr servers which are both setup > as masters and slaves to each other, but they keep clobber each > other's index updates and rolling back each other's delta updates. > It seems that the replication doesn't take the generation # into > account and check that the generation it's fetching is > the > generation it already has before it applies it. > I thought of maybe introducing a JMS queue to send my updates to and > having the JMS message listener set to manually acknowledge the > messages only after a succesfull application of the solrj api calls, > but that seems kind of contrived, and is only a band-aid. > Does anyone have any suggestions? > > "Once you start down the dark path, forever will it > dominate your destiny. Consume you it will " - Yoda > > Message sent via Atmail Open - http://atmail.org/
Re: High Availability
Even when Master 1 is alive again, it shouldn't get the floating IP until Master 2 actually fails. So you'd ideally want them replicating to eachother, but since one will only be updated/Live at a time, it shouldn't cause an issue with cobbling data (?). Just a suggestion tho, not done it myself on Solr, only with DB servers. On Mon 04/01/10 16:28 , Matthew Inger wrote: > So, when the masters switch back, does that mean, we have to force a > full delta update, correct? > > "Once you start down the dark path, forever will it > dominate your destiny. Consume you it will " - Yoda > - Original Message > From: "" > To: > Sent: Mon, January 4, 2010 11:17:40 AM > Subject: Re: High Availability > Have you looked into a basic floating IP setup? > Have the master also replicate to another hot-spare master. > Any downtime during an outage of the 'live' master would be minimal > as the hot-spare takes up the floating IP. > On Mon 04/01/10 16:13 , Matthew Inger wrote: > > I'm kind of stuck and looking for suggestions for high > availability > > options. I've figured out without much trouble how to get the > > master-slave replication working. This eliminates any single > points > > of failure in the application in terms of the application's > searching > > capability. > > I would setup a master which would create the index and several > > slaves to act as the search servers, and put them behind a load > > balancer to distribute the requests. This would ensure that if a > > slave node goes down, requests would continue to get serviced by > the > > other nodes that are still up. > > The problem I have is that my particular application also has the > > capability to trigger index updates from the user interface. This > > means that the master now becomes a single point of failure for > the > > user interface. > > The basic idea of the app is that there are multiple oracle > > instances contributing to a single document. The volume and > > organization of the data (database links, normalization, etc...) > > prevents any sort of fast querying via SQL to do querying of the > > documents. The solution is to build a lucene index (via solr), > and > > use that for searching. When updates are made in the UI, we will > > also send the updates directly to the solr server as well (we > don't > > want to wait some arbitrary interval for a delta query to run). > > So you can see the problem here is that if the master is down, the > > sending of the updates to the master solr server will fail, thus > > causing an application exception. > > I have tried configuring multiple solr servers which are both > setup > > as masters and slaves to each other, but they keep clobber each > > other's index updates and rolling back each other's delta updates. > > > It seems that the replication doesn't take the generation # into > > account and check that the generation it's fetching is > the > > generation it already has before it applies it. > > I thought of maybe introducing a JMS queue to send my updates to > and > > having the JMS message listener set to manually acknowledge the > > messages only after a succesfull application of the solrj api > calls, > > but that seems kind of contrived, and is only a band-aid. > > Does anyone have any suggestions? > > > > "Once you start down the dark path, forever will it > > dominate your destiny. Consume you it will " - Yoda > > > > > Message sent via Atmail Open - http://atmail.org/ > > Message sent via Atmail Open - http://atmail.org/
Re: High Availability
I'm also not sure what hooks you could put in upon the IP floating to the other machine, to start/stop replication - if it IS an issue anyway. On Mon 04/01/10 16:28 , Matthew Inger wrote: > So, when the masters switch back, does that mean, we have to force a > full delta update, correct? > > "Once you start down the dark path, forever will it > dominate your destiny. Consume you it will " - Yoda > - Original Message > From: "" > To: > Sent: Mon, January 4, 2010 11:17:40 AM > Subject: Re: High Availability > Have you looked into a basic floating IP setup? > Have the master also replicate to another hot-spare master. > Any downtime during an outage of the 'live' master would be minimal > as the hot-spare takes up the floating IP. > On Mon 04/01/10 16:13 , Matthew Inger wrote: > > I'm kind of stuck and looking for suggestions for high > availability > > options. I've figured out without much trouble how to get the > > master-slave replication working. This eliminates any single > points > > of failure in the application in terms of the application's > searching > > capability. > > I would setup a master which would create the index and several > > slaves to act as the search servers, and put them behind a load > > balancer to distribute the requests. This would ensure that if a > > slave node goes down, requests would continue to get serviced by > the > > other nodes that are still up. > > The problem I have is that my particular application also has the > > capability to trigger index updates from the user interface. This > > means that the master now becomes a single point of failure for > the > > user interface. > > The basic idea of the app is that there are multiple oracle > > instances contributing to a single document. The volume and > > organization of the data (database links, normalization, etc...) > > prevents any sort of fast querying via SQL to do querying of the > > documents. The solution is to build a lucene index (via solr), > and > > use that for searching. When updates are made in the UI, we will > > also send the updates directly to the solr server as well (we > don't > > want to wait some arbitrary interval for a delta query to run). > > So you can see the problem here is that if the master is down, the > > sending of the updates to the master solr server will fail, thus > > causing an application exception. > > I have tried configuring multiple solr servers which are both > setup > > as masters and slaves to each other, but they keep clobber each > > other's index updates and rolling back each other's delta updates. > > > It seems that the replication doesn't take the generation # into > > account and check that the generation it's fetching is > the > > generation it already has before it applies it. > > I thought of maybe introducing a JMS queue to send my updates to > and > > having the JMS message listener set to manually acknowledge the > > messages only after a succesfull application of the solrj api > calls, > > but that seems kind of contrived, and is only a band-aid. > > Does anyone have any suggestions? > > > > "Once you start down the dark path, forever will it > > dominate your destiny. Consume you it will " - Yoda > > > > > Message sent via Atmail Open - http://atmail.org/ > > Message sent via Atmail Open - http://atmail.org/
Re: Any way to modify result ranking using an integer field?
> Thanks Ahmet. > > Do I need to do anything to enable BoostQParserPlugin in > Solr, or is it already enabled? I just confirmed that it is already enabled. You can see affect of it by appending &debugQuery=on to your search url.
Re: Implementing Autocomplete/Query Suggest using Solr
On Mon, Jan 4, 2010 at 1:20 AM, Shalin Shekhar Mangar < shalinman...@gmail.com> wrote: > On Wed, Dec 30, 2009 at 3:07 AM, Prasanna R wrote: > > > I looked into the Solr/Lucene classes and found the required > information. > > Am summarizing the same for the benefit of those that might refer to this > > thread in the future. > > > > The change I had to make was very simple - make a call to getPrefixQuery > > instead of getWildcardQuery in my custom-modified Solr dismax query > parser > > class. However, this will make a fairly significant difference in terms > of > > efficiency. The key difference between the lucene WildcardQuery and > > PrefixQuery lies in their respective term enumerators, specifically in > the > > term comparators. The termCompare method for PrefixQuery is more > > light-weight than that of WildcardQuery and is essentially an > optimization > > given that a prefix query is nothing but a specialized case of Wildcard > > query. Also, this is why the lucene query parser automatically creates a > > PrefixQuery for query terms of the form 'foo*' instead of a > WildcardQuery. > > > > > I don't understand this. There is nothing that one should need to do in > Solr's code to make this work. Prefix queries are supported out of the box > in Solr. > > I am using the dismax query parser and I match on multiple fields with different boosts. I run a prefix query on some fields in combination with a regular field query on other fields. I do not know of any way in which one could specify a prefix query on a particular field in your dismax query out of the box in Solr 1.4. I had to update Solr to support additional syntax in a dismax query that lets you choose to create a prefix query on a particular field. As part of parsing this custom syntax, I was making a call to the getWildcardQuery which I simply changed to getPrefixQuery. Prasanna.
Phrase search issue with XMLPayload? Is it the better solution?
I have a project that involves words extracted by OCR, each page has words, each word has its geometry to blink a highlight to end user. I've been trying represent this document structure by xml foo bar baz qux Using the field 'fulltext_st' ,I can get all terms in my search result with them payloads. But if I do search using phrase query I can't fetch any result. Example: search?q=foo 1 search?q=foo+bar 1 1 /search?q="foo bar" *nothing* I was wondering if I could get your thoughts if xmlpayload supports sort of the things(with phrase search) or is there a good solution to index a doc with many pages and one rectangle(graphical word geometry) for each term? thank you in advance -- View this message in context: http://old.nabble.com/Phrase-search-issue-with-XMLPayload--Is-it-the-better-solution--tp27018815p27018815.html Sent from the Solr - User mailing list archive at Nabble.com. foo bar baz qux
Re: Any way to modify result ranking using an integer field?
Thank you Ahmet. Is there any way I can configure Solr to always use {!boost b=log(popularity)} as the default for all queries? I'm using Solr through django-haystack, so all the Solr queries are actually generated by haystack. It'd be much cleaner if I could configure Solr to always use BoostQParserPlugin for all queries instead of manually modifying every single query generated by haystack. --- On Mon, 1/4/10, Ahmet Arslan wrote: From: Ahmet Arslan Subject: Re: Any way to modify result ranking using an integer field? To: solr-user@lucene.apache.org Date: Monday, January 4, 2010, 2:33 PM > Thanks Ahmet. > > Do I need to do anything to enable BoostQParserPlugin in > Solr, or is it already enabled? I just confirmed that it is already enabled. You can see affect of it by appending &debugQuery=on to your search url.
Re: Improvising solr queries
Hi - Something doesn't make sense to me here: On Mon, Jan 4, 2010 at 5:55 AM, dipti khullar wrote: > - optimize runs on master in every 7 minutes > - using postOptimize , we execute snapshooter on master > - snappuller/snapinstaller on 2 slaves runs after every 10 minutes > > Why would you optimize every 7 minutes, and update the slaves every ten? After 70 minutes you'll be doing both at the same time. How about optimizing every ten minutes, at :00,:10, :20, :30, :40, :50 and then pulling every ten minutes at :01, :11, :21, :31, :41, :51 (assuming your optimize completes in one minute). Or did I misunderstand something? > The issue gets resolved as soon as we optimize the slave index. In the solr > admin, it shows only 4 requests/sec is handled with 400 ms response time. > >From your earlier description, it seems like you should only be distributing an optimized index, so optimizing the slave should be a no-op. Check to see what files you have on the slave after snappulling. Tom
Non-leading wildcard search
Hello, There are lots of questions and answers in the forum regarding varying wildcard behaviour, but I haven't been able to find any that address this particular behaviour. Perhaps someone could help? Problem: I have a fieldType that only goes through a KeywordTokenizer at index time, to ensure it stays 'verbatim' (e.g. it doesn't get split into any tokens - ws or otherwise). Let's say there's some data stored in this field like this: Something Something Else Something Else Altogether When I query: "Something" or "Something Else" or "*thing" or "*omething*", I get back the expected results. If, however, I query: "Some*" or "S*" or "s*" etc, I get no results (although this type of non-leading wildcard works fine with other fieldType schema elements that don't use KeywordTokenizer). Is this something to do with KeywordTokenizer? Is there a better way to index data (preserving case) and not splitting on ws or stemming etc. (i.e. no WhitespaceTokenizer or similar)? My fieldType schema looks like this: (I've tried a number of other combinations as well including using class=solr.TextField) I understand that wildcard queries don't go through analyzers, but why is it that 'tokenized' data matches on non-leading wildcard queries, whereas non-tokenized (or more specifically Keyword-Tokenized) doesn't? The fieldType schema requires some tokenizer class, and it appears that KeywordTokenizer is the only one that tokenizes to a token size of 1 (i.e. the whole string). I'm sure I'm missing something that is probably reasonbly obvious, but having tried myriad combinations, I thought it prudent to ask the experts in the forum. Many thanks for any insight you can provide on this. Peter _ Use Hotmail to send and receive mail from your different email accounts http://clk.atdmt.com/UKM/go/186394592/direct/01/
Re: Non-leading wildcard search
On Mon, Jan 4, 2010 at 5:38 PM, Peter S wrote: > When I query: "Something" or "Something Else" or "*thing" or "*omething*", > I get back the expected results. > If, however, I query: "Some*" or "S*" or "s*" etc, I get no results (although > this type of non-leading wildcard works fine with other fieldType schema > elements that don't use KeywordTokenizer). Is your query string actually in quotes? Wildcards aren't currently supported in quotes. So text_verbatim:Some* should work. -Yonik http://www.lucidimagination.com
RE: Non-leading wildcard search
Hi Yonik, Thanks for your quick reply. No, the queries themselves aren't in quotes. Since I sent the initial email, I have managed to get non-leading wildcard queries to work with this, but by unexpected means (for me at least :-). If I add a LowerCaseFilterFactory to the fieldType, queries like s* (or S*) work as expected. So the fieldType schema element now looks like: I wasn't expecting this, as I would have thought this would change only the case behaviour, not the wildcard behaviour (or at least not just the non-leading wildcard behaviour). Perhaps I'm just not understanding how the terms (term in this case as not tokenized) is indexed and subsequently matched. What I've noticed is that with the LowerCaseFilterFactory in place, document queries return results with case intact, but facet queries show the results in lower-case (e.g. document->appname=Something facet.field.appname=something). (I kind of expected the document->appname field to be lower case as well) Does this sound like correct behaviour to you? If it's correct, that's ok, I'll manage to work 'round it (maybe there's a way to map the facet field back to the document field?), but if it sounds wrong, perhaps it warrants further investigation. Many thanks, Peter > Date: Mon, 4 Jan 2010 17:42:30 -0500 > Subject: Re: Non-leading wildcard search > From: yo...@lucidimagination.com > To: solr-user@lucene.apache.org > > On Mon, Jan 4, 2010 at 5:38 PM, Peter S wrote: > > When I query: "Something" or "Something Else" or "*thing" or > > "*omething*", I get back the expected results. > > If, however, I query: "Some*" or "S*" or "s*" etc, I get no results > > (although this type of non-leading wildcard works fine with other fieldType > > schema elements that don't use KeywordTokenizer). > > Is your query string actually in quotes? Wildcards aren't currently > supported in quotes. > So text_verbatim:Some* should work. > > -Yonik > http://www.lucidimagination.com _ View your other email accounts from your Hotmail inbox. Add them now. http://clk.atdmt.com/UKM/go/186394592/direct/01/
RE: Non-leading wildcard search
FYI: I have found the root of this behaviour. It has to do with a test patch I've been working on for working 'round pre SOLR-219 (case insensitive wildcard searching). With the test patch switched out, it works as expected. Although the case insensitive wildcard search reverts to pre-SOLR-219 behaviour. I believe I can work 'round this by using a copyField that holds the lower-case text for wildcarding. Many thanks, Yonik for your help. Peter > From: pete...@hotmail.com > To: solr-user@lucene.apache.org > Subject: RE: Non-leading wildcard search > Date: Mon, 4 Jan 2010 23:29:04 + > > > Hi Yonik, > > > > Thanks for your quick reply. > > No, the queries themselves aren't in quotes. > > > > Since I sent the initial email, I have managed to get non-leading wildcard > queries to work with this, but by unexpected means (for me at least :-). > > > > If I add a LowerCaseFilterFactory to the fieldType, queries like s* (or S*) > work as expected. > > > > So the fieldType schema element now looks like: > > positionIncrementGap="100"> > > > > > > > > ignoreCase="true" expand="true"/> > > > > > > I wasn't expecting this, as I would have thought this would change only the > case behaviour, not the wildcard behaviour (or at least not just the > non-leading wildcard behaviour). Perhaps I'm just not understanding how the > terms (term in this case as not tokenized) is indexed and subsequently > matched. > > > > What I've noticed is that with the LowerCaseFilterFactory in place, document > queries return results with case intact, but facet queries show the results > in lower-case > > (e.g. document->appname=Something facet.field.appname=something). (I kind of > expected the document->appname field to be lower case as well) > > > > Does this sound like correct behaviour to you? > > If it's correct, that's ok, I'll manage to work 'round it (maybe there's a > way to map the facet field back to the document field?), but if it sounds > wrong, perhaps it warrants further investigation. > > > > Many thanks, > > Peter > > > > > > > Date: Mon, 4 Jan 2010 17:42:30 -0500 > > Subject: Re: Non-leading wildcard search > > From: yo...@lucidimagination.com > > To: solr-user@lucene.apache.org > > > > On Mon, Jan 4, 2010 at 5:38 PM, Peter S wrote: > > > When I query: "Something" or "Something Else" or "*thing" or > > > "*omething*", I get back the expected results. > > > If, however, I query: "Some*" or "S*" or "s*" etc, I get no results > > > (although this type of non-leading wildcard works fine with other > > > fieldType schema elements that don't use KeywordTokenizer). > > > > Is your query string actually in quotes? Wildcards aren't currently > > supported in quotes. > > So text_verbatim:Some* should work. > > > > -Yonik > > http://www.lucidimagination.com > > _ > View your other email accounts from your Hotmail inbox. Add them now. > http://clk.atdmt.com/UKM/go/186394592/direct/01/ _ Add your Gmail and Yahoo! Mail email accounts into Hotmail - it's easy http://clk.atdmt.com/UKM/go/186394592/direct/01/
Re: Indexing the latests MS Office documents
You must have been searching old documentation - I think tika 0,3+ has support for the new MS formats. but don't take my word for it - why don't you build tika and try it? -Peter On Sun, Jan 3, 2010 at 7:00 PM, Roland Villemoes wrote: > Hi All, > > Anyone who knows how to index the latest MS office documents like .docx and > .xlsx ? > > From searching it seems like Tika only supports the earlier formats .doc and > .xls > > > > med venlig hilsen/best regards > > Roland Villemoes > Tel: (+45) 22 69 59 62 > E-Mail: mailto:r...@alpha-solutions.dk > > -- Peter M. Wolanin, Ph.D. Momentum Specialist, Acquia. Inc. peter.wola...@acquia.com
Re: Improvising solr queries
On 1/5/10 12:46 AM, Shalin Shekhar Mangar wrote: sitename:XYZ OR sitename:"All Sites") AND (localeid:1237400589415) AND > ((assettype:Gallery)) AND (rbcategory:"ABC XYZ" ) AND (startdate:[* TO > 2009-12-07T23:59:00Z] AND enddate:[2009-12-07T00:00:00Z TO > *])&rows=9&start=63&sort=date > desc&facet=true&facet.field=assettype&facet.mincount=1 > > Similar to this query we have several much complex queries supporting all > major landing pages of our application. > > Just want to confirm that whether anyone can identify any major flaws or > issues in the sample query? > > I'm not the expert Shalin is, but I seem to remember sorting by date was pretty rough on CPU. (this could have been resolved since I last looked at it) the other thing I'd question is the facet. it looks like your only retrieving a single assetType (Gallery). so you will only get a single field back. if thats the case, wouldn't the rows returned (which is part of the response) give you the same answer ? Most of those AND conditions can be separate filter queries. Filter queries can be cached separately and can therefore be re-used. See http://wiki.apache.org/solr/FilterQueryGuidance
Listing Terms by Ascending IDF value . . ?
Hello, I am trying to get a list of highly unusual terms or phrases (for example a TF of 1 or 2) within an entire index (essentially this would be the inverse of how Luke gives 'top terms' on the 'Overview' tab). I see how I can do this within a specific query using the Term Vector Component (qt=tvrh). But do I have to write my own analyzer to get a list for the complete index in ascending order? Most grateful for any thoughts or insights, Christopher
Re: Rules engine and Solr
Thanks for the response, Shalin. I am still in two minds over doing it "inside" Solr versus "outside". I'll get back with more questions, if any. Cheers Avlesh On Mon, Jan 4, 2010 at 5:11 PM, Shalin Shekhar Mangar < shalinman...@gmail.com> wrote: > On Mon, Jan 4, 2010 at 10:24 AM, Avlesh Singh wrote: > > > I have a Solr (version 1.3) powered search server running in production. > > Search is keyword driven is supported using custom fields and tokenizers. > > > > I am planning to build a rules engine on top search. The rules are > database > > driven and can't be stored inside solr indexes. These rules would > > ultimately > > two do things - > > > > 1. Change the order of Lucene hits. > > > > A Lucene FieldComparator is what you'd need. The QueryElevationComponent > uses this technique. > > > > 2. Add/remove some results to/from the Lucene hits. > > > > > This is a bit more tricky. If you will always have a very limited number of > docs to add or remove, it may be best to change the query itself to include > or exclude them (i.e. add fq). Otherwise you'd need to write a custom > Collector (see DocSetCollector) and change SolrIndexSearcher to use it. We > are planning to modify SolrIndexSearcher to allow custom collectors soon > for > field collapsing but for now you will have to modify it. > > > > What should be my starting point? Custom search handler? > > > > > A custom SearchComponent which extends/overrides QueryComponent will do the > job. > > -- > Regards, > Shalin Shekhar Mangar. >
Re: Improvising solr queries
Hey Ian This assettype is variable. It can have around 6 values at a time. But this is true that we apply facet mostly on just one field - assettype. Any idea if the use of date range queries is expensive? Also if Shalin can put in some comments on "sorting by date was pretty rough on CPU", I can start analyzing sort by date specific queries. Will look into suggestions/queries by Tom and Shalin and then post the findings. Thanks Dipti On Tue, Jan 5, 2010 at 9:17 AM, Ian Holsman wrote: > On 1/5/10 12:46 AM, Shalin Shekhar Mangar wrote: > >> sitename:XYZ OR sitename:"All Sites") AND (localeid:1237400589415) AND >>> > ((assettype:Gallery)) AND (rbcategory:"ABC XYZ" ) AND (startdate:[* >>> TO >>> > 2009-12-07T23:59:00Z] AND enddate:[2009-12-07T00:00:00Z TO >>> > *])&rows=9&start=63&sort=date >>> > desc&facet=true&facet.field=assettype&facet.mincount=1 >>> > >>> > Similar to this query we have several much complex queries supporting >>> all >>> > major landing pages of our application. >>> > >>> > Just want to confirm that whether anyone can identify any major flaws >>> or >>> > issues in the sample query? >>> > >>> > >>> >>> >> I'm not the expert Shalin is, but I seem to remember sorting by date was > pretty rough on CPU. (this could have been resolved since I last looked at > it) > > the other thing I'd question is the facet. it looks like your only > retrieving a single assetType (Gallery). > so you will only get a single field back. if thats the case, wouldn't the > rows returned (which is part of the response) > give you the same answer ? > > > Most of those AND conditions can be separate filter queries. Filter >> queries >> can be cached separately and can therefore be re-used. See >> http://wiki.apache.org/solr/FilterQueryGuidance >> >> >> > >
Re: Improvising solr queries
On Tue, Jan 5, 2010 at 11:16 AM, dipti khullar wrote: > > This assettype is variable. It can have around 6 values at a time. > But this is true that we apply facet mostly on just one field - assettype. > > Ian has a good point. You are faceting on assettype and you are also filtering on it so you will get only one facet value "Gallery" with a count equal to numFound. > Any idea if the use of date range queries is expensive? Also if Shalin can > put in some comments on > "sorting by date was pretty rough on CPU", I can start analyzing sort by > date specific queries. > > This is a range search and not a sort. I don't know if range search on dates is especially costly compared to a range search on any other type. But I do know that trie fields in Solr 1.4 are much faster for range searches at the cost of more tokens in the index. With a date field, instead of using NOW, you should always try to round it down to the coarsest interval you can use. So if it is possible to use NOW/DAY instead of NOW, you should do that. The problem with querying on NOW is that it is always unique and therefore the query can never be cached (actually, it is cached but can never be hit). If you use NOW/DAY, the query can be cached for a day. -- Regards, Shalin Shekhar Mangar.
Re: Listing Terms by Ascending IDF value . . ?
On Tue, Jan 5, 2010 at 9:15 AM, Christopher Ball < christopher.b...@metaheuristica.com> wrote: > Hello, > > I am trying to get a list of highly unusual terms or phrases (for example a > TF of 1 or 2) within an entire index (essentially this would be the inverse > of how Luke gives 'top terms' on the 'Overview' tab). > > I see how I can do this within a specific query using the Term Vector > Component (qt=tvrh). > > Did you mean TermsComponent (qt=terms)? > But do I have to write my own analyzer to get a list for the complete index > in ascending order? > > No, you don't need a custom analyzer. But TermsComponent can only sort by frequency in descending order or by index order (lexicographical order). Perhaps the patch in SOLR-1672 is more suitable for your task. -- Regards, Shalin Shekhar Mangar.