Re: Trouble Setting Up Development Environment
here is my method. 1. check out latest source codes from trunk or download tar ball svn checkout http://svn.apache.org/repos/asf/lucene/dev/trunklucene_trunk 2. create a dynamic web project in eclipse and close it. for example, I create a project name lucene-solr-trunk in my workspace. 3. copy/mv the source code to this project(it's not necessary) here is my directory structure lili@lili-desktop:~/workspace/lucene-solr-trunk$ ls bin.tests-framework build lucene_trunk src testindex WebContent lucene_trunk is the top directory checked out from svn in step 1. 4. remove WebContent generated by eclipse and modify it to a soft link to lili@lili-desktop:~/workspace/lucene-solr-trunk$ ll WebContent lrwxrwxrwx 1 lili lili 28 2011-08-18 18:50 WebContent -> lucene_trunk/solr/webapp/web/ 5. open lucene_trunk/dev-tools/eclipse/dot.classpath. copy all lines like kind="src" to a temp file 6. replace all string like path="xxx" to path="lucene_trunk/xxx" and copy them into .classpath file 7. mkdir WebContent/WEB-INF/lib 8. extract all jar file in dot.classpath to WebContent/WEB-INF/lib I use this command: lili@lili-desktop:~/workspace/lucene-solr-trunk/lucene_trunk$ cat dev-tools/eclipse/dot.classpath |grep "kind=\"lib"|awk -F "path=\"" '{print $2}' |awk -F "\"/>" '{print $1}' |xargs cp ../WebContent/WEB-INF/lib/ 9. open this project and refresh it. if everything is ok, it will compile all java files successfully. if there is something wrong, Probably we don't use the correct jar. because there are many versions of the same library. 10. right click the project -> debug As -> debug on Server it will fail because no solr home is specified. 11. right click the project -> debug As -> debug Configuration -> Arguments Tab -> VM arguments add -Dsolr.solr.home=/home/lili/workspace/lucene-solr-trunk/lucene_trunk/solr/example/solr you can also add other vm arguments like -Xmx1g here. 12. all fine, add a break point at SolrDispatchFilter.doFilter(). all solr request comes here 13. have fun~ On Fri, Mar 23, 2012 at 11:49 AM, Karthick Duraisamy Soundararaj < karthick.soundara...@gmail.com> wrote: > Hi Solr Ppl, >I have been trying to set up solr dev env. I downloaded the > tar ball of eclipse and the solr 3.5 source. Here are the exact sequence of > steps I followed > > I extracted the solr 3.5 source and eclipse. > I installed run-jetty-run plugin for eclipse. > I ran ant eclipse in the solr 3.5 source directory > I used eclipse's "Open existing project" option to open up the files in > solr 3.5 directory. I got a huge tree in the name of lucene_solr. > > I run it and there is a SEVERE error: System property not set excetption. * > solr*.test.sys.*prop1* not set and then the jetty loads solr. I then try > localhost:8080/solr/select/ I get null pointer execpiton. I am only able to > access admin page. > > Is there anything else I need to do? > > I tried to follow > > http://www.lucidimagination.com/devzone/technical-articles/setting-apache-solr-eclipse > . > But I dont find the solr-3.5.war file. I tried ant dist to generate the > dist folder but that has many jars and wars.. > > I am able to compile the source with ant compile, get the solr in example > directory up and running. > > Will be great if someone can help me with this. > > Thanks, > Karthick >
Re: RequestHandler versus SearchComponent
> I'm looking at the following. I want > to (1) map some query fields to > some other query fields and add some things to FL, and then > (2) > rescore. > > I can see how to do it as a RequestHandler that makes a > parser to get > the fields, or I could see making a SearchComponent that was > stuck > into the list just after the QueryComponent. > > Anyone care to advise in the choice? I would choose SearchComponent. I read somewhere that customizations are now better fit into SC rather than RH.
Re: RequestHandler versus SearchComponent
Am 23.03.2012 10:29, schrieb Ahmet Arslan: I'm looking at the following. I want to (1) map some query fields to some other query fields and add some things to FL, and then (2) rescore. I can see how to do it as a RequestHandler that makes a parser to get the fields, or I could see making a SearchComponent that was stuck into the list just after the QueryComponent. Anyone care to advise in the choice? I would choose SearchComponent. I read somewhere that customizations are now better fit into SC rather than RH. I would override QueryComponent and modify the normal query instead. Adding an own SearchComponent after the regular QueryComponent (or better as a "last-element") is goof when you simply want to modify the existing result. But since you want to rescore, you're likely interested in documents that fell already out of the original result list. Greetings, Kuli
Re: RequestHandler versus SearchComponent
Am 23.03.2012 11:17, schrieb Michael Kuhlmann: Adding an own SearchComponent after the regular QueryComponent (or better as a "last-element") is goof ... Of course, I meant "good", not "goof"! ;) Greetings, Kuli
Re: Grouping queries
On 22 March 2012 03:10, Jamie Johnson wrote: > I need to apologize I believe that in my example I have too grossly > over simplified the problem and it's not clear what I am trying to do, > so I'll try again. > > I have a situation where I have a set of access controls say user, > super user and ultra user. These controls are not necessarily > hierarchical in that user < super user < ultra user. Each of these > controls should only be able to see documents from with some > combination of access controls they have. In my actual case we have > many access controls and they can be combined in a number of fashions > so I can't simply constrain what they are searching by a query alone > (i.e. if it's a user the query is auth:user AND (some query)). Now I > have a case where a document contains information that a user can see > but also contains information a super user can see. Our current > system marks this document at the super user level and the user can't > see it. We now have a requirement to make the pieces that are at the > user level available to the user while still allowing the super user > to see and search all the information. My original thought was to > simply index the document twice, this would end up in a possible > duplicate (say if a user had both user and super user) but since this > situation is rare it may not matter. After coming across the grouping > capability in solr I figured I could execute a group query where we > grouped on some key which indicated that 2 documents were the same > just with different access controls (user and super user in this > example). We could then filter out the documents in the group the > user isn't allowed to see and only keep the document with the access > controls they have. > Maybe I'm not understanding this right... But why can't you save the access controls as a multivalued field in your schema? In your example your can then if the current user is a normal user just query auth:user AND (query) and if the current user is a super user auth:superuser AND (query). A document that is searchable for both superuser and user is then returned (if it matches the rest of the query). > > I hope this makes more sense, unfortunately the Join queries I don't > believe will work because I don't think if I create documents which > would be relevant to each access control I could search across these > document as if it was a single document (i.e. search for something in > the user document and something in the super user document in a single > query). This lead me to believe that grouping was the way to go in > this case, but again I am very interested in any suggestions that the > community could offer. > I wouldn't use grouping. The Solr join is still a option. Lets say you have many access controls and the access controls change often on your documents. You can then choose to store the access controls with an id to your logic document as a separate document in a different Solr Core (index). In the core were your main documents are you don't keep the access controls. You can then use the solr join to filter out documents that the current user isn't supposed to search. Something like this: q=(query)&fq={!join fromIndex=core1 from=doc_id to=id}auth:superuser Core 1 is the core containing the access control documents and the doc_id is the id that points to your regular documents. The benefit of this approach is that if you fine tune the core1 for high updatability you can change you access controls very frequently without paying a big performance penalty.
Re: Faceted range based on float within velocity not working properly
I went deeper in the problem and discovered that... $math.toInteger("10.1") returns 101 $math.toInteger("10,1") returns 10 Although I'm using Strings in the previous examples, I have a Float variable from Solr. I'm not sure if it is just a Solr problem, just a Velocity problema or somewhere between them. May it be something related to my local/regional settings or so? I ask that because in BRL (Brazilian Real) the currency format we use is something line R$1.234,56. Any idea? Marcelo Carvalho Fernandes +55 21 8272-7970 +55 21 2205-2786 On Thu, Mar 22, 2012 at 3:14 PM, Marcelo Carvalho Fernandes < mcf2...@gmail.com> wrote: > Hi all! > > I'm using Apache Solr 3.5.0 with Tomcat 6.0.32. > > My schema.xml has a price field declared as... > > required="false" /> > > My solrconfig.xml has a a velocity RequestHandler (/browser)) that has > the following facet... > >*preco* >0 >100 >10 > > ...and I'm using the default templates in > \example\solr\conf\velocity . > > The problem is that each peace of range that is being generated has a > wrong upper bound. For example, instead of... > > 0 -10 > 10 - 20 > 20 - 30 > 30 - 40 > ... > > ...what is being generated is... > > 0 - 10 > 10 - 110 > 20 - 210 > 30 - 310 > ... > > I've studied the #display_facet_range macro in VM_global_library.vm and > it looks like the $math.add is contatenating the two operands insted of > producing a sum. I mean, insted of 10+10=20 it returns 110, instead of > 20+10=30 it returns 210. > > Any idea what is the problem? > > Thanks in advance, > > > Marcelo Carvalho Fernandes > +55 21 8272-7970 > +55 21 2205-2786 >
Re: Commit Strategy for SolrCloud when Talking about 200 million records.
What issues? It really shouldn't be a problem. On Mar 22, 2012, at 11:44 PM, I-Chiang Chen wrote: > At this time we are not leveraging the NRT functionality. This is the > initial data load process where the idea is to just add all 200 millions > records first. Than do a single commit at the end to make them searchable. > We actually disabled auto commit at this time. > > We have tried to leave auto commit enabled during the initial data load > process and ran into multiple issues that leads to botched loading process. > > On Thu, Mar 22, 2012 at 2:15 PM, Mark Miller wrote: > >> >> On Mar 21, 2012, at 9:37 PM, I-Chiang Chen wrote: >> >>> We are currently experimenting with SolrCloud functionality in Solr 4.0. >>> The goal is to see if Solr 4.0 trunk with is current state is able to >>> handle roughly 200million documents. The document size is not big around >> 40 >>> fields no more than a KB, most of which are empty majority of times. >>> >>> The setup we have is 4 servers w/ 2 shards w/ 2 servers per shard. We are >>> running in Tomcat. >>> >>> The questions are giving the approximate data volume, is it a realistic >> to >>> expect above setup can handle it. >> >> So 100 million docs per machine essentially? Totally depends on the >> hardware and what features you are using - but def in the realm of >> possibility. >> >>> Giving the number of documents should >>> commit every x documents or rely on auto commits? >> >> The number of docs shouldn't really matter here. Do you need near real >> time search? >> >> You should be able to commit about as frequently as you'd like with NRT >> (eg every 1 second if you'd like) - either using soft auto commit or >> commitWithin. >> >> Then you want to do a hard commit less frequently - every minute (or more >> or less) with openSearcher=false. >> >> eg >> >> >> 15000 >> false >> >> >>> >>> -- >>> -IC >> >> - Mark Miller >> lucidimagination.com >> >> >> >> >> >> >> >> >> >> >> >> > > > -- > -IC
Re: Commit Strategy for SolrCloud when Talking about 200 million records.
We did some tests too with many millions of documents and auto-commit enabled. It didn't take long for the indexer to stall and in the meantime the number of open files exploded, to over 16k, then 32k. On Friday 23 March 2012 12:20:15 Mark Miller wrote: > What issues? It really shouldn't be a problem. > > On Mar 22, 2012, at 11:44 PM, I-Chiang Chen wrote: > > At this time we are not leveraging the NRT functionality. This is the > > initial data load process where the idea is to just add all 200 millions > > records first. Than do a single commit at the end to make them > > searchable. We actually disabled auto commit at this time. > > > > We have tried to leave auto commit enabled during the initial data load > > process and ran into multiple issues that leads to botched loading > > process. > > > > On Thu, Mar 22, 2012 at 2:15 PM, Mark Miller wrote: > >> On Mar 21, 2012, at 9:37 PM, I-Chiang Chen wrote: > >>> We are currently experimenting with SolrCloud functionality in Solr > >>> 4.0. The goal is to see if Solr 4.0 trunk with is current state is > >>> able to handle roughly 200million documents. The document size is not > >>> big around > >> > >> 40 > >> > >>> fields no more than a KB, most of which are empty majority of times. > >>> > >>> The setup we have is 4 servers w/ 2 shards w/ 2 servers per shard. We > >>> are running in Tomcat. > >>> > >>> The questions are giving the approximate data volume, is it a realistic > >> > >> to > >> > >>> expect above setup can handle it. > >> > >> So 100 million docs per machine essentially? Totally depends on the > >> hardware and what features you are using - but def in the realm of > >> possibility. > >> > >>> Giving the number of documents should > >>> commit every x documents or rely on auto commits? > >> > >> The number of docs shouldn't really matter here. Do you need near real > >> time search? > >> > >> You should be able to commit about as frequently as you'd like with NRT > >> (eg every 1 second if you'd like) - either using soft auto commit or > >> commitWithin. > >> > >> Then you want to do a hard commit less frequently - every minute (or > >> more or less) with openSearcher=false. > >> > >> eg > >> > >> > >> > >> 15000 > >> false > >> > >> > >>> > >>> -- > >>> -IC > >> > >> - Mark Miller > >> lucidimagination.com -- Markus Jelsma - CTO - Openindex
Simple Slave Replication Question
Hello, Im looking at the replication from a master to a number of slaves. I have configured it and it appears to be working. When updating 40K records on the master is it standard to always copy over the full index, currently 5gb in size. If this is standard what do people do who have massive 200gb indexs, does it not take a while to bring the slaves inline with the master? Thanks Ben This e-mail is sent on behalf of Trader Media Group Limited, Registered Office: Auto Trader House, Cutbush Park Industrial Estate, Danehill, Lower Earley, Reading, Berkshire, RG6 4UT(Registered in England No. 4768833). This email and any files transmitted with it are confidential and may be legally privileged, and intended solely for the use of the individual or entity to whom they are addressed. If you have received this email in error please notify the sender. This email message has been swept for the presence of computer viruses.
Have made site for comparison of Solr and other enterprise search engines
Hi all Fyi, I am working on a website for doing side by side comparison of several common enterprise search engines, including some that is based on Solr. Currently I have Searchdaimon ES, Microsoft SSE 2010, SearchBlox, Google Mini, Thunderstone, Constellio, mnoGoSearch and Ibm OmniFind Yahoo running. All have indexed the same data set to allow for easy comparison of the results. Currently I have focused on ready made search packages that have an end user gui and crawlers integrated, but will also add a pure Solr install later. You can do both side by side comparison in native gui: http://www.opentestsearch.com/cgi-bin/search.cgi?query=enron+logo , and blind comparison: http://www.opentestsearch.com/cgi-bin/search.cgi?query=enron+logo&ano=on . Site: http://www.opentestsearch.com/ Quite cool, don't you think? Any suggestions/feedback would be much appreciated. Best regards Runar Buvik
Re: Grouping queries
On Fri, Mar 23, 2012 at 6:37 AM, Martijn v Groningen wrote: > On 22 March 2012 03:10, Jamie Johnson wrote: > >> I need to apologize I believe that in my example I have too grossly >> over simplified the problem and it's not clear what I am trying to do, >> so I'll try again. >> >> I have a situation where I have a set of access controls say user, >> super user and ultra user. These controls are not necessarily >> hierarchical in that user < super user < ultra user. Each of these >> controls should only be able to see documents from with some >> combination of access controls they have. In my actual case we have >> many access controls and they can be combined in a number of fashions >> so I can't simply constrain what they are searching by a query alone >> (i.e. if it's a user the query is auth:user AND (some query)). Now I >> have a case where a document contains information that a user can see >> but also contains information a super user can see. Our current >> system marks this document at the super user level and the user can't >> see it. We now have a requirement to make the pieces that are at the >> user level available to the user while still allowing the super user >> to see and search all the information. My original thought was to >> simply index the document twice, this would end up in a possible >> duplicate (say if a user had both user and super user) but since this >> situation is rare it may not matter. After coming across the grouping >> capability in solr I figured I could execute a group query where we >> grouped on some key which indicated that 2 documents were the same >> just with different access controls (user and super user in this >> example). We could then filter out the documents in the group the >> user isn't allowed to see and only keep the document with the access >> controls they have. >> > Maybe I'm not understanding this right... But why can't you save the access > controls > as a multivalued field in your schema? In your example your can then if the > current user is a normal user just query auth:user AND (query) and if the > current user > is a super user auth:superuser AND (query). A document that is searchable > for > both superuser and user is then returned (if it matches the rest of the > query). > I'd like to avoid having duplicates, although my access controls are not strictly hierarchical there are cases where super user can see his docs and user docs. The idea was to have the Super User doc be a super set of the user doc. So in my case I really have only 1 document, but that 1 document has pieces user can see and pieces only super user can see. The idea was to index the document twice once with the entire doc and once with the pieces just the user could see. >> >> I hope this makes more sense, unfortunately the Join queries I don't >> believe will work because I don't think if I create documents which >> would be relevant to each access control I could search across these >> document as if it was a single document (i.e. search for something in >> the user document and something in the super user document in a single >> query). This lead me to believe that grouping was the way to go in >> this case, but again I am very interested in any suggestions that the >> community could offer. >> > I wouldn't use grouping. The Solr join is still a option. Lets say you have > many access controls and the access controls change often on your documents. > You can then choose to store the access controls with an id to your logic > document > as a separate document in a different Solr Core (index). In the core were > your main > documents are you don't keep the access controls. You can then use the solr > join > to filter out documents that the current user isn't supposed to search. > Something like this: > q=(query)&fq={!join fromIndex=core1 from=doc_id to=id}auth:superuser > Core 1 is the core containing the access control documents and the doc_id > is the id that > points to your regular documents. > > The benefit of this approach is that if you fine tune the core1 for high > updatability you can > change you access controls very frequently without paying a big > performance penalty. Where is Join documented? I looked at http://wiki.apache.org/solr/Join and see no reference to "fromIndex". Also does this work in a distributed environment?
Re: Grouping queries
> > Where is Join documented? I looked at > http://wiki.apache.org/solr/Join and see no reference to "fromIndex". > Also does this work in a distributed environment? > The "fromIndex" isn't documented in the wiki It is mentioned in the issue and you can find in the Solr code: https://issues.apache.org/jira/browse/SOLR-2272?focusedCommentId=13024918&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13024918 The Solr join only works in a distributed environment if you partition your documents properly. Documents that link to each other need to reside on the same shard and this can be a problem in some cases. Martijn
"Error 500 seek past EOF" : SOLR bug?
Hi list In my ~6M index served from a slave that is replicating from a master, I'm trying to do this query : localhost:8080/solr/core0/select?q=car&qf=document%5E1&defType=edismax Can anybody explain the below error that I get as a result? It may (or may not) be related to another problem that we're seeing, which is that replication sometimes fails "for a while" leading to a lot of retried replication attempts, which seems to finally succeed. == Query failure: Error 500 seek past EOF: MMapIndexInput(path="/mnt/solr.data.0/index/_1sj.frq") java.io.IOException: seek past EOF: MMapIndexInput(path="/mnt/solr.data.0/index/_1sj.frq") at org.apache.lucene.store.MMapDirectory$MMapIndexInput.seek(MMapDirectory.java:352) at org.apache.lucene.index.SegmentTermDocs.seek(SegmentTermDocs.java:92) at org.apache.lucene.index.SegmentTermDocs.seek(SegmentTermDocs.java:59) at org.apache.lucene.index.IndexReader.termDocs(IndexReader.java:1277) at org.apache.lucene.index.SegmentReader.termDocs(SegmentReader.java:490) at org.apache.solr.search.SolrIndexReader.termDocs(SolrIndexReader.java:321) at org.apache.lucene.search.TermQuery$TermWeight.scorer(TermQuery.java:102) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:577) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:364) at org.apache.solr.search.SolrIndexSearcher.getDocListNC(SolrIndexSearcher.java:1282) at org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:1162) at org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:362) at org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:378) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:194) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1372) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1337) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:486) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:119) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:520) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:233) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:973) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:417) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:192) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:907) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:117) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:250) at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:149) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:110) at org.eclipse.jetty.server.Server.handle(Server.java:346) at org.eclipse.jetty.server.HttpConnection.handleRequest(HttpConnection.java:442) at org.eclipse.jetty.server.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:924) at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:582) at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:218) at org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:51) at org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:586) at org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:44) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:598) at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:533) at java.lang.Thread.run(Thread.java:722) == Replication failure: Mar 23, 2012 1:27:30 PM org.apache.solr.handler.ReplicationHandler doFetch SEVERE: SnapPull failed org.apache.solr.common.SolrException: Index fetch failed : at org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:331) at org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:268) at org.apache.solr.handler.SnapPuller$1.run(SnapPuller.java:159) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:351) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:178) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolEx
Re: Simple Slave Replication Question
I guess this would depend on network bandwidth, but we move around 150G/hour when hooking up a new slave to the master. /Martin On Fri, Mar 23, 2012 at 12:33 PM, Ben McCarthy < ben.mccar...@tradermedia.co.uk> wrote: > Hello, > > Im looking at the replication from a master to a number of slaves. I have > configured it and it appears to be working. When updating 40K records on > the master is it standard to always copy over the full index, currently 5gb > in size. If this is standard what do people do who have massive 200gb > indexs, does it not take a while to bring the slaves inline with the master? > > Thanks > Ben > > > > > This e-mail is sent on behalf of Trader Media Group Limited, Registered > Office: Auto Trader House, Cutbush Park Industrial Estate, Danehill, Lower > Earley, Reading, Berkshire, RG6 4UT(Registered in England No. 4768833). > This email and any files transmitted with it are confidential and may be > legally privileged, and intended solely for the use of the individual or > entity to whom they are addressed. If you have received this email in error > please notify the sender. This email message has been swept for the > presence of computer viruses. > >
RE: Simple Slave Replication Question
So do you just simpy address this with big nic and network pipes. -Original Message- From: Martin Koch [mailto:m...@issuu.com] Sent: 23 March 2012 14:07 To: solr-user@lucene.apache.org Subject: Re: Simple Slave Replication Question I guess this would depend on network bandwidth, but we move around 150G/hour when hooking up a new slave to the master. /Martin On Fri, Mar 23, 2012 at 12:33 PM, Ben McCarthy < ben.mccar...@tradermedia.co.uk> wrote: > Hello, > > Im looking at the replication from a master to a number of slaves. I > have configured it and it appears to be working. When updating 40K > records on the master is it standard to always copy over the full > index, currently 5gb in size. If this is standard what do people do > who have massive 200gb indexs, does it not take a while to bring the slaves > inline with the master? > > Thanks > Ben > > > > > This e-mail is sent on behalf of Trader Media Group Limited, > Registered > Office: Auto Trader House, Cutbush Park Industrial Estate, Danehill, > Lower Earley, Reading, Berkshire, RG6 4UT(Registered in England No. 4768833). > This email and any files transmitted with it are confidential and may > be legally privileged, and intended solely for the use of the > individual or entity to whom they are addressed. If you have received > this email in error please notify the sender. This email message has > been swept for the presence of computer viruses. > > This e-mail is sent on behalf of Trader Media Group Limited, Registered Office: Auto Trader House, Cutbush Park Industrial Estate, Danehill, Lower Earley, Reading, Berkshire, RG6 4UT(Registered in England No. 4768833). This email and any files transmitted with it are confidential and may be legally privileged, and intended solely for the use of the individual or entity to whom they are addressed. If you have received this email in error please notify the sender. This email message has been swept for the presence of computer viruses.
Slave index size growing fast
Hello, We have a Solr index that has an average of 1.19 GB in size. After configuring the replication, the slave machine is growing the index size expoentially. Currently we have an slave with 323.44 GB in size. Is there anything that could cause this behavior? The current replication config is below. Master: commit startup startup elevate.xml,protwords.txt,schema.xml,spellings.txt,stopwords.txt,synonyms.txt Slave: http://master:8984/solr/Index/replication Any pointers will be useful. Thanks, Alexandre
Re: Simple Slave Replication Question
Hi Ben, only new segments are replicated from master to slave. In a situation where all the segments are new, this will cause the index to be fully replicated, but this rarely happen with incremental updates. It can also happen if the slave Solr assumes it has an "invalid" index. Are you committing or optimizing on the slaves? After replication, the index directory on the slaves is called "index" or "index."? Tomás On Fri, Mar 23, 2012 at 11:18 AM, Ben McCarthy < ben.mccar...@tradermedia.co.uk> wrote: > So do you just simpy address this with big nic and network pipes. > > -Original Message- > From: Martin Koch [mailto:m...@issuu.com] > Sent: 23 March 2012 14:07 > To: solr-user@lucene.apache.org > Subject: Re: Simple Slave Replication Question > > I guess this would depend on network bandwidth, but we move around > 150G/hour when hooking up a new slave to the master. > > /Martin > > On Fri, Mar 23, 2012 at 12:33 PM, Ben McCarthy < > ben.mccar...@tradermedia.co.uk> wrote: > > > Hello, > > > > Im looking at the replication from a master to a number of slaves. I > > have configured it and it appears to be working. When updating 40K > > records on the master is it standard to always copy over the full > > index, currently 5gb in size. If this is standard what do people do > > who have massive 200gb indexs, does it not take a while to bring the > slaves inline with the master? > > > > Thanks > > Ben > > > > > > > > > > This e-mail is sent on behalf of Trader Media Group Limited, > > Registered > > Office: Auto Trader House, Cutbush Park Industrial Estate, Danehill, > > Lower Earley, Reading, Berkshire, RG6 4UT(Registered in England No. > 4768833). > > This email and any files transmitted with it are confidential and may > > be legally privileged, and intended solely for the use of the > > individual or entity to whom they are addressed. If you have received > > this email in error please notify the sender. This email message has > > been swept for the presence of computer viruses. > > > > > > > > > This e-mail is sent on behalf of Trader Media Group Limited, Registered > Office: Auto Trader House, Cutbush Park Industrial Estate, Danehill, Lower > Earley, Reading, Berkshire, RG6 4UT(Registered in England No. 4768833). > This email and any files transmitted with it are confidential and may be > legally privileged, and intended solely for the use of the individual or > entity to whom they are addressed. If you have received this email in error > please notify the sender. This email message has been swept for the > presence of computer viruses. > >
Re: Slave index size growing fast
What version of Solr and what operating system? But regardless, this shouldn't be happening. Indexes can temporarily double in size, but any extras should be cleaned up relatively soon. On the master, what's the total size of the /data directory? I'm a little suspicious of the on your master, but I don't think that's the root of your problem Are you recreating the index on the master (by deleting the index directory and starting over)? This is unusual, and I suspect it's something odd in your configuration, but I confess I'm at a loss as to what. Best Erick On Fri, Mar 23, 2012 at 10:28 AM, Alexandre Rocco wrote: > Hello, > > We have a Solr index that has an average of 1.19 GB in size. > After configuring the replication, the slave machine is growing the index > size expoentially. > Currently we have an slave with 323.44 GB in size. > Is there anything that could cause this behavior? > The current replication config is below. > > Master: > > > commit > startup > startup > > elevate.xml,protwords.txt,schema.xml,spellings.txt,stopwords.txt,synonyms.txt > > > > > Slave: > > > http://master:8984/solr/Index/replication > > > > Any pointers will be useful. > > Thanks, > Alexandre
Re: Solr 4.0 replication problem
Hmmm, that is odd. But a trunk build from that long ago is going to be almost impossible to debug/fix. The problem with working from trunk is that this kind of problem won't get much attention. I have three suggestions: 1> update to current trunk. NOTE: you'll have to completely reindex your data, the format of the index has changed multiple times since then and there's no back-compatibility maintained with non-released major versions. 2> delete your entire index on the _master_ and re-index from scratch. If you do this, I'd also delete the entire /data directory on the slaves before replication as well. 3> Delete your entire index on the _slave_ and see if you get a clean replication. <3> is the least painful, <1> the most so I'd go in reverse order for the above. Best Erick On Wed, Mar 21, 2012 at 8:49 AM, Hakan İlter wrote: > Hi everyone, > > We are using very early version of Solr 4.0 and we've some replication > problems. Actually we used this build more than one year without any > problem but when I made some changes on schema.xml, the following problem > started. > > I've just changed schema.xml with adding multiValued="true" attribute to > two dynamic fields. > > Before: > > > > > > After: > > multiValued="true"* /> > multiValued="true"* /> > > > After starting tomcats with new configuration, there are no problems > occurred. But after a while, I'm seeing this error: > > *Mar 20, 2012 2:00:05 PM org.apache.solr.handler.ReplicationHandler doFetch > SEVERE: SnapPull failed > org.apache.solr.common.SolrException: Index fetch failed : > at > org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:340) > at > org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:265) > at org.apache.solr.handler.SnapPuller$1.run(SnapPuller.java:166) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) > at > java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:317) > at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:150) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:98) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:204) > at > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) > at java.lang.Thread.run(Thread.java:662) > Caused by: org.apache.lucene.index.IndexFormatTooNewException: Format > version is not supported in file 'segments_1': -12 (needs to be between -9 > and -11) > at > org.apache.lucene.index.codecs.DefaultSegmentInfosReader.read(DefaultSegmentInfosReader.java:51) > at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:230) > at > org.apache.lucene.index.IndexFileDeleter.(IndexFileDeleter.java:169) > at org.apache.lucene.index.IndexWriter.(IndexWriter.java:770) > at > org.apache.solr.update.SolrIndexWriter.(SolrIndexWriter.java:83) > at > org.apache.solr.update.UpdateHandler.createMainIndexWriter(UpdateHandler.java:102) > at > org.apache.solr.update.DirectUpdateHandler2.openWriter(DirectUpdateHandler2.java:111) > at > org.apache.solr.update.DirectUpdateHandler2.forceOpenWriter(DirectUpdateHandler2.java:297) > at org.apache.solr.handler.SnapPuller.doCommit(SnapPuller.java:484) > at > org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:330) > ... 11 more > Mar 20, 2012 2:00:16 PM org.apache.solr.update.SolrIndexWriter finalize > SEVERE: SolrIndexWriter was not closed prior to finalize(), indicates a bug > -- POSSIBLE RESOURCE LEAK!!!* > > After above error, it works for a while and then tomcats are going down > because of out of memory problems. > > Could this be a bug? What do you suggest? Please don't suggest me to update > Solr to current version in trunk because we did a lot of changes related > with our build of Solr. Updating to current Solr would take at least a > couple of weeks. But we need an immediate solution. > > We are using DIH, our schema version is 1.3, both master and slaves using > same binaries, libraries etc. Here are some details about our solr build: > > Solr Specification Version: 4.0.0.2011.03.28.06.20.50 > Solr Implementation Version: 4.0-SNAPSHOT 1086110 - Administrator - > 2011-03-28 06:20:50 > Lucene Specification Version: 4.0-SNAPSHOT > Lucene Implementation Version: 4.0-SNAPSHOT 1086110 - 2011-03-28 06:22:18 > > Thanks for any help.
RE: Simple Slave Replication Question
I just have a index directory. I push the documents through with a change to a field. Im using SOLRJ to do this. Im using the guide from the wiki to setup the replication. When the feed of updates to the master finishes I call a commit again using SOLRJ. I then have a poll period of 5 minutes from the slave. When it kicks in I see a new version of the index and then it copys the full 5gb index. Thanks Ben -Original Message- From: Tomás Fernández Löbbe [mailto:tomasflo...@gmail.com] Sent: 23 March 2012 14:29 To: solr-user@lucene.apache.org Subject: Re: Simple Slave Replication Question Hi Ben, only new segments are replicated from master to slave. In a situation where all the segments are new, this will cause the index to be fully replicated, but this rarely happen with incremental updates. It can also happen if the slave Solr assumes it has an "invalid" index. Are you committing or optimizing on the slaves? After replication, the index directory on the slaves is called "index" or "index."? Tomás On Fri, Mar 23, 2012 at 11:18 AM, Ben McCarthy < ben.mccar...@tradermedia.co.uk> wrote: > So do you just simpy address this with big nic and network pipes. > > -Original Message- > From: Martin Koch [mailto:m...@issuu.com] > Sent: 23 March 2012 14:07 > To: solr-user@lucene.apache.org > Subject: Re: Simple Slave Replication Question > > I guess this would depend on network bandwidth, but we move around > 150G/hour when hooking up a new slave to the master. > > /Martin > > On Fri, Mar 23, 2012 at 12:33 PM, Ben McCarthy < > ben.mccar...@tradermedia.co.uk> wrote: > > > Hello, > > > > Im looking at the replication from a master to a number of slaves. > > I have configured it and it appears to be working. When updating > > 40K records on the master is it standard to always copy over the > > full index, currently 5gb in size. If this is standard what do > > people do who have massive 200gb indexs, does it not take a while to > > bring the > slaves inline with the master? > > > > Thanks > > Ben > > > > > > > > > > This e-mail is sent on behalf of Trader Media Group Limited, > > Registered > > Office: Auto Trader House, Cutbush Park Industrial Estate, Danehill, > > Lower Earley, Reading, Berkshire, RG6 4UT(Registered in England No. > 4768833). > > This email and any files transmitted with it are confidential and > > may be legally privileged, and intended solely for the use of the > > individual or entity to whom they are addressed. If you have > > received this email in error please notify the sender. This email > > message has been swept for the presence of computer viruses. > > > > > > > > > This e-mail is sent on behalf of Trader Media Group Limited, > Registered > Office: Auto Trader House, Cutbush Park Industrial Estate, Danehill, > Lower Earley, Reading, Berkshire, RG6 4UT(Registered in England No. 4768833). > This email and any files transmitted with it are confidential and may > be legally privileged, and intended solely for the use of the > individual or entity to whom they are addressed. If you have received > this email in error please notify the sender. This email message has > been swept for the presence of computer viruses. > > This e-mail is sent on behalf of Trader Media Group Limited, Registered Office: Auto Trader House, Cutbush Park Industrial Estate, Danehill, Lower Earley, Reading, Berkshire, RG6 4UT(Registered in England No. 4768833). This email and any files transmitted with it are confidential and may be legally privileged, and intended solely for the use of the individual or entity to whom they are addressed. If you have received this email in error please notify the sender. This email message has been swept for the presence of computer viruses.
Re: Slave index size growing fast
Erick, We're using Solr 3.3 on Linux (CentOS 5.6). The /data dir on master is actually 1.2G. I haven't tried to recreate the index yet. Since it's a production environment, I guess that I can stop replication and indexing and then recreate the master index to see if it makes any difference. Also just noticed another thread here named "Simple Slave Replication Question" that tells that it could be a problem if I'm seeing an /data/index with an timestamp on the slave node. Is this info relevant to this issue? Thanks, Alexandre On Fri, Mar 23, 2012 at 11:48 AM, Erick Erickson wrote: > What version of Solr and what operating system? > > But regardless, this shouldn't be happening. Indexes can > temporarily double in size, but any extras should be > cleaned up relatively soon. > > On the master, what's the total size of the /data directory? > I'm a little suspicious of the on your master, but I > don't think that's the root of your problem > > Are you recreating the index on the master (by deleting the > index directory and starting over)? > > This is unusual, and I suspect it's something odd in your configuration, > but I confess I'm at a loss as to what. > > Best > Erick > > On Fri, Mar 23, 2012 at 10:28 AM, Alexandre Rocco > wrote: > > Hello, > > > > We have a Solr index that has an average of 1.19 GB in size. > > After configuring the replication, the slave machine is growing the index > > size expoentially. > > Currently we have an slave with 323.44 GB in size. > > Is there anything that could cause this behavior? > > The current replication config is below. > > > > Master: > > > > > > commit > > startup > > startup > > > > > elevate.xml,protwords.txt,schema.xml,spellings.txt,stopwords.txt,synonyms.txt > > > > > > > > > > Slave: > > > > > > http://master:8984/solr/Index/replication > > > > > > > > Any pointers will be useful. > > > > Thanks, > > Alexandre >
Re: Solr 4.0 replication problem
Hi Erick, I've already tried step 2 and 3 but it didn't help. It's almost impossible to do step 1 for us because of project dead-line. Do you have any other suggestion? Thank your reply. On Fri, Mar 23, 2012 at 4:56 PM, Erick Erickson wrote: > Hmmm, that is odd. But a trunk build from that long ago is going > to be almost impossible to debug/fix. The problem with working > from trunk is that this kind of problem won't get much attention. > > I have three suggestions: > 1> update to current trunk. NOTE: you'll have to completely > reindex your data, the format of the index has changed > multiple times since then and there's no back-compatibility > maintained with non-released major versions. > 2> delete your entire index on the _master_ and re-index from scratch. > If you do this, I'd also delete the entire /data directory > on the slaves before replication as well. > 3> Delete your entire index on the _slave_ and see if you get a > clean replication. > > <3> is the least painful, <1> the most so I'd go in reverse order > for the above. > > Best > Erick > > > On Wed, Mar 21, 2012 at 8:49 AM, Hakan İlter wrote: > > Hi everyone, > > > > We are using very early version of Solr 4.0 and we've some replication > > problems. Actually we used this build more than one year without any > > problem but when I made some changes on schema.xml, the following problem > > started. > > > > I've just changed schema.xml with adding multiValued="true" attribute to > > two dynamic fields. > > > > Before: > > > > /> > > > > > > > > After: > > > > > multiValued="true"* /> > > > multiValued="true"* /> > > > > > > After starting tomcats with new configuration, there are no problems > > occurred. But after a while, I'm seeing this error: > > > > *Mar 20, 2012 2:00:05 PM org.apache.solr.handler.ReplicationHandler > doFetch > > SEVERE: SnapPull failed > > org.apache.solr.common.SolrException: Index fetch failed : > >at > > org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:340) > >at > > > org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:265) > >at org.apache.solr.handler.SnapPuller$1.run(SnapPuller.java:166) > >at > > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) > >at > > > java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:317) > >at > java.util.concurrent.FutureTask.runAndReset(FutureTask.java:150) > >at > > > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:98) > >at > > > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic(ScheduledThreadPoolExecutor.java:180) > >at > > > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:204) > >at > > > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) > >at > > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) > >at java.lang.Thread.run(Thread.java:662) > > Caused by: org.apache.lucene.index.IndexFormatTooNewException: Format > > version is not supported in file 'segments_1': -12 (needs to be between > -9 > > and -11) > >at > > > org.apache.lucene.index.codecs.DefaultSegmentInfosReader.read(DefaultSegmentInfosReader.java:51) > >at > org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:230) > >at > > > org.apache.lucene.index.IndexFileDeleter.(IndexFileDeleter.java:169) > >at > org.apache.lucene.index.IndexWriter.(IndexWriter.java:770) > >at > > org.apache.solr.update.SolrIndexWriter.(SolrIndexWriter.java:83) > >at > > > org.apache.solr.update.UpdateHandler.createMainIndexWriter(UpdateHandler.java:102) > >at > > > org.apache.solr.update.DirectUpdateHandler2.openWriter(DirectUpdateHandler2.java:111) > >at > > > org.apache.solr.update.DirectUpdateHandler2.forceOpenWriter(DirectUpdateHandler2.java:297) > >at > org.apache.solr.handler.SnapPuller.doCommit(SnapPuller.java:484) > >at > > org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:330) > >... 11 more > > Mar 20, 2012 2:00:16 PM org.apache.solr.update.SolrIndexWriter finalize > > SEVERE: SolrIndexWriter was not closed prior to finalize(), indicates a > bug > > -- POSSIBLE RESOURCE LEAK!!!* > > > > After above error, it works for a while and then tomcats are going down > > because of out of memory problems. > > > > Could this be a bug? What do you suggest? Please don't suggest me to > update > > Solr to current version in trunk because we did a lot of changes related > > with our build of Solr. Updating to current Solr would take at least a > > couple of weeks. But we need an immediate solution. > > > > We are using DIH, our schema version is 1.3, both master and slaves using > > same binarie
Re: Simple Slave Replication Question
Have you changed the mergeFactor or are you using 10 as in the example solrconfig? What do you see in the slave's log during replication? Do you see any line like "Skipping download for..."? On Fri, Mar 23, 2012 at 11:57 AM, Ben McCarthy < ben.mccar...@tradermedia.co.uk> wrote: > I just have a index directory. > > I push the documents through with a change to a field. Im using SOLRJ to > do this. Im using the guide from the wiki to setup the replication. When > the feed of updates to the master finishes I call a commit again using > SOLRJ. I then have a poll period of 5 minutes from the slave. When it > kicks in I see a new version of the index and then it copys the full 5gb > index. > > Thanks > Ben > > -Original Message- > From: Tomás Fernández Löbbe [mailto:tomasflo...@gmail.com] > Sent: 23 March 2012 14:29 > To: solr-user@lucene.apache.org > Subject: Re: Simple Slave Replication Question > > Hi Ben, only new segments are replicated from master to slave. In a > situation where all the segments are new, this will cause the index to be > fully replicated, but this rarely happen with incremental updates. It can > also happen if the slave Solr assumes it has an "invalid" index. > Are you committing or optimizing on the slaves? After replication, the > index directory on the slaves is called "index" or "index."? > > Tomás > > On Fri, Mar 23, 2012 at 11:18 AM, Ben McCarthy < > ben.mccar...@tradermedia.co.uk> wrote: > > > So do you just simpy address this with big nic and network pipes. > > > > -Original Message- > > From: Martin Koch [mailto:m...@issuu.com] > > Sent: 23 March 2012 14:07 > > To: solr-user@lucene.apache.org > > Subject: Re: Simple Slave Replication Question > > > > I guess this would depend on network bandwidth, but we move around > > 150G/hour when hooking up a new slave to the master. > > > > /Martin > > > > On Fri, Mar 23, 2012 at 12:33 PM, Ben McCarthy < > > ben.mccar...@tradermedia.co.uk> wrote: > > > > > Hello, > > > > > > Im looking at the replication from a master to a number of slaves. > > > I have configured it and it appears to be working. When updating > > > 40K records on the master is it standard to always copy over the > > > full index, currently 5gb in size. If this is standard what do > > > people do who have massive 200gb indexs, does it not take a while to > > > bring the > > slaves inline with the master? > > > > > > Thanks > > > Ben > > > > > > > > > > > > > > > This e-mail is sent on behalf of Trader Media Group Limited, > > > Registered > > > Office: Auto Trader House, Cutbush Park Industrial Estate, Danehill, > > > Lower Earley, Reading, Berkshire, RG6 4UT(Registered in England No. > > 4768833). > > > This email and any files transmitted with it are confidential and > > > may be legally privileged, and intended solely for the use of the > > > individual or entity to whom they are addressed. If you have > > > received this email in error please notify the sender. This email > > > message has been swept for the presence of computer viruses. > > > > > > > > > > > > > > > > This e-mail is sent on behalf of Trader Media Group Limited, > > Registered > > Office: Auto Trader House, Cutbush Park Industrial Estate, Danehill, > > Lower Earley, Reading, Berkshire, RG6 4UT(Registered in England No. > 4768833). > > This email and any files transmitted with it are confidential and may > > be legally privileged, and intended solely for the use of the > > individual or entity to whom they are addressed. If you have received > > this email in error please notify the sender. This email message has > > been swept for the presence of computer viruses. > > > > > > > > > This e-mail is sent on behalf of Trader Media Group Limited, Registered > Office: Auto Trader House, Cutbush Park Industrial Estate, Danehill, Lower > Earley, Reading, Berkshire, RG6 4UT(Registered in England No. 4768833). > This email and any files transmitted with it are confidential and may be > legally privileged, and intended solely for the use of the individual or > entity to whom they are addressed. If you have received this email in error > please notify the sender. This email message has been swept for the > presence of computer viruses. > >
Re: Simple Slave Replication Question
Also, what happens if, instead of adding the 40K docs you add just one and commit? 2012/3/23 Tomás Fernández Löbbe > Have you changed the mergeFactor or are you using 10 as in the example > solrconfig? > > What do you see in the slave's log during replication? Do you see any line > like "Skipping download for..."? > > > On Fri, Mar 23, 2012 at 11:57 AM, Ben McCarthy < > ben.mccar...@tradermedia.co.uk> wrote: > >> I just have a index directory. >> >> I push the documents through with a change to a field. Im using SOLRJ to >> do this. Im using the guide from the wiki to setup the replication. When >> the feed of updates to the master finishes I call a commit again using >> SOLRJ. I then have a poll period of 5 minutes from the slave. When it >> kicks in I see a new version of the index and then it copys the full 5gb >> index. >> >> Thanks >> Ben >> >> -Original Message- >> From: Tomás Fernández Löbbe [mailto:tomasflo...@gmail.com] >> Sent: 23 March 2012 14:29 >> To: solr-user@lucene.apache.org >> Subject: Re: Simple Slave Replication Question >> >> Hi Ben, only new segments are replicated from master to slave. In a >> situation where all the segments are new, this will cause the index to be >> fully replicated, but this rarely happen with incremental updates. It can >> also happen if the slave Solr assumes it has an "invalid" index. >> Are you committing or optimizing on the slaves? After replication, the >> index directory on the slaves is called "index" or "index."? >> >> Tomás >> >> On Fri, Mar 23, 2012 at 11:18 AM, Ben McCarthy < >> ben.mccar...@tradermedia.co.uk> wrote: >> >> > So do you just simpy address this with big nic and network pipes. >> > >> > -Original Message- >> > From: Martin Koch [mailto:m...@issuu.com] >> > Sent: 23 March 2012 14:07 >> > To: solr-user@lucene.apache.org >> > Subject: Re: Simple Slave Replication Question >> > >> > I guess this would depend on network bandwidth, but we move around >> > 150G/hour when hooking up a new slave to the master. >> > >> > /Martin >> > >> > On Fri, Mar 23, 2012 at 12:33 PM, Ben McCarthy < >> > ben.mccar...@tradermedia.co.uk> wrote: >> > >> > > Hello, >> > > >> > > Im looking at the replication from a master to a number of slaves. >> > > I have configured it and it appears to be working. When updating >> > > 40K records on the master is it standard to always copy over the >> > > full index, currently 5gb in size. If this is standard what do >> > > people do who have massive 200gb indexs, does it not take a while to >> > > bring the >> > slaves inline with the master? >> > > >> > > Thanks >> > > Ben >> > > >> > > >> > > >> > > >> > > This e-mail is sent on behalf of Trader Media Group Limited, >> > > Registered >> > > Office: Auto Trader House, Cutbush Park Industrial Estate, Danehill, >> > > Lower Earley, Reading, Berkshire, RG6 4UT(Registered in England No. >> > 4768833). >> > > This email and any files transmitted with it are confidential and >> > > may be legally privileged, and intended solely for the use of the >> > > individual or entity to whom they are addressed. If you have >> > > received this email in error please notify the sender. This email >> > > message has been swept for the presence of computer viruses. >> > > >> > > >> > >> > >> > >> > >> > This e-mail is sent on behalf of Trader Media Group Limited, >> > Registered >> > Office: Auto Trader House, Cutbush Park Industrial Estate, Danehill, >> > Lower Earley, Reading, Berkshire, RG6 4UT(Registered in England No. >> 4768833). >> > This email and any files transmitted with it are confidential and may >> > be legally privileged, and intended solely for the use of the >> > individual or entity to whom they are addressed. If you have received >> > this email in error please notify the sender. This email message has >> > been swept for the presence of computer viruses. >> > >> > >> >> >> >> >> This e-mail is sent on behalf of Trader Media Group Limited, Registered >> Office: Auto Trader House, Cutbush Park Industrial Estate, Danehill, Lower >> Earley, Reading, Berkshire, RG6 4UT(Registered in England No. 4768833). >> This email and any files transmitted with it are confidential and may be >> legally privileged, and intended solely for the use of the individual or >> entity to whom they are addressed. If you have received this email in error >> please notify the sender. This email message has been swept for the >> presence of computer viruses. >> >> >
Re: Slave index size growing fast
not really, unless perhaps you're issuing commits or optimizes on the _slave_ (which you should NOT do). Replication happens based on the version of the index on the master. True, it starts out as a timestamp, but then successive versions just have that number incremented. The version number in the index on the slave is compared against the one on the master, but the actual time (on the slave or master) is irrelevant. This is explicitly to avoid problems with time synching across machines/timezones/whataver It would be instructive to look at the admin/info page to see what the index version is on the master and slave. But, if you optimize or commit (I think) on the _slave_, you might change the timestamp and mess things up (although I'm reaching here, I don't know this for certain). What's the index look like on the slave as compared to the master? Are there just a bunch of files on the slave? Or a bunch of directories? Instead of re-indexing on the master, you could try to bring down the slave, blow away the entire index and start it back up. Since this is a production system, I'd only try this if I had more than one slave. Although you could bring up a new slave and attach it to the master and see what happens there. You wouldn't affect production if you didn't point incoming requests at it... Best Erick On Fri, Mar 23, 2012 at 11:03 AM, Alexandre Rocco wrote: > Erick, > > We're using Solr 3.3 on Linux (CentOS 5.6). > The /data dir on master is actually 1.2G. > > I haven't tried to recreate the index yet. Since it's a production > environment, > I guess that I can stop replication and indexing and then recreate the > master index to see if it makes any difference. > > Also just noticed another thread here named "Simple Slave Replication > Question" that tells that it could be a problem if I'm seeing an > /data/index with an timestamp on the slave node. > Is this info relevant to this issue? > > Thanks, > Alexandre > > On Fri, Mar 23, 2012 at 11:48 AM, Erick Erickson > wrote: > >> What version of Solr and what operating system? >> >> But regardless, this shouldn't be happening. Indexes can >> temporarily double in size, but any extras should be >> cleaned up relatively soon. >> >> On the master, what's the total size of the /data directory? >> I'm a little suspicious of the on your master, but I >> don't think that's the root of your problem >> >> Are you recreating the index on the master (by deleting the >> index directory and starting over)? >> >> This is unusual, and I suspect it's something odd in your configuration, >> but I confess I'm at a loss as to what. >> >> Best >> Erick >> >> On Fri, Mar 23, 2012 at 10:28 AM, Alexandre Rocco >> wrote: >> > Hello, >> > >> > We have a Solr index that has an average of 1.19 GB in size. >> > After configuring the replication, the slave machine is growing the index >> > size expoentially. >> > Currently we have an slave with 323.44 GB in size. >> > Is there anything that could cause this behavior? >> > The current replication config is below. >> > >> > Master: >> > >> > >> > commit >> > startup >> > startup >> > >> > >> elevate.xml,protwords.txt,schema.xml,spellings.txt,stopwords.txt,synonyms.txt >> > >> > >> > >> > >> > Slave: >> > >> > >> > http://master:8984/solr/Index/replication >> > >> > >> > >> > Any pointers will be useful. >> > >> > Thanks, >> > Alexandre >>
Re: Solr 4.0 replication problem
Hmmm, looking at your stack trace in a bit more detail, this is really suspicious: Caused by: org.apache.lucene.index.IndexFormatTooNewException: Format version is not supported in file 'segments_1': -12 (needs to be between -9 and -11) This *looks* like your Solr version on your slave is older than the version on your master. Is this possible at all? Best Erick On Fri, Mar 23, 2012 at 11:03 AM, Hakan İlter wrote: > Hi Erick, > > I've already tried step 2 and 3 but it didn't help. It's almost impossible > to do step 1 for us because of project dead-line. > > Do you have any other suggestion? > > Thank your reply. > > On Fri, Mar 23, 2012 at 4:56 PM, Erick Erickson > wrote: > >> Hmmm, that is odd. But a trunk build from that long ago is going >> to be almost impossible to debug/fix. The problem with working >> from trunk is that this kind of problem won't get much attention. >> >> I have three suggestions: >> 1> update to current trunk. NOTE: you'll have to completely >> reindex your data, the format of the index has changed >> multiple times since then and there's no back-compatibility >> maintained with non-released major versions. >> 2> delete your entire index on the _master_ and re-index from scratch. >> If you do this, I'd also delete the entire /data directory >> on the slaves before replication as well. >> 3> Delete your entire index on the _slave_ and see if you get a >> clean replication. >> >> <3> is the least painful, <1> the most so I'd go in reverse order >> for the above. >> >> Best >> Erick >> >> >> On Wed, Mar 21, 2012 at 8:49 AM, Hakan İlter wrote: >> > Hi everyone, >> > >> > We are using very early version of Solr 4.0 and we've some replication >> > problems. Actually we used this build more than one year without any >> > problem but when I made some changes on schema.xml, the following problem >> > started. >> > >> > I've just changed schema.xml with adding multiValued="true" attribute to >> > two dynamic fields. >> > >> > Before: >> > >> > > /> >> > >> > >> > >> > After: >> > >> > > > multiValued="true"* /> >> > > > multiValued="true"* /> >> > >> > >> > After starting tomcats with new configuration, there are no problems >> > occurred. But after a while, I'm seeing this error: >> > >> > *Mar 20, 2012 2:00:05 PM org.apache.solr.handler.ReplicationHandler >> doFetch >> > SEVERE: SnapPull failed >> > org.apache.solr.common.SolrException: Index fetch failed : >> > at >> > org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:340) >> > at >> > >> org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:265) >> > at org.apache.solr.handler.SnapPuller$1.run(SnapPuller.java:166) >> > at >> > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) >> > at >> > >> java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:317) >> > at >> java.util.concurrent.FutureTask.runAndReset(FutureTask.java:150) >> > at >> > >> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:98) >> > at >> > >> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic(ScheduledThreadPoolExecutor.java:180) >> > at >> > >> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:204) >> > at >> > >> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) >> > at >> > >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) >> > at java.lang.Thread.run(Thread.java:662) >> > Caused by: org.apache.lucene.index.IndexFormatTooNewException: Format >> > version is not supported in file 'segments_1': -12 (needs to be between >> -9 >> > and -11) >> > at >> > >> org.apache.lucene.index.codecs.DefaultSegmentInfosReader.read(DefaultSegmentInfosReader.java:51) >> > at >> org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:230) >> > at >> > >> org.apache.lucene.index.IndexFileDeleter.(IndexFileDeleter.java:169) >> > at >> org.apache.lucene.index.IndexWriter.(IndexWriter.java:770) >> > at >> > org.apache.solr.update.SolrIndexWriter.(SolrIndexWriter.java:83) >> > at >> > >> org.apache.solr.update.UpdateHandler.createMainIndexWriter(UpdateHandler.java:102) >> > at >> > >> org.apache.solr.update.DirectUpdateHandler2.openWriter(DirectUpdateHandler2.java:111) >> > at >> > >> org.apache.solr.update.DirectUpdateHandler2.forceOpenWriter(DirectUpdateHandler2.java:297) >> > at >> org.apache.solr.handler.SnapPuller.doCommit(SnapPuller.java:484) >> > at >> > org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:330) >> > ... 11 more >> > Mar 20, 2012 2:00:16 PM org.apache.solr.update.SolrIndexWriter finalize >> > SEVERE: SolrIndexWriter was not closed prior to finalize(), indi
Re: Slave index size growing fast
Alexandre, additionally to what Erick said, you may want to check in the slave if what's 300+GB is the "data" directory or the "index." directory. On Fri, Mar 23, 2012 at 12:25 PM, Erick Erickson wrote: > not really, unless perhaps you're issuing commits or optimizes > on the _slave_ (which you should NOT do). > > Replication happens based on the version of the index on the master. > True, it starts out as a timestamp, but then successive versions > just have that number incremented. The version number > in the index on the slave is compared against the one on the master, > but the actual time (on the slave or master) is irrelevant. This is > explicitly to avoid problems with time synching across > machines/timezones/whataver > > It would be instructive to look at the admin/info page to see what > the index version is on the master and slave. > > But, if you optimize or commit (I think) on the _slave_, you might > change the timestamp and mess things up (although I'm reaching > here, I don't know this for certain). > > What's the index look like on the slave as compared to the master? > Are there just a bunch of files on the slave? Or a bunch of directories? > > Instead of re-indexing on the master, you could try to bring down the > slave, blow away the entire index and start it back up. Since this is a > production system, I'd only try this if I had more than one slave. Although > you could bring up a new slave and attach it to the master and see > what happens there. You wouldn't affect production if you didn't point > incoming requests at it... > > Best > Erick > > On Fri, Mar 23, 2012 at 11:03 AM, Alexandre Rocco > wrote: > > Erick, > > > > We're using Solr 3.3 on Linux (CentOS 5.6). > > The /data dir on master is actually 1.2G. > > > > I haven't tried to recreate the index yet. Since it's a production > > environment, > > I guess that I can stop replication and indexing and then recreate the > > master index to see if it makes any difference. > > > > Also just noticed another thread here named "Simple Slave Replication > > Question" that tells that it could be a problem if I'm seeing an > > /data/index with an timestamp on the slave node. > > Is this info relevant to this issue? > > > > Thanks, > > Alexandre > > > > On Fri, Mar 23, 2012 at 11:48 AM, Erick Erickson < > erickerick...@gmail.com>wrote: > > > >> What version of Solr and what operating system? > >> > >> But regardless, this shouldn't be happening. Indexes can > >> temporarily double in size, but any extras should be > >> cleaned up relatively soon. > >> > >> On the master, what's the total size of the /data directory? > >> I'm a little suspicious of the on your master, but I > >> don't think that's the root of your problem > >> > >> Are you recreating the index on the master (by deleting the > >> index directory and starting over)? > >> > >> This is unusual, and I suspect it's something odd in your configuration, > >> but I confess I'm at a loss as to what. > >> > >> Best > >> Erick > >> > >> On Fri, Mar 23, 2012 at 10:28 AM, Alexandre Rocco > >> wrote: > >> > Hello, > >> > > >> > We have a Solr index that has an average of 1.19 GB in size. > >> > After configuring the replication, the slave machine is growing the > index > >> > size expoentially. > >> > Currently we have an slave with 323.44 GB in size. > >> > Is there anything that could cause this behavior? > >> > The current replication config is below. > >> > > >> > Master: > >> > > >> > > >> > commit > >> > startup > >> > startup > >> > > >> > > >> > elevate.xml,protwords.txt,schema.xml,spellings.txt,stopwords.txt,synonyms.txt > >> > > >> > > >> > > >> > > >> > Slave: > >> > > >> > > >> > http://master:8984/solr/Index/replication > >> > > >> > > >> > > >> > Any pointers will be useful. > >> > > >> > Thanks, > >> > Alexandre > >> >
Re: querying on shards
@Shawn Heisey-4 how look your requestHandler of your broker? i think about your idea to do the same ;) - --- System One Server, 12 GB RAM, 2 Solr Instances, 8 Cores, 1 Core with 45 Million Documents other Cores < 200.000 - Solr1 for Search-Requests - commit every Minute - 5GB Xmx - Solr2 for Update-Request - delta every Minute - 4GB Xmx -- View this message in context: http://lucene.472066.n3.nabble.com/querying-on-shards-tp3841446p3852001.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr 4.0 replication problem
Hi Erick, It's not possible because both master and slaves using same binaries. Thanks... On Fri, Mar 23, 2012 at 5:30 PM, Erick Erickson wrote: > Hmmm, looking at your stack trace in a bit more detail, this is really > suspicious: > > Caused by: org.apache.lucene.index.IndexFormatTooNewException: Format > version is not supported in file 'segments_1': -12 (needs to be between -9 > and -11) > > This *looks* like your Solr version on your slave is older than the version > on your master. Is this possible at all? > > Best > Erick > > On Fri, Mar 23, 2012 at 11:03 AM, Hakan İlter > wrote: > > Hi Erick, > > > > I've already tried step 2 and 3 but it didn't help. It's almost > impossible > > to do step 1 for us because of project dead-line. > > > > Do you have any other suggestion? > > > > Thank your reply. > > > > On Fri, Mar 23, 2012 at 4:56 PM, Erick Erickson >wrote: > > > >> Hmmm, that is odd. But a trunk build from that long ago is going > >> to be almost impossible to debug/fix. The problem with working > >> from trunk is that this kind of problem won't get much attention. > >> > >> I have three suggestions: > >> 1> update to current trunk. NOTE: you'll have to completely > >> reindex your data, the format of the index has changed > >> multiple times since then and there's no back-compatibility > >> maintained with non-released major versions. > >> 2> delete your entire index on the _master_ and re-index from scratch. > >> If you do this, I'd also delete the entire /data > directory > >> on the slaves before replication as well. > >> 3> Delete your entire index on the _slave_ and see if you get a > >> clean replication. > >> > >> <3> is the least painful, <1> the most so I'd go in reverse order > >> for the above. > >> > >> Best > >> Erick > >> > >> > >> On Wed, Mar 21, 2012 at 8:49 AM, Hakan İlter > wrote: > >> > Hi everyone, > >> > > >> > We are using very early version of Solr 4.0 and we've some replication > >> > problems. Actually we used this build more than one year without any > >> > problem but when I made some changes on schema.xml, the following > problem > >> > started. > >> > > >> > I've just changed schema.xml with adding multiValued="true" attribute > to > >> > two dynamic fields. > >> > > >> > Before: > >> > > >> > stored="false" > >> /> > >> > stored="false" /> > >> > > >> > > >> > After: > >> > > >> > stored="false" * > >> > multiValued="true"* /> > >> > stored="false" * > >> > multiValued="true"* /> > >> > > >> > > >> > After starting tomcats with new configuration, there are no problems > >> > occurred. But after a while, I'm seeing this error: > >> > > >> > *Mar 20, 2012 2:00:05 PM org.apache.solr.handler.ReplicationHandler > >> doFetch > >> > SEVERE: SnapPull failed > >> > org.apache.solr.common.SolrException: Index fetch failed : > >> >at > >> > > org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:340) > >> >at > >> > > >> > org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:265) > >> >at > org.apache.solr.handler.SnapPuller$1.run(SnapPuller.java:166) > >> >at > >> > > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) > >> >at > >> > > >> > java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:317) > >> >at > >> java.util.concurrent.FutureTask.runAndReset(FutureTask.java:150) > >> >at > >> > > >> > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:98) > >> >at > >> > > >> > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic(ScheduledThreadPoolExecutor.java:180) > >> >at > >> > > >> > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:204) > >> >at > >> > > >> > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) > >> >at > >> > > >> > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) > >> >at java.lang.Thread.run(Thread.java:662) > >> > Caused by: org.apache.lucene.index.IndexFormatTooNewException: Format > >> > version is not supported in file 'segments_1': -12 (needs to be > between > >> -9 > >> > and -11) > >> >at > >> > > >> > org.apache.lucene.index.codecs.DefaultSegmentInfosReader.read(DefaultSegmentInfosReader.java:51) > >> >at > >> org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:230) > >> >at > >> > > >> > org.apache.lucene.index.IndexFileDeleter.(IndexFileDeleter.java:169) > >> >at > >> org.apache.lucene.index.IndexWriter.(IndexWriter.java:770) > >> >at > >> > org.apache.solr.update.SolrIndexWriter.(SolrIndexWriter.java:83) > >> >at > >> > > >> > org.apache.solr.update.UpdateHandler.createMainIndexWriter(UpdateHandler.java:102) > >> >at > >> > > >> > org.apache.solr.update.D
Unexpected Tika Exception extracting text from a PDF file.
Howdy Folks, I'm stumped and hope somebody can give me some clues on how to work around this occasional error I'm getting. I've got a .Net console program using SolrNet to scour certain folders at certain times and extract text from PDF files and index them. It succeeds on a majority of the files, but it fails on several test files. Though I'm new to this environment, I gather the SolrNet library calls on Solr (v. 3.5.0) to do this, which in turn calls on the Tika library (v. 0.10) , which calls on the PDFBox library (v. 1.6.0). To try and isolate the problem I took SolrNet and .Net out of the equation and switched to a Linux console. I downloaded the pdfbox-app-1.6.0.jar and executed: Java -jar pdfbox-app-1.6.0.jar ExtractText -console a.pdf Everything worked fine. I moved up to Tika. Downloaded tika-app-0.10.jar and executed: Java -jar tika-app-0.10.jar -t a.pdf And again everything worked fine. I then executed: Curl 'http://localhost:8993/solr/MyCore/update/extract?map.content=text&commit-tr ue' -F file=@a.pdf And it failed with the following output (Note: the above command works fine with other pdf files, but fails on these few pdf files) Error 500 org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser@58c5f8 org.apache.solr.common.SolrException: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser@58c5f8 at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingD ocumentLoader.java:219) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentSt reamHandlerBase.java:67) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase. java:129) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1368) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:3 56) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java: 252) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler .java:1212) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450) at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerColl ection.java:230) at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:11 4) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) at org.mortbay.jetty.Server.handle(Server.java:326) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542) at org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java: 945) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:756) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404) at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:22 8) at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582 ) Caused by: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser@58c5f8 at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:199) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:137) at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingD ocumentLoader.java:213) ... 22 more Caused by: java.lang.NullPointerException at org.apache.pdfbox.pdmodel.font.PDFont.getEncodingFromFont(PDFont.java:832) at org.apache.pdfbox.pdmodel.font.PDFont.determineEncoding(PDFont.java:293) at org.apache.pdfbox.pdmodel.font.PDFont.(PDFont.java:178) at org.apache.pdfbox.pdmodel.font.PDSimpleFont. (PDSimpleFont.java:7 9) at org.apache.pdfbox.pdmodel.font.PDType1Font. (PDType1Font.java:139 ) at org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:1 09) at org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:7 6) at org.apache.pdfbo
Re: Commit Strategy for SolrCloud when Talking about 200 million records.
We saw couple distinct errors and all machines in a shard is identical: -On the leader of the shard Mar 21, 2012 1:58:34 AM org.apache.solr.common.SolrException log SEVERE: shard update error StdNode: http://blah.blah.net:8983/solr/master2-slave1/:org.apache.solr.common.SolrException: Map failed at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:488) at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:251) at org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:319) at org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:300) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) followed by Mar 21, 2012 1:58:52 AM org.apache.solr.common.SolrException log SEVERE: shard update error StdNode: http://blah.blah.net:8983/solr/master2-slave1/:org.apache.solr.common.SolrException: java.io.IOException: Map failed at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:488) at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:251) at org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:319) at org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:300) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) followed by Mar 21, 2012 1:58:55 AM org.apache.solr.update.processor.DistributedUpdateProcessor doFinish INFO: Could not tell a replica to recover org.apache.solr.client.solrj.SolrServerException: http://blah.blah.net:8983/solr at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:496) at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:251) at org.apache.solr.update.processor.DistributedUpdateProcessor.doFinish(DistributedUpdateProcessor.java:347) at org.apache.solr.update.processor.DistributedUpdateProcessor.finish(DistributedUpdateProcessor.java:816) at org.apache.solr.update.processor.LogUpdateProcessor.finish(LogUpdateProcessorFactory.java:176) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:68) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1540) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:433) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:256) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:293) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:859) at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:602) at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489) at java.lang.Thread.run(Thread.java:662) Caused by: java.net.SocketTimeoutException: Read timed out at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.read(SocketInputStream.java:129) at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) at java.io.BufferedInputStream.read(BufferedInputStream.java:237) at org.apache.commons.httpclient.HttpParser.readRawLine(HttpParser.java:78) at org.apache.commons.httpclient.HttpParser.readLine(HttpParser.java:106) at org.apache.commons.httpclient.HttpConnecti
Re: Commit Strategy for SolrCloud when Talking about 200 million records.
On Mar 23, 2012, at 12:49 PM, I-Chiang Chen wrote: > Caused by: java.lang.OutOfMemoryError: Map failed Hmm...looks like this is the key info here. - Mark Miller lucidimagination.com
Re: Slave index size growing fast
Erick, The master /data dir contains only an index dir with a bunch of files. In the slave, the /data dir contains an index.20110926152410 dir with a lot more files than the master. That is quite strange for me. I guess that the config is right, since we have another slave that is running fine with the same config. The best bet would be clean up this messed slave and try to sync it again and see what happens. Thanks On Fri, Mar 23, 2012 at 12:25 PM, Erick Erickson wrote: > not really, unless perhaps you're issuing commits or optimizes > on the _slave_ (which you should NOT do). > > Replication happens based on the version of the index on the master. > True, it starts out as a timestamp, but then successive versions > just have that number incremented. The version number > in the index on the slave is compared against the one on the master, > but the actual time (on the slave or master) is irrelevant. This is > explicitly to avoid problems with time synching across > machines/timezones/whataver > > It would be instructive to look at the admin/info page to see what > the index version is on the master and slave. > > But, if you optimize or commit (I think) on the _slave_, you might > change the timestamp and mess things up (although I'm reaching > here, I don't know this for certain). > > What's the index look like on the slave as compared to the master? > Are there just a bunch of files on the slave? Or a bunch of directories? > > Instead of re-indexing on the master, you could try to bring down the > slave, blow away the entire index and start it back up. Since this is a > production system, I'd only try this if I had more than one slave. Although > you could bring up a new slave and attach it to the master and see > what happens there. You wouldn't affect production if you didn't point > incoming requests at it... > > Best > Erick > > On Fri, Mar 23, 2012 at 11:03 AM, Alexandre Rocco > wrote: > > Erick, > > > > We're using Solr 3.3 on Linux (CentOS 5.6). > > The /data dir on master is actually 1.2G. > > > > I haven't tried to recreate the index yet. Since it's a production > > environment, > > I guess that I can stop replication and indexing and then recreate the > > master index to see if it makes any difference. > > > > Also just noticed another thread here named "Simple Slave Replication > > Question" that tells that it could be a problem if I'm seeing an > > /data/index with an timestamp on the slave node. > > Is this info relevant to this issue? > > > > Thanks, > > Alexandre > > > > On Fri, Mar 23, 2012 at 11:48 AM, Erick Erickson < > erickerick...@gmail.com>wrote: > > > >> What version of Solr and what operating system? > >> > >> But regardless, this shouldn't be happening. Indexes can > >> temporarily double in size, but any extras should be > >> cleaned up relatively soon. > >> > >> On the master, what's the total size of the /data directory? > >> I'm a little suspicious of the on your master, but I > >> don't think that's the root of your problem > >> > >> Are you recreating the index on the master (by deleting the > >> index directory and starting over)? > >> > >> This is unusual, and I suspect it's something odd in your configuration, > >> but I confess I'm at a loss as to what. > >> > >> Best > >> Erick > >> > >> On Fri, Mar 23, 2012 at 10:28 AM, Alexandre Rocco > >> wrote: > >> > Hello, > >> > > >> > We have a Solr index that has an average of 1.19 GB in size. > >> > After configuring the replication, the slave machine is growing the > index > >> > size expoentially. > >> > Currently we have an slave with 323.44 GB in size. > >> > Is there anything that could cause this behavior? > >> > The current replication config is below. > >> > > >> > Master: > >> > > >> > > >> > commit > >> > startup > >> > startup > >> > > >> > > >> > elevate.xml,protwords.txt,schema.xml,spellings.txt,stopwords.txt,synonyms.txt > >> > > >> > > >> > > >> > > >> > Slave: > >> > > >> > > >> > http://master:8984/solr/Index/replication > >> > > >> > > >> > > >> > Any pointers will be useful. > >> > > >> > Thanks, > >> > Alexandre > >> >
Re: Slave index size growing fast
Tomás, The 300+GB size is only inside the index.20110926152410 dir. Inside there are a lot of files. I am almost conviced that something is messed up like someone commited on this slave machine. Thanks 2012/3/23 Tomás Fernández Löbbe > Alexandre, additionally to what Erick said, you may want to check in the > slave if what's 300+GB is the "data" directory or the "index." > directory. > > On Fri, Mar 23, 2012 at 12:25 PM, Erick Erickson >wrote: > > > not really, unless perhaps you're issuing commits or optimizes > > on the _slave_ (which you should NOT do). > > > > Replication happens based on the version of the index on the master. > > True, it starts out as a timestamp, but then successive versions > > just have that number incremented. The version number > > in the index on the slave is compared against the one on the master, > > but the actual time (on the slave or master) is irrelevant. This is > > explicitly to avoid problems with time synching across > > machines/timezones/whataver > > > > It would be instructive to look at the admin/info page to see what > > the index version is on the master and slave. > > > > But, if you optimize or commit (I think) on the _slave_, you might > > change the timestamp and mess things up (although I'm reaching > > here, I don't know this for certain). > > > > What's the index look like on the slave as compared to the master? > > Are there just a bunch of files on the slave? Or a bunch of directories? > > > > Instead of re-indexing on the master, you could try to bring down the > > slave, blow away the entire index and start it back up. Since this is a > > production system, I'd only try this if I had more than one slave. > Although > > you could bring up a new slave and attach it to the master and see > > what happens there. You wouldn't affect production if you didn't point > > incoming requests at it... > > > > Best > > Erick > > > > On Fri, Mar 23, 2012 at 11:03 AM, Alexandre Rocco > > wrote: > > > Erick, > > > > > > We're using Solr 3.3 on Linux (CentOS 5.6). > > > The /data dir on master is actually 1.2G. > > > > > > I haven't tried to recreate the index yet. Since it's a production > > > environment, > > > I guess that I can stop replication and indexing and then recreate the > > > master index to see if it makes any difference. > > > > > > Also just noticed another thread here named "Simple Slave Replication > > > Question" that tells that it could be a problem if I'm seeing an > > > /data/index with an timestamp on the slave node. > > > Is this info relevant to this issue? > > > > > > Thanks, > > > Alexandre > > > > > > On Fri, Mar 23, 2012 at 11:48 AM, Erick Erickson < > > erickerick...@gmail.com>wrote: > > > > > >> What version of Solr and what operating system? > > >> > > >> But regardless, this shouldn't be happening. Indexes can > > >> temporarily double in size, but any extras should be > > >> cleaned up relatively soon. > > >> > > >> On the master, what's the total size of the /data > directory? > > >> I'm a little suspicious of the on your master, but I > > >> don't think that's the root of your problem > > >> > > >> Are you recreating the index on the master (by deleting the > > >> index directory and starting over)? > > >> > > >> This is unusual, and I suspect it's something odd in your > configuration, > > >> but I confess I'm at a loss as to what. > > >> > > >> Best > > >> Erick > > >> > > >> On Fri, Mar 23, 2012 at 10:28 AM, Alexandre Rocco > > >> wrote: > > >> > Hello, > > >> > > > >> > We have a Solr index that has an average of 1.19 GB in size. > > >> > After configuring the replication, the slave machine is growing the > > index > > >> > size expoentially. > > >> > Currently we have an slave with 323.44 GB in size. > > >> > Is there anything that could cause this behavior? > > >> > The current replication config is below. > > >> > > > >> > Master: > > >> > > > >> > > > >> > commit > > >> > startup > > >> > startup > > >> > > > >> > > > >> > > > elevate.xml,protwords.txt,schema.xml,spellings.txt,stopwords.txt,synonyms.txt > > >> > > > >> > > > >> > > > >> > > > >> > Slave: > > >> > > > >> > > > >> > http://master:8984/solr/Index/replication > > > >> > > > >> > > > >> > > > >> > Any pointers will be useful. > > >> > > > >> > Thanks, > > >> > Alexandre > > >> > > >
Tags and Folksonomies
Suppose I have content which has title and description. Users can tag content and search content based on tag, title and description. Tag has more weightage. Any inputs on how indexing and retrieval will work given there is content and tags using Solr? Has anyone implemented search based on collaborative tagging? Thanks, Nishant
Re: Solr 4.0 replication problem
In that case, I'm kind of stuck. You've already rebuilt your index from scratch and removed it from your slaves. That should have cleared out most everything that could be an issue. I'd suggest you set up a pair of machines from scratch and try to set up an index/replication with your current schema. If a fresh install shows the same problem, you've probably found a bona-fide bug. But you'll probably have to fix it yourself if you can't upgrade. But I really doubt it's a bug, this just smells way too much like you have something going on you aren't aware of ("interesting" CLASSPATH issues? Multiple installations of Solr? Someone, sometime, indexed things to the master you didn't know about with a newer version?whataver?) Sorry I can't be more help Erick On Fri, Mar 23, 2012 at 12:01 PM, Hakan İlter wrote: > Hi Erick, > > It's not possible because both master and slaves using same binaries. > > Thanks... > > > On Fri, Mar 23, 2012 at 5:30 PM, Erick Erickson > wrote: > >> Hmmm, looking at your stack trace in a bit more detail, this is really >> suspicious: >> >> Caused by: org.apache.lucene.index.IndexFormatTooNewException: Format >> version is not supported in file 'segments_1': -12 (needs to be between -9 >> and -11) >> >> This *looks* like your Solr version on your slave is older than the version >> on your master. Is this possible at all? >> >> Best >> Erick >> >> On Fri, Mar 23, 2012 at 11:03 AM, Hakan İlter >> wrote: >> > Hi Erick, >> > >> > I've already tried step 2 and 3 but it didn't help. It's almost >> impossible >> > to do step 1 for us because of project dead-line. >> > >> > Do you have any other suggestion? >> > >> > Thank your reply. >> > >> > On Fri, Mar 23, 2012 at 4:56 PM, Erick Erickson > >wrote: >> > >> >> Hmmm, that is odd. But a trunk build from that long ago is going >> >> to be almost impossible to debug/fix. The problem with working >> >> from trunk is that this kind of problem won't get much attention. >> >> >> >> I have three suggestions: >> >> 1> update to current trunk. NOTE: you'll have to completely >> >> reindex your data, the format of the index has changed >> >> multiple times since then and there's no back-compatibility >> >> maintained with non-released major versions. >> >> 2> delete your entire index on the _master_ and re-index from scratch. >> >> If you do this, I'd also delete the entire /data >> directory >> >> on the slaves before replication as well. >> >> 3> Delete your entire index on the _slave_ and see if you get a >> >> clean replication. >> >> >> >> <3> is the least painful, <1> the most so I'd go in reverse order >> >> for the above. >> >> >> >> Best >> >> Erick >> >> >> >> >> >> On Wed, Mar 21, 2012 at 8:49 AM, Hakan İlter >> wrote: >> >> > Hi everyone, >> >> > >> >> > We are using very early version of Solr 4.0 and we've some replication >> >> > problems. Actually we used this build more than one year without any >> >> > problem but when I made some changes on schema.xml, the following >> problem >> >> > started. >> >> > >> >> > I've just changed schema.xml with adding multiValued="true" attribute >> to >> >> > two dynamic fields. >> >> > >> >> > Before: >> >> > >> >> > > stored="false" >> >> /> >> >> > > stored="false" /> >> >> > >> >> > >> >> > After: >> >> > >> >> > > stored="false" * >> >> > multiValued="true"* /> >> >> > > stored="false" * >> >> > multiValued="true"* /> >> >> > >> >> > >> >> > After starting tomcats with new configuration, there are no problems >> >> > occurred. But after a while, I'm seeing this error: >> >> > >> >> > *Mar 20, 2012 2:00:05 PM org.apache.solr.handler.ReplicationHandler >> >> doFetch >> >> > SEVERE: SnapPull failed >> >> > org.apache.solr.common.SolrException: Index fetch failed : >> >> > at >> >> > >> org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:340) >> >> > at >> >> > >> >> >> org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:265) >> >> > at >> org.apache.solr.handler.SnapPuller$1.run(SnapPuller.java:166) >> >> > at >> >> > >> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) >> >> > at >> >> > >> >> >> java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:317) >> >> > at >> >> java.util.concurrent.FutureTask.runAndReset(FutureTask.java:150) >> >> > at >> >> > >> >> >> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:98) >> >> > at >> >> > >> >> >> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic(ScheduledThreadPoolExecutor.java:180) >> >> > at >> >> > >> >> >> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:204) >> >> > at >> >> > >> >> >> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) >> >> > at >> >> > >> >> >> jav
Field length and scoring
Hello there, I have a quite basic question but my Solr is behaving in a way I'm not quite sure of why it does so. The setup is simple: I have a field "suggestionText" in which single strings are indexed. Schema: Since I want this field to serve for a suggestion-search, the input string is analyzed by a EdgeNGramFilter. Lets have a look on two cases: case1: Input string was 'il2' case2: Input string was 'il24' As I can see from the Solr-admin-analysis-page, case1 is analysed as i il il2 and case2 as i il il2 il24 As you would expect. The point now is: When I search for 'il2' I would expect case1 to have a higher score than case2. I thought this way because I did not omit norms and thus I thought, the shorter field would get a (slightly) higher score. However, the scores in both cases are identical and so it happens that 'il24' is suggested prior to 'il2'. Perhaps I did understand the norms or the notion of "field length" wrong. I would be grateful if you could help me out here and give me advice on how to accomplish the wished behavior. Thanks and best regards, Erik
Re: Slave index size growing fast
Alexandre: Have you changed anything like on your slave? And do you have more than one slave? If you do, have you considered just blowing away the entire .../data directory on the slave and letting it re-start from scratch? I'd take the slave out of service for the duration of this operation, or do it when you are OK with some number of requests going to an empty index Because having an index. directory indicates that sometime someone forced the slave to get out of sync, possibly as you say by doing a commit. Or sending docs to it to be indexed or some such. Starting the slave over should fix that if it's the root of your problem. Note a curious thing about the . When you start indexing, the index version is a timestamp. However, from that point on when the index changes, the version number is just incremented (not made the current time). This is to avoid problems with masters and slaves having different times. But a consequence of that is if your slave somehow gets an index that's newer, the replication process does the best it can to not delete indexes that are out of sync with the master and saves them away. This might be what you're seeing. I'm grasping at straws a bit here, but this seems possible. Best Erick On Fri, Mar 23, 2012 at 1:16 PM, Alexandre Rocco wrote: > Tomás, > > The 300+GB size is only inside the index.20110926152410 dir. Inside there > are a lot of files. > I am almost conviced that something is messed up like someone commited on > this slave machine. > > Thanks > > 2012/3/23 Tomás Fernández Löbbe > >> Alexandre, additionally to what Erick said, you may want to check in the >> slave if what's 300+GB is the "data" directory or the "index." >> directory. >> >> On Fri, Mar 23, 2012 at 12:25 PM, Erick Erickson > >wrote: >> >> > not really, unless perhaps you're issuing commits or optimizes >> > on the _slave_ (which you should NOT do). >> > >> > Replication happens based on the version of the index on the master. >> > True, it starts out as a timestamp, but then successive versions >> > just have that number incremented. The version number >> > in the index on the slave is compared against the one on the master, >> > but the actual time (on the slave or master) is irrelevant. This is >> > explicitly to avoid problems with time synching across >> > machines/timezones/whataver >> > >> > It would be instructive to look at the admin/info page to see what >> > the index version is on the master and slave. >> > >> > But, if you optimize or commit (I think) on the _slave_, you might >> > change the timestamp and mess things up (although I'm reaching >> > here, I don't know this for certain). >> > >> > What's the index look like on the slave as compared to the master? >> > Are there just a bunch of files on the slave? Or a bunch of directories? >> > >> > Instead of re-indexing on the master, you could try to bring down the >> > slave, blow away the entire index and start it back up. Since this is a >> > production system, I'd only try this if I had more than one slave. >> Although >> > you could bring up a new slave and attach it to the master and see >> > what happens there. You wouldn't affect production if you didn't point >> > incoming requests at it... >> > >> > Best >> > Erick >> > >> > On Fri, Mar 23, 2012 at 11:03 AM, Alexandre Rocco >> > wrote: >> > > Erick, >> > > >> > > We're using Solr 3.3 on Linux (CentOS 5.6). >> > > The /data dir on master is actually 1.2G. >> > > >> > > I haven't tried to recreate the index yet. Since it's a production >> > > environment, >> > > I guess that I can stop replication and indexing and then recreate the >> > > master index to see if it makes any difference. >> > > >> > > Also just noticed another thread here named "Simple Slave Replication >> > > Question" that tells that it could be a problem if I'm seeing an >> > > /data/index with an timestamp on the slave node. >> > > Is this info relevant to this issue? >> > > >> > > Thanks, >> > > Alexandre >> > > >> > > On Fri, Mar 23, 2012 at 11:48 AM, Erick Erickson < >> > erickerick...@gmail.com>wrote: >> > > >> > >> What version of Solr and what operating system? >> > >> >> > >> But regardless, this shouldn't be happening. Indexes can >> > >> temporarily double in size, but any extras should be >> > >> cleaned up relatively soon. >> > >> >> > >> On the master, what's the total size of the /data >> directory? >> > >> I'm a little suspicious of the on your master, but I >> > >> don't think that's the root of your problem >> > >> >> > >> Are you recreating the index on the master (by deleting the >> > >> index directory and starting over)? >> > >> >> > >> This is unusual, and I suspect it's something odd in your >> configuration, >> > >> but I confess I'm at a loss as to what. >> > >> >> > >> Best >> > >> Erick >> > >> >> > >> On Fri, Mar 23, 2012 at 10:28 AM, Alexandre Rocco >> > >> wrote: >> > >> > Hello, >> > >> > >> > >> > We have a Solr index that has an average of 1.19 GB in size.
Re: Field length and scoring
Erik: The field length is, I believe, based on _tokens_, not characters. Both of your examples are exactly one token long, so the scores are probably identical Also, the field length is enocded in a byte (as I remember). So it's quite possible that, even if the lengths of these fields were 3 and 4 instead of both being 1, the value stored for the length norms would be the same number. HTH Erick On Fri, Mar 23, 2012 at 2:40 PM, Erik Fäßler wrote: > Hello there, > > I have a quite basic question but my Solr is behaving in a way I'm not quite > sure of why it does so. > > The setup is simple: I have a field "suggestionText" in which single strings > are indexed. Schema: > > stored="true"/> > > Since I want this field to serve for a suggestion-search, the input string is > analyzed by a EdgeNGramFilter. > > Lets have a look on two cases: > > case1: Input string was 'il2' > case2: Input string was 'il24' > > As I can see from the Solr-admin-analysis-page, case1 is analysed as > > i > il > il2 > > and case2 as > > i > il > il2 > il24 > > As you would expect. The point now is: When I search for 'il2' I would expect > case1 to have a higher score than case2. I thought this way because I did not > omit norms and thus I thought, the shorter field would get a (slightly) > higher score. However, the scores in both cases are identical and so it > happens that 'il24' is suggested prior to 'il2'. > > Perhaps I did understand the norms or the notion of "field length" wrong. I > would be grateful if you could help me out here and give me advice on how to > accomplish the wished behavior. > > Thanks and best regards, > > Erik
Practical Optimization
Hey All- we run a http://carsabi.com car search engine with Solr and did some benchmarking recently after we switched from a hosted service to self-hosting. In brief, we went from 800ms complex range queries on a 1.5M document corpus to 43ms. The major shifts were switching from EC2 Large to EC2 CC8XL which got us down to 282ms (2.82x speed gain due to 2.75x CPU speed increase we think), and then down to 43ms when we sharded to 8 cores. We tried sharding to 12 and 16 but saw negligible gains after this point. Anyway, hope this might be useful to someone - we write up exact stats and a step by step sharding procedure on our http://carsabi.com/car-news/2012/03/23/optimizing-solr-7x-your-search-speed/ tech blog if anyone's interested. best Dwight -- View this message in context: http://lucene.472066.n3.nabble.com/Practical-Optimization-tp3852776p3852776.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Field length and scoring
> Also, the field length is enocded in a byte (as I remember). > So it's > quite possible that, > even if the lengths of these fields were 3 and 4 instead of > both being > 1, the value > stored for the length norms would be the same number. Exactly. http://search-lucene.com/m/uGKRu1pvRjw
Re: querying on shards
On 3/23/2012 9:55 AM, stockii wrote: how look your requestHandler of your broker? i think about your idea to do the same ;) Here's what I have got for the default request handler in my broker core, which is called ncmain. The "rollingStatistics" section is applicable to the SOLR-1972 patch. all 70 name="shards">idxa2.REDACTED.com:8981/solr/inclive,idxa1.REDACTED.com:8981/solr/s0live,idxa1.REDACTED.com:8981/solr/s1live,idxa1.REDACTED.com:8981/solr/s2live,idxa2.REDACTED.com:8981/solr/s3live,idxa2.REDACTED.com:8981/solr/s4live,idxa2.REDACTED.com:8981/solr/s5live default false false 9 5 2 2 spellcheck 604800 16384 5 75 95 99 100
spellcheck file format - multiple words on a line?
hello all, for business reasons, we are sourcing the spellcheck file from another business group. the file we receive looks like the example data below can solr support this type of format - or do i need to process this file in to a format that has a single word on a single line? thanks for any help mark // snipped from spellcheck file sourced from business group 14-INCH CHAIN 14-INCH RIGHT TINE 1/4 open end ignition wrench 150 DEGREES CELSIUS 15 foot I wire 15 INCH 15 WATT 16 HORSEPOWER ENGINE 16 HORSEPOWER GASOLINE ENGINE 16-INCH BAR 16-INCH CHAIN 16l Cross 16p SIXTEEN PIECE FLAT FLEXIBLE CABLE -- View this message in context: http://lucene.472066.n3.nabble.com/spellcheck-file-format-multiple-words-on-a-line-tp3853096p3853096.html Sent from the Solr - User mailing list archive at Nabble.com.