Re: Trouble Setting Up Development Environment

2012-03-23 Thread Li Li
here is my method.
1. check out latest source codes from trunk or download tar ball
svn checkout http://svn.apache.org/repos/asf/lucene/dev/trunklucene_trunk

2. create a dynamic web project in eclipse and close it.
   for example, I create a project name lucene-solr-trunk in my
workspace.

3. copy/mv the source code to this project(it's not necessary)
   here is my directory structure
   lili@lili-desktop:~/workspace/lucene-solr-trunk$ ls
bin.tests-framework  build  lucene_trunk  src  testindex  WebContent
  lucene_trunk is the top directory checked out from svn in step 1.
4. remove WebContent generated by eclipse and modify it to a soft link to
  lili@lili-desktop:~/workspace/lucene-solr-trunk$ ll WebContent
lrwxrwxrwx 1 lili lili 28 2011-08-18 18:50 WebContent ->
lucene_trunk/solr/webapp/web/
5. open lucene_trunk/dev-tools/eclipse/dot.classpath. copy all lines like
kind="src" to a temp file



6. replace all string like path="xxx" to path="lucene_trunk/xxx" and copy
them into .classpath file
7. mkdir WebContent/WEB-INF/lib
8. extract all jar file in dot.classpath to WebContent/WEB-INF/lib
I use this command:
lili@lili-desktop:~/workspace/lucene-solr-trunk/lucene_trunk$ cat
dev-tools/eclipse/dot.classpath |grep "kind=\"lib"|awk -F "path=\"" '{print
$2}' |awk -F "\"/>" '{print $1}' |xargs cp ../WebContent/WEB-INF/lib/
9. open this project and refresh it.
if everything is ok, it will compile all java files successfully. if
there is something wrong, Probably we don't use the correct jar. because
there are many versions of the same library.
10. right click the project -> debug As -> debug on Server
it will fail because no solr home is specified.
11. right click the project -> debug As -> debug Configuration -> Arguments
Tab -> VM arguments
 add
-Dsolr.solr.home=/home/lili/workspace/lucene-solr-trunk/lucene_trunk/solr/example/solr
 you can also add other vm arguments like -Xmx1g here.
12. all fine, add a break point at SolrDispatchFilter.doFilter(). all solr
request comes here
13. have fun~


On Fri, Mar 23, 2012 at 11:49 AM, Karthick Duraisamy Soundararaj <
karthick.soundara...@gmail.com> wrote:

> Hi Solr Ppl,
>I have been trying to set up solr dev env. I downloaded the
> tar ball of eclipse and the solr 3.5 source. Here are the exact sequence of
> steps I followed
>
> I extracted the solr 3.5 source and eclipse.
> I installed run-jetty-run plugin for eclipse.
> I ran ant eclipse in the solr 3.5 source directory
> I used eclipse's "Open existing project" option to open up the files in
> solr 3.5 directory. I got a huge tree in the name of lucene_solr.
>
> I run it and there is a SEVERE error: System property not set excetption. *
> solr*.test.sys.*prop1* not set and then the jetty loads solr. I then try
> localhost:8080/solr/select/ I get null pointer execpiton. I am only able to
> access admin page.
>
> Is there anything else I need to do?
>
> I tried to follow
>
> http://www.lucidimagination.com/devzone/technical-articles/setting-apache-solr-eclipse
> .
> But I dont find the solr-3.5.war file. I tried ant dist to generate the
> dist folder but that has many jars and wars..
>
> I am able to compile the source with ant compile, get the solr in example
> directory up and running.
>
> Will be great if someone can help me with this.
>
> Thanks,
> Karthick
>


Re: RequestHandler versus SearchComponent

2012-03-23 Thread Ahmet Arslan
> I'm looking at the following. I want
> to (1) map some query fields to
> some other query fields and add some things to FL, and then
> (2)
> rescore.
> 
> I can see how to do it as a RequestHandler that makes a
> parser to get
> the fields, or I could see making a SearchComponent that was
> stuck
> into the list just after the QueryComponent.
> 
> Anyone care to advise in the choice?

I would choose SearchComponent. I read somewhere that customizations are now 
better fit into SC rather than RH.



Re: RequestHandler versus SearchComponent

2012-03-23 Thread Michael Kuhlmann

Am 23.03.2012 10:29, schrieb Ahmet Arslan:

I'm looking at the following. I want
to (1) map some query fields to
some other query fields and add some things to FL, and then
(2)
rescore.

I can see how to do it as a RequestHandler that makes a
parser to get
the fields, or I could see making a SearchComponent that was
stuck
into the list just after the QueryComponent.

Anyone care to advise in the choice?


I would choose SearchComponent. I read somewhere that customizations are now 
better fit into SC rather than RH.



I would override QueryComponent and modify the normal query instead.

Adding an own SearchComponent after the regular QueryComponent (or 
better as a "last-element") is goof when you simply want to modify the 
existing result. But since you want to rescore, you're likely interested 
in documents that fell already out of the original result list.


Greetings,
Kuli


Re: RequestHandler versus SearchComponent

2012-03-23 Thread Michael Kuhlmann

Am 23.03.2012 11:17, schrieb Michael Kuhlmann:

Adding an own SearchComponent after the regular QueryComponent (or
better as a "last-element") is goof ...


Of course, I meant "good", not "goof"! ;)



Greetings,
Kuli




Re: Grouping queries

2012-03-23 Thread Martijn v Groningen
On 22 March 2012 03:10, Jamie Johnson  wrote:

> I need to apologize I believe that in my example I have too grossly
> over simplified the problem and it's not clear what I am trying to do,
> so I'll try again.
>
> I have a situation where I have a set of access controls say user,
> super user and ultra user.  These controls are not necessarily
> hierarchical in that user < super user < ultra user.  Each of these
> controls should only be able to see documents from with some
> combination of access controls they have.  In my actual case we have
> many access controls and they can be combined in a number of fashions
> so I can't simply constrain what they are searching by a query alone
> (i.e. if it's a user the query is auth:user AND (some query)).  Now I
> have a case where a document contains information that a user can see
> but also contains information a super user can see.  Our current
> system marks this document at the super user level and the user can't
> see it.  We now have a requirement to make the pieces that are at the
> user level available to the user while still allowing the super user
> to see and search all the information.  My original thought was to
> simply index the document twice, this would end up in a possible
> duplicate (say if a user had both user and super user) but since this
> situation is rare it may not matter.  After coming across the grouping
> capability in solr I figured I could execute a group query where we
> grouped on some key which indicated that 2 documents were the same
> just with different access controls (user and super user in this
> example).  We could then filter out the documents in the group the
> user isn't allowed to see and only keep the document with the access
> controls they have.
>
Maybe I'm not understanding this right... But why can't you save the access
controls
as a multivalued field in your schema? In your example your can then if the
current user is a normal user just query auth:user AND (query) and if the
current user
is a super user auth:superuser AND (query). A document that is searchable
for
both superuser and user is then returned (if it matches the rest of the
query).

>
> I hope this makes more sense, unfortunately the Join queries I don't
> believe will work because I don't think if I create documents which
> would be relevant to each access control I could search across these
> document as if it was a single document (i.e. search for something in
> the user document and something in the super user document in a single
> query).  This lead me to believe that grouping was the way to go in
> this case, but again I am very interested in any suggestions that the
> community could offer.
>
I wouldn't use grouping. The Solr join is still a option. Lets say you have
many access controls and the access controls change often on your documents.
You can then choose to store the access controls with an id to your logic
document
as a separate document in a different Solr Core (index). In the core were
your main
documents are you don't keep the access controls. You can then use the solr
join
to filter out documents that the current user isn't supposed to search.
Something like this:
q=(query)&fq={!join fromIndex=core1 from=doc_id to=id}auth:superuser
Core 1 is the core containing the access control documents and the doc_id
is the id that
points to your regular documents.

The benefit of this approach is that if you fine tune the core1 for high
updatability you can
change you access controls very frequently without paying a big
performance penalty.


Re: Faceted range based on float within velocity not working properly

2012-03-23 Thread Marcelo Carvalho Fernandes
I went deeper in the problem and discovered that...

$math.toInteger("10.1") returns 101
$math.toInteger("10,1") returns 10

Although I'm using Strings in the previous examples, I have a Float
variable from Solr.

I'm not sure if it is just a Solr problem, just a Velocity problema or
somewhere between them.

May it be something related to my local/regional settings or so? I ask that
because in BRL (Brazilian Real) the currency format we use is something
line R$1.234,56.

Any idea?




Marcelo Carvalho Fernandes
+55 21 8272-7970
+55 21 2205-2786


On Thu, Mar 22, 2012 at 3:14 PM, Marcelo Carvalho Fernandes <
mcf2...@gmail.com> wrote:

> Hi all!
>
> I'm using Apache Solr 3.5.0 with Tomcat 6.0.32.
>
> My schema.xml has a price field declared as...
>
> required="false" />
>
> My solrconfig.xml has a a velocity RequestHandler (/browser)) that has
> the following facet...
>
>*preco*
>0
>100
>10
>
> ...and I'm using the default templates in
> \example\solr\conf\velocity .
>
> The problem is that each peace of range that is being generated has a
> wrong upper bound. For example, instead of...
>
>  0 -10
> 10 - 20
> 20 - 30
> 30 - 40
> ...
>
> ...what is being generated is...
>
>  0 - 10
> 10 - 110
> 20 - 210
> 30 - 310
> ...
>
> I've studied the #display_facet_range macro in VM_global_library.vm and
> it looks like the $math.add is contatenating the two operands insted of
> producing a sum. I  mean, insted of 10+10=20 it returns 110, instead of
> 20+10=30 it returns 210.
>
> Any idea what is the problem?
>
> Thanks in advance,
>
> 
> Marcelo Carvalho Fernandes
> +55 21 8272-7970
> +55 21 2205-2786
>


Re: Commit Strategy for SolrCloud when Talking about 200 million records.

2012-03-23 Thread Mark Miller
What issues? It really shouldn't be a problem. 


On Mar 22, 2012, at 11:44 PM, I-Chiang Chen  wrote:

> At this time we are not leveraging the NRT functionality. This is the
> initial data load process where the idea is to just add all 200 millions
> records first. Than do a single commit at the end to make them searchable.
> We actually disabled auto commit at this time.
> 
> We have tried to leave auto commit enabled during the initial data load
> process and ran into multiple issues that leads to botched loading process.
> 
> On Thu, Mar 22, 2012 at 2:15 PM, Mark Miller  wrote:
> 
>> 
>> On Mar 21, 2012, at 9:37 PM, I-Chiang Chen wrote:
>> 
>>> We are currently experimenting with SolrCloud functionality in Solr 4.0.
>>> The goal is to see if Solr 4.0 trunk with is current state is able to
>>> handle roughly 200million documents. The document size is not big around
>> 40
>>> fields no more than a KB, most of which are empty majority of times.
>>> 
>>> The setup we have is 4 servers w/ 2 shards w/ 2 servers per shard. We are
>>> running in Tomcat.
>>> 
>>> The questions are giving the approximate data volume, is it a realistic
>> to
>>> expect above setup can handle it.
>> 
>> So 100 million docs per machine essentially? Totally depends on the
>> hardware and what features you are using - but def in the realm of
>> possibility.
>> 
>>> Giving the number of documents should
>>> commit every x documents or rely on auto commits?
>> 
>> The number of docs shouldn't really matter here. Do you need near real
>> time search?
>> 
>> You should be able to commit about as frequently as you'd like with NRT
>> (eg every 1 second if you'd like) - either using soft auto commit or
>> commitWithin.
>> 
>> Then you want to do a hard commit less frequently - every minute (or more
>> or less) with openSearcher=false.
>> 
>> eg
>> 
>>
>>  15000
>>  false
>>
>> 
>>> 
>>> --
>>> -IC
>> 
>> - Mark Miller
>> lucidimagination.com
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
> 
> 
> -- 
> -IC


Re: Commit Strategy for SolrCloud when Talking about 200 million records.

2012-03-23 Thread Markus Jelsma
We did some tests too with many millions of documents and auto-commit enabled. 
It didn't take long for the indexer to stall and in the meantime the number of 
open files exploded, to over 16k, then 32k.

On Friday 23 March 2012 12:20:15 Mark Miller wrote:
> What issues? It really shouldn't be a problem.
> 
> On Mar 22, 2012, at 11:44 PM, I-Chiang Chen  wrote:
> > At this time we are not leveraging the NRT functionality. This is the
> > initial data load process where the idea is to just add all 200 millions
> > records first. Than do a single commit at the end to make them
> > searchable. We actually disabled auto commit at this time.
> > 
> > We have tried to leave auto commit enabled during the initial data load
> > process and ran into multiple issues that leads to botched loading
> > process.
> > 
> > On Thu, Mar 22, 2012 at 2:15 PM, Mark Miller  
wrote:
> >> On Mar 21, 2012, at 9:37 PM, I-Chiang Chen wrote:
> >>> We are currently experimenting with SolrCloud functionality in Solr
> >>> 4.0. The goal is to see if Solr 4.0 trunk with is current state is
> >>> able to handle roughly 200million documents. The document size is not
> >>> big around
> >> 
> >> 40
> >> 
> >>> fields no more than a KB, most of which are empty majority of times.
> >>> 
> >>> The setup we have is 4 servers w/ 2 shards w/ 2 servers per shard. We
> >>> are running in Tomcat.
> >>> 
> >>> The questions are giving the approximate data volume, is it a realistic
> >> 
> >> to
> >> 
> >>> expect above setup can handle it.
> >> 
> >> So 100 million docs per machine essentially? Totally depends on the
> >> hardware and what features you are using - but def in the realm of
> >> possibility.
> >> 
> >>> Giving the number of documents should
> >>> commit every x documents or rely on auto commits?
> >> 
> >> The number of docs shouldn't really matter here. Do you need near real
> >> time search?
> >> 
> >> You should be able to commit about as frequently as you'd like with NRT
> >> (eg every 1 second if you'd like) - either using soft auto commit or
> >> commitWithin.
> >> 
> >> Then you want to do a hard commit less frequently - every minute (or
> >> more or less) with openSearcher=false.
> >> 
> >> eg
> >> 
> >>
> >>
> >>  15000
> >>  false
> >>
> >>
> >>> 
> >>> --
> >>> -IC
> >> 
> >> - Mark Miller
> >> lucidimagination.com

-- 
Markus Jelsma - CTO - Openindex


Simple Slave Replication Question

2012-03-23 Thread Ben McCarthy
Hello,

Im looking at the replication from a master to a number of slaves.  I have 
configured it and it appears to be working.  When updating 40K records on the 
master is it standard to always copy over the full index, currently 5gb in 
size.  If this is standard what do people do who have massive 200gb indexs, 
does it not take a while to bring the slaves inline with the master?

Thanks
Ben




This e-mail is sent on behalf of Trader Media Group Limited, Registered Office: 
Auto Trader House, Cutbush Park Industrial Estate, Danehill, Lower Earley, 
Reading, Berkshire, RG6 4UT(Registered in England No. 4768833). This email and 
any files transmitted with it are confidential and may be legally privileged, 
and intended solely for the use of the individual or entity to whom they are 
addressed. If you have received this email in error please notify the sender. 
This email message has been swept for the presence of computer viruses. 



Have made site for comparison of Solr and other enterprise search engines

2012-03-23 Thread Runar Buvik
Hi all

Fyi, I am working on a website for doing side by side comparison of
several common enterprise search engines, including some that is based
on Solr. Currently I have  Searchdaimon ES, Microsoft SSE 2010,
SearchBlox, Google Mini, Thunderstone, Constellio, mnoGoSearch and Ibm
OmniFind Yahoo running. All have indexed the same data set to allow
for easy comparison of the results. Currently I have focused on ready
made search packages that have an end user gui and crawlers
integrated, but will also add a pure Solr install later.

You can do both side by side comparison in native gui:
http://www.opentestsearch.com/cgi-bin/search.cgi?query=enron+logo ,
and blind comparison:
http://www.opentestsearch.com/cgi-bin/search.cgi?query=enron+logo&ano=on
.

Site: http://www.opentestsearch.com/

Quite cool, don't you think? Any suggestions/feedback would be much appreciated.


Best regards
Runar Buvik


Re: Grouping queries

2012-03-23 Thread Jamie Johnson
On Fri, Mar 23, 2012 at 6:37 AM, Martijn v Groningen
 wrote:
> On 22 March 2012 03:10, Jamie Johnson  wrote:
>
>> I need to apologize I believe that in my example I have too grossly
>> over simplified the problem and it's not clear what I am trying to do,
>> so I'll try again.
>>
>> I have a situation where I have a set of access controls say user,
>> super user and ultra user.  These controls are not necessarily
>> hierarchical in that user < super user < ultra user.  Each of these
>> controls should only be able to see documents from with some
>> combination of access controls they have.  In my actual case we have
>> many access controls and they can be combined in a number of fashions
>> so I can't simply constrain what they are searching by a query alone
>> (i.e. if it's a user the query is auth:user AND (some query)).  Now I
>> have a case where a document contains information that a user can see
>> but also contains information a super user can see.  Our current
>> system marks this document at the super user level and the user can't
>> see it.  We now have a requirement to make the pieces that are at the
>> user level available to the user while still allowing the super user
>> to see and search all the information.  My original thought was to
>> simply index the document twice, this would end up in a possible
>> duplicate (say if a user had both user and super user) but since this
>> situation is rare it may not matter.  After coming across the grouping
>> capability in solr I figured I could execute a group query where we
>> grouped on some key which indicated that 2 documents were the same
>> just with different access controls (user and super user in this
>> example).  We could then filter out the documents in the group the
>> user isn't allowed to see and only keep the document with the access
>> controls they have.
>>
> Maybe I'm not understanding this right... But why can't you save the access
> controls
> as a multivalued field in your schema? In your example your can then if the
> current user is a normal user just query auth:user AND (query) and if the
> current user
> is a super user auth:superuser AND (query). A document that is searchable
> for
> both superuser and user is then returned (if it matches the rest of the
> query).
>

I'd like to avoid having duplicates, although my access controls are
not strictly hierarchical there are cases where super user can see his
docs and user docs.  The idea was to have the Super User doc be a
super set of the user doc.  So in my case I really have only 1
document, but that 1 document has pieces user can see and pieces only
super user can see.  The idea was to index the document twice once
with the entire doc and once with the pieces just the user could see.

>>
>> I hope this makes more sense, unfortunately the Join queries I don't
>> believe will work because I don't think if I create documents which
>> would be relevant to each access control I could search across these
>> document as if it was a single document (i.e. search for something in
>> the user document and something in the super user document in a single
>> query).  This lead me to believe that grouping was the way to go in
>> this case, but again I am very interested in any suggestions that the
>> community could offer.
>>
> I wouldn't use grouping. The Solr join is still a option. Lets say you have
> many access controls and the access controls change often on your documents.
> You can then choose to store the access controls with an id to your logic
> document
> as a separate document in a different Solr Core (index). In the core were
> your main
> documents are you don't keep the access controls. You can then use the solr
> join
> to filter out documents that the current user isn't supposed to search.
> Something like this:
> q=(query)&fq={!join fromIndex=core1 from=doc_id to=id}auth:superuser
> Core 1 is the core containing the access control documents and the doc_id
> is the id that
> points to your regular documents.
>
> The benefit of this approach is that if you fine tune the core1 for high
> updatability you can
> change you access controls very frequently without paying a big
> performance penalty.

Where is Join documented?  I looked at
http://wiki.apache.org/solr/Join and see no reference to "fromIndex".
Also does this work in a distributed environment?


Re: Grouping queries

2012-03-23 Thread Martijn v Groningen
>
> Where is Join documented?  I looked at
> http://wiki.apache.org/solr/Join and see no reference to "fromIndex".
> Also does this work in a distributed environment?
>
The "fromIndex" isn't documented in the wiki It is mentioned in the
issue and you can find in the Solr code:
https://issues.apache.org/jira/browse/SOLR-2272?focusedCommentId=13024918&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13024918


The Solr join only works in a distributed environment if you partition your
documents properly. Documents that link to each other need to reside on the
same shard and this can be a problem in some cases.

Martijn


"Error 500 seek past EOF" : SOLR bug?

2012-03-23 Thread Martin Koch
Hi list

In my ~6M index served from a slave that is replicating from a master, I'm
trying to do this query :

localhost:8080/solr/core0/select?q=car&qf=document%5E1&defType=edismax

Can anybody explain the below error that I get as a result? It may (or may
not) be related to another problem that we're seeing, which is that
replication sometimes fails "for a while" leading to a lot of retried
replication attempts, which seems to finally succeed.


==
Query failure:
Error 500 seek past EOF:
MMapIndexInput(path="/mnt/solr.data.0/index/_1sj.frq")

java.io.IOException: seek past EOF:
MMapIndexInput(path="/mnt/solr.data.0/index/_1sj.frq")
at
org.apache.lucene.store.MMapDirectory$MMapIndexInput.seek(MMapDirectory.java:352)
at org.apache.lucene.index.SegmentTermDocs.seek(SegmentTermDocs.java:92)
at org.apache.lucene.index.SegmentTermDocs.seek(SegmentTermDocs.java:59)
at org.apache.lucene.index.IndexReader.termDocs(IndexReader.java:1277)
at org.apache.lucene.index.SegmentReader.termDocs(SegmentReader.java:490)
at org.apache.solr.search.SolrIndexReader.termDocs(SolrIndexReader.java:321)
at org.apache.lucene.search.TermQuery$TermWeight.scorer(TermQuery.java:102)
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:577)
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:364)
at
org.apache.solr.search.SolrIndexSearcher.getDocListNC(SolrIndexSearcher.java:1282)
at
org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:1162)
at
org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:362)
at
org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:378)
at
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:194)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1372)
at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252)
at
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1337)
at
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:486)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:119)
at
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:520)
at
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:233)
at
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:973)
at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:417)
at
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:192)
at
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:907)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:117)
at
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:250)
at
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:149)
at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:110)
at org.eclipse.jetty.server.Server.handle(Server.java:346)
at
org.eclipse.jetty.server.HttpConnection.handleRequest(HttpConnection.java:442)
at
org.eclipse.jetty.server.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:924)
at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:582)
at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:218)
at
org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:51)
at
org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:586)
at
org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:44)
at
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:598)
at
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:533)
at java.lang.Thread.run(Thread.java:722)


==
Replication failure:

Mar 23, 2012 1:27:30 PM org.apache.solr.handler.ReplicationHandler doFetch
SEVERE: SnapPull failed
org.apache.solr.common.SolrException: Index fetch failed :
at
org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:331)
at
org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:268)
at org.apache.solr.handler.SnapPuller$1.run(SnapPuller.java:159)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at
java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:351)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:178)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at
java.util.concurrent.ThreadPoolEx

Re: Simple Slave Replication Question

2012-03-23 Thread Martin Koch
I guess this would depend on network bandwidth, but we move around
150G/hour when hooking up a new slave to the master.

/Martin

On Fri, Mar 23, 2012 at 12:33 PM, Ben McCarthy <
ben.mccar...@tradermedia.co.uk> wrote:

> Hello,
>
> Im looking at the replication from a master to a number of slaves.  I have
> configured it and it appears to be working.  When updating 40K records on
> the master is it standard to always copy over the full index, currently 5gb
> in size.  If this is standard what do people do who have massive 200gb
> indexs, does it not take a while to bring the slaves inline with the master?
>
> Thanks
> Ben
>
> 
>
>
> This e-mail is sent on behalf of Trader Media Group Limited, Registered
> Office: Auto Trader House, Cutbush Park Industrial Estate, Danehill, Lower
> Earley, Reading, Berkshire, RG6 4UT(Registered in England No. 4768833).
> This email and any files transmitted with it are confidential and may be
> legally privileged, and intended solely for the use of the individual or
> entity to whom they are addressed. If you have received this email in error
> please notify the sender. This email message has been swept for the
> presence of computer viruses.
>
>


RE: Simple Slave Replication Question

2012-03-23 Thread Ben McCarthy
So do you just simpy address this with big nic and network pipes.

-Original Message-
From: Martin Koch [mailto:m...@issuu.com]
Sent: 23 March 2012 14:07
To: solr-user@lucene.apache.org
Subject: Re: Simple Slave Replication Question

I guess this would depend on network bandwidth, but we move around 150G/hour 
when hooking up a new slave to the master.

/Martin

On Fri, Mar 23, 2012 at 12:33 PM, Ben McCarthy < 
ben.mccar...@tradermedia.co.uk> wrote:

> Hello,
>
> Im looking at the replication from a master to a number of slaves.  I
> have configured it and it appears to be working.  When updating 40K
> records on the master is it standard to always copy over the full
> index, currently 5gb in size.  If this is standard what do people do
> who have massive 200gb indexs, does it not take a while to bring the slaves 
> inline with the master?
>
> Thanks
> Ben
>
> 
>
>
> This e-mail is sent on behalf of Trader Media Group Limited,
> Registered
> Office: Auto Trader House, Cutbush Park Industrial Estate, Danehill,
> Lower Earley, Reading, Berkshire, RG6 4UT(Registered in England No. 4768833).
> This email and any files transmitted with it are confidential and may
> be legally privileged, and intended solely for the use of the
> individual or entity to whom they are addressed. If you have received
> this email in error please notify the sender. This email message has
> been swept for the presence of computer viruses.
>
>




This e-mail is sent on behalf of Trader Media Group Limited, Registered Office: 
Auto Trader House, Cutbush Park Industrial Estate, Danehill, Lower Earley, 
Reading, Berkshire, RG6 4UT(Registered in England No. 4768833). This email and 
any files transmitted with it are confidential and may be legally privileged, 
and intended solely for the use of the individual or entity to whom they are 
addressed. If you have received this email in error please notify the sender. 
This email message has been swept for the presence of computer viruses. 



Slave index size growing fast

2012-03-23 Thread Alexandre Rocco
Hello,

We have a Solr index that has an average of 1.19 GB in size.
After configuring the replication, the slave machine is growing the index
size expoentially.
Currently we have an slave with 323.44 GB in size.
Is there anything that could cause this behavior?
The current replication config is below.

Master:


commit
startup
startup

elevate.xml,protwords.txt,schema.xml,spellings.txt,stopwords.txt,synonyms.txt




Slave:


http://master:8984/solr/Index/replication



Any pointers will be useful.

Thanks,
Alexandre


Re: Simple Slave Replication Question

2012-03-23 Thread Tomás Fernández Löbbe
Hi Ben, only new segments are replicated from master to slave. In a
situation where all the segments are new, this will cause the index to be
fully replicated, but this rarely happen with incremental updates. It can
also happen if the slave Solr assumes it has an "invalid" index.
Are you committing or optimizing on the slaves? After replication, the
index directory on the slaves is called "index" or "index."?

Tomás

On Fri, Mar 23, 2012 at 11:18 AM, Ben McCarthy <
ben.mccar...@tradermedia.co.uk> wrote:

> So do you just simpy address this with big nic and network pipes.
>
> -Original Message-
> From: Martin Koch [mailto:m...@issuu.com]
> Sent: 23 March 2012 14:07
> To: solr-user@lucene.apache.org
> Subject: Re: Simple Slave Replication Question
>
> I guess this would depend on network bandwidth, but we move around
> 150G/hour when hooking up a new slave to the master.
>
> /Martin
>
> On Fri, Mar 23, 2012 at 12:33 PM, Ben McCarthy <
> ben.mccar...@tradermedia.co.uk> wrote:
>
> > Hello,
> >
> > Im looking at the replication from a master to a number of slaves.  I
> > have configured it and it appears to be working.  When updating 40K
> > records on the master is it standard to always copy over the full
> > index, currently 5gb in size.  If this is standard what do people do
> > who have massive 200gb indexs, does it not take a while to bring the
> slaves inline with the master?
> >
> > Thanks
> > Ben
> >
> > 
> >
> >
> > This e-mail is sent on behalf of Trader Media Group Limited,
> > Registered
> > Office: Auto Trader House, Cutbush Park Industrial Estate, Danehill,
> > Lower Earley, Reading, Berkshire, RG6 4UT(Registered in England No.
> 4768833).
> > This email and any files transmitted with it are confidential and may
> > be legally privileged, and intended solely for the use of the
> > individual or entity to whom they are addressed. If you have received
> > this email in error please notify the sender. This email message has
> > been swept for the presence of computer viruses.
> >
> >
>
> 
>
>
> This e-mail is sent on behalf of Trader Media Group Limited, Registered
> Office: Auto Trader House, Cutbush Park Industrial Estate, Danehill, Lower
> Earley, Reading, Berkshire, RG6 4UT(Registered in England No. 4768833).
> This email and any files transmitted with it are confidential and may be
> legally privileged, and intended solely for the use of the individual or
> entity to whom they are addressed. If you have received this email in error
> please notify the sender. This email message has been swept for the
> presence of computer viruses.
>
>


Re: Slave index size growing fast

2012-03-23 Thread Erick Erickson
What version of Solr and what operating system?

But regardless, this shouldn't be happening. Indexes can
temporarily double in size, but any extras should be
cleaned up relatively soon.

On the master, what's the total size of the /data directory?
I'm a little suspicious of the  on your master, but I
don't think that's the root of your problem

Are you recreating the index on the master (by deleting the
index directory and starting over)?

This is unusual, and I suspect it's something odd in your configuration,
but I confess I'm at a loss as to what.

Best
Erick

On Fri, Mar 23, 2012 at 10:28 AM, Alexandre Rocco  wrote:
> Hello,
>
> We have a Solr index that has an average of 1.19 GB in size.
> After configuring the replication, the slave machine is growing the index
> size expoentially.
> Currently we have an slave with 323.44 GB in size.
> Is there anything that could cause this behavior?
> The current replication config is below.
>
> Master:
> 
> 
> commit
> startup
> startup
> 
> elevate.xml,protwords.txt,schema.xml,spellings.txt,stopwords.txt,synonyms.txt
> 
> 
> 
>
> Slave:
> 
> 
> http://master:8984/solr/Index/replication
> 
> 
>
> Any pointers will be useful.
>
> Thanks,
> Alexandre


Re: Solr 4.0 replication problem

2012-03-23 Thread Erick Erickson
Hmmm, that is odd. But a trunk build from that long ago is going
to be almost impossible to debug/fix. The problem with working
from trunk is that this kind of problem won't get much attention.

I have three suggestions:
1> update to current trunk. NOTE: you'll have to completely
 reindex your data, the format of the index has changed
 multiple times since then and there's no back-compatibility
 maintained with non-released major versions.
2> delete your entire index on the _master_ and re-index from scratch.
 If you do this, I'd also delete the entire /data directory
 on the slaves before replication as well.
3> Delete your entire index on the _slave_ and see if you get a
 clean replication.

<3> is the least painful, <1> the most so I'd go in reverse order
for the above.

Best
Erick


On Wed, Mar 21, 2012 at 8:49 AM, Hakan İlter  wrote:
> Hi everyone,
>
> We are using very early version of Solr 4.0 and we've some replication
> problems. Actually we used this build more than one year without any
> problem but when I made some changes on schema.xml, the following problem
> started.
>
> I've just changed schema.xml with adding multiValued="true" attribute to
> two dynamic fields.
>
> Before:
> 
> 
> 
> 
>
> After:
> 
>  multiValued="true"* />
>  multiValued="true"* />
> 
>
> After starting tomcats with new configuration, there are no problems
> occurred. But after a while, I'm seeing this error:
>
> *Mar 20, 2012 2:00:05 PM org.apache.solr.handler.ReplicationHandler doFetch
> SEVERE: SnapPull failed
> org.apache.solr.common.SolrException: Index fetch failed :
>        at
> org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:340)
>        at
> org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:265)
>        at org.apache.solr.handler.SnapPuller$1.run(SnapPuller.java:166)
>        at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
>        at
> java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:317)
>        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:150)
>        at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:98)
>        at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic(ScheduledThreadPoolExecutor.java:180)
>        at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:204)
>        at
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>        at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>        at java.lang.Thread.run(Thread.java:662)
> Caused by: org.apache.lucene.index.IndexFormatTooNewException: Format
> version is not supported in file 'segments_1': -12 (needs to be between -9
> and -11)
>        at
> org.apache.lucene.index.codecs.DefaultSegmentInfosReader.read(DefaultSegmentInfosReader.java:51)
>        at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:230)
>        at
> org.apache.lucene.index.IndexFileDeleter.(IndexFileDeleter.java:169)
>        at org.apache.lucene.index.IndexWriter.(IndexWriter.java:770)
>        at
> org.apache.solr.update.SolrIndexWriter.(SolrIndexWriter.java:83)
>        at
> org.apache.solr.update.UpdateHandler.createMainIndexWriter(UpdateHandler.java:102)
>        at
> org.apache.solr.update.DirectUpdateHandler2.openWriter(DirectUpdateHandler2.java:111)
>        at
> org.apache.solr.update.DirectUpdateHandler2.forceOpenWriter(DirectUpdateHandler2.java:297)
>        at org.apache.solr.handler.SnapPuller.doCommit(SnapPuller.java:484)
>        at
> org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:330)
>        ... 11 more
> Mar 20, 2012 2:00:16 PM org.apache.solr.update.SolrIndexWriter finalize
> SEVERE: SolrIndexWriter was not closed prior to finalize(), indicates a bug
> -- POSSIBLE RESOURCE LEAK!!!*
>
> After above error, it works for a while and then tomcats are going down
> because of out of memory problems.
>
> Could this be a bug? What do you suggest? Please don't suggest me to update
> Solr to current version in trunk because we did a lot of changes related
> with our build of Solr. Updating to current Solr would take at least a
> couple of weeks. But we need an immediate solution.
>
> We are using DIH, our schema version is 1.3, both master and slaves using
> same binaries, libraries etc. Here are some details about our solr build:
>
> Solr Specification Version: 4.0.0.2011.03.28.06.20.50
> Solr Implementation Version: 4.0-SNAPSHOT 1086110 - Administrator -
> 2011-03-28 06:20:50
> Lucene Specification Version: 4.0-SNAPSHOT
> Lucene Implementation Version: 4.0-SNAPSHOT 1086110 - 2011-03-28 06:22:18
>
> Thanks for any help.


RE: Simple Slave Replication Question

2012-03-23 Thread Ben McCarthy
I just have a index directory.

I push the documents through with a change to a field.  Im using SOLRJ to do 
this.  Im using the guide from the wiki to setup the replication.  When the 
feed of updates to the master finishes I call a commit again using SOLRJ.  I 
then have a poll period of 5 minutes from the slave.  When it kicks in I see a 
new version of the index and then it copys the full 5gb index.

Thanks
Ben

-Original Message-
From: Tomás Fernández Löbbe [mailto:tomasflo...@gmail.com]
Sent: 23 March 2012 14:29
To: solr-user@lucene.apache.org
Subject: Re: Simple Slave Replication Question

Hi Ben, only new segments are replicated from master to slave. In a situation 
where all the segments are new, this will cause the index to be fully 
replicated, but this rarely happen with incremental updates. It can also happen 
if the slave Solr assumes it has an "invalid" index.
Are you committing or optimizing on the slaves? After replication, the index 
directory on the slaves is called "index" or "index."?

Tomás

On Fri, Mar 23, 2012 at 11:18 AM, Ben McCarthy < 
ben.mccar...@tradermedia.co.uk> wrote:

> So do you just simpy address this with big nic and network pipes.
>
> -Original Message-
> From: Martin Koch [mailto:m...@issuu.com]
> Sent: 23 March 2012 14:07
> To: solr-user@lucene.apache.org
> Subject: Re: Simple Slave Replication Question
>
> I guess this would depend on network bandwidth, but we move around
> 150G/hour when hooking up a new slave to the master.
>
> /Martin
>
> On Fri, Mar 23, 2012 at 12:33 PM, Ben McCarthy <
> ben.mccar...@tradermedia.co.uk> wrote:
>
> > Hello,
> >
> > Im looking at the replication from a master to a number of slaves.
> > I have configured it and it appears to be working.  When updating
> > 40K records on the master is it standard to always copy over the
> > full index, currently 5gb in size.  If this is standard what do
> > people do who have massive 200gb indexs, does it not take a while to
> > bring the
> slaves inline with the master?
> >
> > Thanks
> > Ben
> >
> > 
> >
> >
> > This e-mail is sent on behalf of Trader Media Group Limited,
> > Registered
> > Office: Auto Trader House, Cutbush Park Industrial Estate, Danehill,
> > Lower Earley, Reading, Berkshire, RG6 4UT(Registered in England No.
> 4768833).
> > This email and any files transmitted with it are confidential and
> > may be legally privileged, and intended solely for the use of the
> > individual or entity to whom they are addressed. If you have
> > received this email in error please notify the sender. This email
> > message has been swept for the presence of computer viruses.
> >
> >
>
> 
>
>
> This e-mail is sent on behalf of Trader Media Group Limited,
> Registered
> Office: Auto Trader House, Cutbush Park Industrial Estate, Danehill,
> Lower Earley, Reading, Berkshire, RG6 4UT(Registered in England No. 4768833).
> This email and any files transmitted with it are confidential and may
> be legally privileged, and intended solely for the use of the
> individual or entity to whom they are addressed. If you have received
> this email in error please notify the sender. This email message has
> been swept for the presence of computer viruses.
>
>




This e-mail is sent on behalf of Trader Media Group Limited, Registered Office: 
Auto Trader House, Cutbush Park Industrial Estate, Danehill, Lower Earley, 
Reading, Berkshire, RG6 4UT(Registered in England No. 4768833). This email and 
any files transmitted with it are confidential and may be legally privileged, 
and intended solely for the use of the individual or entity to whom they are 
addressed. If you have received this email in error please notify the sender. 
This email message has been swept for the presence of computer viruses. 



Re: Slave index size growing fast

2012-03-23 Thread Alexandre Rocco
Erick,

We're using Solr 3.3 on Linux (CentOS 5.6).
The /data dir on master is actually 1.2G.

I haven't tried to recreate the index yet. Since it's a production
environment,
I guess that I can stop replication and indexing and then recreate the
master index to see if it makes any difference.

Also just noticed another thread here named "Simple Slave Replication
Question" that tells that it could be a problem if I'm seeing an
/data/index with an timestamp on the slave node.
Is this info relevant to this issue?

Thanks,
Alexandre

On Fri, Mar 23, 2012 at 11:48 AM, Erick Erickson wrote:

> What version of Solr and what operating system?
>
> But regardless, this shouldn't be happening. Indexes can
> temporarily double in size, but any extras should be
> cleaned up relatively soon.
>
> On the master, what's the total size of the /data directory?
> I'm a little suspicious of the  on your master, but I
> don't think that's the root of your problem
>
> Are you recreating the index on the master (by deleting the
> index directory and starting over)?
>
> This is unusual, and I suspect it's something odd in your configuration,
> but I confess I'm at a loss as to what.
>
> Best
> Erick
>
> On Fri, Mar 23, 2012 at 10:28 AM, Alexandre Rocco 
> wrote:
> > Hello,
> >
> > We have a Solr index that has an average of 1.19 GB in size.
> > After configuring the replication, the slave machine is growing the index
> > size expoentially.
> > Currently we have an slave with 323.44 GB in size.
> > Is there anything that could cause this behavior?
> > The current replication config is below.
> >
> > Master:
> > 
> > 
> > commit
> > startup
> > startup
> > 
> >
> elevate.xml,protwords.txt,schema.xml,spellings.txt,stopwords.txt,synonyms.txt
> > 
> > 
> > 
> >
> > Slave:
> > 
> > 
> > http://master:8984/solr/Index/replication
> > 
> > 
> >
> > Any pointers will be useful.
> >
> > Thanks,
> > Alexandre
>


Re: Solr 4.0 replication problem

2012-03-23 Thread Hakan İlter
Hi Erick,

I've already tried step 2 and 3 but it didn't help. It's almost impossible
to do step 1 for us because of project dead-line.

Do you have any other suggestion?

Thank your reply.

On Fri, Mar 23, 2012 at 4:56 PM, Erick Erickson wrote:

> Hmmm, that is odd. But a trunk build from that long ago is going
> to be almost impossible to debug/fix. The problem with working
> from trunk is that this kind of problem won't get much attention.
>
> I have three suggestions:
> 1> update to current trunk. NOTE: you'll have to completely
> reindex your data, the format of the index has changed
> multiple times since then and there's no back-compatibility
> maintained with non-released major versions.
> 2> delete your entire index on the _master_ and re-index from scratch.
> If you do this, I'd also delete the entire /data directory
> on the slaves before replication as well.
> 3> Delete your entire index on the _slave_ and see if you get a
> clean replication.
>
> <3> is the least painful, <1> the most so I'd go in reverse order
> for the above.
>
> Best
> Erick
>
>
> On Wed, Mar 21, 2012 at 8:49 AM, Hakan İlter  wrote:
> > Hi everyone,
> >
> > We are using very early version of Solr 4.0 and we've some replication
> > problems. Actually we used this build more than one year without any
> > problem but when I made some changes on schema.xml, the following problem
> > started.
> >
> > I've just changed schema.xml with adding multiValued="true" attribute to
> > two dynamic fields.
> >
> > Before:
> > 
> >  />
> > 
> > 
> >
> > After:
> > 
> >  > multiValued="true"* />
> >  > multiValued="true"* />
> > 
> >
> > After starting tomcats with new configuration, there are no problems
> > occurred. But after a while, I'm seeing this error:
> >
> > *Mar 20, 2012 2:00:05 PM org.apache.solr.handler.ReplicationHandler
> doFetch
> > SEVERE: SnapPull failed
> > org.apache.solr.common.SolrException: Index fetch failed :
> >at
> > org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:340)
> >at
> >
> org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:265)
> >at org.apache.solr.handler.SnapPuller$1.run(SnapPuller.java:166)
> >at
> > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
> >at
> >
> java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:317)
> >at
> java.util.concurrent.FutureTask.runAndReset(FutureTask.java:150)
> >at
> >
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:98)
> >at
> >
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic(ScheduledThreadPoolExecutor.java:180)
> >at
> >
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:204)
> >at
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> >at
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> >at java.lang.Thread.run(Thread.java:662)
> > Caused by: org.apache.lucene.index.IndexFormatTooNewException: Format
> > version is not supported in file 'segments_1': -12 (needs to be between
> -9
> > and -11)
> >at
> >
> org.apache.lucene.index.codecs.DefaultSegmentInfosReader.read(DefaultSegmentInfosReader.java:51)
> >at
> org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:230)
> >at
> >
> org.apache.lucene.index.IndexFileDeleter.(IndexFileDeleter.java:169)
> >at
> org.apache.lucene.index.IndexWriter.(IndexWriter.java:770)
> >at
> > org.apache.solr.update.SolrIndexWriter.(SolrIndexWriter.java:83)
> >at
> >
> org.apache.solr.update.UpdateHandler.createMainIndexWriter(UpdateHandler.java:102)
> >at
> >
> org.apache.solr.update.DirectUpdateHandler2.openWriter(DirectUpdateHandler2.java:111)
> >at
> >
> org.apache.solr.update.DirectUpdateHandler2.forceOpenWriter(DirectUpdateHandler2.java:297)
> >at
> org.apache.solr.handler.SnapPuller.doCommit(SnapPuller.java:484)
> >at
> > org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:330)
> >... 11 more
> > Mar 20, 2012 2:00:16 PM org.apache.solr.update.SolrIndexWriter finalize
> > SEVERE: SolrIndexWriter was not closed prior to finalize(), indicates a
> bug
> > -- POSSIBLE RESOURCE LEAK!!!*
> >
> > After above error, it works for a while and then tomcats are going down
> > because of out of memory problems.
> >
> > Could this be a bug? What do you suggest? Please don't suggest me to
> update
> > Solr to current version in trunk because we did a lot of changes related
> > with our build of Solr. Updating to current Solr would take at least a
> > couple of weeks. But we need an immediate solution.
> >
> > We are using DIH, our schema version is 1.3, both master and slaves using
> > same binarie

Re: Simple Slave Replication Question

2012-03-23 Thread Tomás Fernández Löbbe
Have you changed the mergeFactor or are you using 10 as in the example
solrconfig?

What do you see in the slave's log during replication? Do you see any line
like "Skipping download for..."?

On Fri, Mar 23, 2012 at 11:57 AM, Ben McCarthy <
ben.mccar...@tradermedia.co.uk> wrote:

> I just have a index directory.
>
> I push the documents through with a change to a field.  Im using SOLRJ to
> do this.  Im using the guide from the wiki to setup the replication.  When
> the feed of updates to the master finishes I call a commit again using
> SOLRJ.  I then have a poll period of 5 minutes from the slave.  When it
> kicks in I see a new version of the index and then it copys the full 5gb
> index.
>
> Thanks
> Ben
>
> -Original Message-
> From: Tomás Fernández Löbbe [mailto:tomasflo...@gmail.com]
> Sent: 23 March 2012 14:29
> To: solr-user@lucene.apache.org
> Subject: Re: Simple Slave Replication Question
>
> Hi Ben, only new segments are replicated from master to slave. In a
> situation where all the segments are new, this will cause the index to be
> fully replicated, but this rarely happen with incremental updates. It can
> also happen if the slave Solr assumes it has an "invalid" index.
> Are you committing or optimizing on the slaves? After replication, the
> index directory on the slaves is called "index" or "index."?
>
> Tomás
>
> On Fri, Mar 23, 2012 at 11:18 AM, Ben McCarthy <
> ben.mccar...@tradermedia.co.uk> wrote:
>
> > So do you just simpy address this with big nic and network pipes.
> >
> > -Original Message-
> > From: Martin Koch [mailto:m...@issuu.com]
> > Sent: 23 March 2012 14:07
> > To: solr-user@lucene.apache.org
> > Subject: Re: Simple Slave Replication Question
> >
> > I guess this would depend on network bandwidth, but we move around
> > 150G/hour when hooking up a new slave to the master.
> >
> > /Martin
> >
> > On Fri, Mar 23, 2012 at 12:33 PM, Ben McCarthy <
> > ben.mccar...@tradermedia.co.uk> wrote:
> >
> > > Hello,
> > >
> > > Im looking at the replication from a master to a number of slaves.
> > > I have configured it and it appears to be working.  When updating
> > > 40K records on the master is it standard to always copy over the
> > > full index, currently 5gb in size.  If this is standard what do
> > > people do who have massive 200gb indexs, does it not take a while to
> > > bring the
> > slaves inline with the master?
> > >
> > > Thanks
> > > Ben
> > >
> > > 
> > >
> > >
> > > This e-mail is sent on behalf of Trader Media Group Limited,
> > > Registered
> > > Office: Auto Trader House, Cutbush Park Industrial Estate, Danehill,
> > > Lower Earley, Reading, Berkshire, RG6 4UT(Registered in England No.
> > 4768833).
> > > This email and any files transmitted with it are confidential and
> > > may be legally privileged, and intended solely for the use of the
> > > individual or entity to whom they are addressed. If you have
> > > received this email in error please notify the sender. This email
> > > message has been swept for the presence of computer viruses.
> > >
> > >
> >
> > 
> >
> >
> > This e-mail is sent on behalf of Trader Media Group Limited,
> > Registered
> > Office: Auto Trader House, Cutbush Park Industrial Estate, Danehill,
> > Lower Earley, Reading, Berkshire, RG6 4UT(Registered in England No.
> 4768833).
> > This email and any files transmitted with it are confidential and may
> > be legally privileged, and intended solely for the use of the
> > individual or entity to whom they are addressed. If you have received
> > this email in error please notify the sender. This email message has
> > been swept for the presence of computer viruses.
> >
> >
>
> 
>
>
> This e-mail is sent on behalf of Trader Media Group Limited, Registered
> Office: Auto Trader House, Cutbush Park Industrial Estate, Danehill, Lower
> Earley, Reading, Berkshire, RG6 4UT(Registered in England No. 4768833).
> This email and any files transmitted with it are confidential and may be
> legally privileged, and intended solely for the use of the individual or
> entity to whom they are addressed. If you have received this email in error
> please notify the sender. This email message has been swept for the
> presence of computer viruses.
>
>


Re: Simple Slave Replication Question

2012-03-23 Thread Tomás Fernández Löbbe
Also, what happens if, instead of adding the 40K docs you add just one and
commit?

2012/3/23 Tomás Fernández Löbbe 

> Have you changed the mergeFactor or are you using 10 as in the example
> solrconfig?
>
> What do you see in the slave's log during replication? Do you see any line
> like "Skipping download for..."?
>
>
> On Fri, Mar 23, 2012 at 11:57 AM, Ben McCarthy <
> ben.mccar...@tradermedia.co.uk> wrote:
>
>> I just have a index directory.
>>
>> I push the documents through with a change to a field.  Im using SOLRJ to
>> do this.  Im using the guide from the wiki to setup the replication.  When
>> the feed of updates to the master finishes I call a commit again using
>> SOLRJ.  I then have a poll period of 5 minutes from the slave.  When it
>> kicks in I see a new version of the index and then it copys the full 5gb
>> index.
>>
>> Thanks
>> Ben
>>
>> -Original Message-
>> From: Tomás Fernández Löbbe [mailto:tomasflo...@gmail.com]
>> Sent: 23 March 2012 14:29
>> To: solr-user@lucene.apache.org
>> Subject: Re: Simple Slave Replication Question
>>
>> Hi Ben, only new segments are replicated from master to slave. In a
>> situation where all the segments are new, this will cause the index to be
>> fully replicated, but this rarely happen with incremental updates. It can
>> also happen if the slave Solr assumes it has an "invalid" index.
>> Are you committing or optimizing on the slaves? After replication, the
>> index directory on the slaves is called "index" or "index."?
>>
>> Tomás
>>
>> On Fri, Mar 23, 2012 at 11:18 AM, Ben McCarthy <
>> ben.mccar...@tradermedia.co.uk> wrote:
>>
>> > So do you just simpy address this with big nic and network pipes.
>> >
>> > -Original Message-
>> > From: Martin Koch [mailto:m...@issuu.com]
>> > Sent: 23 March 2012 14:07
>> > To: solr-user@lucene.apache.org
>> > Subject: Re: Simple Slave Replication Question
>> >
>> > I guess this would depend on network bandwidth, but we move around
>> > 150G/hour when hooking up a new slave to the master.
>> >
>> > /Martin
>> >
>> > On Fri, Mar 23, 2012 at 12:33 PM, Ben McCarthy <
>> > ben.mccar...@tradermedia.co.uk> wrote:
>> >
>> > > Hello,
>> > >
>> > > Im looking at the replication from a master to a number of slaves.
>> > > I have configured it and it appears to be working.  When updating
>> > > 40K records on the master is it standard to always copy over the
>> > > full index, currently 5gb in size.  If this is standard what do
>> > > people do who have massive 200gb indexs, does it not take a while to
>> > > bring the
>> > slaves inline with the master?
>> > >
>> > > Thanks
>> > > Ben
>> > >
>> > > 
>> > >
>> > >
>> > > This e-mail is sent on behalf of Trader Media Group Limited,
>> > > Registered
>> > > Office: Auto Trader House, Cutbush Park Industrial Estate, Danehill,
>> > > Lower Earley, Reading, Berkshire, RG6 4UT(Registered in England No.
>> > 4768833).
>> > > This email and any files transmitted with it are confidential and
>> > > may be legally privileged, and intended solely for the use of the
>> > > individual or entity to whom they are addressed. If you have
>> > > received this email in error please notify the sender. This email
>> > > message has been swept for the presence of computer viruses.
>> > >
>> > >
>> >
>> > 
>> >
>> >
>> > This e-mail is sent on behalf of Trader Media Group Limited,
>> > Registered
>> > Office: Auto Trader House, Cutbush Park Industrial Estate, Danehill,
>> > Lower Earley, Reading, Berkshire, RG6 4UT(Registered in England No.
>> 4768833).
>> > This email and any files transmitted with it are confidential and may
>> > be legally privileged, and intended solely for the use of the
>> > individual or entity to whom they are addressed. If you have received
>> > this email in error please notify the sender. This email message has
>> > been swept for the presence of computer viruses.
>> >
>> >
>>
>> 
>>
>>
>> This e-mail is sent on behalf of Trader Media Group Limited, Registered
>> Office: Auto Trader House, Cutbush Park Industrial Estate, Danehill, Lower
>> Earley, Reading, Berkshire, RG6 4UT(Registered in England No. 4768833).
>> This email and any files transmitted with it are confidential and may be
>> legally privileged, and intended solely for the use of the individual or
>> entity to whom they are addressed. If you have received this email in error
>> please notify the sender. This email message has been swept for the
>> presence of computer viruses.
>>
>>
>


Re: Slave index size growing fast

2012-03-23 Thread Erick Erickson
not really, unless perhaps you're issuing commits or optimizes
on the _slave_ (which you should NOT do).

Replication happens based on the version of the index on the master.
True, it starts out as a timestamp, but then successive versions
just have that number incremented. The version number
in the index on the slave is compared against the one on the master,
but the actual time (on the slave or master) is irrelevant. This is
explicitly to avoid problems with time synching across
machines/timezones/whataver

It would be instructive to look at the admin/info page to see what
the index version is on the master and slave.

But, if you optimize or commit (I think) on the _slave_, you might
change the timestamp and mess things up (although I'm reaching
here, I don't know this for certain).

What's the  index look like on the slave as compared to the master?
Are there just a bunch of files on the slave? Or a bunch of directories?

Instead of re-indexing on the master, you could try to bring down the
slave, blow away the entire index and start it back up. Since this is a
production system, I'd only try this if I had more than one slave. Although
you could bring up a new slave and attach it to the master and see
what happens there. You wouldn't affect production if you didn't point
incoming requests at it...

Best
Erick

On Fri, Mar 23, 2012 at 11:03 AM, Alexandre Rocco  wrote:
> Erick,
>
> We're using Solr 3.3 on Linux (CentOS 5.6).
> The /data dir on master is actually 1.2G.
>
> I haven't tried to recreate the index yet. Since it's a production
> environment,
> I guess that I can stop replication and indexing and then recreate the
> master index to see if it makes any difference.
>
> Also just noticed another thread here named "Simple Slave Replication
> Question" that tells that it could be a problem if I'm seeing an
> /data/index with an timestamp on the slave node.
> Is this info relevant to this issue?
>
> Thanks,
> Alexandre
>
> On Fri, Mar 23, 2012 at 11:48 AM, Erick Erickson 
> wrote:
>
>> What version of Solr and what operating system?
>>
>> But regardless, this shouldn't be happening. Indexes can
>> temporarily double in size, but any extras should be
>> cleaned up relatively soon.
>>
>> On the master, what's the total size of the /data directory?
>> I'm a little suspicious of the  on your master, but I
>> don't think that's the root of your problem
>>
>> Are you recreating the index on the master (by deleting the
>> index directory and starting over)?
>>
>> This is unusual, and I suspect it's something odd in your configuration,
>> but I confess I'm at a loss as to what.
>>
>> Best
>> Erick
>>
>> On Fri, Mar 23, 2012 at 10:28 AM, Alexandre Rocco 
>> wrote:
>> > Hello,
>> >
>> > We have a Solr index that has an average of 1.19 GB in size.
>> > After configuring the replication, the slave machine is growing the index
>> > size expoentially.
>> > Currently we have an slave with 323.44 GB in size.
>> > Is there anything that could cause this behavior?
>> > The current replication config is below.
>> >
>> > Master:
>> > 
>> > 
>> > commit
>> > startup
>> > startup
>> > 
>> >
>> elevate.xml,protwords.txt,schema.xml,spellings.txt,stopwords.txt,synonyms.txt
>> > 
>> > 
>> > 
>> >
>> > Slave:
>> > 
>> > 
>> > http://master:8984/solr/Index/replication
>> > 
>> > 
>> >
>> > Any pointers will be useful.
>> >
>> > Thanks,
>> > Alexandre
>>


Re: Solr 4.0 replication problem

2012-03-23 Thread Erick Erickson
Hmmm, looking at your stack trace in a bit more detail, this is really
suspicious:

Caused by: org.apache.lucene.index.IndexFormatTooNewException: Format
version is not supported in file 'segments_1': -12 (needs to be between -9
and -11)

This *looks* like your Solr version on your slave is older than the version
on your master. Is this possible at all?

Best
Erick

On Fri, Mar 23, 2012 at 11:03 AM, Hakan İlter  wrote:
> Hi Erick,
>
> I've already tried step 2 and 3 but it didn't help. It's almost impossible
> to do step 1 for us because of project dead-line.
>
> Do you have any other suggestion?
>
> Thank your reply.
>
> On Fri, Mar 23, 2012 at 4:56 PM, Erick Erickson 
> wrote:
>
>> Hmmm, that is odd. But a trunk build from that long ago is going
>> to be almost impossible to debug/fix. The problem with working
>> from trunk is that this kind of problem won't get much attention.
>>
>> I have three suggestions:
>> 1> update to current trunk. NOTE: you'll have to completely
>>     reindex your data, the format of the index has changed
>>     multiple times since then and there's no back-compatibility
>>     maintained with non-released major versions.
>> 2> delete your entire index on the _master_ and re-index from scratch.
>>     If you do this, I'd also delete the entire /data directory
>>     on the slaves before replication as well.
>> 3> Delete your entire index on the _slave_ and see if you get a
>>     clean replication.
>>
>> <3> is the least painful, <1> the most so I'd go in reverse order
>> for the above.
>>
>> Best
>> Erick
>>
>>
>> On Wed, Mar 21, 2012 at 8:49 AM, Hakan İlter  wrote:
>> > Hi everyone,
>> >
>> > We are using very early version of Solr 4.0 and we've some replication
>> > problems. Actually we used this build more than one year without any
>> > problem but when I made some changes on schema.xml, the following problem
>> > started.
>> >
>> > I've just changed schema.xml with adding multiValued="true" attribute to
>> > two dynamic fields.
>> >
>> > Before:
>> > 
>> > > />
>> > 
>> > 
>> >
>> > After:
>> > 
>> > > > multiValued="true"* />
>> > > > multiValued="true"* />
>> > 
>> >
>> > After starting tomcats with new configuration, there are no problems
>> > occurred. But after a while, I'm seeing this error:
>> >
>> > *Mar 20, 2012 2:00:05 PM org.apache.solr.handler.ReplicationHandler
>> doFetch
>> > SEVERE: SnapPull failed
>> > org.apache.solr.common.SolrException: Index fetch failed :
>> >        at
>> > org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:340)
>> >        at
>> >
>> org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:265)
>> >        at org.apache.solr.handler.SnapPuller$1.run(SnapPuller.java:166)
>> >        at
>> > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
>> >        at
>> >
>> java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:317)
>> >        at
>> java.util.concurrent.FutureTask.runAndReset(FutureTask.java:150)
>> >        at
>> >
>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:98)
>> >        at
>> >
>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic(ScheduledThreadPoolExecutor.java:180)
>> >        at
>> >
>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:204)
>> >        at
>> >
>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>> >        at
>> >
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>> >        at java.lang.Thread.run(Thread.java:662)
>> > Caused by: org.apache.lucene.index.IndexFormatTooNewException: Format
>> > version is not supported in file 'segments_1': -12 (needs to be between
>> -9
>> > and -11)
>> >        at
>> >
>> org.apache.lucene.index.codecs.DefaultSegmentInfosReader.read(DefaultSegmentInfosReader.java:51)
>> >        at
>> org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:230)
>> >        at
>> >
>> org.apache.lucene.index.IndexFileDeleter.(IndexFileDeleter.java:169)
>> >        at
>> org.apache.lucene.index.IndexWriter.(IndexWriter.java:770)
>> >        at
>> > org.apache.solr.update.SolrIndexWriter.(SolrIndexWriter.java:83)
>> >        at
>> >
>> org.apache.solr.update.UpdateHandler.createMainIndexWriter(UpdateHandler.java:102)
>> >        at
>> >
>> org.apache.solr.update.DirectUpdateHandler2.openWriter(DirectUpdateHandler2.java:111)
>> >        at
>> >
>> org.apache.solr.update.DirectUpdateHandler2.forceOpenWriter(DirectUpdateHandler2.java:297)
>> >        at
>> org.apache.solr.handler.SnapPuller.doCommit(SnapPuller.java:484)
>> >        at
>> > org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:330)
>> >        ... 11 more
>> > Mar 20, 2012 2:00:16 PM org.apache.solr.update.SolrIndexWriter finalize
>> > SEVERE: SolrIndexWriter was not closed prior to finalize(), indi

Re: Slave index size growing fast

2012-03-23 Thread Tomás Fernández Löbbe
Alexandre, additionally to what Erick said, you may want to check in the
slave if what's 300+GB is the "data" directory or the "index."
directory.

On Fri, Mar 23, 2012 at 12:25 PM, Erick Erickson wrote:

> not really, unless perhaps you're issuing commits or optimizes
> on the _slave_ (which you should NOT do).
>
> Replication happens based on the version of the index on the master.
> True, it starts out as a timestamp, but then successive versions
> just have that number incremented. The version number
> in the index on the slave is compared against the one on the master,
> but the actual time (on the slave or master) is irrelevant. This is
> explicitly to avoid problems with time synching across
> machines/timezones/whataver
>
> It would be instructive to look at the admin/info page to see what
> the index version is on the master and slave.
>
> But, if you optimize or commit (I think) on the _slave_, you might
> change the timestamp and mess things up (although I'm reaching
> here, I don't know this for certain).
>
> What's the  index look like on the slave as compared to the master?
> Are there just a bunch of files on the slave? Or a bunch of directories?
>
> Instead of re-indexing on the master, you could try to bring down the
> slave, blow away the entire index and start it back up. Since this is a
> production system, I'd only try this if I had more than one slave. Although
> you could bring up a new slave and attach it to the master and see
> what happens there. You wouldn't affect production if you didn't point
> incoming requests at it...
>
> Best
> Erick
>
> On Fri, Mar 23, 2012 at 11:03 AM, Alexandre Rocco 
> wrote:
> > Erick,
> >
> > We're using Solr 3.3 on Linux (CentOS 5.6).
> > The /data dir on master is actually 1.2G.
> >
> > I haven't tried to recreate the index yet. Since it's a production
> > environment,
> > I guess that I can stop replication and indexing and then recreate the
> > master index to see if it makes any difference.
> >
> > Also just noticed another thread here named "Simple Slave Replication
> > Question" that tells that it could be a problem if I'm seeing an
> > /data/index with an timestamp on the slave node.
> > Is this info relevant to this issue?
> >
> > Thanks,
> > Alexandre
> >
> > On Fri, Mar 23, 2012 at 11:48 AM, Erick Erickson <
> erickerick...@gmail.com>wrote:
> >
> >> What version of Solr and what operating system?
> >>
> >> But regardless, this shouldn't be happening. Indexes can
> >> temporarily double in size, but any extras should be
> >> cleaned up relatively soon.
> >>
> >> On the master, what's the total size of the /data directory?
> >> I'm a little suspicious of the  on your master, but I
> >> don't think that's the root of your problem
> >>
> >> Are you recreating the index on the master (by deleting the
> >> index directory and starting over)?
> >>
> >> This is unusual, and I suspect it's something odd in your configuration,
> >> but I confess I'm at a loss as to what.
> >>
> >> Best
> >> Erick
> >>
> >> On Fri, Mar 23, 2012 at 10:28 AM, Alexandre Rocco 
> >> wrote:
> >> > Hello,
> >> >
> >> > We have a Solr index that has an average of 1.19 GB in size.
> >> > After configuring the replication, the slave machine is growing the
> index
> >> > size expoentially.
> >> > Currently we have an slave with 323.44 GB in size.
> >> > Is there anything that could cause this behavior?
> >> > The current replication config is below.
> >> >
> >> > Master:
> >> > 
> >> > 
> >> > commit
> >> > startup
> >> > startup
> >> > 
> >> >
> >>
> elevate.xml,protwords.txt,schema.xml,spellings.txt,stopwords.txt,synonyms.txt
> >> > 
> >> > 
> >> > 
> >> >
> >> > Slave:
> >> > 
> >> > 
> >> > http://master:8984/solr/Index/replication
> >> > 
> >> > 
> >> >
> >> > Any pointers will be useful.
> >> >
> >> > Thanks,
> >> > Alexandre
> >>
>


Re: querying on shards

2012-03-23 Thread stockii
@Shawn Heisey-4

how look your requestHandler of your broker? i think about your idea to do
the same ;)

-
--- System 

One Server, 12 GB RAM, 2 Solr Instances, 8 Cores, 
1 Core with 45 Million Documents other Cores < 200.000

- Solr1 for Search-Requests - commit every Minute  - 5GB Xmx
- Solr2 for Update-Request  - delta every Minute - 4GB Xmx
--
View this message in context: 
http://lucene.472066.n3.nabble.com/querying-on-shards-tp3841446p3852001.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr 4.0 replication problem

2012-03-23 Thread Hakan İlter
Hi Erick,

It's not possible because both master and slaves using same binaries.

Thanks...


On Fri, Mar 23, 2012 at 5:30 PM, Erick Erickson wrote:

> Hmmm, looking at your stack trace in a bit more detail, this is really
> suspicious:
>
> Caused by: org.apache.lucene.index.IndexFormatTooNewException: Format
> version is not supported in file 'segments_1': -12 (needs to be between -9
> and -11)
>
> This *looks* like your Solr version on your slave is older than the version
> on your master. Is this possible at all?
>
> Best
> Erick
>
> On Fri, Mar 23, 2012 at 11:03 AM, Hakan İlter 
> wrote:
> > Hi Erick,
> >
> > I've already tried step 2 and 3 but it didn't help. It's almost
> impossible
> > to do step 1 for us because of project dead-line.
> >
> > Do you have any other suggestion?
> >
> > Thank your reply.
> >
> > On Fri, Mar 23, 2012 at 4:56 PM, Erick Erickson  >wrote:
> >
> >> Hmmm, that is odd. But a trunk build from that long ago is going
> >> to be almost impossible to debug/fix. The problem with working
> >> from trunk is that this kind of problem won't get much attention.
> >>
> >> I have three suggestions:
> >> 1> update to current trunk. NOTE: you'll have to completely
> >> reindex your data, the format of the index has changed
> >> multiple times since then and there's no back-compatibility
> >> maintained with non-released major versions.
> >> 2> delete your entire index on the _master_ and re-index from scratch.
> >> If you do this, I'd also delete the entire /data
> directory
> >> on the slaves before replication as well.
> >> 3> Delete your entire index on the _slave_ and see if you get a
> >> clean replication.
> >>
> >> <3> is the least painful, <1> the most so I'd go in reverse order
> >> for the above.
> >>
> >> Best
> >> Erick
> >>
> >>
> >> On Wed, Mar 21, 2012 at 8:49 AM, Hakan İlter 
> wrote:
> >> > Hi everyone,
> >> >
> >> > We are using very early version of Solr 4.0 and we've some replication
> >> > problems. Actually we used this build more than one year without any
> >> > problem but when I made some changes on schema.xml, the following
> problem
> >> > started.
> >> >
> >> > I've just changed schema.xml with adding multiValued="true" attribute
> to
> >> > two dynamic fields.
> >> >
> >> > Before:
> >> > 
> >> >  stored="false"
> >> />
> >> >  stored="false" />
> >> > 
> >> >
> >> > After:
> >> > 
> >> >  stored="false" *
> >> > multiValued="true"* />
> >> >  stored="false" *
> >> > multiValued="true"* />
> >> > 
> >> >
> >> > After starting tomcats with new configuration, there are no problems
> >> > occurred. But after a while, I'm seeing this error:
> >> >
> >> > *Mar 20, 2012 2:00:05 PM org.apache.solr.handler.ReplicationHandler
> >> doFetch
> >> > SEVERE: SnapPull failed
> >> > org.apache.solr.common.SolrException: Index fetch failed :
> >> >at
> >> >
> org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:340)
> >> >at
> >> >
> >>
> org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:265)
> >> >at
> org.apache.solr.handler.SnapPuller$1.run(SnapPuller.java:166)
> >> >at
> >> >
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
> >> >at
> >> >
> >>
> java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:317)
> >> >at
> >> java.util.concurrent.FutureTask.runAndReset(FutureTask.java:150)
> >> >at
> >> >
> >>
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:98)
> >> >at
> >> >
> >>
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic(ScheduledThreadPoolExecutor.java:180)
> >> >at
> >> >
> >>
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:204)
> >> >at
> >> >
> >>
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> >> >at
> >> >
> >>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> >> >at java.lang.Thread.run(Thread.java:662)
> >> > Caused by: org.apache.lucene.index.IndexFormatTooNewException: Format
> >> > version is not supported in file 'segments_1': -12 (needs to be
> between
> >> -9
> >> > and -11)
> >> >at
> >> >
> >>
> org.apache.lucene.index.codecs.DefaultSegmentInfosReader.read(DefaultSegmentInfosReader.java:51)
> >> >at
> >> org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:230)
> >> >at
> >> >
> >>
> org.apache.lucene.index.IndexFileDeleter.(IndexFileDeleter.java:169)
> >> >at
> >> org.apache.lucene.index.IndexWriter.(IndexWriter.java:770)
> >> >at
> >> > org.apache.solr.update.SolrIndexWriter.(SolrIndexWriter.java:83)
> >> >at
> >> >
> >>
> org.apache.solr.update.UpdateHandler.createMainIndexWriter(UpdateHandler.java:102)
> >> >at
> >> >
> >>
> org.apache.solr.update.D

Unexpected Tika Exception extracting text from a PDF file.

2012-03-23 Thread Jon Dragt
Howdy Folks,

I'm stumped and hope somebody can give me some clues on how to work around
this occasional error I'm getting.

 

I've got a .Net console program using SolrNet to scour certain folders at
certain times and extract text from PDF files and index them. It succeeds on
a majority of the files, but it fails on several test files. Though I'm new
to this environment, I gather the SolrNet library calls on Solr (v. 3.5.0)
to do this, which in turn calls on the Tika library (v. 0.10) , which calls
on the PDFBox library (v. 1.6.0).

 

To try and isolate the problem I took SolrNet and .Net out of the equation
and switched to a Linux console. I downloaded the pdfbox-app-1.6.0.jar and
executed:

Java -jar pdfbox-app-1.6.0.jar ExtractText -console a.pdf 

Everything worked fine.

 

I moved up to Tika. Downloaded tika-app-0.10.jar and executed:

Java -jar tika-app-0.10.jar -t a.pdf

And again everything worked fine.

 

I then executed:

Curl
'http://localhost:8993/solr/MyCore/update/extract?map.content=text&commit-tr
ue' -F file=@a.pdf

And it failed with the following output (Note: the above command works fine
with other pdf files, but fails on these few pdf files)

 







Error 500 org.apache.tika.exception.TikaException: Unexpected
RuntimeException from org.apache.tika.parser.pdf.PDFParser@58c5f8

 

org.apache.solr.common.SolrException:
org.apache.tika.exception.TikaException: Unexpected RuntimeException from
org.apache.tika.parser.pdf.PDFParser@58c5f8

at
org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingD
ocumentLoader.java:219)

at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentSt
reamHandlerBase.java:67)

at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.
java:129)

at org.apache.solr.core.SolrCore.execute(SolrCore.java:1368)

at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:3
56)

at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:
252)

at
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler
.java:1212)

at
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)

at
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)

at
org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)

at
org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)

at
org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)

at
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerColl
ection.java:230)

at
org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:11
4)

at
org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)

at org.mortbay.jetty.Server.handle(Server.java:326)

at
org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)

at
org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:
945)

at
org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:756)

at
org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)

at
org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)

at
org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:22
8)

at
org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582
)

Caused by: org.apache.tika.exception.TikaException: Unexpected
RuntimeException from org.apache.tika.parser.pdf.PDFParser@58c5f8

at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:199)

at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)

at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:137)

at
org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingD
ocumentLoader.java:213)

... 22 more

Caused by: java.lang.NullPointerException

at
org.apache.pdfbox.pdmodel.font.PDFont.getEncodingFromFont(PDFont.java:832)

at
org.apache.pdfbox.pdmodel.font.PDFont.determineEncoding(PDFont.java:293)

at
org.apache.pdfbox.pdmodel.font.PDFont.(PDFont.java:178)

at
org.apache.pdfbox.pdmodel.font.PDSimpleFont.(PDSimpleFont.java:7
9)

at
org.apache.pdfbox.pdmodel.font.PDType1Font.(PDType1Font.java:139
)

at
org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:1
09)

at
org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:7
6)

at
org.apache.pdfbo

Re: Commit Strategy for SolrCloud when Talking about 200 million records.

2012-03-23 Thread I-Chiang Chen
We saw couple distinct errors and all machines in a shard is identical:

-On the leader of the shard
Mar 21, 2012 1:58:34 AM org.apache.solr.common.SolrException log
SEVERE: shard update error StdNode:
http://blah.blah.net:8983/solr/master2-slave1/:org.apache.solr.common.SolrException:
Map failed
at
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:488)
at
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:251)
at
org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:319)
at
org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:300)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

followed by

Mar 21, 2012 1:58:52 AM org.apache.solr.common.SolrException log
SEVERE: shard update error StdNode:
http://blah.blah.net:8983/solr/master2-slave1/:org.apache.solr.common.SolrException:
java.io.IOException: Map failed
at
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:488)
at
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:251)
at
org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:319)
at
org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:300)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

followed by

Mar 21, 2012 1:58:55 AM
org.apache.solr.update.processor.DistributedUpdateProcessor doFinish
INFO: Could not tell a replica to recover
org.apache.solr.client.solrj.SolrServerException:
http://blah.blah.net:8983/solr
at
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:496)
at
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:251)
at
org.apache.solr.update.processor.DistributedUpdateProcessor.doFinish(DistributedUpdateProcessor.java:347)
at
org.apache.solr.update.processor.DistributedUpdateProcessor.finish(DistributedUpdateProcessor.java:816)
at
org.apache.solr.update.processor.LogUpdateProcessor.finish(LogUpdateProcessorFactory.java:176)
at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:68)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1540)
at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:433)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:256)
at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
at
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
at
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
at
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
at
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
at
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
at
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:293)
at
org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:859)
at
org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:602)
at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:129)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
at org.apache.commons.httpclient.HttpParser.readRawLine(HttpParser.java:78)
at org.apache.commons.httpclient.HttpParser.readLine(HttpParser.java:106)
at
org.apache.commons.httpclient.HttpConnecti

Re: Commit Strategy for SolrCloud when Talking about 200 million records.

2012-03-23 Thread Mark Miller

On Mar 23, 2012, at 12:49 PM, I-Chiang Chen wrote:

> Caused by: java.lang.OutOfMemoryError: Map failed

Hmm...looks like this is the key info here. 

- Mark Miller
lucidimagination.com













Re: Slave index size growing fast

2012-03-23 Thread Alexandre Rocco
Erick,

The master /data dir contains only an index dir with a bunch of files.
In the slave, the /data dir contains an index.20110926152410 dir with a lot
more files than the master. That is quite strange for me.

I guess that the config is right, since we have another slave that is
running fine with the same config.
The best bet would be clean up this messed slave and try to sync it again
and see what happens.

Thanks

On Fri, Mar 23, 2012 at 12:25 PM, Erick Erickson wrote:

> not really, unless perhaps you're issuing commits or optimizes
> on the _slave_ (which you should NOT do).
>
> Replication happens based on the version of the index on the master.
> True, it starts out as a timestamp, but then successive versions
> just have that number incremented. The version number
> in the index on the slave is compared against the one on the master,
> but the actual time (on the slave or master) is irrelevant. This is
> explicitly to avoid problems with time synching across
> machines/timezones/whataver
>
> It would be instructive to look at the admin/info page to see what
> the index version is on the master and slave.
>
> But, if you optimize or commit (I think) on the _slave_, you might
> change the timestamp and mess things up (although I'm reaching
> here, I don't know this for certain).
>
> What's the  index look like on the slave as compared to the master?
> Are there just a bunch of files on the slave? Or a bunch of directories?
>
> Instead of re-indexing on the master, you could try to bring down the
> slave, blow away the entire index and start it back up. Since this is a
> production system, I'd only try this if I had more than one slave. Although
> you could bring up a new slave and attach it to the master and see
> what happens there. You wouldn't affect production if you didn't point
> incoming requests at it...
>
> Best
> Erick
>
> On Fri, Mar 23, 2012 at 11:03 AM, Alexandre Rocco 
> wrote:
> > Erick,
> >
> > We're using Solr 3.3 on Linux (CentOS 5.6).
> > The /data dir on master is actually 1.2G.
> >
> > I haven't tried to recreate the index yet. Since it's a production
> > environment,
> > I guess that I can stop replication and indexing and then recreate the
> > master index to see if it makes any difference.
> >
> > Also just noticed another thread here named "Simple Slave Replication
> > Question" that tells that it could be a problem if I'm seeing an
> > /data/index with an timestamp on the slave node.
> > Is this info relevant to this issue?
> >
> > Thanks,
> > Alexandre
> >
> > On Fri, Mar 23, 2012 at 11:48 AM, Erick Erickson <
> erickerick...@gmail.com>wrote:
> >
> >> What version of Solr and what operating system?
> >>
> >> But regardless, this shouldn't be happening. Indexes can
> >> temporarily double in size, but any extras should be
> >> cleaned up relatively soon.
> >>
> >> On the master, what's the total size of the /data directory?
> >> I'm a little suspicious of the  on your master, but I
> >> don't think that's the root of your problem
> >>
> >> Are you recreating the index on the master (by deleting the
> >> index directory and starting over)?
> >>
> >> This is unusual, and I suspect it's something odd in your configuration,
> >> but I confess I'm at a loss as to what.
> >>
> >> Best
> >> Erick
> >>
> >> On Fri, Mar 23, 2012 at 10:28 AM, Alexandre Rocco 
> >> wrote:
> >> > Hello,
> >> >
> >> > We have a Solr index that has an average of 1.19 GB in size.
> >> > After configuring the replication, the slave machine is growing the
> index
> >> > size expoentially.
> >> > Currently we have an slave with 323.44 GB in size.
> >> > Is there anything that could cause this behavior?
> >> > The current replication config is below.
> >> >
> >> > Master:
> >> > 
> >> > 
> >> > commit
> >> > startup
> >> > startup
> >> > 
> >> >
> >>
> elevate.xml,protwords.txt,schema.xml,spellings.txt,stopwords.txt,synonyms.txt
> >> > 
> >> > 
> >> > 
> >> >
> >> > Slave:
> >> > 
> >> > 
> >> > http://master:8984/solr/Index/replication
> >> > 
> >> > 
> >> >
> >> > Any pointers will be useful.
> >> >
> >> > Thanks,
> >> > Alexandre
> >>
>


Re: Slave index size growing fast

2012-03-23 Thread Alexandre Rocco
Tomás,

The 300+GB size is only inside the index.20110926152410 dir. Inside there
are a lot of files.
I am almost conviced that something is messed up like someone commited on
this slave machine.

Thanks

2012/3/23 Tomás Fernández Löbbe 

> Alexandre, additionally to what Erick said, you may want to check in the
> slave if what's 300+GB is the "data" directory or the "index."
> directory.
>
> On Fri, Mar 23, 2012 at 12:25 PM, Erick Erickson  >wrote:
>
> > not really, unless perhaps you're issuing commits or optimizes
> > on the _slave_ (which you should NOT do).
> >
> > Replication happens based on the version of the index on the master.
> > True, it starts out as a timestamp, but then successive versions
> > just have that number incremented. The version number
> > in the index on the slave is compared against the one on the master,
> > but the actual time (on the slave or master) is irrelevant. This is
> > explicitly to avoid problems with time synching across
> > machines/timezones/whataver
> >
> > It would be instructive to look at the admin/info page to see what
> > the index version is on the master and slave.
> >
> > But, if you optimize or commit (I think) on the _slave_, you might
> > change the timestamp and mess things up (although I'm reaching
> > here, I don't know this for certain).
> >
> > What's the  index look like on the slave as compared to the master?
> > Are there just a bunch of files on the slave? Or a bunch of directories?
> >
> > Instead of re-indexing on the master, you could try to bring down the
> > slave, blow away the entire index and start it back up. Since this is a
> > production system, I'd only try this if I had more than one slave.
> Although
> > you could bring up a new slave and attach it to the master and see
> > what happens there. You wouldn't affect production if you didn't point
> > incoming requests at it...
> >
> > Best
> > Erick
> >
> > On Fri, Mar 23, 2012 at 11:03 AM, Alexandre Rocco 
> > wrote:
> > > Erick,
> > >
> > > We're using Solr 3.3 on Linux (CentOS 5.6).
> > > The /data dir on master is actually 1.2G.
> > >
> > > I haven't tried to recreate the index yet. Since it's a production
> > > environment,
> > > I guess that I can stop replication and indexing and then recreate the
> > > master index to see if it makes any difference.
> > >
> > > Also just noticed another thread here named "Simple Slave Replication
> > > Question" that tells that it could be a problem if I'm seeing an
> > > /data/index with an timestamp on the slave node.
> > > Is this info relevant to this issue?
> > >
> > > Thanks,
> > > Alexandre
> > >
> > > On Fri, Mar 23, 2012 at 11:48 AM, Erick Erickson <
> > erickerick...@gmail.com>wrote:
> > >
> > >> What version of Solr and what operating system?
> > >>
> > >> But regardless, this shouldn't be happening. Indexes can
> > >> temporarily double in size, but any extras should be
> > >> cleaned up relatively soon.
> > >>
> > >> On the master, what's the total size of the /data
> directory?
> > >> I'm a little suspicious of the  on your master, but I
> > >> don't think that's the root of your problem
> > >>
> > >> Are you recreating the index on the master (by deleting the
> > >> index directory and starting over)?
> > >>
> > >> This is unusual, and I suspect it's something odd in your
> configuration,
> > >> but I confess I'm at a loss as to what.
> > >>
> > >> Best
> > >> Erick
> > >>
> > >> On Fri, Mar 23, 2012 at 10:28 AM, Alexandre Rocco 
> > >> wrote:
> > >> > Hello,
> > >> >
> > >> > We have a Solr index that has an average of 1.19 GB in size.
> > >> > After configuring the replication, the slave machine is growing the
> > index
> > >> > size expoentially.
> > >> > Currently we have an slave with 323.44 GB in size.
> > >> > Is there anything that could cause this behavior?
> > >> > The current replication config is below.
> > >> >
> > >> > Master:
> > >> > 
> > >> > 
> > >> > commit
> > >> > startup
> > >> > startup
> > >> > 
> > >> >
> > >>
> >
> elevate.xml,protwords.txt,schema.xml,spellings.txt,stopwords.txt,synonyms.txt
> > >> > 
> > >> > 
> > >> > 
> > >> >
> > >> > Slave:
> > >> > 
> > >> > 
> > >> > http://master:8984/solr/Index/replication
> 
> > >> > 
> > >> > 
> > >> >
> > >> > Any pointers will be useful.
> > >> >
> > >> > Thanks,
> > >> > Alexandre
> > >>
> >
>


Tags and Folksonomies

2012-03-23 Thread Nishant Chandra
Suppose I have content which has title and description. Users can tag content
and search content based on tag, title and description. Tag has more
weightage.

Any inputs on how indexing and retrieval will work given there is content
and tags using Solr? Has anyone implemented search based on collaborative
tagging?

Thanks,
Nishant


Re: Solr 4.0 replication problem

2012-03-23 Thread Erick Erickson
In that case, I'm kind of stuck. You've already rebuilt your index
from scratch and removed it from your slaves. That should have
cleared out most everything that could be an issue. I'd suggest
you set up a pair of machines from scratch and try to set up an
index/replication with your current schema. If a fresh install shows
the same problem, you've probably found a bona-fide bug. But
you'll probably have to fix it yourself if you can't upgrade.

But  I really doubt it's a bug, this just smells way too much  like
you have something going on you aren't aware of ("interesting"
CLASSPATH issues? Multiple installations of Solr? Someone,
sometime, indexed things to the master you didn't know about
with a newer version?whataver?)

Sorry I can't be more help
Erick

On Fri, Mar 23, 2012 at 12:01 PM, Hakan İlter  wrote:
> Hi Erick,
>
> It's not possible because both master and slaves using same binaries.
>
> Thanks...
>
>
> On Fri, Mar 23, 2012 at 5:30 PM, Erick Erickson 
> wrote:
>
>> Hmmm, looking at your stack trace in a bit more detail, this is really
>> suspicious:
>>
>> Caused by: org.apache.lucene.index.IndexFormatTooNewException: Format
>> version is not supported in file 'segments_1': -12 (needs to be between -9
>> and -11)
>>
>> This *looks* like your Solr version on your slave is older than the version
>> on your master. Is this possible at all?
>>
>> Best
>> Erick
>>
>> On Fri, Mar 23, 2012 at 11:03 AM, Hakan İlter 
>> wrote:
>> > Hi Erick,
>> >
>> > I've already tried step 2 and 3 but it didn't help. It's almost
>> impossible
>> > to do step 1 for us because of project dead-line.
>> >
>> > Do you have any other suggestion?
>> >
>> > Thank your reply.
>> >
>> > On Fri, Mar 23, 2012 at 4:56 PM, Erick Erickson > >wrote:
>> >
>> >> Hmmm, that is odd. But a trunk build from that long ago is going
>> >> to be almost impossible to debug/fix. The problem with working
>> >> from trunk is that this kind of problem won't get much attention.
>> >>
>> >> I have three suggestions:
>> >> 1> update to current trunk. NOTE: you'll have to completely
>> >>     reindex your data, the format of the index has changed
>> >>     multiple times since then and there's no back-compatibility
>> >>     maintained with non-released major versions.
>> >> 2> delete your entire index on the _master_ and re-index from scratch.
>> >>     If you do this, I'd also delete the entire /data
>> directory
>> >>     on the slaves before replication as well.
>> >> 3> Delete your entire index on the _slave_ and see if you get a
>> >>     clean replication.
>> >>
>> >> <3> is the least painful, <1> the most so I'd go in reverse order
>> >> for the above.
>> >>
>> >> Best
>> >> Erick
>> >>
>> >>
>> >> On Wed, Mar 21, 2012 at 8:49 AM, Hakan İlter 
>> wrote:
>> >> > Hi everyone,
>> >> >
>> >> > We are using very early version of Solr 4.0 and we've some replication
>> >> > problems. Actually we used this build more than one year without any
>> >> > problem but when I made some changes on schema.xml, the following
>> problem
>> >> > started.
>> >> >
>> >> > I've just changed schema.xml with adding multiValued="true" attribute
>> to
>> >> > two dynamic fields.
>> >> >
>> >> > Before:
>> >> > 
>> >> > > stored="false"
>> >> />
>> >> > > stored="false" />
>> >> > 
>> >> >
>> >> > After:
>> >> > 
>> >> > > stored="false" *
>> >> > multiValued="true"* />
>> >> > > stored="false" *
>> >> > multiValued="true"* />
>> >> > 
>> >> >
>> >> > After starting tomcats with new configuration, there are no problems
>> >> > occurred. But after a while, I'm seeing this error:
>> >> >
>> >> > *Mar 20, 2012 2:00:05 PM org.apache.solr.handler.ReplicationHandler
>> >> doFetch
>> >> > SEVERE: SnapPull failed
>> >> > org.apache.solr.common.SolrException: Index fetch failed :
>> >> >        at
>> >> >
>> org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:340)
>> >> >        at
>> >> >
>> >>
>> org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:265)
>> >> >        at
>> org.apache.solr.handler.SnapPuller$1.run(SnapPuller.java:166)
>> >> >        at
>> >> >
>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
>> >> >        at
>> >> >
>> >>
>> java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:317)
>> >> >        at
>> >> java.util.concurrent.FutureTask.runAndReset(FutureTask.java:150)
>> >> >        at
>> >> >
>> >>
>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:98)
>> >> >        at
>> >> >
>> >>
>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic(ScheduledThreadPoolExecutor.java:180)
>> >> >        at
>> >> >
>> >>
>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:204)
>> >> >        at
>> >> >
>> >>
>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>> >> >        at
>> >> >
>> >>
>> jav

Field length and scoring

2012-03-23 Thread Erik Fäßler
Hello there,

I have a quite basic question but my Solr is behaving in a way I'm not quite 
sure of why it does so.

The setup is simple: I have a field "suggestionText" in which single strings 
are indexed. Schema:

 

Since I want this field to serve for a suggestion-search, the input string is 
analyzed by a EdgeNGramFilter.

Lets have a look on two cases:

case1: Input string was 'il2'
case2: Input string was 'il24'

As I can see from the Solr-admin-analysis-page, case1 is analysed as

i
il
il2

and case2 as

i
il
il2
il24

As you would expect. The point now is: When I search for 'il2' I would expect 
case1 to have a higher score than case2. I thought this way because I did not 
omit norms and thus I thought, the shorter field would get a (slightly) higher 
score. However, the scores in both cases are identical and so it happens that 
'il24' is suggested prior to 'il2'.

Perhaps I did understand the norms or the notion of "field length" wrong. I 
would be grateful if you could help me out here and give me advice on how to 
accomplish the wished behavior.

Thanks and best regards,

Erik

Re: Slave index size growing fast

2012-03-23 Thread Erick Erickson
Alexandre:

Have you changed anything like  on your slave?
And do you have more than one slave? If you do, have you considered
just blowing away the entire .../data directory on the slave and letting
it re-start from scratch? I'd take the slave out of service for the
duration of this operation, or do it when you are OK with some number of
requests going to an empty index

Because having an index. directory indicates that sometime
someone forced the slave to get out of sync, possibly as you say by
doing a commit. Or sending docs to it to be indexed or some such. Starting
the slave over should fix that if it's the root of your problem.

Note a curious thing about the . When you start indexing, the
index version is a timestamp. However, from that point on when the index
changes, the version number is just incremented (not made the current
time). This is to avoid problems with masters and slaves having different
times. But a consequence of that is if your slave somehow gets an index
that's newer, the replication process does the best it can to not delete
indexes that are out of sync with the master and saves them away. This
might be what you're seeing.

I'm grasping at straws a bit here, but this seems possible.

Best
Erick

On Fri, Mar 23, 2012 at 1:16 PM, Alexandre Rocco  wrote:
> Tomás,
>
> The 300+GB size is only inside the index.20110926152410 dir. Inside there
> are a lot of files.
> I am almost conviced that something is messed up like someone commited on
> this slave machine.
>
> Thanks
>
> 2012/3/23 Tomás Fernández Löbbe 
>
>> Alexandre, additionally to what Erick said, you may want to check in the
>> slave if what's 300+GB is the "data" directory or the "index."
>> directory.
>>
>> On Fri, Mar 23, 2012 at 12:25 PM, Erick Erickson > >wrote:
>>
>> > not really, unless perhaps you're issuing commits or optimizes
>> > on the _slave_ (which you should NOT do).
>> >
>> > Replication happens based on the version of the index on the master.
>> > True, it starts out as a timestamp, but then successive versions
>> > just have that number incremented. The version number
>> > in the index on the slave is compared against the one on the master,
>> > but the actual time (on the slave or master) is irrelevant. This is
>> > explicitly to avoid problems with time synching across
>> > machines/timezones/whataver
>> >
>> > It would be instructive to look at the admin/info page to see what
>> > the index version is on the master and slave.
>> >
>> > But, if you optimize or commit (I think) on the _slave_, you might
>> > change the timestamp and mess things up (although I'm reaching
>> > here, I don't know this for certain).
>> >
>> > What's the  index look like on the slave as compared to the master?
>> > Are there just a bunch of files on the slave? Or a bunch of directories?
>> >
>> > Instead of re-indexing on the master, you could try to bring down the
>> > slave, blow away the entire index and start it back up. Since this is a
>> > production system, I'd only try this if I had more than one slave.
>> Although
>> > you could bring up a new slave and attach it to the master and see
>> > what happens there. You wouldn't affect production if you didn't point
>> > incoming requests at it...
>> >
>> > Best
>> > Erick
>> >
>> > On Fri, Mar 23, 2012 at 11:03 AM, Alexandre Rocco 
>> > wrote:
>> > > Erick,
>> > >
>> > > We're using Solr 3.3 on Linux (CentOS 5.6).
>> > > The /data dir on master is actually 1.2G.
>> > >
>> > > I haven't tried to recreate the index yet. Since it's a production
>> > > environment,
>> > > I guess that I can stop replication and indexing and then recreate the
>> > > master index to see if it makes any difference.
>> > >
>> > > Also just noticed another thread here named "Simple Slave Replication
>> > > Question" that tells that it could be a problem if I'm seeing an
>> > > /data/index with an timestamp on the slave node.
>> > > Is this info relevant to this issue?
>> > >
>> > > Thanks,
>> > > Alexandre
>> > >
>> > > On Fri, Mar 23, 2012 at 11:48 AM, Erick Erickson <
>> > erickerick...@gmail.com>wrote:
>> > >
>> > >> What version of Solr and what operating system?
>> > >>
>> > >> But regardless, this shouldn't be happening. Indexes can
>> > >> temporarily double in size, but any extras should be
>> > >> cleaned up relatively soon.
>> > >>
>> > >> On the master, what's the total size of the /data
>> directory?
>> > >> I'm a little suspicious of the  on your master, but I
>> > >> don't think that's the root of your problem
>> > >>
>> > >> Are you recreating the index on the master (by deleting the
>> > >> index directory and starting over)?
>> > >>
>> > >> This is unusual, and I suspect it's something odd in your
>> configuration,
>> > >> but I confess I'm at a loss as to what.
>> > >>
>> > >> Best
>> > >> Erick
>> > >>
>> > >> On Fri, Mar 23, 2012 at 10:28 AM, Alexandre Rocco 
>> > >> wrote:
>> > >> > Hello,
>> > >> >
>> > >> > We have a Solr index that has an average of 1.19 GB in size.

Re: Field length and scoring

2012-03-23 Thread Erick Erickson
Erik:

The field length is, I believe, based on _tokens_, not characters.
Both of your examples
are exactly one token long, so the scores are probably identical

Also, the field length is enocded in a byte (as I remember). So it's
quite possible that,
even if the lengths of these fields were 3 and 4 instead of both being
1, the value
stored for the length norms would be the same number.

HTH
Erick

On Fri, Mar 23, 2012 at 2:40 PM, Erik Fäßler  wrote:
> Hello there,
>
> I have a quite basic question but my Solr is behaving in a way I'm not quite 
> sure of why it does so.
>
> The setup is simple: I have a field "suggestionText" in which single strings 
> are indexed. Schema:
>
>   stored="true"/>
>
> Since I want this field to serve for a suggestion-search, the input string is 
> analyzed by a EdgeNGramFilter.
>
> Lets have a look on two cases:
>
> case1: Input string was 'il2'
> case2: Input string was 'il24'
>
> As I can see from the Solr-admin-analysis-page, case1 is analysed as
>
> i
> il
> il2
>
> and case2 as
>
> i
> il
> il2
> il24
>
> As you would expect. The point now is: When I search for 'il2' I would expect 
> case1 to have a higher score than case2. I thought this way because I did not 
> omit norms and thus I thought, the shorter field would get a (slightly) 
> higher score. However, the scores in both cases are identical and so it 
> happens that 'il24' is suggested prior to 'il2'.
>
> Perhaps I did understand the norms or the notion of "field length" wrong. I 
> would be grateful if you could help me out here and give me advice on how to 
> accomplish the wished behavior.
>
> Thanks and best regards,
>
>        Erik


Practical Optimization

2012-03-23 Thread dw5ight
Hey All-

we run a  http://carsabi.com car search engine  with Solr and did some
benchmarking recently after we switched from a hosted service to
self-hosting. In brief, we went from 800ms complex range queries on a 1.5M
document corpus to 43ms. The major shifts were switching from EC2 Large to
EC2 CC8XL which got us down to 282ms (2.82x speed gain due to 2.75x CPU
speed increase we think), and then down to 43ms when we sharded to 8 cores.
We tried sharding to 12 and 16 but saw negligible gains after this point.

Anyway, hope this might be useful to someone - we write up exact stats and a
step by step sharding procedure on our 
http://carsabi.com/car-news/2012/03/23/optimizing-solr-7x-your-search-speed/
tech blog  if anyone's interested.

best
Dwight

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Practical-Optimization-tp3852776p3852776.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Field length and scoring

2012-03-23 Thread Ahmet Arslan
> Also, the field length is enocded in a byte (as I remember).
> So it's
> quite possible that,
> even if the lengths of these fields were 3 and 4 instead of
> both being
> 1, the value
> stored for the length norms would be the same number.

Exactly. http://search-lucene.com/m/uGKRu1pvRjw



Re: querying on shards

2012-03-23 Thread Shawn Heisey

On 3/23/2012 9:55 AM, stockii wrote:

how look your requestHandler of your broker? i think about your idea to do
the same ;)


Here's what I have got for the default request handler in my broker 
core, which is called ncmain. The "rollingStatistics" section is 
applicable to the SOLR-1972 patch.




all
70
name="shards">idxa2.REDACTED.com:8981/solr/inclive,idxa1.REDACTED.com:8981/solr/s0live,idxa1.REDACTED.com:8981/solr/s1live,idxa1.REDACTED.com:8981/solr/s2live,idxa2.REDACTED.com:8981/solr/s3live,idxa2.REDACTED.com:8981/solr/s4live,idxa2.REDACTED.com:8981/solr/s5live


default
false
false
9
5
2
2



spellcheck



604800
16384
5

75
95
99
100






spellcheck file format - multiple words on a line?

2012-03-23 Thread geeky2
hello all,

for business reasons, we are sourcing the spellcheck file from another
business group.  

the file we receive looks like the example data below

can solr support this type of format - or do i need to process this file in
to a format that has a single word on a single line?

thanks for any help
mark



// snipped from spellcheck file sourced from business group

14-INCH CHAIN
14-INCH RIGHT TINE
1/4 open end ignition wrench
150 DEGREES CELSIUS
15 foot I wire
15 INCH
15 WATT
16 HORSEPOWER ENGINE
16 HORSEPOWER GASOLINE ENGINE
16-INCH BAR
16-INCH CHAIN
16l Cross
16p SIXTEEN PIECE FLAT FLEXIBLE CABLE


--
View this message in context: 
http://lucene.472066.n3.nabble.com/spellcheck-file-format-multiple-words-on-a-line-tp3853096p3853096.html
Sent from the Solr - User mailing list archive at Nabble.com.