Solr on Several Machines Communication Fails

2013-06-30 Thread Ophir Michaeli
Hi,

I' running Solr on 4 machines with the following configuration:
Solr Vesrion: 4.3
Solr Cloud

Machine 1: 
running shard 1 with embedded zokeeper -
java -Dbootstrap_confdir=./solr/collection1/conf
-Dcollection.configName=myconf -DzkRun -DnumShards=2 -jar start.jar

Machine 2:
Running shard 2
java -Djetty.port=7574 -DzkHost=shard1_Machine_IP:9983 -jar start.jar

Error:

ERROR - 2013-06-27 15:18:06.066; org.apache.solr.common.SolrException;
org.apache.solr.common.SolrException: no servers hosting shard: 
at
org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.j
ava:149)
at
org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.j
ava:119)
at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source)
at java.util.concurrent.FutureTask.run(Unknown Source)
at java.util.concurrent.Executors$RunnableAdapter.call(Unknown
Source)
at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source)
at java.util.concurrent.FutureTask.run(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown
Source)
at java.lang.Thread.run(Unknown Source)
 
If I run the same 2 shards on one machine it works, I get  correct response
for queries.

Thanks




No date.gap on pivoted facets

2013-06-30 Thread Dotan Cohen
Consider the following query:
select?q=*:*
&facet=true
&facet.date=added
&facet.date.start=2013-04-01T00:00:00Z
&facet.date.end=2013-06-30T00:00:00Z
&facet.date.gap=%2b7DAYS
&rows=0
&facet.pivot=added,provider

In this query, the facet.date.gap is ignored and each individual
second in faceted on. The issue remains the same even when reversing
the order of the pivot:
&facet.pivot=provider,added

Is this a Solr bug, or am I pivoting wrong? This is on Solr 4.1.0
running on OpenJDK 64-Bit Server VM (build 20.0-b12, mixed mode) on
Ubuntu Server 12.04. Thank you!


--
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com


Solr cloud shard goes down when after many broken pipe exceptions

2013-06-30 Thread spcyrus
First ClientAbortException comes, which is expected as there is timeout on
client side with stack trace as follows

Jun 30, 2013 2:24:30 PM org.apache.solr.common.SolrException log
SEVERE: null:ClientAbortException:  java.net.SocketException: Broken pipe
at
org.apache.catalina.connector.OutputBuffer.realWriteBytes(OutputBuffer.java:369)
at org.apache.tomcat.util.buf.ByteChunk.append(ByteChunk.java:339)
at
org.apache.catalina.connector.OutputBuffer.writeBytes(OutputBuffer.java:392)
at
org.apache.catalina.connector.OutputBuffer.write(OutputBuffer.java:381)
at
org.apache.catalina.connector.CoyoteOutputStream.write(CoyoteOutputStream.java:89)
at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:202)
at sun.nio.cs.StreamEncoder.implWrite(StreamEncoder.java:263)
at sun.nio.cs.StreamEncoder.write(StreamEncoder.java:106)
at java.io.OutputStreamWriter.write(OutputStreamWriter.java:190)
at org.apache.solr.util.FastWriter.flush(FastWriter.java:141)
at org.apache.solr.util.FastWriter.write(FastWriter.java:55)
at
org.apache.solr.response.JSONWriter.writeStr(JSONResponseWriter.java:449)
at
org.apache.solr.response.TextResponseWriter.writeVal(TextResponseWriter.java:124)
at
org.apache.solr.response.JSONWriter.writeSolrDocument(JSONResponseWriter.java:355)
at
org.apache.solr.response.TextResponseWriter.writeSolrDocumentList(TextResponseWriter.java:222)
at
org.apache.solr.response.TextResponseWriter.writeVal(TextResponseWriter.java:184)
at
org.apache.solr.response.JSONWriter.writeNamedListAsMapWithDups(JSONResponseWriter.java:183)
at
org.apache.solr.response.JSONWriter.writeNamedList(JSONResponseWriter.java:299)
at
org.apache.solr.response.JSONWriter.writeResponse(JSONResponseWriter.java:95)
at
org.apache.solr.response.JSONResponseWriter.write(JSONResponseWriter.java:60)
at
org.apache.solr.servlet.SolrDispatchFilter.writeResponse(SolrDispatchFilter.java:404)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:282)
at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
at
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
at
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
at
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
at
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:103)
at
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
at
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:293)
at
org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:861)
at
org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:606)
at
org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489)
at java.lang.Thread.run(Thread.java:662)

Caused by: java.net.SocketException: Broken pipe
at java.net.SocketOutputStream.socketWrite0(Native Method)
at
java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:92)
at java.net.SocketOutputStream.write(SocketOutputStream.java:136)
at
org.apache.coyote.http11.InternalOutputBuffer.realWriteBytes(InternalOutputBuffer.java:756)
at
org.apache.tomcat.util.buf.ByteChunk.flushBuffer(ByteChunk.java:448)
at org.apache.tomcat.util.buf.ByteChunk.append(ByteChunk.java:363)
at
org.apache.coyote.http11.InternalOutputBuffer$OutputStreamOutputBuffer.doWrite(InternalOutputBuffer.java:780)
at
org.apache.coyote.http11.filters.ChunkedOutputFilter.doWrite(ChunkedOutputFilter.java:126)
at
org.apache.coyote.http11.InternalOutputBuffer.doWrite(InternalOutputBuffer.java:593)
at org.apache.coyote.Response.doWrite(Response.java:560)
at
org.apache.catalina.connector.OutputBuffer.realWriteBytes(OutputBuffer.java:364)
... 33 more




The above exception comes a number of times followed by a shard getting down

Jun 30, 2013 2:24:33 PM org.apache.solr.common.SolrException log
SEVERE: org.apache.solr.common.SolrException: no servers hosting shard:
at
org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:162)
at
org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:135)
at
java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
at
java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.Fu

Distributed search results in "SocketException: Connection reset"

2013-06-30 Thread Shahar Davidson
Hi all,

We're getting the below exception sporadically when using distributed search. 
(using Solr 4.2.1)
Note that 'core_3' is one of the cores mentioned in the 'shards' parameter.

Any ideas anyone?

Thanks,

Shahar.


Jun 03, 2013 5:27:38 PM org.apache.solr.common.SolrException log
SEVERE: org.apache.solr.common.SolrException: 
org.apache.solr.client.solrj.SolrServerException: IOException occured when 
talking to server at: http://127.0.0.1:8210/solr/core_3
at 
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:300)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:144)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1830)
at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:455)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:276)
at 
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1307)
at 
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:453)
at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
at 
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:560)
at 
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
at 
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1072)
at 
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:382)
at 
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)
at 
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1006)
at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
at 
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
at 
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)
at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
at org.eclipse.jetty.server.Server.handle(Server.java:365)
at 
org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:485)
at 
org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53)
at 
org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:937)
at 
org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:998)
at 
org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:856)
at 
org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240)
at 
org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
at 
org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
at 
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
at 
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
at java.lang.Thread.run(Unknown Source)
Caused by: org.apache.solr.client.solrj.SolrServerException: IOException 
occured when talking to server at: http://127.0.0.1:8210/solr/core_3
at 
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:413)
at 
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
at 
org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:166)
at 
org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:133)
at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source)
at java.util.concurrent.FutureTask.run(Unknown Source)
at java.util.concurrent.Executors$RunnableAdapter.call(Unknown 
Source)
at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source)
at java.util.concurrent.FutureTask.run(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown 
Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown 
Source)
... 1 more
Caused by: java.net.SocketException: Connection reset
at java.net.SocketInputStream.read(Unknown Source)
at java.net.SocketInputStream.read(Unknown Source)
at 
org.apache.http.impl.io.AbstractSessionInputBuffer.fillBuffer(AbstractSessionInputBuffer.java:149)
at 
org.apache.http.impl.io.SocketInputBuffer.fillBuffe

Re: No date.gap on pivoted facets

2013-06-30 Thread Jack Krupansky
Sorry, but Solr pivot faceting is based solely on "field" facets, not 
"range" (or "date") facets.


You can approximate date gaps by making a copy of your raw date field and 
then manually "gap" (truncate) the date values so that the their discrete 
values correspond to your date gap.


You can do that with an update processor, or do it before you send the data 
to Solr.


In the next release of my book I have a script for a 
StatelessScriptUpdateProccessor (with examples) that supports truncation of 
dates to a desired resolution, copying or modifying the input date as 
desired.


-- Jack Krupansky

-Original Message- 
From: Dotan Cohen

Sent: Sunday, June 30, 2013 5:51 AM
To: solr-user@lucene.apache.org
Subject: No date.gap on pivoted facets

Consider the following query:
select?q=*:*
&facet=true
&facet.date=added
&facet.date.start=2013-04-01T00:00:00Z
&facet.date.end=2013-06-30T00:00:00Z
&facet.date.gap=%2b7DAYS
&rows=0
&facet.pivot=added,provider

In this query, the facet.date.gap is ignored and each individual
second in faceted on. The issue remains the same even when reversing
the order of the pivot:
&facet.pivot=provider,added

Is this a Solr bug, or am I pivoting wrong? This is on Solr 4.1.0
running on OpenJDK 64-Bit Server VM (build 20.0-b12, mixed mode) on
Ubuntu Server 12.04. Thank you!


--
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com 



Re: Improving performance to return 2000+ documents

2013-06-30 Thread Utkarsh Sengar
Thanks Erick/Peter.

This is an offline process, used by a relevancy engine implemented around
solr. The engine computes boost scores for related keywords based on
clickstream data.
i.e.: say clickstream has: ipad=upc1,upc2,upc3
I query solr with keyword: "ipad" (to get 2000 documents) and then make 3
individual queries for upc1,upc2,upc3 (which are fast).
The data is then used to compute related keywords to "ipad" with their
boost values.

So, I cannot really replace that, since I need full text search over my
dataset to retrieve top 2000 documents.

I tried paging: I retrieve 500 solr documents 4 times (0-500, 500-1000...),
but don't see any improvements.


Some questions:
1. Maybe the JVM size might help?
This is what I see in the dashboard:
Physical Memory 76.2%
Swap Space NaN% (don't have any swap space, running on AWS EBS)
File Descriptor Count 4.7%
JVM-Memory 73.8%

Screenshot: http://i.imgur.com/aegKzP6.png

2. Will reducing the shards from 3 to 1 improve performance? (maybe
increase the RAM from 30 to 60GB) The problem I will face in that case will
be fitting 50M documents on 1 machine.

Thanks,
-Utkarsh


On Sat, Jun 29, 2013 at 3:58 PM, Peter Sturge wrote:

> Hello Utkarsh,
> This may or may not be relevant for your use-case, but the way we deal with
> this scenario is to retrieve the top N documents 5,10,20or100 at a time
> (user selectable). We can then page the results, changing the start
> parameter to return the next set. This allows us to 'retrieve' millions of
> documents - we just do it at the user's leisure, rather than make them wait
> for the whole lot in one go.
> This works well because users very rarely want to see ALL 2000 (or whatever
> number) documents at one time - it's simply too much to take in at one
> time.
> If your use-case involves an automated or offline procedure (e.g. running a
> report or some data-mining op), then presumably it doesn't matter so much
> it takes a bit longer (as long as it returns in some reasonble time).
> Have you looked at doing paging on the client-side - this will hugely
> speed-up your search time.
> HTH
> Peter
>
>
>
> On Sat, Jun 29, 2013 at 6:17 PM, Erick Erickson  >wrote:
>
> > Well, depending on how many docs get served
> > from the cache the time will vary. But this is
> > just ugly, if you can avoid this use-case it would
> > be a Good Thing.
> >
> > Problem here is that each and every shard must
> > assemble the list of 2,000 documents (just ID and
> > sort criteria, usually score).
> >
> > Then the node serving the original request merges
> > the sub-lists to pick the top 2,000. Then the node
> > sends another request to each shard to get
> > the full document. Then the node merges this
> > into the full list to return to the user.
> >
> > Solr really isn't built for this use-case, is it actually
> > a compelling situation?
> >
> > And having your document cache set at 1M is kinda
> > high if you have very big documents.
> >
> > FWIW,
> > Erick
> >
> >
> > On Fri, Jun 28, 2013 at 8:44 PM, Utkarsh Sengar  > >wrote:
> >
> > > Also, I don't see a consistent response time from solr, I ran ab again
> > and
> > > I get this:
> > >
> > > ubuntu@ip-10-149-6-68:~$ ab -c 10 -n 500 "
> > >
> > >
> >
> http://x.amazonaws.com:8983/solr/prodinfo/select?q=allText:huggies%20diapers%20size%201&rows=2000&wt=json
> > > "
> > >
> > >
> > > Benchmarking x.amazonaws.com (be patient)
> > > Completed 100 requests
> > > Completed 200 requests
> > > Completed 300 requests
> > > Completed 400 requests
> > > Completed 500 requests
> > > Finished 500 requests
> > >
> > >
> > > Server Software:
> > > Server Hostname:   x.amazonaws.com
> > > Server Port:8983
> > >
> > > Document Path:
> > >
> > >
> >
> /solr/prodinfo/select?q=allText:huggies%20diapers%20size%201&rows=2000&wt=json
> > > Document Length:1538537 bytes
> > >
> > > Concurrency Level:  10
> > > Time taken for tests:   10.858 seconds
> > > Complete requests:  500
> > > Failed requests:8
> > >(Connect: 0, Receive: 0, Length: 8, Exceptions: 0)
> > > Write errors:   0
> > > Total transferred:  769297992 bytes
> > > HTML transferred:   769268492 bytes
> > > Requests per second:46.05 [#/sec] (mean)
> > > Time per request:   217.167 [ms] (mean)
> > > Time per request:   21.717 [ms] (mean, across all concurrent
> > requests)
> > > Transfer rate:  69187.90 [Kbytes/sec] received
> > >
> > > Connection Times (ms)
> > >   min  mean[+/-sd] median   max
> > > Connect:00   0.3  0   2
> > > Processing:   110  215  72.0190 497
> > > Waiting:   91  180  70.5152 473
> > > Total:112  216  72.0191 497
> > >
> > > Percentage of the requests served within a certain time (ms)
> > >   50%191
> > >   66%225
> > >   75%252
> > >   80%272
> > >   90%319
> > >   95%364
> > >   98%420
> > >   99%453
> > >  100%497 (longest request)
> > >
> > >
> > > Som

Re: cores sharing an instance

2013-06-30 Thread Peyman Faratin
I see. If I wanted to try the second option ("find a place inside solr before 
the core is created") then where would that place be in the flow of app waking 
up? Currently what I am doing is each core loads its app caches via a 
requesthandler (in solrconfig.xml) that initializes the java class that does 
the loading. For instance:


   
 AppCaches
   

 


So each core has its own so specific cachedResources handler. Where in SOLR 
would I need to place the AppCaches code to make it visible to all other cores 
then?

thank you Roman

On Jun 29, 2013, at 10:58 AM, Roman Chyla  wrote:

> Cores can be reloaded, they are inside solrcore loader /I forgot the exact
> name/, and they will have different classloaders /that's servlet thing/, so
> if you want singletons you must load them outside of the core, using a
> parent classloader - in case of jetty, this means writing your own jetty
> initialization or config to force shared class loaders. or find a place
> inside the solr, before the core is created. Google for montysolr to see
> the example of the first approach.
> 
> But, unless you really have no other choice, using singletons is IMHO a bad
> idea in this case
> 
> Roman
> 
> On 29 Jun 2013 10:18, "Peyman Faratin"  wrote:
>> 
>> its the singleton pattern, where in my case i want an object (which is
> RAM expensive) to be a centralized coordinator of application logic.
>> 
>> thank you
>> 
>> On Jun 29, 2013, at 1:16 AM, Shalin Shekhar Mangar 
> wrote:
>> 
>>> There is very little shared between multiple cores (instanceDir paths,
>>> logging config maybe?). Why are you trying to do this?
>>> 
>>> On Sat, Jun 29, 2013 at 1:14 AM, Peyman Faratin 
> wrote:
 Hi
 
 I have a multicore setup (in 4.3.0). Is it possible for one core to
> share an instance of its class with other cores at run time? i.e.
 
 At run time core 1 makes an instance of object O_i
 
 core 1 --> object O_i
 core 2
 ---
 core n
 
 then can core K access O_i? I know they can share properties but is it
> possible to share objects?
 
 thank you
 
>>> 
>>> 
>>> 
>>> --
>>> Regards,
>>> Shalin Shekhar Mangar.
>> 



Re: cores sharing an instance

2013-06-30 Thread Peyman Faratin
That is what I had assumed but it appears not to be the case. A class (and its 
properties) of one core is not visible to another class in another core - in 
the same JVM. 

Peyman

On Jun 29, 2013, at 1:23 PM, Erick Erickson  wrote:

> Well, the code is all in the same JVM, so there's no
> reason a singleton approach wouldn't work that I
> can think of. All the multithreaded caveats apply.
> 
> Best
> Erick
> 
> 
> On Fri, Jun 28, 2013 at 3:44 PM, Peyman Faratin wrote:
> 
>> Hi
>> 
>> I have a multicore setup (in 4.3.0). Is it possible for one core to share
>> an instance of its class with other cores at run time? i.e.
>> 
>> At run time core 1 makes an instance of object O_i
>> 
>> core 1 --> object O_i
>> core 2
>> ---
>> core n
>> 
>> then can core K access O_i? I know they can share properties but is it
>> possible to share objects?
>> 
>> thank you
>> 
>> 



Re: Distributed search results in "SocketException: Connection reset"

2013-06-30 Thread Lance Norskog

This usually means the end server timed out.

On 06/30/2013 06:31 AM, Shahar Davidson wrote:

Hi all,

We're getting the below exception sporadically when using distributed search. 
(using Solr 4.2.1)
Note that 'core_3' is one of the cores mentioned in the 'shards' parameter.

Any ideas anyone?

Thanks,

Shahar.


Jun 03, 2013 5:27:38 PM org.apache.solr.common.SolrException log
SEVERE: org.apache.solr.common.SolrException: 
org.apache.solr.client.solrj.SolrServerException: IOException occured when 
talking to server at: http://127.0.0.1:8210/solr/core_3
 at 
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:300)
 at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:144)
 at org.apache.solr.core.SolrCore.execute(SolrCore.java:1830)
 at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:455)
 at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:276)
 at 
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1307)
 at 
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:453)
 at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
 at 
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:560)
 at 
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
 at 
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1072)
 at 
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:382)
 at 
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)
 at 
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1006)
 at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
 at 
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
 at 
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)
 at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
 at org.eclipse.jetty.server.Server.handle(Server.java:365)
 at 
org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:485)
 at 
org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53)
 at 
org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:937)
 at 
org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:998)
 at 
org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:856)
 at 
org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240)
 at 
org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
 at 
org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
 at 
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
 at 
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
 at java.lang.Thread.run(Unknown Source)
Caused by: org.apache.solr.client.solrj.SolrServerException: IOException 
occured when talking to server at: http://127.0.0.1:8210/solr/core_3
 at 
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:413)
 at 
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
 at 
org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:166)
 at 
org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:133)
 at java.util.concurrent.FutureTask$Sync.innerRun(Unknown 
Source)
 at java.util.concurrent.FutureTask.run(Unknown Source)
 at java.util.concurrent.Executors$RunnableAdapter.call(Unknown 
Source)
 at java.util.concurrent.FutureTask$Sync.innerRun(Unknown 
Source)
 at java.util.concurrent.FutureTask.run(Unknown Source)
 at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown 
Source)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown 
Source)
 ... 1 more
Caused by: java.net.SocketException: Connection reset
 at java.net.SocketInputStream.read(Unknown Source)
 at java.net.SocketInputStream.read(Unknown Source)
 at 
org.apache.http.impl.io.Abst

Re: getting different search results for words with same meaning in Japanese language

2013-06-30 Thread Lance Norskog
The MappingCharFilter allows you to map both characters to one
characters. If you do this during indexing and querying, searching with
one should find the other. This is sort of like synonyms, but on a
character-by-character basis.

Lance

On 06/18/2013 11:08 PM, Yash Sharma wrote:
> Hi,
>
> we have two japanese words with the same meaning ソフトウェア and ソフトウエア (notice
> the difference in capital I looking character - word meaning is 'software'
> in the english language). When ソフトウェア is searched, it gives around 8 search
> results but when ソフトウエア is searched, it gives only 2 search results.
>
> The japanese translator told that this is something called yugari (which
> means that the above words can be seen as authorise and authorize, so they
> should yield same search results as they have same meaning but spelled
> differently).
>
> we have one solution to this issue - to use synonyms.txt and place all
> these similar words in this text file. This solved our problem to some
> extent but, in real time scenario, we do not have all the japanese
> technical words like software, product, technology, and so on and we cannot
> keep updating synonyms.txt on a daily basis.
>
> Is there any better solution, so that all the similar japanese words give
> same search results ?
> Any help is greatly appreciated.
>



Re: Improving performance to return 2000+ documents

2013-06-30 Thread Erick Erickson
50M documents, depending on a bunch of things,
may not be unreasonable for a single node, only
testing will tell.

But the question I have is whether you should be
using standard Solr queries for this or building a custom
component that goes at the base Lucene index
and "does the right thing". Or even re-indexing your
entire corpus periodically to add this kind of data.

FWIW,
Erick


On Sun, Jun 30, 2013 at 2:00 PM, Utkarsh Sengar wrote:

> Thanks Erick/Peter.
>
> This is an offline process, used by a relevancy engine implemented around
> solr. The engine computes boost scores for related keywords based on
> clickstream data.
> i.e.: say clickstream has: ipad=upc1,upc2,upc3
> I query solr with keyword: "ipad" (to get 2000 documents) and then make 3
> individual queries for upc1,upc2,upc3 (which are fast).
> The data is then used to compute related keywords to "ipad" with their
> boost values.
>
> So, I cannot really replace that, since I need full text search over my
> dataset to retrieve top 2000 documents.
>
> I tried paging: I retrieve 500 solr documents 4 times (0-500, 500-1000...),
> but don't see any improvements.
>
>
> Some questions:
> 1. Maybe the JVM size might help?
> This is what I see in the dashboard:
> Physical Memory 76.2%
> Swap Space NaN% (don't have any swap space, running on AWS EBS)
> File Descriptor Count 4.7%
> JVM-Memory 73.8%
>
> Screenshot: http://i.imgur.com/aegKzP6.png
>
> 2. Will reducing the shards from 3 to 1 improve performance? (maybe
> increase the RAM from 30 to 60GB) The problem I will face in that case will
> be fitting 50M documents on 1 machine.
>
> Thanks,
> -Utkarsh
>
>
> On Sat, Jun 29, 2013 at 3:58 PM, Peter Sturge  >wrote:
>
> > Hello Utkarsh,
> > This may or may not be relevant for your use-case, but the way we deal
> with
> > this scenario is to retrieve the top N documents 5,10,20or100 at a time
> > (user selectable). We can then page the results, changing the start
> > parameter to return the next set. This allows us to 'retrieve' millions
> of
> > documents - we just do it at the user's leisure, rather than make them
> wait
> > for the whole lot in one go.
> > This works well because users very rarely want to see ALL 2000 (or
> whatever
> > number) documents at one time - it's simply too much to take in at one
> > time.
> > If your use-case involves an automated or offline procedure (e.g.
> running a
> > report or some data-mining op), then presumably it doesn't matter so much
> > it takes a bit longer (as long as it returns in some reasonble time).
> > Have you looked at doing paging on the client-side - this will hugely
> > speed-up your search time.
> > HTH
> > Peter
> >
> >
> >
> > On Sat, Jun 29, 2013 at 6:17 PM, Erick Erickson  > >wrote:
> >
> > > Well, depending on how many docs get served
> > > from the cache the time will vary. But this is
> > > just ugly, if you can avoid this use-case it would
> > > be a Good Thing.
> > >
> > > Problem here is that each and every shard must
> > > assemble the list of 2,000 documents (just ID and
> > > sort criteria, usually score).
> > >
> > > Then the node serving the original request merges
> > > the sub-lists to pick the top 2,000. Then the node
> > > sends another request to each shard to get
> > > the full document. Then the node merges this
> > > into the full list to return to the user.
> > >
> > > Solr really isn't built for this use-case, is it actually
> > > a compelling situation?
> > >
> > > And having your document cache set at 1M is kinda
> > > high if you have very big documents.
> > >
> > > FWIW,
> > > Erick
> > >
> > >
> > > On Fri, Jun 28, 2013 at 8:44 PM, Utkarsh Sengar  > > >wrote:
> > >
> > > > Also, I don't see a consistent response time from solr, I ran ab
> again
> > > and
> > > > I get this:
> > > >
> > > > ubuntu@ip-10-149-6-68:~$ ab -c 10 -n 500 "
> > > >
> > > >
> > >
> >
> http://x.amazonaws.com:8983/solr/prodinfo/select?q=allText:huggies%20diapers%20size%201&rows=2000&wt=json
> > > > "
> > > >
> > > >
> > > > Benchmarking x.amazonaws.com (be patient)
> > > > Completed 100 requests
> > > > Completed 200 requests
> > > > Completed 300 requests
> > > > Completed 400 requests
> > > > Completed 500 requests
> > > > Finished 500 requests
> > > >
> > > >
> > > > Server Software:
> > > > Server Hostname:   x.amazonaws.com
> > > > Server Port:8983
> > > >
> > > > Document Path:
> > > >
> > > >
> > >
> >
> /solr/prodinfo/select?q=allText:huggies%20diapers%20size%201&rows=2000&wt=json
> > > > Document Length:1538537 bytes
> > > >
> > > > Concurrency Level:  10
> > > > Time taken for tests:   10.858 seconds
> > > > Complete requests:  500
> > > > Failed requests:8
> > > >(Connect: 0, Receive: 0, Length: 8, Exceptions: 0)
> > > > Write errors:   0
> > > > Total transferred:  769297992 bytes
> > > > HTML transferred:   769268492 bytes
> > > > Requests per second:46.05 [#/sec] (mean)
> > > > Time per request:   217.167 [ms] (mean)

Re: Improving performance to return 2000+ documents

2013-06-30 Thread Jagdish Nomula
Solrconfig.xml has got entries which you can tweak for your use case. One
of them is queryresultwindowsize. You can try using the value of 2000 and
see if it helps improving performance. Please make sure you have enough
memory allocated for queryresultcache.
A combination of sharding and distribution of workload(requesting
2000/number of shards) with an aggregator would be a good way to maximize
performance.

Thanks,

Jagdish


On Sun, Jun 30, 2013 at 6:48 PM, Erick Erickson wrote:

> 50M documents, depending on a bunch of things,
> may not be unreasonable for a single node, only
> testing will tell.
>
> But the question I have is whether you should be
> using standard Solr queries for this or building a custom
> component that goes at the base Lucene index
> and "does the right thing". Or even re-indexing your
> entire corpus periodically to add this kind of data.
>
> FWIW,
> Erick
>
>
> On Sun, Jun 30, 2013 at 2:00 PM, Utkarsh Sengar  >wrote:
>
> > Thanks Erick/Peter.
> >
> > This is an offline process, used by a relevancy engine implemented around
> > solr. The engine computes boost scores for related keywords based on
> > clickstream data.
> > i.e.: say clickstream has: ipad=upc1,upc2,upc3
> > I query solr with keyword: "ipad" (to get 2000 documents) and then make 3
> > individual queries for upc1,upc2,upc3 (which are fast).
> > The data is then used to compute related keywords to "ipad" with their
> > boost values.
> >
> > So, I cannot really replace that, since I need full text search over my
> > dataset to retrieve top 2000 documents.
> >
> > I tried paging: I retrieve 500 solr documents 4 times (0-500,
> 500-1000...),
> > but don't see any improvements.
> >
> >
> > Some questions:
> > 1. Maybe the JVM size might help?
> > This is what I see in the dashboard:
> > Physical Memory 76.2%
> > Swap Space NaN% (don't have any swap space, running on AWS EBS)
> > File Descriptor Count 4.7%
> > JVM-Memory 73.8%
> >
> > Screenshot: http://i.imgur.com/aegKzP6.png
> >
> > 2. Will reducing the shards from 3 to 1 improve performance? (maybe
> > increase the RAM from 30 to 60GB) The problem I will face in that case
> will
> > be fitting 50M documents on 1 machine.
> >
> > Thanks,
> > -Utkarsh
> >
> >
> > On Sat, Jun 29, 2013 at 3:58 PM, Peter Sturge  > >wrote:
> >
> > > Hello Utkarsh,
> > > This may or may not be relevant for your use-case, but the way we deal
> > with
> > > this scenario is to retrieve the top N documents 5,10,20or100 at a time
> > > (user selectable). We can then page the results, changing the start
> > > parameter to return the next set. This allows us to 'retrieve' millions
> > of
> > > documents - we just do it at the user's leisure, rather than make them
> > wait
> > > for the whole lot in one go.
> > > This works well because users very rarely want to see ALL 2000 (or
> > whatever
> > > number) documents at one time - it's simply too much to take in at one
> > > time.
> > > If your use-case involves an automated or offline procedure (e.g.
> > running a
> > > report or some data-mining op), then presumably it doesn't matter so
> much
> > > it takes a bit longer (as long as it returns in some reasonble time).
> > > Have you looked at doing paging on the client-side - this will hugely
> > > speed-up your search time.
> > > HTH
> > > Peter
> > >
> > >
> > >
> > > On Sat, Jun 29, 2013 at 6:17 PM, Erick Erickson <
> erickerick...@gmail.com
> > > >wrote:
> > >
> > > > Well, depending on how many docs get served
> > > > from the cache the time will vary. But this is
> > > > just ugly, if you can avoid this use-case it would
> > > > be a Good Thing.
> > > >
> > > > Problem here is that each and every shard must
> > > > assemble the list of 2,000 documents (just ID and
> > > > sort criteria, usually score).
> > > >
> > > > Then the node serving the original request merges
> > > > the sub-lists to pick the top 2,000. Then the node
> > > > sends another request to each shard to get
> > > > the full document. Then the node merges this
> > > > into the full list to return to the user.
> > > >
> > > > Solr really isn't built for this use-case, is it actually
> > > > a compelling situation?
> > > >
> > > > And having your document cache set at 1M is kinda
> > > > high if you have very big documents.
> > > >
> > > > FWIW,
> > > > Erick
> > > >
> > > >
> > > > On Fri, Jun 28, 2013 at 8:44 PM, Utkarsh Sengar <
> utkarsh2...@gmail.com
> > > > >wrote:
> > > >
> > > > > Also, I don't see a consistent response time from solr, I ran ab
> > again
> > > > and
> > > > > I get this:
> > > > >
> > > > > ubuntu@ip-10-149-6-68:~$ ab -c 10 -n 500 "
> > > > >
> > > > >
> > > >
> > >
> >
> http://x.amazonaws.com:8983/solr/prodinfo/select?q=allText:huggies%20diapers%20size%201&rows=2000&wt=json
> > > > > "
> > > > >
> > > > >
> > > > > Benchmarking x.amazonaws.com (be patient)
> > > > > Completed 100 requests
> > > > > Completed 200 requests
> > > > > Completed 300 requests
> > > > > Completed 400 requests
> > > > > Complete

RE: Distributed search results in "SocketException: Connection reset"

2013-06-30 Thread Shahar Davidson
Thanks Lance.

If that is the case, are there any timeout mechanisms defined by Solr other 
than Jetty timeout definitions?

Thanks,

Shahar.

-Original Message-
From: Lance Norskog [mailto:goks...@gmail.com] 
Sent: Monday, July 01, 2013 4:18 AM
To: solr-user@lucene.apache.org
Subject: Re: Distributed search results in "SocketException: Connection reset"

This usually means the end server timed out.

On 06/30/2013 06:31 AM, Shahar Davidson wrote:
> Hi all,
>
> We're getting the below exception sporadically when using distributed 
> search. (using Solr 4.2.1) Note that 'core_3' is one of the cores mentioned 
> in the 'shards' parameter.
>
> Any ideas anyone?
>
> Thanks,
>
> Shahar.
>
>
> Jun 03, 2013 5:27:38 PM org.apache.solr.common.SolrException log
> SEVERE: org.apache.solr.common.SolrException: 
> org.apache.solr.client.solrj.SolrServerException: IOException occured when 
> talking to server at: http://127.0.0.1:8210/solr/core_3
>  at 
> org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:300)
>  at 
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:144)
>  at org.apache.solr.core.SolrCore.execute(SolrCore.java:1830)
>  at 
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:455)
>  at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:276)
>  at 
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1307)
>  at 
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:453)
>  at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
>  at 
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:560)
>  at 
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
>  at 
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1072)
>  at 
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:382)
>  at 
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)
>  at 
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1006)
>  at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
>  at 
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
>  at 
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)
>  at 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
>  at org.eclipse.jetty.server.Server.handle(Server.java:365)
>  at 
> org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:485)
>  at 
> org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53)
>  at 
> org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:937)
>  at 
> org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:998)
>  at 
> org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:856)
>  at 
> org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240)
>  at 
> org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
>  at 
> org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
>  at 
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
>  at 
> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
>  at java.lang.Thread.run(Unknown Source) Caused by: 
> org.apache.solr.client.solrj.SolrServerException: IOException occured when 
> talking to server at: http://127.0.0.1:8210/solr/core_3
>  at 
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:413)
>  at 
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
>  at 
> org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:166)
>  at 
> org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:133)
>  at java.util.concurrent.FutureTask$Sync.innerRun(Unknown 
> Source)
>  at java.util.concurrent.FutureTask.run(Unknown Source)
>  at 
> java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
>  at java.util.concurrent.FutureTask$Sync.