Nutch + Solr - Indexer causes java.lang.OutOfMemoryError: Java heap space

2014-09-07 Thread glumet
Hello everyone, 

I have configured my 2 servers to run in distributed mode (with Hadoop) and
my configuration for crawling process is Nutch 2.2.1 - HBase (as a storage)
and Solr. Solr is run by Tomcat. The problem is everytime I try to do the
last step - I mean when I want to index data from HBase into Solr. After
then this *[1]* error occures. I tried to add CATALINA_OPTS (or JAVA_OPTS)
like this:

CATALINA_OPTS="$JAVA_OPTS -XX:+UseConcMarkSweepGC -Xms1g -Xmx6000m
-XX:MinHeapFreeRatio=10 -XX:MaxHeapFreeRatio=30 -XX:MaxPermSize=512m
-XX:+CMSClassUnloadingEnabled"

to Tomcat's catalina.sh script and run server with this script but it didn't
help. I also add these *[2]* properties to nutch-site.xml file but it ended
up with OutOfMemory again. Can you help me please?

*[1]*
/2014-09-06 22:52:50,683 FATAL org.apache.hadoop.mapred.Child: Error running
child : java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2367)
at
java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:130)
at
java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:114)
at 
java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:587)
at java.lang.StringBuffer.append(StringBuffer.java:332)
at java.io.StringWriter.write(StringWriter.java:77)
at org.apache.solr.common.util.XML.escape(XML.java:204)
at org.apache.solr.common.util.XML.escapeCharData(XML.java:77)
at org.apache.solr.common.util.XML.writeXML(XML.java:147)
at
org.apache.solr.client.solrj.util.ClientUtils.writeVal(ClientUtils.java:161)
at
org.apache.solr.client.solrj.util.ClientUtils.writeXML(ClientUtils.java:129)
at
org.apache.solr.client.solrj.request.UpdateRequest.writeXML(UpdateRequest.java:355)
at
org.apache.solr.client.solrj.request.UpdateRequest.getXML(UpdateRequest.java:271)
at
org.apache.solr.client.solrj.request.RequestWriter.getContentStream(RequestWriter.java:66)
at
org.apache.solr.client.solrj.request.RequestWriter$LazyContentStream.getDelegate(RequestWriter.java:94)
at
org.apache.solr.client.solrj.request.RequestWriter$LazyContentStream.getName(RequestWriter.java:104)
at
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:247)
at
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:197)
at
org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117)
at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:68)
at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:54)
at
org.apache.nutch.indexwriter.solr.SolrIndexWriter.close(SolrIndexWriter.java:96)
at org.apache.nutch.indexer.IndexWriters.close(IndexWriters.java:117)
at
org.apache.nutch.indexer.IndexerOutputFormat$1.close(IndexerOutputFormat.java:54)
at
org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.close(MapTask.java:650)
at org.apache.hadoop.mapred.MapTask.closeQuietly(MapTask.java:1793)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:779)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:364)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
/

*[2]*


  http.content.limit
  15000
  The length limit for downloaded content using the http
  protocol, in bytes. If this value is nonnegative (>=0), content longer
  than it will be truncated; otherwise, no truncation at all. Do not
  confuse this setting with the file.content.limit setting.
  For our purposes it is twice bigger than default - parsing big pages: 128
* 1024
  



   indexer.max.tokens
   10



  http.timeout
  5
  The default network timeout, in milliseconds.



  solr.commit.size
  100
  
  Defines the number of documents to send to Solr in a single update batch.
  Decrease when handling very large documents to prevent Nutch from running
  out of memory. NOTE: It does not explicitly trigger a server side commit.
  




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Nutch-Solr-Indexer-causes-java-lang-OutOfMemoryError-Java-heap-space-tp4157308.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: SolrCloud : node recovery fails with "No registered leader was found"

2014-09-07 Thread heaven
Seeing the same thing after a crash of one ZK node (from 5):
{code}
org.apache.solr.common.SolrException: No registered leader was found after
waiting for 4000ms , collection: crm-prod slice: shard1
at
org.apache.solr.common.cloud.ZkStateReader.getLeaderRetry(ZkStateReader.java:545)
at
org.apache.solr.common.cloud.ZkStateReader.getLeaderRetry(ZkStateReader.java:528)
at
org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:250)
at
org.apache.solr.update.processor.DistributedUpdateProcessor.processDelete(DistributedUpdateProcessor.java:982)
at
org.apache.solr.update.processor.LogUpdateProcessor.processDelete(LogUpdateProcessorFactory.java:121)
at
org.apache.solr.handler.loader.XMLLoader.processDelete(XMLLoader.java:349)
at
org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:278)
at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:174)
at
org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1952)
at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:774)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:418)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207)
at
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419)
at
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
at
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557)
at
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
at
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1075)
at
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:384)
at
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)
at
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1009)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
at
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
at
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)
at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
at org.eclipse.jetty.server.Server.handle(Server.java:368)
at
org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:489)
at
org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53)
at
org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:953)
at
org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:1014)
at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:861)
at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240)
at
org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
at
org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
at
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
at
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
at java.lang.Thread.run(Thread.java:744)
{code}



--
View this message in context: 
http://lucene.472066.n3.nabble.com/SolrCloud-node-recovery-fails-with-No-registered-leader-was-found-tp4137331p4157312.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Query ReRanking question

2014-09-07 Thread Erick Erickson
Joel:

I find that whenever I say something totally wrong publicly, I
remember the correction really really well...

Thanks for straightening that out!
Erick

On Sat, Sep 6, 2014 at 12:58 PM, Joel Bernstein  wrote:
> This folllowing query:
>
> http://localhost:8080/solr/select?q=malaysian airline crash&rq={!rerank
> reRankQuery=$rqq reRankDocs=1000}&rqq=*:*&sort=publish_date
> desc&fl=headline,publish_date,score
>
> Is doing the following:
>
> The main query is sorted by publish_date. Then the results are reranked by
> *:*, which in theory would have no effect at all.
>
> The reRankQuery only uses the reRankQuery to re-rank the results. The sort
> param will always apply to the main query.
>
>
>
>
>
>
>
>
>
>
>
>
> Joel Bernstein
> Search Engineer at Heliosearch
>
>
> On Sat, Sep 6, 2014 at 2:33 PM, Ravi Solr  wrote:
>
>> Erick,
>> Your idea about reversing Joel's suggestion seems to give the best
>> results of all the options I tried...but I cant seem to understand why. I
>> thought the query shown below should give irrelevant results as sorting by
>> date would throw relevancy off...but somehow its getting relevant results
>> with fair enough reverse chronology. It is as if the sort is applied after
>> the docs are collected and reranked (which is what I wanted). One more
>> thing that baffled me was, if I change reRankDocs from 1000 to100 the
>> results become irrelevant, which doesnt make sense.
>>
>> So can you kindly explain whats going on in the following query.
>>
>> http://localhost:8080/solr/select?q=malaysian airline crash&rq={!rerank
>> reRankQuery=$rqq reRankDocs=1000}&rqq=*:*&sort=publish_date
>> desc&fl=headline,publish_date,score
>>
>> I love the solr community, so much to learn from so many knowledgeable
>> people.
>>
>> Thanks
>>
>> Ravi Kiran Bhaskar
>>
>>
>>
>> On Fri, Sep 5, 2014 at 1:23 PM, Erick Erickson 
>> wrote:
>>
>> > OK, why can't you switch the clauses from Joel's suggestion?
>> >
>> > Something like:
>> > q=Malaysia plane crash&rq={!rerank reRankDocs=1000
>> > reRankQuery=$myquery}&myquery=*:*&sort=date+desc
>> >
>> > (haven't tried this yet, but you get the idea).
>> >
>> > Best,
>> > Erick
>> >
>> > On Fri, Sep 5, 2014 at 9:33 AM, Markus Jelsma
>> >  wrote:
>> > > Hi - You can already achieve this by boosting on the document's
>> recency.
>> > The result set won't be exactly ordered by date but you will get the most
>> > relevant and recent documents on top.
>> > >
>> > > Markus
>> > >
>> > > -Original message-
>> > >> From:Ravi Solr mailto:ravis...@gmail.com> >
>> > >> Sent: Friday 5th September 2014 18:06
>> > >> To: solr-user@lucene.apache.org 
>> > >> Subject: Re: Query ReRanking question
>> > >>
>> > >> Thank you very much for responding. I want to do exactly the opposite
>> of
>> > >> what you said. I want to sort the relevant docs in reverse chronology.
>> > If
>> > >> you sort by date before hand then the relevancy is lost. So I want to
>> > get
>> > >> Top N relevant results and then rerank those Top N to achieve relevant
>> > >> reverse chronological results.
>> > >>
>> > >> If you ask Why would I want to do that ??
>> > >>
>> > >> Lets take a example about Malaysian airline crash. several articles
>> > might
>> > >> have been published over a period of time. When I search for -
>> malaysia
>> > >> airline crash blackbox - I would want to see "relevant" results but
>> > would
>> > >> also like to see the the recent developments on the top i.e.
>> > effectively a
>> > >> reverse chronological order within the relevant results, like telling
>> a
>> > >> story over a period of time
>> > >>
>> > >> Hope i am clear. Thanks for your help.
>> > >>
>> > >> Thanks
>> > >>
>> > >> Ravi Kiran Bhaskar
>> > >>
>> > >>
>> > >> On Thu, Sep 4, 2014 at 5:08 PM, Joel Bernstein > >  > wrote:
>> > >>
>> > >> > If you want the main query to be sorted by date then the top N docs
>> > >> > reranked by a query, that should work. Try something like this:
>> > >> >
>> > >> > q=foo&sort=date+desc&rq={!rerank reRandDocs=1000
>> > >> > reRankQuery=$myquery}&myquery=blah
>> > >> >
>> > >> >
>> > >> > Joel Bernstein
>> > >> > Search Engineer at Heliosearch
>> > >> >
>> > >> >
>> > >> > On Thu, Sep 4, 2014 at 4:25 PM, Ravi Solr > >  > wrote:
>> > >> >
>> > >> > > Can the ReRanking API be used to sort within docs retrieved by a
>> > date
>> > >> > field
>> > >> > > ? Can somebody help me understand how to write such a query ?
>> > >> > >
>> > >> > > Thanks
>> > >> > >
>> > >> > > Ravi Kiran Bhaskar
>> > >> > >
>> > >> >
>> > >>
>> > >
>> >
>>


New cloud - replica in recovering state?

2014-09-07 Thread Jakov Sosic

Hi guys,


I'm trying to set up new solr cloud, with two core's, each with two 
shards and two replicas.


This is my solr.xml:


  

  
  



But when I start everything, I can see 4 cores (each for 1 shard) are 
green in solr01:8080/solr/#/~cloud, but replicas are in yellow, 
RECOVERING state.


How can I fix them to go from Recovering to Active?


ANNOUNCE: Solr Reference Guide for Solr 4.10

2014-09-07 Thread Chris Hostetter


The Lucene PMC is pleased to announce that there is a new version of the 
Solr Reference Guide for Solr 4.10.


The 511 page PDF serves as the definitive user's manual for Solr 4.10. It 
can be downloaded from the Apache mirror network:


https://www.apache.org/dyn/closer.cgi/lucene/solr/ref-guide/


-Hoss


Re: Query ReRanking question

2014-09-07 Thread Joel Bernstein
Ok, just reviewed the code. The ReRankingQParserPlugin always tracks the
scores from the main query. So this explains things. Speaking of explaining
things, the ReRankingParserPlugin also works with Lucene's explain. So if
you use debugQuery=true we should see that the score from the initial query
was combined with the score from the reRankQuery, which should be 1.

You have stumbled on a interesting usage pattern which I never considered.
But basically what's happening is:

1) Main query is sorted by score.
2) Reranker is reRanking docs based on the score from the main query.

No, worries Erick, you've taught me a lot over the past couple of years!








Joel Bernstein
Search Engineer at Heliosearch


On Sun, Sep 7, 2014 at 11:37 AM, Erick Erickson 
wrote:

> Joel:
>
> I find that whenever I say something totally wrong publicly, I
> remember the correction really really well...
>
> Thanks for straightening that out!
> Erick
>
> On Sat, Sep 6, 2014 at 12:58 PM, Joel Bernstein 
> wrote:
> > This folllowing query:
> >
> > http://localhost:8080/solr/select?q=malaysian airline crash&rq={!rerank
> > reRankQuery=$rqq reRankDocs=1000}&rqq=*:*&sort=publish_date
> > desc&fl=headline,publish_date,score
> >
> > Is doing the following:
> >
> > The main query is sorted by publish_date. Then the results are reranked
> by
> > *:*, which in theory would have no effect at all.
> >
> > The reRankQuery only uses the reRankQuery to re-rank the results. The
> sort
> > param will always apply to the main query.
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > Joel Bernstein
> > Search Engineer at Heliosearch
> >
> >
> > On Sat, Sep 6, 2014 at 2:33 PM, Ravi Solr  wrote:
> >
> >> Erick,
> >> Your idea about reversing Joel's suggestion seems to give the
> best
> >> results of all the options I tried...but I cant seem to understand why.
> I
> >> thought the query shown below should give irrelevant results as sorting
> by
> >> date would throw relevancy off...but somehow its getting relevant
> results
> >> with fair enough reverse chronology. It is as if the sort is applied
> after
> >> the docs are collected and reranked (which is what I wanted). One more
> >> thing that baffled me was, if I change reRankDocs from 1000 to100 the
> >> results become irrelevant, which doesnt make sense.
> >>
> >> So can you kindly explain whats going on in the following query.
> >>
> >> http://localhost:8080/solr/select?q=malaysian airline crash&rq={!rerank
> >> reRankQuery=$rqq reRankDocs=1000}&rqq=*:*&sort=publish_date
> >> desc&fl=headline,publish_date,score
> >>
> >> I love the solr community, so much to learn from so many knowledgeable
> >> people.
> >>
> >> Thanks
> >>
> >> Ravi Kiran Bhaskar
> >>
> >>
> >>
> >> On Fri, Sep 5, 2014 at 1:23 PM, Erick Erickson  >
> >> wrote:
> >>
> >> > OK, why can't you switch the clauses from Joel's suggestion?
> >> >
> >> > Something like:
> >> > q=Malaysia plane crash&rq={!rerank reRankDocs=1000
> >> > reRankQuery=$myquery}&myquery=*:*&sort=date+desc
> >> >
> >> > (haven't tried this yet, but you get the idea).
> >> >
> >> > Best,
> >> > Erick
> >> >
> >> > On Fri, Sep 5, 2014 at 9:33 AM, Markus Jelsma
> >> >  wrote:
> >> > > Hi - You can already achieve this by boosting on the document's
> >> recency.
> >> > The result set won't be exactly ordered by date but you will get the
> most
> >> > relevant and recent documents on top.
> >> > >
> >> > > Markus
> >> > >
> >> > > -Original message-
> >> > >> From:Ravi Solr mailto:ravis...@gmail.com> >
> >> > >> Sent: Friday 5th September 2014 18:06
> >> > >> To: solr-user@lucene.apache.org  solr-user@lucene.apache.org>
> >> > >> Subject: Re: Query ReRanking question
> >> > >>
> >> > >> Thank you very much for responding. I want to do exactly the
> opposite
> >> of
> >> > >> what you said. I want to sort the relevant docs in reverse
> chronology.
> >> > If
> >> > >> you sort by date before hand then the relevancy is lost. So I want
> to
> >> > get
> >> > >> Top N relevant results and then rerank those Top N to achieve
> relevant
> >> > >> reverse chronological results.
> >> > >>
> >> > >> If you ask Why would I want to do that ??
> >> > >>
> >> > >> Lets take a example about Malaysian airline crash. several articles
> >> > might
> >> > >> have been published over a period of time. When I search for -
> >> malaysia
> >> > >> airline crash blackbox - I would want to see "relevant" results but
> >> > would
> >> > >> also like to see the the recent developments on the top i.e.
> >> > effectively a
> >> > >> reverse chronological order within the relevant results, like
> telling
> >> a
> >> > >> story over a period of time
> >> > >>
> >> > >> Hope i am clear. Thanks for your help.
> >> > >>
> >> > >> Thanks
> >> > >>
> >> > >> Ravi Kiran Bhaskar
> >> > >>
> >> > >>
> >> > >> On Thu, Sep 4, 2014 at 5:08 PM, Joel Bernstein  >> >  > wrote:
> >> > >>
> >> > >> > If you want the main query to be sorted by date then the top N
> d

Re: Query ReRanking question

2014-09-07 Thread Joel Bernstein
Oops wrong usage pattern. It should be:

1) Main query is sorted by a field (scores tracked silently in the
background).
2) Reranker is reRanking docs based on the score from the main query.



Joel Bernstein
Search Engineer at Heliosearch


On Sun, Sep 7, 2014 at 4:43 PM, Joel Bernstein  wrote:

> Ok, just reviewed the code. The ReRankingQParserPlugin always tracks the
> scores from the main query. So this explains things. Speaking of explaining
> things, the ReRankingParserPlugin also works with Lucene's explain. So if
> you use debugQuery=true we should see that the score from the initial query
> was combined with the score from the reRankQuery, which should be 1.
>
> You have stumbled on a interesting usage pattern which I never considered.
> But basically what's happening is:
>
> 1) Main query is sorted by score.
> 2) Reranker is reRanking docs based on the score from the main query.
>
> No, worries Erick, you've taught me a lot over the past couple of years!
>
>
>
>
>
>
>
>
> Joel Bernstein
> Search Engineer at Heliosearch
>
>
> On Sun, Sep 7, 2014 at 11:37 AM, Erick Erickson 
> wrote:
>
>> Joel:
>>
>> I find that whenever I say something totally wrong publicly, I
>> remember the correction really really well...
>>
>> Thanks for straightening that out!
>> Erick
>>
>> On Sat, Sep 6, 2014 at 12:58 PM, Joel Bernstein 
>> wrote:
>> > This folllowing query:
>> >
>> > http://localhost:8080/solr/select?q=malaysian airline crash&rq={!rerank
>> > reRankQuery=$rqq reRankDocs=1000}&rqq=*:*&sort=publish_date
>> > desc&fl=headline,publish_date,score
>> >
>> > Is doing the following:
>> >
>> > The main query is sorted by publish_date. Then the results are reranked
>> by
>> > *:*, which in theory would have no effect at all.
>> >
>> > The reRankQuery only uses the reRankQuery to re-rank the results. The
>> sort
>> > param will always apply to the main query.
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> > Joel Bernstein
>> > Search Engineer at Heliosearch
>> >
>> >
>> > On Sat, Sep 6, 2014 at 2:33 PM, Ravi Solr  wrote:
>> >
>> >> Erick,
>> >> Your idea about reversing Joel's suggestion seems to give the
>> best
>> >> results of all the options I tried...but I cant seem to understand
>> why. I
>> >> thought the query shown below should give irrelevant results as
>> sorting by
>> >> date would throw relevancy off...but somehow its getting relevant
>> results
>> >> with fair enough reverse chronology. It is as if the sort is applied
>> after
>> >> the docs are collected and reranked (which is what I wanted). One more
>> >> thing that baffled me was, if I change reRankDocs from 1000 to100 the
>> >> results become irrelevant, which doesnt make sense.
>> >>
>> >> So can you kindly explain whats going on in the following query.
>> >>
>> >> http://localhost:8080/solr/select?q=malaysian airline
>> crash&rq={!rerank
>> >> reRankQuery=$rqq reRankDocs=1000}&rqq=*:*&sort=publish_date
>> >> desc&fl=headline,publish_date,score
>> >>
>> >> I love the solr community, so much to learn from so many knowledgeable
>> >> people.
>> >>
>> >> Thanks
>> >>
>> >> Ravi Kiran Bhaskar
>> >>
>> >>
>> >>
>> >> On Fri, Sep 5, 2014 at 1:23 PM, Erick Erickson <
>> erickerick...@gmail.com>
>> >> wrote:
>> >>
>> >> > OK, why can't you switch the clauses from Joel's suggestion?
>> >> >
>> >> > Something like:
>> >> > q=Malaysia plane crash&rq={!rerank reRankDocs=1000
>> >> > reRankQuery=$myquery}&myquery=*:*&sort=date+desc
>> >> >
>> >> > (haven't tried this yet, but you get the idea).
>> >> >
>> >> > Best,
>> >> > Erick
>> >> >
>> >> > On Fri, Sep 5, 2014 at 9:33 AM, Markus Jelsma
>> >> >  wrote:
>> >> > > Hi - You can already achieve this by boosting on the document's
>> >> recency.
>> >> > The result set won't be exactly ordered by date but you will get the
>> most
>> >> > relevant and recent documents on top.
>> >> > >
>> >> > > Markus
>> >> > >
>> >> > > -Original message-
>> >> > >> From:Ravi Solr mailto:ravis...@gmail.com> >
>> >> > >> Sent: Friday 5th September 2014 18:06
>> >> > >> To: solr-user@lucene.apache.org > solr-user@lucene.apache.org>
>> >> > >> Subject: Re: Query ReRanking question
>> >> > >>
>> >> > >> Thank you very much for responding. I want to do exactly the
>> opposite
>> >> of
>> >> > >> what you said. I want to sort the relevant docs in reverse
>> chronology.
>> >> > If
>> >> > >> you sort by date before hand then the relevancy is lost. So I
>> want to
>> >> > get
>> >> > >> Top N relevant results and then rerank those Top N to achieve
>> relevant
>> >> > >> reverse chronological results.
>> >> > >>
>> >> > >> If you ask Why would I want to do that ??
>> >> > >>
>> >> > >> Lets take a example about Malaysian airline crash. several
>> articles
>> >> > might
>> >> > >> have been published over a period of time. When I search for -
>> >> malaysia
>> >> > >> airline crash blackbox - I would want to see "relevant" results
>> but
>> >> > would
>> >> > >> also like to see the the recent developm

Re: statuscode list

2014-09-07 Thread Koji Sekiguchi

Hi Jan,

(2014/09/05 21:01), Jan Verweij - Reeleez wrote:

Hi,

If I'm correct you will get a statuscode="0" in the response if you
use XML messages for updating the solr index.


I think you mean by statuscode="0" is status=0 here.



07



Is there a list of possible other statuscodes you can receive in case
anything fails and what these errorcodes mean?


I don't think we have a list of possible other status because Solr
doen't return status other than 0. Instead of status code in XML,
you should look at HTTP status code e.g. 200 OK, 404 Not Found, etc.
because if there is something wrong on Solr while updating (even querying)
index, Solr may not return XML anyway.

Koji
--
http://soleami.com/blog/comparing-document-classification-functions-of-lucene-and-mahout.html


Re: New cloud - replica in recovering state?

2014-09-07 Thread Erick Erickson
I really recommend you use the new-style core discovery, if for no
other reason than this style is deprecated in 5.0. See:
https://wiki.apache.org/solr/Solr.xml%204.4%20and%20beyond

FWIW,
Erick

On Sun, Sep 7, 2014 at 8:51 AM, Jakov Sosic  wrote:
> Hi guys,
>
>
> I'm trying to set up new solr cloud, with two core's, each with two shards
> and two replicas.
>
> This is my solr.xml:
>
> 
>zkHost="10.200.1.104:2181,10.200.1.105:2181,10.200.1.106:2181">
> defaultCoreName="mycore1"
>host="${host:}" hostPort="${jetty.port:}"
>hostContext="${hostContext:}"
>zkClientTimeout="${zkClientTimeout:15000}">
>   
>   
> 
> 
>
> But when I start everything, I can see 4 cores (each for 1 shard) are green
> in solr01:8080/solr/#/~cloud, but replicas are in yellow, RECOVERING state.
>
> How can I fix them to go from Recovering to Active?


[ANN] Heliosearch 0.07 released

2014-09-07 Thread Yonik Seeley
http://heliosearch.org/download

Heliosearch v0.07 Features
  o  Heliosearch v0.07 is based on (and contains all features of)
Lucene/Solr 4.10.0
  o  An optimized Terms Query with native code performance
enhancements for efficiently matching multiple terms in a field.
  http://heliosearch.org/solr-terms-query/
  o  Native code to accelerate creation of off-heap filters.
  o  Added a off-heap buffer pool to speed allocation of temporary
memory buffers.
  o  Added ConstantScoreQuery support to lucene query syntax.
  Example: +color:blue^=1 text:shoes
  http://heliosearch.org/solr/query-syntax/#ConstantScoreQuery
  o  Added filter support to lucene query syntax. This retrieves an
off-heap filter from the filter cache, essentially like embedding “fq”
(filter queries) in a lucene query at any level. This also effectively
provides a way to “OR” various cached filters together.
  Example: description:HDTV OR filter(+promotion:tv
+promotion_date:[NOW/DAY-7DAYS TO NOW/DAY+1DAY])
  http://heliosearch.org/solr/query-syntax/#FilterQuery
  o  Added C-style comments to lucene query syntax.
  Example: description:HDTV /* this is a comment */
  http://heliosearch.org/solr/query-syntax/#comments

-Yonik
http://heliosearch.org - native code faceting, facet functions,
sub-facets, off-heap data


Re: Performance of Boolean query with hundreds of OR clauses.

2014-09-07 Thread Yonik Seeley
Solr 4.10 has added a {!terms} query that should speed up these cases.

Benchmarks here:
http://heliosearch.org/solr-terms-query/

-Yonik
http://heliosearch.org - native code faceting, facet functions,
sub-facets, off-heap data

On Tue, Aug 19, 2014 at 2:57 PM, SolrUser1543  wrote:
> I am using Solr to perform search for finding similar pictures.
>
> For this purpose, every image indexed as a set of descriptors ( descriptor
> is a string of 6 chars ) .
> Number of descriptors for every image may vary ( from few to many thousands)
>
> When I want to search  for a similar image , I am extracting the descriptors
> from it and create a query like :
> MyImage:( desc1 desc2 ...  desc n )
>
> Number of descriptors in query may also vary. Usual it is about 1000.
>
> Of course performance of this query very bad and may take few minutes to
> return .
>
> Any ideas for performance improvement ?
>
> P.s I also tried to use lire , but it is not fits my use case.


Re: How to implement multilingual word components fields schema?

2014-09-07 Thread Ilia Sretenskii
Thank you for the replies, guys!

Using field-per-language approach for multilingual content is the last
thing I would try since my actual task is to implement a search
functionality which would implement relatively the same possibilities for
every known world language.
The closest references are those popular web search engines, they seem to
serve worldwide users with their different languages and even
cross-language queries as well.
Thus, a field-per-language approach would be a sure waste of storage
resources due to the high number of duplicates, since there are over 200
known languages.
I really would like to keep single field for cross-language searchable text
content, witout splitting it into specific language fields or specific
language cores.

So my current choice will be to stay with just the ICUTokenizer and
ICUFoldingFilter as they are without any language specific
stemmers/lemmatizers yet at all.

Probably I will put the most popular languages stop words filters and
stemmers into the same one searchable text field to give it a try and see
if it works correctly in a stack.
Does specific language related filters stacking work correctly in one field?

Further development will most likely involve some advanced custom analyzers
like the "SimplePolyGlotStemmingTokenFilter" to utilize the ICU generated
ScriptAttribute.
http://comments.gmane.org/gmane.comp.jakarta.lucene.solr.user/100236
https://github.com/whateverdood/cross-lingual-search/blob/master/src/main/java/org/apache/lucene/sandbox/analysis/polyglot/SimplePolyGlotStemmingTokenFilter.java

So I would like to know more about those "academic papers on this issue of
how best to deal with mixed language/mixed script queries and documents".
Tom, could you please share them?


Deleted Collections not updated in Zookeeper

2014-09-07 Thread RadhaJayalakshmi
Hi,
Issue in brief:
I am facing a strange issue, where, the collections that are deleted in
SOLR, are still having reference in Zookeeper and due to which, in the solr
cloud console, i am still seeing the reference to the deleted collections in
down state

Issue in Detail:
I am using Solr 4.5.1 and zookeeper 3.4.5. 
Running a solr cloud of 3 nodes in one physical box. In that same box, i am
also running zookeeper ensemble of 3 nodes

Now as part of my application, every week, i create new indexes(collection)
with the new data and that index will become LIVE index for Searches.
Old Index(collections), will be deleted periodically from the solr server,
using the curl command and using DELETE on /admin/collections(Collections
API).

So far, it is working as expected.
But the problem i am facing is that, the indexes(Collections) that gets
deleted from the SOLR server periodically, are still having reference in the
zookeeper.
Becuase it has reference in zookeeper, i am still able to see those deleted
collections in SOLR cloud console, highlighted in orange color(which means
in down state).
Over time, i can see many such deleted collections in solr cloud console. I
dont know how to get rid of this.

Even restart of SOLR and Zookeeper is not giving any relief.

One more thing is, the deleted collections, dont have any reference in the
SOLR HOME folder. 
But it sits only on zookeeper. 

Expecting a reply for this issue

Thanks
Radha




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Deleted-Collections-not-updated-in-Zookeeper-tp4157362.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Deleted Collections not updated in Zookeeper

2014-09-07 Thread Anshum Gupta
Hi Radha,

This is strange as I the collections API delete command is supposed to
clean up zk. Do you see any errors in your Solr logs? Does the response
from the call include any errors/exceptions?


On Sun, Sep 7, 2014 at 11:32 PM, RadhaJayalakshmi <
rlakshminaraya...@inautix.co.in> wrote:

> Hi,
> Issue in brief:
> I am facing a strange issue, where, the collections that are deleted in
> SOLR, are still having reference in Zookeeper and due to which, in the solr
> cloud console, i am still seeing the reference to the deleted collections
> in
> down state
>
> Issue in Detail:
> I am using Solr 4.5.1 and zookeeper 3.4.5.
> Running a solr cloud of 3 nodes in one physical box. In that same box, i am
> also running zookeeper ensemble of 3 nodes
>
> Now as part of my application, every week, i create new indexes(collection)
> with the new data and that index will become LIVE index for Searches.
> Old Index(collections), will be deleted periodically from the solr server,
> using the curl command and using DELETE on /admin/collections(Collections
> API).
>
> So far, it is working as expected.
> But the problem i am facing is that, the indexes(Collections) that gets
> deleted from the SOLR server periodically, are still having reference in
> the
> zookeeper.
> Becuase it has reference in zookeeper, i am still able to see those deleted
> collections in SOLR cloud console, highlighted in orange color(which means
> in down state).
> Over time, i can see many such deleted collections in solr cloud console. I
> dont know how to get rid of this.
>
> Even restart of SOLR and Zookeeper is not giving any relief.
>
> One more thing is, the deleted collections, dont have any reference in the
> SOLR HOME folder.
> But it sits only on zookeeper.
>
> Expecting a reply for this issue
>
> Thanks
> Radha
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Deleted-Collections-not-updated-in-Zookeeper-tp4157362.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 

Anshum Gupta
http://www.anshumgupta.net