long QTime for big index

2013-01-31 Thread Mou
I am running solr 3.4 on tomcat 7.

Our index is very big , two cores each 120G. We are searching the slaves
which are replicated every 30 min.
 I am using filtercache only and We have more than 90% cache hits. We use
lot of filter queries, queries are usually pretty big with 10-20 fq
parameters. Not all filters are cached.

we are searching three shards and query looks like this --
shards=core1,core2,core3&q=*:* &fq=field1:some value&fq = -field2=some
value&sort=date 
But some queries are taking more than 30 sec to return result and the
behavior is intermittent. I can not find relation to replication. We are
using Zing jvm which reduced our GC pause to milli secs, so GC is not a
problem.

How can I improve the qtime? Is it at all possible to get a better qtime
given our index size?

Thank you for your suggestion.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/long-QTime-for-big-index-tp4037635.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Fwd: advice about develop AbstractSolrEventListener.

2013-01-31 Thread Miguel

Hi

  After to study apache solr documentation, I think only way to know 
update records (modify, delete an insert actions) is developed a class 
extends org.apache.solr.servlet.SolrUpdateServlet.
In this class, I can access updated record information go into Apache 
solr server.


Somebody can confirm me, that this way is the best way? or is there any 
options?


thanks

El 30/01/2013 13:39, Miguel escribió:


Hi

I have to developed a function that must comunicate with webservice and
this function must execute after each time commits.
My doubt;
it's possible get that records had been updated on solr index?
My function must send information about add, updated and delete records
from solr index to external webservice, and this information must be
send after commit event.

I have read wiki apache solr and it seems the best way is create
listener with event=postCommit, but I have seen example
"solr.RunExecutableListener" and I don't see how to know records
associated to commit event.

Example Solrconfig.xml:


 


Thanks.







How to use SolrCloud in multi-threaded indexing

2013-01-31 Thread andy
Hi, 

I am going to upgrade to solr 4.1 from version 3.6, and I want to set up to
shards.
I use ConcurrentUpdateSolrServer to index the documents in solr3.6.
I saw the api CloudSolrServer in 4.1,BUT
1:CloudSolrServer use the LBHttpSolrServer to issue requests,but "*
LBHttpSolrServer  should NOT be used for indexing *" documented in the api 
http://lucene.apache.org/solr/4_1_0/solr-solrj/index.html
  
2:it seems CloudSolrServer does not support multi thread indexing 

So, how to do multi-threaded indexing in solr 4.1?

Thanks



--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-use-SolrCloud-in-multi-threaded-indexing-tp4037641.html
Sent from the Solr - User mailing list archive at Nabble.com.


searching for an id

2013-01-31 Thread b.riez...@pixel-ink.de
Hi

I have an id wich is a string like this.
tx-20130130-4599

i'm using a field without processing, wich i got confirmed via the analyser tool
But when i search for that it got split up, so instead of finding that specific 
entry with that unique id,
it finds all entries with "tx" in it.

Any idea how to get rid of that behavior?

Best
Ben



RE: Indexing problems

2013-01-31 Thread GASPARD Joel
Hello,

After more tests, we could identify our problem in indexation (Solr 4.0.0).
Indeed our problems are OutOfMemoryErrors. Thinking about Zookeeper connection 
problems was a mistake. We have thought about this because OOME sometimes 
appear in logs after errors on Zookeeper leader election.

Indexing fails when we define several Solr schemas in Zookeeper.
When we define a single schema, indexation works well. It has been tested with 
a single Solr node in the cluster, or with two Solr nodes.
We are facing problems when we upload several configurations in Zookeeper : we 
can create an index for a single collection, but OutOfMemoryErrors are thrown 
when we try to create an index for a second collection with another schema.
Garbage collect logs show a rapid increase of memory consumption, then 
OutOfMemory errors.

Can we define a distinct schema for each collection ?

Thanks !

Joel Gaspard



De : GASPARD Joel [mailto:joel.gasp...@cegedim.com]
Envoyé : mardi 22 janvier 2013 16:30
À : solr-user@lucene.apache.org
Objet : Indexing problems

Hello,

We are facing some problems when indexing with Solr 4.0.0 with more than one 
server node and we can't find a way to solve them.
We have 2 nodes of Solr Cloud instances.
They are running in a Zookeeper ensemble (3.4.4 version) with 3 servers 
(another application is deployed on the third server).
We try to index a collection with 1 shard stored in the 2 nodes.
2 other collections with an only shard have already been indexed. The logs for 
this first indexing have been lost but maybe there was a single Solr node when 
the indexing has been made. Each collection contains about 3.000.000 documents 
(16 Go).

When we start adding documents, failures occur very fast, after maybe 2000 
documents, and the solr servers cannot be accessed anymore.
I add to this mail an attachment containing a part of the logs.

When we use Solr Cloud with only one node in a single zookeeper ensemble, we 
don't encounter any problem.



Some precisions on our configuration :
We send about 400 documents per minute.
The documents are added in Solr by two threads on our application, using the 
CloudSolrServer class.
These threads don't call the commit method. We use only the solr config to 
commit. The solrconfig.xml defines for now :
15000false
No soft commit
We have also tried :
60false
1000

The Solr servers are launched with these options :
-Xmx12G -Xms4G
-XX:MaxPermSize=256m -XX:MaxNewSize=356m
-XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:+UseParNewGC
-XX:+CMSClassUnloadingEnabled
-XX:MinHeapFreeRatio=10
-XX:MaxHeapFreeRatio=25
-DzkHost=server1:2188,server2:2188,server3:2188

The solr.xml contains zkClientTimeout="6" and zoo.cfg defines a ticktime of 
3000 ms.

The Solr servers on which we are facing some problems contain old collections 
and old cores created for some tests.



Could you give some indications to me ?
Is this a problem in our solr or zookeeper config ?
How could we detect network problems ?
Is there a problem with the VM parameters ? Should we analyse some garbage 
collect logs ?

Thanks in advance.

Joel Gaspard


Re: long QTime for big index

2013-01-31 Thread Dmitry Kan
Does debugQuery=true tell anything useful for these? Like what is the
component taking most of the 30 seconds. Do you have evictions in your solr
caches?

Dmitry

On Thu, Jan 31, 2013 at 10:01 AM, Mou  wrote:

> I am running solr 3.4 on tomcat 7.
>
> Our index is very big , two cores each 120G. We are searching the slaves
> which are replicated every 30 min.
>  I am using filtercache only and We have more than 90% cache hits. We use
> lot of filter queries, queries are usually pretty big with 10-20 fq
> parameters. Not all filters are cached.
>
> we are searching three shards and query looks like this --
> shards=core1,core2,core3&q=*:* &fq=field1:some value&fq = -field2=some
> value&sort=date
> But some queries are taking more than 30 sec to return result and the
> behavior is intermittent. I can not find relation to replication. We are
> using Zing jvm which reduced our GC pause to milli secs, so GC is not a
> problem.
>
> How can I improve the qtime? Is it at all possible to get a better qtime
> given our index size?
>
> Thank you for your suggestion.
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/long-QTime-for-big-index-tp4037635.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Question on Facet field constraints sort order

2013-01-31 Thread vijeshnair
It could be a foolish question or concern, but I have no option :-) . We do
have an e-com site where we consuming the feed from the CSE partners and
indexing it in to SOLR for our search. Instead of the traditional
auto-suggest, the predictive search in the header search box recommends the
categories(category facet) for which it found the matching for the given
keyword. With this approach for a search like "apple iphone" will yield more
results for "cell phone accessories" than "Cell phone", hence in the drop
down cell phone accessories will come first then the cell phone. Which is
quiet natural and works as expected as we have the default sorting for facet
constraints as "count". Today my boss "tech director" was asking me to tweak
this order, i.e. business team will prioritize the whole 1300 categories
which is available today in my taxonomy in some order, then my category
facet constraint's order should be based on the order that they are
providing to us. He was telling me in Oracle Endeca it is possible, where he
was showing me to change the order of category etc, mean to say any sort of
customization to change the order etc, so check if SOLR supports. Though my
answer was no to that, he was proposing to handle this in the code other
wise, i.e. change the order in the client side. So the intention of writing
this is to check whether there are any such options available in SOLR or
not. I understand the two types of sorting which is available i.e count and
index, are there some thing beyond? where I can alter this using an external
list or some thing like that. Any help will be appreciated.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Question-on-Facet-field-constraints-sort-order-tp4037647.html
Sent from the Solr - User mailing list archive at Nabble.com.


Solr4.1 changing result order FIFO to LIFO

2013-01-31 Thread Bernd Fehling
Hi list,

I recognized that the result order is FIFO if documents have the same score.
I think this is due to the fact that documents which are indexed later get a 
higher
internal document ID and the output for documents with the same score starts
with the lowest internal document ID and raises.
Is this right so far?

I would be pleased to get LIFO output. Documents with the same score but 
indexed later
are "newer" (as seen for my data) and should be displayed first.

Sure, I could use sorting, but sorting is always time consuming.
Whereas the output as LIFO is just starting with highest internal document ID 
first for
documents with the same score.

Is there anything like this already available?

If not, any hint where to look at (Lucene or Solr)?

Regards
Bernd


Re: searching for an id

2013-01-31 Thread Chandan Tamrakar
which analyzer are you  using to index that field ,  you can verify that
from schema file .

thanks


On Thu, Jan 31, 2013 at 2:35 PM, b.riez...@pixel-ink.de <
b.riez...@pixel-ink.de> wrote:

> Hi
>
> I have an id wich is a string like this.
> tx-20130130-4599
>
> i'm using a field without processing, wich i got confirmed via the
> analyser tool
> But when i search for that it got split up, so instead of finding that
> specific entry with that unique id,
> it finds all entries with "tx" in it.
>
> Any idea how to get rid of that behavior?
>
> Best
> Ben
>
>


-- 
Chandan Tamrakar
*
*


Thoughts on production deployment?

2013-01-31 Thread Scott Stults
Part of this is a rant, part is a plea to others who've run successful 
production deployments.

Solr is a second-class citizen when it comes to production deployment. Every 
recipe I've seen (RPM, DEB, chef, or puppet) makes assumptions that in one way 
or another run afoul of best-practices when it comes to production use. And if 
you're not using one of these recipe formats to deploy Solr you're building a 
SnowflakeServer (Martin Fowler's term).

Granted, Solr _can_ be deployed into any vanilla JEE container, so the 
deployment spec responsibility may be erroneously assigned to whichever you 
choose. BUT, if you want to get the maximum out of Solr you'll want to put it 
on its own box, running in its own tuned container, and that container should 
be the one that Solr's been tested on repeatedly by an army of build bots. 
Right now that blessed container is Jetty version 8.1.2.v20120308.

So the first problem with the recipes is that they make a generic dependency of 
Jetty or Tomcat. The assumption there is that either can be treated as a 
generic OS facility to be shared with other apps. That's not true because Solr 
is the driving force behind which version is deployed. The container can't be 
up- or downgraded without affecting Solr, and any other app running in there 
needs to be aware that Solr is taking first priority.

The next problem is that most recipes don't make a distinction between 
collections. "Solr" configuration goes in one folder, "Solr" data goes in 
another, and the logs and container stuff gets scattered likewise. In reality, 
every collection can be configured differently and there is no generic "Solr" 
data. 

Lastly, the package maintainers of all the major OS distributions have ignored 
Solr since around version 1.4. That means if you want a newer version you're 
going to download a tarball and make another snowflake. This might be 
attributable to thinking of Solr as just another web app that doesn't need 
special packaging. Regardless, the consequence is that the only people who are 
deploying Solr according to best-practices are those intimately familiar with 
Solr.

So what's the best way to fix this situation? Solr already ships with 
everything it needs except Java and a start-up script. Maybe the first step is 
to include a generic "install.sh" script that has a couple distro-specific 
support scripts. That would be fairly agnostic toward package management 
systems and it would be useful to sysadmins right away. It would also help 
package maintainers update their build specs.

What do _you_ think? 


-Scott

RE: Solr load balancer

2013-01-31 Thread Phil Hoy
Hi,

So am I correct in thinking that I add the jira myself, if so can I add it do 
the 4.2 release? Also I have further questions about the scope of my patch, 
should that be left to the comments of the jira itself?

Phil

-Original Message-
From: Otis Gospodnetic [mailto:otis.gospodne...@gmail.com] 
Sent: 22 January 2013 17:25
To: solr-user@lucene.apache.org
Subject: Re: Solr load balancer

Hi Phil,

Have a look at http://wiki.apache.org/solr/HowToContribute and thank you in 
advance! :)

Otis
--
Solr & ElasticSearch Support
http://sematext.com/





On Fri, Jan 18, 2013 at 5:41 AM, Phil Hoy  wrote:

> Hi,
>
> I would like to experiment with some custom load balancers to help 
> with query latency in the face of long gc pauses and the odd 
> time-consuming query that we need to be able to support. At the moment 
> setting the socket timeout via the HttpShardHandlerFactory does help, 
> but of course it can only be set to a length of time as long as the 
> most time consuming query we are likely to receive.
>
> For example perhaps a load balancer that sends multiple queries 
> concurrently to all/some replicas and only keeps the first response 
> might be effective. Or maybe a load balancer which takes account of 
> the frequency of timeouts would be able to recognize zombies more effectively.
>
> To use alternative load balancer implementations cleanly and without 
> having to hack solr directly, I would need to be able to make the 
> existing LBHttpSolrServer and HttpShardHandlerFactory more amenable to 
> extension, I can then override the default load balancer using solr's plugin 
> mechanism.
>
> So my question is, if I made a patch to make the load balancer more 
> pluggable, is this something that would be acceptable and if so what 
> do I do next?
>
> Phil
>
> __
> "brightsolid" is used in this email to collectively mean brightsolid 
> online innovation limited and its subsidiary companies brightsolid 
> online publishing limited and brightsolid online technology limited.
> findmypast.co.uk is a brand of brightsolid online publishing limited.
> brightsolid online innovation limited, Gateway House, Luna Place, 
> Dundee Technology Park, Dundee DD2 1TP.  Registered in Scotland No. SC274983.
> brightsolid online publishing limited, The Glebe, 6 Chapel Place, 
> Rivington Street, London EC2A 3DQ. Registered in England No. 04369607.
> brightsolid online technology limited, Gateway House, Luna Place, 
> Dundee Technology Park, Dundee DD2 1TP.  Registered in Scotland No. SC161678.
>
> Email Disclaimer
>
> This message is confidential and may contain privileged information. 
> You should not disclose its contents to any other person. If you are 
> not the intended recipient, please notify the sender named above 
> immediately. It is expressly declared that this e-mail does not 
> constitute nor form part of a contract or unilateral obligation. 
> Opinions, conclusions and other information in this message that do 
> not relate to the official business of brightsolid shall be understood as 
> neither given nor endorsed by it.
> __
> This email has been scanned by the brightsolid Email Security System.
> Powered by MessageLabs
> __


__
This email has been scanned by the brightsolid Email Security System. Powered 
by MessageLabs 
__

__
"brightsolid" is used in this email to collectively mean brightsolid online 
innovation limited and its subsidiary companies brightsolid online publishing 
limited and brightsolid online technology limited.
findmypast.co.uk is a brand of brightsolid online publishing limited.
brightsolid online innovation limited, Gateway House, Luna Place, Dundee 
Technology Park, Dundee DD2 1TP.  Registered in Scotland No. SC274983.
brightsolid online publishing limited, The Glebe, 6 Chapel Place, Rivington 
Street, London EC2A 3DQ. Registered in England No. 04369607.
brightsolid online technology limited, Gateway House, Luna Place, Dundee 
Technology Park, Dundee DD2 1TP.  Registered in Scotland No. SC161678.

Email Disclaimer

This message is confidential and may contain privileged information. You should 
not disclose its contents to any other person. If you are not the intended 
recipient, please notify the sender named above immediately. It is expressly 
declared that this e-mail does not constitute nor form part of a contract or 
unilateral obligation. Opinions, conclusions and other information in this 
message that do not relate to the official business of brightsolid shall be 
understood as neither given nor endorsed by it.
_

solr atomic update

2013-01-31 Thread Marcos Mendez
Is there a way to do an atomic update (inc by 1) and retrieve the updated value 
in one operation?

Re: Can I start solr with replication activated but disabled between master and slave

2013-01-31 Thread Erick Erickson
You can also do all this via HTTP commands, see:
http://wiki.apache.org/solr/SolrReplication#HTTP_API

that allows you to control _all_ replication from the master (i.e. tell the
master "don't to any replication") or just tell a slave "don't replicate
any more" as well as a lot of other stuff.

Best
Erick


On Wed, Jan 30, 2013 at 11:58 AM, Arcadius Ahouansou
wrote:

> As stated by Robi, you can through the admin UI:
>
> -disable replication on the master through the admin or
>
> -disable polling on the slave through the admin UI. Disabling polling on
> the slaves is very handy if you doing stuff on the master that require
> master restart as a restart.
>
> Thanks.
>
> Arcadius.
>
>
>
>
>
> On 30 January 2013 16:35, Petersen, Robert  wrote:
>
> > Hi Jamel,
> >
> > You can start solr slaves with them pointed at a master and then turn off
> > replication in the admin replication page.
> >
> > Hope that helps,
> > -Robi
> >
> > Robert (Robi) Petersen
> > Senior Software Engineer
> > Search Department
> >
> >
> >
> >
> > -Original Message-
> > From: Jamel ESSOUSSI [mailto:jamel.essou...@gmail.com]
> > Sent: Wednesday, January 30, 2013 2:45 AM
> > To: solr-user@lucene.apache.org
> > Subject: Can I start solr with replication activated but disabled between
> > master and slave
> >
> > Hello,
> >
> > I would like to start solr with the following configuration;
> >
> > Replication between master and slave activated but not enabled.
> >
> > Regards
> >
> >
> >
> > --
> > View this message in context:
> >
> http://lucene.472066.n3.nabble.com/Can-I-start-solr-with-replication-activated-but-disabled-between-master-and-slave-tp4037333.html
> > Sent from the Solr - User mailing list archive at Nabble.com.
> >
> >
> >
>


Re: Indexing problems

2013-01-31 Thread Erick Erickson
I'm really surprised you're hitting OOM errors, I suspect you have
something else pathological in your system. So, I'd start checking things
like
- how many concurrent warming searchers you allow
- How big your indexing RAM is set to (we find very little gain over 128M
BTW).
- Other load on your Solr server. Are you, for instance, searching on it
too?
- what your autocommit characterstics are (think about autocommitting
fairly often with openSearcher=false).
- have you defined huge caches?
- .

How big are these documents anyway? With 12G of ram, they'd have to be
absolutely _huge_ to matter much.

Multiple collections should work fine in ZK. I really think you have some
innocent-looking configuration setting thats bollixing you up, this is not
expected behavior.

If at all possible, I'd also go with 4.1. I don't really think it's
relevant to your situation, but there have been a lot of improvements in
the code

Best
Erick


Indexing nouns only - UIMA vs. OpenNLP

2013-01-31 Thread Kai Gülzau
Hi,

I am stuck trying to index only the nouns of german and english texts.
(very similar to http://wiki.apache.org/solr/OpenNLP#Full_Example)


First try was to use UIMA with the HMMTagger:


  

/org/apache/uima/desc/AggregateSentenceAE.xml
false

  false
  albody


  
org.apache.uima.SentenceAnnotation

  coveredText
  albody2

  
   
  


- But how do I set the ModelFile to use the german corpus?
- What about language identification?
-- How do I use the right corpus/tagger based on the language?
-- Should this be done in UIMA (how?) or via solr contrib/langid field mapping?
- How to remove non nouns in the annotated field?


Second try is to use OpenNLP and to apply the patch 
https://issues.apache.org/jira/browse/LUCENE-2899
But the patch seems to be a bit out of date.
Currently I try to get it to work with solr 4.1.


Any pointers appreciated :-)

Regards,

Kai Gülzau



Re: Possible issue in edismax?

2013-01-31 Thread Felipe Lahti
So, it depends of your business requirement, right? If a document has
matches in more searchable fields, at least for me, this document is more
important than other document that has less matches.

Example:
Put this in your schema:


And create a class in your classpath of your Solr:

package com.your.namespace;

import org.apache.lucene.search.similarities.DefaultSimilarity;

public class NoIDFSimilarity extends DefaultSimilarity {

@Override

public float idf(long docFreq, long numDocs) {

return 1;

}

}


It will "neutralize" the idf (which is the rarity of term).






On Thu, Jan 31, 2013 at 5:31 AM, Sandeep Mestry  wrote:

> Thanks Felipe..
> Can you point me an example please?
>
> Also forgive me but if a document has matches in more searchable fields
> then should it not rank higher?
>
> Thanks,
> Sandeep
> On 30 Jan 2013 19:30, "Felipe Lahti"  wrote:
>
> > If you compare the first and last document scores you will see that the
> > last one matches more fields than first one. So, you maybe thinking why?
> > The first doc only matches "contributions" field and the last matches a
> > bunch of fields so if you want to  have behave more like ( > name="qf">series_title^500 title^100 description^15 contribution)
> you
> > have to override the method of DefaultSimilarity.
> >
> >
> > On Wed, Jan 30, 2013 at 4:12 PM, Sandeep Mestry 
> > wrote:
> >
> > > I have pasted it below and it is slightly variant from the dismax
> > > configuration I have mentioned above as I was playing with all sorts of
> > > boost values, however it looks more lie below:
> > >
> > > 
> > > 2675.7844 = (MATCH) sum of: 2675.7844 = (MATCH) max plus 0.01 times
> > others
> > > of: 2675.7844 = (MATCH) weight(contributions:news in 63298)
> > > [DefaultSimilarity], result of: 2675.7844 = score(doc=63298,freq=1.0 =
> > > termFreq=1.0 ), product of: 0.004495774 = queryWeight, product of:
> > > 14.530705 = idf(docFreq=14, maxDocs=11282414) 3.093982E-4 = queryNorm
> > > 595177.7 = fieldWeight in 63298, product of: 1.0 = tf(freq=1.0), with
> > freq
> > > of: 1.0 = termFreq=1.0 14.530705 = idf(docFreq=14, maxDocs=11282414)
> > > 40960.0 = fieldNorm(doc=63298)
> > > 
> > > 
> > > 2317.297 = (MATCH) sum of: 2317.297 = (MATCH) max plus 0.01 times
> others
> > > of: 2317.297 = (MATCH) weight(contributions:news in 9826415)
> > > [DefaultSimilarity], result of: 2317.297 = score(doc=9826415,freq=3.0 =
> > > termFreq=3.0 ), product of: 0.004495774 = queryWeight, product of:
> > > 14.530705 = idf(docFreq=14, maxDocs=11282414) 3.093982E-4 = queryNorm
> > > 515439.0 = fieldWeight in 9826415, product of: 1.7320508 =
> tf(freq=3.0),
> > > with freq of: 3.0 = termFreq=3.0 14.530705 = idf(docFreq=14,
> > > maxDocs=11282414) 20480.0 = fieldNorm(doc=9826415)
> > > 
> > > 
> > > 2140.6274 = (MATCH) sum of: 2140.6274 = (MATCH) max plus 0.01 times
> > others
> > > of: 2140.6274 = (MATCH) weight(contributions:news in 9882325)
> > > [DefaultSimilarity], result of: 2140.6274 = score(doc=9882325,freq=1.0
> =
> > > termFreq=1.0 ), product of: 0.004495774 = queryWeight, product of:
> > > 14.530705 = idf(docFreq=14, maxDocs=11282414) 3.093982E-4 = queryNorm
> > > 476142.16 = fieldWeight in 9882325, product of: 1.0 = tf(freq=1.0),
> with
> > > freq of: 1.0 = termFreq=1.0 14.530705 = idf(docFreq=14,
> maxDocs=11282414)
> > > 32768.0 = fieldNorm(doc=9882325)
> > > 
> > > 
> > > 1605.4707 = (MATCH) sum of: 1605.4707 = (MATCH) max plus 0.01 times
> > others
> > > of: 1605.4707 = (MATCH) weight(contributions:news in 220007)
> > > [DefaultSimilarity], result of: 1605.4707 = score(doc=220007,freq=1.0 =
> > > termFreq=1.0 ), product of: 0.004495774 = queryWeight, product of:
> > > 14.530705 = idf(docFreq=14, maxDocs=11282414) 3.093982E-4 = queryNorm
> > > 357106.62 = fieldWeight in 220007, product of: 1.0 = tf(freq=1.0), with
> > > freq of: 1.0 = termFreq=1.0 14.530705 = idf(docFreq=14,
> maxDocs=11282414)
> > > 24576.0 = fieldNorm(doc=220007)
> > > 
> > > 
> > > 1605.4707 = (MATCH) sum of: 1605.4707 = (MATCH) max plus 0.01 times
> > others
> > > of: 1605.4707 = (MATCH) weight(contributions:news in 241151)
> > > [DefaultSimilarity], result of: 1605.4707 = score(doc=241151,freq=1.0 =
> > > termFreq=1.0 ), product of: 0.004495774 = queryWeight, product of:
> > > 14.530705 = idf(docFreq=14, maxDocs=11282414) 3.093982E-4 = queryNorm
> > > 357106.62 = fieldWeight in 241151, product of: 1.0 = tf(freq=1.0), with
> > > freq of: 1.0 = termFreq=1.0 14.530705 = idf(docFreq=14,
> maxDocs=11282414)
> > > 24576.0 = fieldNorm(doc=241151)
> > > 
> > > 
> > > id:c208c2b4-1b3e-27b8-e040-a8c00409063a
> > > 
> > >  
> > > 6.5742764 = (MATCH) sum of: 6.5742764 = (MATCH) max plus 0.01 times
> > others
> > > of: 3.304414 = (MATCH) weight(description:news^25.0 in 967895)
> > > [DefaultSimilarity], result of: 3.304414 = score(doc=967895,freq=1.0 =
> > > termFreq=1.0 ), product of: 0.042727955 = queryWeight, product of:
> 25.0 =
> > > boost 5.5240083 = idf(docFreq=122362, maxDocs=112

Re: setting up master and slave in same machine with diff ip's and same port

2013-01-31 Thread epnRui
Hi,

I solved the issue by setting up two different virtual network adapters in
ubuntu server.

case closed ;)


thanks for the help!!



--
View this message in context: 
http://lucene.472066.n3.nabble.com/setting-up-master-and-slave-in-same-machine-with-diff-ip-s-and-same-port-tp4035795p4037713.html
Sent from the Solr - User mailing list archive at Nabble.com.


Stopping solr

2013-01-31 Thread epnRui
Hi people,


First of all this forum is a god sent!!!

Second:

I have a master / slave configuration, using replication.

Currently in production I have only one server, there's no backup server
(really...).
The webapplication is a public webapplication, everyone can see it.

 - How often, in your experience, and why, would solr crash?
 - If I kill solr master and slave, usually do I need to also delete the
indexes? Or everything should be fine upon restarting?
 - If I want to upgrade solr master and slave, or patch them, is there a way
that the services feeding from them will not fail? Solr in my application is
being used for indexing social networks feeds, like facebook posts...what
I'm trying to achieve is that the user keeps seeing the webpage working
normally (of course, with old index feeds of solr) in case solr crashes.
Maybe I can setup a backup solr slave as a backup system?

I know these are "innocent" questions, but I am learning sys admin,
apparently my IT department thinks i'm the "do it all" guy and IT people
need to develop and sys admin. If I told where I work you would fall from
your chair.


Best regards,
Rui



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Stopping-solr-tp4037715.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Thoughts on production deployment?

2013-01-31 Thread Michael Della Bitta
On Thu, Jan 31, 2013 at 5:13 AM, Scott Stults
 wrote:
> Right now that blessed container is Jetty version 8.1.2.v20120308.

I'd really like some confirmation from the devs that there really is a
blessed status for a given container that provides advantages over
others. From what I understand, Jetty's considered one option out of
many, and isn't considered to be head and shoulders above any other.

We have a Chef regime here, and I've written Tomcat and Solr recipes
to be played against Ubuntu 12.04 Server. I chose Tomcat mostly
because I have the most experience administrating and configuring it,
and I would assume familiarity with operations would be a pretty
important factor.

Michael Della Bitta


Appinions
18 East 41st Street, 2nd Floor
New York, NY 10017-6271

www.appinions.com

Where Influence Isn’t a Game


RE: Indexing problems

2013-01-31 Thread GASPARD Joel
Hello Erick,

Thanks for your answer.

After reading previous subjects on the user list, we had already tried to 
change the parameters we mentioned.

- concurrent warming searchers : we have set the maxWarmingSearchers attribute 
to 2 
2

- we have tried 32 and 64 for the ramBufferSizeMB attribute

- there is no other load on the Solr server, or search when we index

- the autocommit is defined with openSearcher=false, maxTime=60ms, 
maxDocs=6000 - the autoSoftCommit is defined with maxTime=1000
We have already tried to change the softcommit and the commit parameters in 
several ways. We have also tried to commit on the client size.
Ok I try to commit more often.

- we have used cache sizes defined in the example : size=512

The documents size is not too big, I think : 1 million documents produce a 6Go 
index.

Thanks for your answer on multiple collections. I thought multiple collections 
should have the same schema in Zk after reading a wiki page : 
http://wiki.apache.org/solr/NewSolrCloudDesign : "The entire cluster must have 
a single schema and solrconfig"
Maybe is this page deprecated ?
I also thought that because OOM errors occur only when we index a second 
collection. There is no problem when indexing a single collection.

Going with 4.1 would not be easy for now... We'll think about it.

Thanks.

Joel


-Message d'origine-
De : Erick Erickson [mailto:erickerick...@gmail.com] 
Envoyé : jeudi 31 janvier 2013 14:00
À : solr-user@lucene.apache.org
Objet : Re: Indexing problems

I'm really surprised you're hitting OOM errors, I suspect you have something 
else pathological in your system. So, I'd start checking things like
- how many concurrent warming searchers you allow
- How big your indexing RAM is set to (we find very little gain over 128M BTW).
- Other load on your Solr server. Are you, for instance, searching on it too?
- what your autocommit characterstics are (think about autocommitting fairly 
often with openSearcher=false).
- have you defined huge caches?
- .

How big are these documents anyway? With 12G of ram, they'd have to be 
absolutely _huge_ to matter much.

Multiple collections should work fine in ZK. I really think you have some 
innocent-looking configuration setting thats bollixing you up, this is not 
expected behavior.

If at all possible, I'd also go with 4.1. I don't really think it's relevant to 
your situation, but there have been a lot of improvements in the code

Best
Erick


Re: Stopping solr

2013-01-31 Thread Michael Della Bitta
>  - How often, in your experience, and why, would solr crash?

Not very often. Typically if your heap is too small, you'll end up going OOM.

>  - If I kill solr master and slave, usually do I need to also delete the
> indexes? Or everything should be fine upon restarting?

Restarts are fine. Order shouldn't matter.

>  - If I want to upgrade solr master and slave, or patch them, is there a way
> that the services feeding from them will not fail?

You'd need at least a load balanced pair of servers serving results to
your application. In theory, if you have enough RAM, you could run
them on the same machine, although you'd lose some redundancy that
way.

I guess another way is to borrow and  temporarily cut over to another
system and then cut back, but I'd really recommend having two full
time systems if you want to preserve uptime overall.

Michael Della Bitta


Appinions
18 East 41st Street, 2nd Floor
New York, NY 10017-6271

www.appinions.com

Where Influence Isn’t a Game


Re: Possible issue in edismax?

2013-01-31 Thread Sandeep Mestry
Fantastic! Thanks very much.. I will do so accordingly and will let you
know the results.

Thanks again,
Sandeep


On 31 January 2013 13:54, Felipe Lahti  wrote:

> So, it depends of your business requirement, right? If a document has
> matches in more searchable fields, at least for me, this document is more
> important than other document that has less matches.
>
> Example:
> Put this in your schema:
> 
>
> And create a class in your classpath of your Solr:
>
> package com.your.namespace;
>
> import org.apache.lucene.search.similarities.DefaultSimilarity;
>
> public class NoIDFSimilarity extends DefaultSimilarity {
>
> @Override
>
> public float idf(long docFreq, long numDocs) {
>
> return 1;
>
> }
>
> }
>
>
> It will "neutralize" the idf (which is the rarity of term).
>
>
>
>
>
>
> On Thu, Jan 31, 2013 at 5:31 AM, Sandeep Mestry 
> wrote:
>
> > Thanks Felipe..
> > Can you point me an example please?
> >
> > Also forgive me but if a document has matches in more searchable fields
> > then should it not rank higher?
> >
> > Thanks,
> > Sandeep
> > On 30 Jan 2013 19:30, "Felipe Lahti"  wrote:
> >
> > > If you compare the first and last document scores you will see that the
> > > last one matches more fields than first one. So, you maybe thinking
> why?
> > > The first doc only matches "contributions" field and the last matches a
> > > bunch of fields so if you want to  have behave more like ( > > name="qf">series_title^500 title^100 description^15 contribution)
> > you
> > > have to override the method of DefaultSimilarity.
> > >
> > >
> > > On Wed, Jan 30, 2013 at 4:12 PM, Sandeep Mestry 
> > > wrote:
> > >
> > > > I have pasted it below and it is slightly variant from the dismax
> > > > configuration I have mentioned above as I was playing with all sorts
> of
> > > > boost values, however it looks more lie below:
> > > >
> > > > 
> > > > 2675.7844 = (MATCH) sum of: 2675.7844 = (MATCH) max plus 0.01 times
> > > others
> > > > of: 2675.7844 = (MATCH) weight(contributions:news in 63298)
> > > > [DefaultSimilarity], result of: 2675.7844 = score(doc=63298,freq=1.0
> =
> > > > termFreq=1.0 ), product of: 0.004495774 = queryWeight, product of:
> > > > 14.530705 = idf(docFreq=14, maxDocs=11282414) 3.093982E-4 = queryNorm
> > > > 595177.7 = fieldWeight in 63298, product of: 1.0 = tf(freq=1.0), with
> > > freq
> > > > of: 1.0 = termFreq=1.0 14.530705 = idf(docFreq=14, maxDocs=11282414)
> > > > 40960.0 = fieldNorm(doc=63298)
> > > > 
> > > > 
> > > > 2317.297 = (MATCH) sum of: 2317.297 = (MATCH) max plus 0.01 times
> > others
> > > > of: 2317.297 = (MATCH) weight(contributions:news in 9826415)
> > > > [DefaultSimilarity], result of: 2317.297 =
> score(doc=9826415,freq=3.0 =
> > > > termFreq=3.0 ), product of: 0.004495774 = queryWeight, product of:
> > > > 14.530705 = idf(docFreq=14, maxDocs=11282414) 3.093982E-4 = queryNorm
> > > > 515439.0 = fieldWeight in 9826415, product of: 1.7320508 =
> > tf(freq=3.0),
> > > > with freq of: 3.0 = termFreq=3.0 14.530705 = idf(docFreq=14,
> > > > maxDocs=11282414) 20480.0 = fieldNorm(doc=9826415)
> > > > 
> > > > 
> > > > 2140.6274 = (MATCH) sum of: 2140.6274 = (MATCH) max plus 0.01 times
> > > others
> > > > of: 2140.6274 = (MATCH) weight(contributions:news in 9882325)
> > > > [DefaultSimilarity], result of: 2140.6274 =
> score(doc=9882325,freq=1.0
> > =
> > > > termFreq=1.0 ), product of: 0.004495774 = queryWeight, product of:
> > > > 14.530705 = idf(docFreq=14, maxDocs=11282414) 3.093982E-4 = queryNorm
> > > > 476142.16 = fieldWeight in 9882325, product of: 1.0 = tf(freq=1.0),
> > with
> > > > freq of: 1.0 = termFreq=1.0 14.530705 = idf(docFreq=14,
> > maxDocs=11282414)
> > > > 32768.0 = fieldNorm(doc=9882325)
> > > > 
> > > > 
> > > > 1605.4707 = (MATCH) sum of: 1605.4707 = (MATCH) max plus 0.01 times
> > > others
> > > > of: 1605.4707 = (MATCH) weight(contributions:news in 220007)
> > > > [DefaultSimilarity], result of: 1605.4707 =
> score(doc=220007,freq=1.0 =
> > > > termFreq=1.0 ), product of: 0.004495774 = queryWeight, product of:
> > > > 14.530705 = idf(docFreq=14, maxDocs=11282414) 3.093982E-4 = queryNorm
> > > > 357106.62 = fieldWeight in 220007, product of: 1.0 = tf(freq=1.0),
> with
> > > > freq of: 1.0 = termFreq=1.0 14.530705 = idf(docFreq=14,
> > maxDocs=11282414)
> > > > 24576.0 = fieldNorm(doc=220007)
> > > > 
> > > > 
> > > > 1605.4707 = (MATCH) sum of: 1605.4707 = (MATCH) max plus 0.01 times
> > > others
> > > > of: 1605.4707 = (MATCH) weight(contributions:news in 241151)
> > > > [DefaultSimilarity], result of: 1605.4707 =
> score(doc=241151,freq=1.0 =
> > > > termFreq=1.0 ), product of: 0.004495774 = queryWeight, product of:
> > > > 14.530705 = idf(docFreq=14, maxDocs=11282414) 3.093982E-4 = queryNorm
> > > > 357106.62 = fieldWeight in 241151, product of: 1.0 = tf(freq=1.0),
> with
> > > > freq of: 1.0 = termFreq=1.0 14.530705 = idf(docFreq=14,
> > maxDocs=11282414)
> > > > 24576.0 = fieldNorm(doc=241151)
> > > > 
> > > > 
> > > > id:c208

Re: long QTime for big index

2013-01-31 Thread Mou
Thanks for your reply.

No, there is no eviction, yet.

The time is spent mostly on org.apache.solr.handler.component.QueryComponent
to process the request.

Again, the time varies widely for same query.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/long-QTime-for-big-index-tp4037635p4037741.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: searching for an id

2013-01-31 Thread Alexandre Rafalovitch
Are you using eDismax? Maybe your ID field is not part of the search fields
or not a high priority. And, just maybe, you are doing a copyField * to
text and the text splits the ID into parts. Enable the debug on your query
and you should be able to figure it out.

Regards,
   Alex.

Personal blog: http://blog.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all at
once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)


On Thu, Jan 31, 2013 at 3:50 AM, b.riez...@pixel-ink.de <
b.riez...@pixel-ink.de> wrote:

> Hi
>
> I have an id wich is a string like this.
> tx-20130130-4599
>
> i'm using a field without processing, wich i got confirmed via the
> analyser tool
> But when i search for that it got split up, so instead of finding that
> specific entry with that unique id,
> it finds all entries with "tx" in it.
>
> Any idea how to get rid of that behavior?
>
> Best
> Ben
>
>


Re: help to build query

2013-01-31 Thread Abhishek tiwari
jack Thanks for your response..

we have a deal web application.. and having free text search in it . here
free text
 means you can type any thing in it..

we have deals of different categories..  and tagged at  different
 merchant  locations..
As per requirement  i have to do some tweaks in search ..

for example user can search deals like :

a)   cat1 in location1 , location 2.( spa in malviya nagar, Ashok vihar
... here spa : cat1, location1: malviya nagar, location2:Ashok vihar
b) cat1 and cat2 in location1
c) cat1 in location1 and location2

Hope i am able to explain you better..




On Wed, Jan 30, 2013 at 9:06 PM, Jack Krupansky wrote:

> Start by expressing the specific semantics of those queries in strict
> boolean form. I mean, what exactly do you mean by "in", and "location1,
> location 2", and "location1, loc2 and loc3? Is the latter an AND or an OR?
>
> Or at least fully express those two queries, unambiguously in plain
> English. There is too much ambiguity present to give you any solid
> direction.
>
> -- Jack Krupansky
>
> -Original Message- From: Abhishek tiwari
> Sent: Wednesday, January 30, 2013 12:55 AM
> To: solr-user@lucene.apache.org
> Subject: help to build query
>
>
> want to execute queries like :
> a)  cat in location1 , location2
> b)  cat 1 and cat2 in location1 ,loc2 and  loc3
>
> in our search .
>
> our challenges :
>
> 1)  picking right keywords(category and locality) from query entered.
> 2)  its mapping to relevant entity
>
> How should i proceed for it .
>
> we have localities and categories data indexed .
>
> thanks in advance.
>
> ~abhishek
>


Re: Thoughts on production deployment?

2013-01-31 Thread Paul Jungwirth
>
> We have a Chef regime here, and I've written Tomcat and Solr recipes
> to be played against Ubuntu 12.04 Server.


We do mostly the same: chef to install Tomcat (with configuration
appropriate to Solr), but then instead of deploying Solr via chef, we use
an ant script to package and deploy a war that includes Solr + some custom
Lucene extensions, then also deploy our {schema,solrconfig}.xml files. This
is a little easier for us than doing everything in chef, since we can more
easily push updates to our custom extensions.

I'll also note that our current process mirrors what we do for the
front-end app server (written in Ruby). We use chef to set up the box, then
Capistrano to deploy the app. We push app updates several times a week, but
rarely need to run chef after the initial setup.

But I'd love to know if there is an easier way to do it.

Paul

-- 
_
Pulchritudo splendor veritatis.


RE: Indexing nouns only - UIMA vs. OpenNLP

2013-01-31 Thread Kai Gülzau
UIMA:

I just found this issue https://issues.apache.org/jira/browse/SOLR-3013
Now I am able to use this analyzer for english texts and filter (un)wanted 
token types :-)


  


  


Open issue -> How to set the ModelFile for the Tagger to 
"german/TuebaModel.dat" ???



OpenNLP:

And a modified patch for https://issues.apache.org/jira/browse/LUCENE-2899 is 
now working
with solr 4.1. :-)


  

  
  
  
  




Any hints on which lib is more accurate on noun tagging?
Any performance or memory issues (some OOM here while testing with 1GB via 
Analyzer Admin GUI)?


Regards,

Kai Gülzau




-Original Message-
From: Kai Gülzau [mailto:kguel...@novomind.com] 
Sent: Thursday, January 31, 2013 2:19 PM
To: solr-user@lucene.apache.org
Subject: Indexing nouns only - UIMA vs. OpenNLP

Hi,

I am stuck trying to index only the nouns of german and english texts.
(very similar to http://wiki.apache.org/solr/OpenNLP#Full_Example)


First try was to use UIMA with the HMMTagger:


  

/org/apache/uima/desc/AggregateSentenceAE.xml
false

  false
  albody


  
org.apache.uima.SentenceAnnotation

  coveredText
  albody2

  
   
  


- But how do I set the ModelFile to use the german corpus?
- What about language identification?
-- How do I use the right corpus/tagger based on the language?
-- Should this be done in UIMA (how?) or via solr contrib/langid field mapping?
- How to remove non nouns in the annotated field?


Second try is to use OpenNLP and to apply the patch 
https://issues.apache.org/jira/browse/LUCENE-2899
But the patch seems to be a bit out of date.
Currently I try to get it to work with solr 4.1.


Any pointers appreciated :-)

Regards,

Kai Gülzau



RE: field space consumption - stored vs not stored

2013-01-31 Thread Petersen, Robert
Thanks Shawn.  Actually now that I think about it,  Yonik also mentioned 
something about lucene number representation once in reply to one of my 
questions.  Here it is:
Could you also tell me what these `#8;#0;#0;#0;#1; strings represent in the 
debug output?

"That's internally how a number is encoded into a string (5 bytes, the first 
being binary 8, the next 0, etc.)  This is not representable in XML as � is 
illegal, hence we leave off the '&' so it's not a true character entity.  
-Yonik"

Hey I followed your link, and it had a link to this talk.  Did you see this 
example?
http://lucene.sourceforge.net/talks/pisa/

VInt Encoding Example (table was flattened during pasting):

Value

First byte

Second byte

Third byte

0





1

0001



2

0010



...




127

0111



128

1000

0001



129

1001

0001


130

1010

0001


...




16,383



0111


16,384

1000

1000

0001

16,385

1001

1000

0001

...



-Original Message-
From: Shawn Heisey [mailto:s...@elyograg.org] 
Sent: Wednesday, January 30, 2013 5:28 PM
Cc: solr-user@lucene.apache.org
Subject: Re: field space consumption - stored vs not stored

On 1/30/2013 6:24 PM, Shawn Heisey wrote:
> If I had to guess about the extra space required for storing an int 
> field, I would say it's in the neighborhood of 20 bytes per document, 
> perhaps less.  I am also interested in a definitive answer.

The answer is very likely less than 20 bytes per doc.  I was assuming a larger 
size for VInt than it is likely to use.  See the answer for this
question:

http://stackoverflow.com/questions/2752612/what-is-the-vint-in-lucene

Thanks,
Shawn





Search match all tokens in Query Text

2013-01-31 Thread Bing Hua
Hello,

I have a field text with type text_general here.















When I query for text:a b, solr returns results that contain only a but not
b. That is, it uses OR operator between the two tokens.

Am I right here? What should I do to force an AND operator between the two
tokens?

Thanks



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Search-match-all-tokens-in-Query-Text-tp4037758.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Search match all tokens in Query Text

2013-01-31 Thread Jack Krupansky

+text:a +b

-- Jack Krupansky

-Original Message- 
From: Bing Hua

Sent: Thursday, January 31, 2013 12:59 PM
To: solr-user@lucene.apache.org
Subject: Search match all tokens in Query Text

Hello,

I have a field text with type text_general here.















When I query for text:a b, solr returns results that contain only a but not
b. That is, it uses OR operator between the two tokens.

Am I right here? What should I do to force an AND operator between the two
tokens?

Thanks



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Search-match-all-tokens-in-Query-Text-tp4037758.html
Sent from the Solr - User mailing list archive at Nabble.com. 



Re: Search match all tokens in Query Text

2013-01-31 Thread Bing Hua
Thanks for the quick reply. Seems like you are suggesting to add explicitly
AND operator. I don't think this solves my problem.

I found it  somewhere, and this
works.





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Search-match-all-tokens-in-Query-Text-tp4037758p4037762.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: long QTime for big index

2013-01-31 Thread Shawn Heisey

On 1/31/2013 1:01 AM, Mou wrote:

I am running solr 3.4 on tomcat 7.

Our index is very big , two cores each 120G. We are searching the slaves
which are replicated every 30 min.
  I am using filtercache only and We have more than 90% cache hits. We use
lot of filter queries, queries are usually pretty big with 10-20 fq
parameters. Not all filters are cached.

we are searching three shards and query looks like this --
shards=core1,core2,core3&q=*:* &fq=field1:some value&fq = -field2=some
value&sort=date
But some queries are taking more than 30 sec to return result and the
behavior is intermittent. I can not find relation to replication. We are
using Zing jvm which reduced our GC pause to milli secs, so GC is not a
problem.


Complex queries, especially on a distributed search, can be very slow. 
In my experience, uncached filters make things particularly slow.


Your first paragraph says you have two cores 120GB each, but then later 
you say you are using three cores in a shards parameter.  What's the 
true core situation?


If you have a total index size for this JVM of 240GB, then you may not 
have enough RAM to let the OS disk cache work efficiently.  For that 
size of index, I would plan on a system with at least 128GB of RAM, 
256GB would be better.  You have to have enough free memory (after the 
OS and programs including tomcat/solr) in the system to cache the 
critical pieces of your index.


Tomcat would probably need a heap size between 8GB and 24GB - it's 
impossible to give you the right heap size here, you'd just have to 
test.  I could be way off on that estimate, too.


To give you an idea of how to size memory based on a production Solr 
3.5.0 system with good performance, one of my solr servers has 70GB of 
total index data, of which 24GB is stored fields and 22GB is 
termvectors.  The OS disk cache has 41.8GB of data in it at the moment. 
 The system has 64GB of memory and Solr (Jetty) has a max heap size of 
8GB.  Based on observations, I could run with a heap size of 4GB during 
normal operation, but when I am doing a full database import on my 
indexes, it requires the 8GB heap.


I looked through the mailing list history to see what else you've said 
about your setup and what other help you've gotten.  In one of your 
other messages, you said that you have a 70GB heap size.  That is 
extremely large, and probably not necessary.  If you have found that it 
is necessary, then your overall Solr architecture may need further 
adjustment.


One of your earlier messages indicated that you are using SSD storage. 
Main system memory is quite a lot faster than an SSD, so the OS cache is 
still important.


Thanks,
Shawn



Re: long QTime for big index

2013-01-31 Thread Mou
Thank you Shawn for reading all of my previous entries and for a detailed
answer.

To clarify, the third shard is used to store the recently added/updated
data. Two main big cores take very long to replicate ( when a full
replication is required) so the third one helps us to return the newly
indexed documents quickly. It gets deleted every hour after we replicate the
two other cores with last hour's of new/changed data. This third core is
very small.

As you said, with that big index and distributed queries , searches were too
slow.So we tried to use the filtercache to speed up the queries. Filtercache
was big as we have thousands of different filters. other caches were not
very helpful as queries are not repetitive and there is heavy add/update to
the index. So we have to use bigger heap size. Now,with that big heap size
GC pauses was horrible, so we moved to Zing jvm. Zing jvm is now using 134 G
of heap and does not have those big pauses but it also does not leave much
memory for OS. 

I am now testing with small heap, small filter cache ( just the basic
filters) and lot of memory available for OS disk cache. If that does not
work, I am thinking of breaking my index down into small pieces.




--
View this message in context: 
http://lucene.472066.n3.nabble.com/long-QTime-for-big-index-tp4037635p4037781.html
Sent from the Solr - User mailing list archive at Nabble.com.


DIH and splitBy

2013-01-31 Thread Christopher Condit
I'm having an issue getting the splitBy construct from the regex
transformer to work in a very basic case (with either Solr 3.6 or
4.1).

I have a field defined like this:


The entity is defined like this:

  


Here's a POM:
http://pastie.org/5992725

A JUnit test case showing the problem:
http://pastie.org/5993437

And a stackoverflow question with the same information:
http://stackoverflow.com/questions/14512055/splitting-database-column-into-multivalued-solr-field

Anyone have any ideas?

Thanks,
Chris


RE: DIH and splitBy

2013-01-31 Thread Dyer, James
In your unit test, you have:

"" +

And also:

runner.update("INSERT INTO test VALUES 1, 'foo,bar,baz'");

So you need to decide if you want to delimit with a pipe or a comma.

James Dyer
Ingram Content Group
(615) 213-4311


-Original Message-
From: Christopher Condit [mailto:con...@sdsc.edu] 
Sent: Thursday, January 31, 2013 2:03 PM
To: solr-user@lucene.apache.org
Subject: DIH and splitBy

I'm having an issue getting the splitBy construct from the regex
transformer to work in a very basic case (with either Solr 3.6 or
4.1).

I have a field defined like this:


The entity is defined like this:

  


Here's a POM:
http://pastie.org/5992725

A JUnit test case showing the problem:
http://pastie.org/5993437

And a stackoverflow question with the same information:
http://stackoverflow.com/questions/14512055/splitting-database-column-into-multivalued-solr-field

Anyone have any ideas?

Thanks,
Chris



Re: long QTime for big index

2013-01-31 Thread Shawn Heisey

On 1/31/2013 12:47 PM, Mou wrote:

To clarify, the third shard is used to store the recently added/updated
data. Two main big cores take very long to replicate ( when a full
replication is required) so the third one helps us to return the newly
indexed documents quickly. It gets deleted every hour after we replicate the
two other cores with last hour's of new/changed data. This third core is
very small.


I use this approach.  My entire index is 74 million documents, but all 
new data is added to a shard that only contains about 400K documents. 
The other six shards have over 12 million documents each and take up 
about 22GB of disk space.  It takes two servers to house one complete 
copy of my index.


Index updates happen once a minute.  Because most delete/reinsert 
activity happens on recently added content and all new content gets 
added only to the small shard, the large shards can run for many minutes 
without seeing commits.



As you said, with that big index and distributed queries , searches were too
slow.So we tried to use the filtercache to speed up the queries. Filtercache
was big as we have thousands of different filters. other caches were not
very helpful as queries are not repetitive and there is heavy add/update to
the index. So we have to use bigger heap size. Now,with that big heap size
GC pauses was horrible, so we moved to Zing jvm. Zing jvm is now using 134 G
of heap and does not have those big pauses but it also does not leave much
memory for OS.

I am now testing with small heap, small filter cache ( just the basic
filters) and lot of memory available for OS disk cache. If that does not
work, I am thinking of breaking my index down into small pieces.


I hope it works for you!  With this approach, the first queries will 
probably still be pretty slow, but as the data gets cached, things 
should speed up.


You can pre-cache the important parts of your index with a command like 
the following in the index directory.


cat `ls | egrep -v "(\.fd|\.tv)"` > /dev/null

That command will read all the index files except for the stored fields 
(.fdx, .fdt) and termvectors (.tvx, .tvd, .tvf).  That puts them in the 
OS disk cache.  Before trying that command, you would want to find out 
how much disk space those files take to make sure they will all fit in 
RAM.  It is usually a bad idea to schedule this operation in cron.


Thanks,
Shawn



Re: DIH and splitBy

2013-01-31 Thread Christopher Condit
Sorry about that - even if I switch the splitBy to "," it still
doesn't work. Here's the corrected unit test:
http://pastie.org/5995399

On Thu, Jan 31, 2013 at 12:30 PM, Dyer, James
 wrote:
> In your unit test, you have:
>
> "" +
>
> And also:
>
> runner.update("INSERT INTO test VALUES 1, 'foo,bar,baz'");
>
> So you need to decide if you want to delimit with a pipe or a comma.
>
> James Dyer
> Ingram Content Group
> (615) 213-4311
>
>
> -Original Message-
> From: Christopher Condit [mailto:con...@sdsc.edu]
> Sent: Thursday, January 31, 2013 2:03 PM
> To: solr-user@lucene.apache.org
> Subject: DIH and splitBy
>
> I'm having an issue getting the splitBy construct from the regex
> transformer to work in a very basic case (with either Solr 3.6 or
> 4.1).
>
> I have a field defined like this:
> 
>
> The entity is defined like this:
> 
>   
> 
>
> Here's a POM:
> http://pastie.org/5992725
>
> A JUnit test case showing the problem:
> http://pastie.org/5993437
>
> And a stackoverflow question with the same information:
> http://stackoverflow.com/questions/14512055/splitting-database-column-into-multivalued-solr-field
>
> Anyone have any ideas?
>
> Thanks,
> Chris
>


RE: long QTime for big index

2013-01-31 Thread Toke Eskildsen
Shawn Heisey [s...@elyograg.org] wrote:

[...]

> If you have a total index size for this JVM of 240GB, then you may not
> have enough RAM to let the OS disk cache work efficiently.  For that
> size of index, I would plan on a system with at least 128GB of RAM,
> 256GB would be better.

[...]

> One of your earlier messages indicated that you are using SSD storage.
> Main system memory is quite a lot faster than an SSD, so the OS cache is
> still important.

While technically true, our internal tests showed little practical gain by 
using main memory over SSD. Our main search servers are equipped with paltry 
memory (16GB) and consumer grade SSDs for our 10M documents/70GB indexes. 

However, we did our comparison testing way back before MMapDirectory with 
Lucene 2.something, so our observations might not be valid anymore. Do you know 
of any recent experiments with RAM vs. SSD?

Regards,
Toke Eskildsen, State and University Library, Denmark

Re: Stopping solr

2013-01-31 Thread Michael Della Bitta
The ping handler is how we tell our load balancers that our Solr cores
are healthy. I guess if you're running more than one core behind the
same balancer, it would make sense to drop a webapp in there that ran
the ping queries for all your cores and only responded OK if they all
came back OK.

Or if you have one core that's the most important, you could only use
that ping handler.

Or you could invent your own check if you think that the criteria for
"up" should be different than what the ping handler offers.

Michael

Michael Della Bitta


Appinions
18 East 41st Street, 2nd Floor
New York, NY 10017-6271

www.appinions.com

Where Influence Isn’t a Game


On Thu, Jan 31, 2013 at 10:43 AM, epnRui  wrote:
> Hi Michael!
>
> Thank you for you response.
>
> Do you know how I could check the health of Solr Master?
>
> Is this the only way?
>
> http://wiki.apache.org/solr/SolrConfigXml#The_Admin.2BAC8-GUI_Section
>
> I guess checking the overall server health isn't the same as checking wether
> or not the index is responding correctly and with the correct data?
>
> Thanks!
> Rui
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Stopping-solr-tp4037715p4037728.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: Thoughts on production deployment?

2013-01-31 Thread Mark Miller

On Jan 31, 2013, at 10:15 AM, Michael Della Bitta 
 wrote:

> I'd really like some confirmation from the devs that there really is a
> blessed status for a given container that provides advantages over
> others.

IMO: jetty is what all of our unit/integration tests are run in, jetty is what 
we configure to work well out of the box and add workarounds to, jetty is what 
the devs run, jetty is very likely what most of the users run simply because we 
ship with it, most of the bug reports we get around containers involve jetty 
(because of the previous most likely).

I'd say jetty doesn't get anymore blessed than that. If you want to run another 
container, fine, but I would pick jetty myself - specifically, the one we ship 
with without darn good reason.

- Mark

Re: Minimum word length for stemming

2013-01-31 Thread Jan Høydahl
Hi,

I believe each stemmer implementation decides that themselves. At least the 
MinimalNorwegianStemmer has a built-in logic which stems certain suffixes only 
if the token is >N chars.

If you want external control, you can look at 
http://wiki.apache.org/solr/LanguageAnalysis#Customizing_Stemming and the 
KeywordMarkerFilterFactory which lets you list a bunch of words you do not want 
the stemmers to touch. I guess you could easily implement your own 
TokenLengthMarkerFilterFactory which keeps words from being stemmed based on 
length.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com

31. jan. 2013 kl. 17:35 skrev Jamie Johnson :

> Is there a capability to provide a minimum word threshold that must be met
> before a word is analyzed by a stemmer or other language analyzer?



Re: Thoughts on production deployment?

2013-01-31 Thread Michael Della Bitta
That's surprising to me, mostly because a number of the Solr wiki
pages don't really make that strong of a case for it:

http://wiki.apache.org/solr/SolrInstall
http://wiki.apache.org/solr/SolrTomcat
http://wiki.apache.org/solr/SolrJetty

Would it make sense to spell that out somewhere?

I do notice that it seems like the version of Jetty that ships with
Solr isn't the preferred one according to the wiki, so that would be
an extra dependency for a config management system like Chef.

Are there any other configuration choices that are blessed like this?
JDK versions or sources (oracle vs. open), for example?

Thanks,

Michael

Michael Della Bitta


Appinions
18 East 41st Street, 2nd Floor
New York, NY 10017-6271

www.appinions.com

Where Influence Isn’t a Game


On Thu, Jan 31, 2013 at 5:07 PM, Mark Miller  wrote:
>
> On Jan 31, 2013, at 10:15 AM, Michael Della Bitta 
>  wrote:
>
>> I'd really like some confirmation from the devs that there really is a
>> blessed status for a given container that provides advantages over
>> others.
>
> IMO: jetty is what all of our unit/integration tests are run in, jetty is 
> what we configure to work well out of the box and add workarounds to, jetty 
> is what the devs run, jetty is very likely what most of the users run simply 
> because we ship with it, most of the bug reports we get around containers 
> involve jetty (because of the previous most likely).
>
> I'd say jetty doesn't get anymore blessed than that. If you want to run 
> another container, fine, but I would pick jetty myself - specifically, the 
> one we ship with without darn good reason.
>
> - Mark


Re: Minimum word length for stemming

2013-01-31 Thread Jamie Johnson
Thanks for confirming my suspicions, the custom
TokenLengthMarkerFilterFactory sounds like the best approach for doing this.


On Thu, Jan 31, 2013 at 5:12 PM, Jan Høydahl  wrote:

> Hi,
>
> I believe each stemmer implementation decides that themselves. At least
> the MinimalNorwegianStemmer has a built-in logic which stems certain
> suffixes only if the token is >N chars.
>
> If you want external control, you can look at
> http://wiki.apache.org/solr/LanguageAnalysis#Customizing_Stemming and the
> KeywordMarkerFilterFactory which lets you list a bunch of words you do not
> want the stemmers to touch. I guess you could easily implement your own
> TokenLengthMarkerFilterFactory which keeps words from being stemmed based
> on length.
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
> Solr Training - www.solrtraining.com
>
> 31. jan. 2013 kl. 17:35 skrev Jamie Johnson :
>
> > Is there a capability to provide a minimum word threshold that must be
> met
> > before a word is analyzed by a stemmer or other language analyzer?
>
>


Re: Thoughts on production deployment?

2013-01-31 Thread Shawn Heisey

On 1/31/2013 3:21 PM, Michael Della Bitta wrote:

I do notice that it seems like the version of Jetty that ships with
Solr isn't the preferred one according to the wiki, so that would be
an extra dependency for a config management system like Chef.


Near as I can tell, the versions of jetty that shipped with 4.0 (8.1.2) 
and 4.1 (8.1.7) are unmodified.  The config is somewhat specialized, but 
Jetty itself is not changed.  I upgraded my 4.1-SNAPSHOT install to 
8.1.7 before the committers did without any problems.


The Jetty 6 versions included with 1.x and 3.x releases were patched for 
one or more bugs - the upstream package from mortbay wouldn't be the 
right thing to use.


I have a RHEL/CentOS-friendly jetty init script with config options that 
can be overridden by a file in /etc/sysconfig.  I could probably also 
come up with one for Debian (sysvinit).  The newest Fedora releases use 
systemd, but systemd is backward compatible with RHEL/CentOS init 
scripts.  Outside of these distributions, I know very little.  Recent 
Ubuntu releases use upstart, about which I am completely clueless.


If there's interest, I can make my init script more generic, make one 
for debian, and try to come up with an installation script to go with 
it.  If someone knows upstart, they can use my work as a base.


Thanks,
Shawn



RE: Solr load balancer

2013-01-31 Thread Jeff Wartes

For what it's worth, Google has done some pretty interesting research into 
coping with the idea that particular shards might very well be busy doing 
something else when your query comes in.

Check out this slide deck: http://research.google.com/people/jeff/latency.html
Lots of interesting ideas, but in particular, around slide 39 he talks about 
"backup requests" where you wait for something like your typical response time 
and then issue a second request to a different shard. You take whichever answer 
you get first, and cancel the other. The initial wait + cancellation means your 
extra cluster load is minimal, and you still get the benefit of reducing your 
p95+ response times if the first request was high-latency due to something 
unrelated to the query. (Say, GC.)

Of course, a central principle of this approach is being able to cancel a query 
and have it stop consuming resources. I'd love to be corrected, but I don't 
think Solr allows this. You can stop waiting for a response, but even the 
timeAllowed param doesn't seem to stop resource usage after the allotted time.  
Meaning, a few exceptionally long-running queries can take out your 
high-throughput cluster by tying up entire CPUs for long periods.

Let me know the JIRA number, I'd love to see work in this area.


-Original Message-
From: Phil Hoy [mailto:p...@brightsolid.com] 
Sent: Tuesday, January 29, 2013 11:33 AM
To: solr-user@lucene.apache.org
Subject: RE: Solr load balancer

Hi Erick,

Thanks, I have read the blogs you cited and I found them very interesting, and 
we have tuned the jvm accordingly but still we get the odd longish gc pause. 

That said we perhaps have an unusual setup; we index a lot of small documents 
using servers with ssd's and 128 GB RAM in a sharded set up with replicas and 
our queries rely heavily on query filters and faceting with minimal free-text 
style searching. For that reason we rely heavily on the filter cache to improve 
query latency, therefore we assign a large percentage of available ram to the 
jvm hosting solr. 

Anyhow we are happy with the current configuration and performance profile, 
aside from the odd gc pause that is, and as we have index replicas it seems to 
me that we should be able to cope, hence my willingness to tweak how the load 
balancer behaves.

Thanks,
Phil



-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com]
Sent: 20 January 2013 15:56
To: solr-user@lucene.apache.org
Subject: Re: Solr load balancer

Hmmm, the first thing I'd look at is why you are having long GC pauses. Here's 
a great place to start:

http://www.lucidimagination.com/blog/2011/03/27/garbage-collection-bootcamp-1-0/
and:
http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html

I've wondered about a similar approach, but by firing off the same query to 
multiple nodes in your cluster, you'll be effectively doubling (at least) the 
load on your system. Leading to more memory issues perhaps in a "non-virtuous 
cycle".

FWIW,
Erick

On Fri, Jan 18, 2013 at 5:41 AM, Phil Hoy  wrote:
> Hi,
>
> I would like to experiment with some custom load balancers to help with query 
> latency in the face of long gc pauses and the odd time-consuming query that 
> we need to be able to support. At the moment setting the socket timeout via 
> the HttpShardHandlerFactory does help, but of course it can only be set to a 
> length of time as long as the most time consuming query we are likely to 
> receive.
>
> For example perhaps a load balancer that sends multiple queries concurrently 
> to all/some replicas and only keeps the first response might be effective. Or 
> maybe a load balancer which takes account of the frequency of timeouts would 
> be able to recognize zombies more effectively.
>
> To use alternative load balancer implementations cleanly and without having 
> to hack solr directly, I would need to be able to make the existing 
> LBHttpSolrServer and HttpShardHandlerFactory more amenable to extension, I 
> can then override the default load balancer using solr's plugin mechanism.
>
> So my question is, if I made a patch to make the load balancer more 
> pluggable, is this something that would be acceptable and if so what do I do 
> next?
>
> Phil
>
> __
> "brightsolid" is used in this email to collectively mean brightsolid online 
> innovation limited and its subsidiary companies brightsolid online publishing 
> limited and brightsolid online technology limited.
> findmypast.co.uk is a brand of brightsolid online publishing limited.
> brightsolid online innovation limited, Gateway House, Luna Place, Dundee 
> Technology Park, Dundee DD2 1TP.  Registered in Scotland No. SC274983.
> brightsolid online publishing limited, The Glebe, 6 Chapel Place, Rivington 
> Street, London EC2A 3DQ. Registered in England No. 04369607.
> brightsolid online technology limited, Gateway House, Luna Place, Dundee 
> Techn

Re: Solr load balancer

2013-01-31 Thread Lance Norskog
It is possible to do this with IP Multicast. The query goes out on the 
multicast and all query servers read it. The servers wait for a random 
amount of time, then transmit the answer. Here's the trick: it's 
multicast. All of the query servers listen to each other's responses, 
and drop out when another server answers the query. The server has to 
decide whether to do the query before responding; this would take some 
tuning.


Having all participants snoop on their peers is a really powerful 
design. I worked on a telecom system that used IP Multicast to do 
shortest-path-first allocation of T1 lines.  Worked really well. It's a 
shame Enron never used it.


On 01/24/2013 04:17 PM, Chris Hostetter wrote:

: For example perhaps a load balancer that sends multiple queries
: concurrently to all/some replicas and only keeps the first response
: might be effective. Or maybe a load balancer which takes account of the

I know of other distributed query systems that use this approach, when
query speed is more important to people then load and people who use them
seem to think it works well.

given that it synthetically multiplies the load of each end user request,
it's probably not something we'd want to turn on by default, but a
configurable option certainly seems like it might be handy.


-Hoss




Re: Indexing nouns only - UIMA vs. OpenNLP

2013-01-31 Thread Lance Norskog

Thanks, Kai!

About removing non-nouns: the OpenNLP patch includes two simple 
TokenFilters for manipulating terms with payloads. The 
FilterPayloadFilter lets you keep or remove terms with given payloads. 
In the demo schema.xml, there is an example type that keeps only 
nouns&verbs.


There is a "universal" mapping for parts-of-speech systems for different 
languages. There is no Solr/Lucene support for it.

http://code.google.com/p/universal-pos-tags/

On 01/31/2013 09:47 AM, Kai Gülzau wrote:

UIMA:

I just found this issue https://issues.apache.org/jira/browse/SOLR-3013
Now I am able to use this analyzer for english texts and filter (un)wanted 
token types :-)


   
 
 
   


Open issue -> How to set the ModelFile for the Tagger to 
"german/TuebaModel.dat" ???



OpenNLP:

And a modified patch for https://issues.apache.org/jira/browse/LUCENE-2899 is 
now working
with solr 4.1. :-)


   
 
   
   
   
   




Any hints on which lib is more accurate on noun tagging?
Any performance or memory issues (some OOM here while testing with 1GB via 
Analyzer Admin GUI)?


Regards,

Kai Gülzau




-Original Message-
From: Kai Gülzau [mailto:kguel...@novomind.com]
Sent: Thursday, January 31, 2013 2:19 PM
To: solr-user@lucene.apache.org
Subject: Indexing nouns only - UIMA vs. OpenNLP

Hi,

I am stuck trying to index only the nouns of german and english texts.
(very similar to http://wiki.apache.org/solr/OpenNLP#Full_Example)


First try was to use UIMA with the HMMTagger:


   
 
 /org/apache/uima/desc/AggregateSentenceAE.xml
 false
 
   false
   albody
 
 
   
 org.apache.uima.SentenceAnnotation
 
   coveredText
   albody2
 
   

   


- But how do I set the ModelFile to use the german corpus?
- What about language identification?
-- How do I use the right corpus/tagger based on the language?
-- Should this be done in UIMA (how?) or via solr contrib/langid field mapping?
- How to remove non nouns in the annotated field?


Second try is to use OpenNLP and to apply the patch 
https://issues.apache.org/jira/browse/LUCENE-2899
But the patch seems to be a bit out of date.
Currently I try to get it to work with solr 4.1.


Any pointers appreciated :-)

Regards,

Kai Gülzau





Re: long QTime for big index

2013-01-31 Thread Mou
Thank you again.

Unfortunately the index files will not fit in the RAM.I have to try using
document cache. I am also moving my index to SSD again, we took our index
off when fusion IO cards failed twice during indexing and index was
corrupted.Now with the bios upgrade and new driver, it is supposed to be
more reliable.

Also I am going to look into the client app to verify that it is making
proper query requests.

Surprisingly when I used a much lower value than default for
defaultconnectionperhost and maxconnectionperhost in solrmeter , it performs
very well, the same queries return in less than one sec . I am not sure yet,
need to run solrmeter with different heap size , with cache and without
cache etc.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/long-QTime-for-big-index-tp4037635p4037870.html
Sent from the Solr - User mailing list archive at Nabble.com.