Solr Hanging

2012-04-19 Thread Trym R. Møller

Hi

I am using Solr trunk and have 7 Solr instances running with 28 leaders 
and 28 replicas for a single collection.
After indexing a while (a couple of days) the solrs start hanging and 
doing a thread dump on the jvm I see blocked threads like the following:

Thread 2369: (state = BLOCKED)
 - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; 
information may be imprecise)
 - java.util.concurrent.locks.LockSupport.park(java.lang.Object) 
@bci=14, line=158 (Compiled frame)
 - 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await() 
@bci=42, line=1987 (Compiled frame)
 - java.util.concurrent.LinkedBlockingQueue.take() @bci=29, 
line=399 (Compiled frame)
 - java.util.concurrent.ExecutorCompletionService.take() @bci=4, 
line=164 (Compiled frame)
 - 
org.apache.solr.update.SolrCmdDistributor.checkResponses(boolean) 
@bci=27, line=350 (Compiled frame)
 - org.apache.solr.update.SolrCmdDistributor.finish() @bci=18, 
line=98 (Compiled frame)
 - 
org.apache.solr.update.processor.DistributedUpdateProcessor.doFinish() 
@bci=4, line=299 (Compiled frame)
 - 
org.apache.solr.update.processor.DistributedUpdateProcessor.finish() 
@bci=1, line=817 (Compiled frame)

...
 - org.mortbay.thread.QueuedThreadPool$PoolThread.run() @bci=25, 
line=582 (Interpreted frame)


I read the stack trace as my indexing client has indexed a document and 
this Solr is now waiting for the replica? to respond before returning an 
answer to the client.


The other Solrs have similar blocked threads.

Any ideas of how I can get closer to the problem? Am I reading the stack 
trace correctly? Any further information that are relevant for 
commenting this problem?


Thanks for any comments.

Best regards Trym


How to escape “<” character in regex in Solr schema.xml?

2012-04-19 Thread smooth almonds
Using Solr 3.5.0 and in my schema.xml I'm using the following to mark the end
of sentences and replace the end punctuation with a symbolic token:



I'm not sure if that will even work for what I want, but first I need to
solve the problem of escaping the '<' character in the first '?<='
lookbehind.

I get the following error:

org.xml.sax.SAXParseException: The value of attribute "pattern" associated
with an element type "null" must not contain the '<' character.

I've tried using a '\' as in:

pattern="(?\<=[^.!?\\s][^.!?]*(?:[.!?](?![']?\s|$)[^.!?]*)*)[.!?]+(?=\\s|$)"

But I get the same error.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-escape-character-in-regex-in-Solr-schema-xml-tp3921961p3921961.html
Sent from the Solr - User mailing list archive at Nabble.com.


Dismax request handler and Dismax query parser

2012-04-19 Thread mechravi25
Hi,

If I give the search string as, "type list", I want my search to match both
"type" & "list". 
The following search query which we are using 
 
/select/?qf=name%5e2.3+text+r_name%5e0.3+id%5e0.3+uid%5e0.3&fl=*&qf=name%5e2.3+text+r_name%5e0.3+id%5e0.3+uid%5e0.3&fl=*&qt=dismax&f.typeFacet.facet.mincount=1&facet.field=typeFacet&f.rFacet.facet.mincount=1&facet.field=rFacet&facet=true&hl.fl=*&hl=t
rue&rows=10&start=0&q=type+list&debugQuery=on
 
and this does not return any results.
 
But if we remove qt=dismax in the above case and replace it with
defType=dismax then, we are getting results for the same 
search string. The request Handlers used for the standard and dismax is as
follows.
 
 

 
   explicit
 
  

 

 dismax
 explicit
 
id,score
 

 *:*
 0
 name
 regex

  

Im hitting the above request query for a common core usings the shards
concept (in this case im using 2 cores to be combined in the common core).
When I use the debugQuery=On, I get the following response in the back end
(while hitting the different cores from the common core).
 
INFO: [corex] webapp=/solr path=/select
params={facet=true&qf=name^2.3+text+r_name^0.3+id^0.3+uid^0.3&q.alt=*:*&hl.fl=*&wt=javabin&hl=false&defType=dismax&rows=10&version=1&f.rFacet.facet.limit=160&fl=uid,score&start=0&f.typeFacet.facet.limit=160&q=type+list&f.text.hl.fragmenter=regex&f.name.hl.fragsize=0&facet.field=typeFacet&facet.field=rFacet&f.name.hl.alternateField=name&isShard=true&fsv=true}
hits=0 status=0 QTime=6

INFO: [corey] webapp=/solr path=/select
params={facet=true&qf=name^2.3+text+r_name^0.3+id^0.3+uid^0.3&q.alt=*:*&hl.fl=*&wt=javabin&hl=false&defType=dismax&rows=10&version=1&f.rFacet.facet.limit=160&fl=uid,score&start=0&f.typeFacet.facet.limit=160&q=type+list&f.text.hl.fragmenter=regex&f.name.hl.fragsize=0&facet.field=typeFacet&facet.field=rFacet&f.name.hl.alternateField=name&isShard=true&fsv=true}
hits=0 status=0 QTime=6

So, here I can see that defType=dismax is being used in the query string
while querying the individual cores even if we use qt=dismax on the common
core. If this is the case why is it not returing any values.
Am I missing anything?Can you guide me on this?

I ve used the defType=dismax in the default section of dismax handler
definition, but still Im not getting the required results. In Our scenario,
We would like to use Dismax request handler along with Dismax Query Parser.
Can you tell me how this can be done?

Regards,
Sivaganesh



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Dismax-request-handler-and-Dismax-query-parser-tp3922708p3922708.html
Sent from the Solr - User mailing list archive at Nabble.com.


# open files with SolrCloud

2012-04-19 Thread Sami Siren
I have a simple solrcloud setup from trunk with default configs; 1
shard with one replica. As few other people have reported there seems
to be some kind of leak somewhere that causes the number of open files
to grow over time when doing indexing.

One thing that correlates with the open file count that the jvm
reports is the count of deleted files that solr still keeps open (not
sure if the problem is this or something else). The deleted but not
closed files are all ending with "nrm.cfs", for example

/solr/data/index/_jwk_nrm.cfs (deleted)

Any ideas about what could be the cause for this? I don't even know
where to start looking...

--
 Sami Siren


Re: Solr Hanging

2012-04-19 Thread Yonik Seeley
On Thu, Apr 19, 2012 at 4:25 AM, "Trym R. Møller"  wrote:
> Hi
>
> I am using Solr trunk and have 7 Solr instances running with 28 leaders and
> 28 replicas for a single collection.
> After indexing a while (a couple of days) the solrs start hanging and doing
> a thread dump on the jvm I see blocked threads like the following:
>    Thread 2369: (state = BLOCKED)
>     - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame;
> information may be imprecise)
>     - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14,
> line=158 (Compiled frame)
>     -
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await()
> @bci=42, line=1987 (Compiled frame)
>     - java.util.concurrent.LinkedBlockingQueue.take() @bci=29, line=399
> (Compiled frame)
>     - java.util.concurrent.ExecutorCompletionService.take() @bci=4, line=164
> (Compiled frame)
>     - org.apache.solr.update.SolrCmdDistributor.checkResponses(boolean)
> @bci=27, line=350 (Compiled frame)
>     - org.apache.solr.update.SolrCmdDistributor.finish() @bci=18, line=98
> (Compiled frame)
>     - org.apache.solr.update.processor.DistributedUpdateProcessor.doFinish()
> @bci=4, line=299 (Compiled frame)
>     - org.apache.solr.update.processor.DistributedUpdateProcessor.finish()
> @bci=1, line=817 (Compiled frame)
>    ...
>     - org.mortbay.thread.QueuedThreadPool$PoolThread.run() @bci=25, line=582
> (Interpreted frame)
>
> I read the stack trace as my indexing client has indexed a document and this
> Solr is now waiting for the replica? to respond before returning an answer
> to the client.

Correct.  What's the full stack trace like on both a leader and replica?
We need to know what the replica is blocking on.

What version of trunk are you using?

-Yonik
lucenerevolution.com - Lucene/Solr Open Source Search Conference.
Boston May 7-10


Re: How to escape “<” character in regex in Solr schema.xml?

2012-04-19 Thread Jeevanandam

try this one

pattern="(?<=[^.!?\\s][^.!?]*(?:[.!?](?![']?\s|$)[^.!?]*)*)[.!?]+(?=\\s|$)"

I tested locally, solr start perfectly. now please test with data.

-Jeevanandam


On 19-04-2012 9:29 am, smooth almonds wrote:
Using Solr 3.5.0 and in my schema.xml I'm using the following to mark 
the end

of sentences and replace the end punctuation with a symbolic token:



I'm not sure if that will even work for what I want, but first I need 
to

solve the problem of escaping the '<' character in the first '?<='
lookbehind.

I get the following error:

org.xml.sax.SAXParseException: The value of attribute "pattern" 
associated

with an element type "null" must not contain the '<' character.

I've tried using a '\' as in:


pattern="(?\<=[^.!?\\s][^.!?]*(?:[.!?](?![']?\s|$)[^.!?]*)*)[.!?]+(?=\\s|$)"

But I get the same error.

--
View this message in context:

http://lucene.472066.n3.nabble.com/How-to-escape-character-in-regex-in-Solr-schema-xml-tp3921961p3921961.html
Sent from the Solr - User mailing list archive at Nabble.com.




Re: Solr Hanging

2012-04-19 Thread Trym R. Møller

Thanks for your answer.

I am running an (older) revision of solr from around the 29/2-2012

I suspect that the thread I have included is the leader of the shard?
The Solr instance, I have the dump from, contains more than one leader, 
so I don't know which shard (slice) the thread is working on. How can I 
find the solr instance containing the replica (I guess ZooKeeper can't 
help me)?
And when I have found the solr instance containing the replica, how do I 
know which thread is handling the update request (all my solr instances 
contains 8 cores)?


If this is not possible, I might be able to restart with a setup where 
each Solr instances only contains a single core (a leader or a replica).


Best regards Trym

Den 19-04-2012 14:36, Yonik Seeley skrev:

On Thu, Apr 19, 2012 at 4:25 AM, "Trym R. Møller"  wrote:

Hi

I am using Solr trunk and have 7 Solr instances running with 28 leaders and
28 replicas for a single collection.
After indexing a while (a couple of days) the solrs start hanging and doing
a thread dump on the jvm I see blocked threads like the following:
Thread 2369: (state = BLOCKED)
 - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame;
information may be imprecise)
 - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14,
line=158 (Compiled frame)
 -
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await()
@bci=42, line=1987 (Compiled frame)
 - java.util.concurrent.LinkedBlockingQueue.take() @bci=29, line=399
(Compiled frame)
 - java.util.concurrent.ExecutorCompletionService.take() @bci=4, line=164
(Compiled frame)
 - org.apache.solr.update.SolrCmdDistributor.checkResponses(boolean)
@bci=27, line=350 (Compiled frame)
 - org.apache.solr.update.SolrCmdDistributor.finish() @bci=18, line=98
(Compiled frame)
 - org.apache.solr.update.processor.DistributedUpdateProcessor.doFinish()
@bci=4, line=299 (Compiled frame)
 - org.apache.solr.update.processor.DistributedUpdateProcessor.finish()
@bci=1, line=817 (Compiled frame)
...
 - org.mortbay.thread.QueuedThreadPool$PoolThread.run() @bci=25, line=582
(Interpreted frame)

I read the stack trace as my indexing client has indexed a document and this
Solr is now waiting for the replica? to respond before returning an answer
to the client.

Correct.  What's the full stack trace like on both a leader and replica?
We need to know what the replica is blocking on.

What version of trunk are you using?

-Yonik
lucenerevolution.com - Lucene/Solr Open Source Search Conference.
Boston May 7-10


Re: How to escape “<” character in regex in Solr schema.xml?

2012-04-19 Thread Jeevanandam
previously given pattern will solve the '<' char issue. however you 
will get following exception in the log


Caused by: java.util.regex.PatternSyntaxException: Look-behind group 
does not have an obvious maximum length near index 48

(?<=[^.!?\\s][^.!?]*(?:[.!?](?![']?\s|$)[^.!?]*)*)[.!?]+(?=\\s|$)
^
so revisit your regex pattern particularly position 48

-Jeevanandam


On 19-04-2012 7:06 pm, Jeevanandam wrote:

try this one


pattern="(?<=[^.!?\\s][^.!?]*(?:[.!?](?![']?\s|$)[^.!?]*)*)[.!?]+(?=\\s|$)"

I tested locally, solr start perfectly. now please test with data.

-Jeevanandam


On 19-04-2012 9:29 am, smooth almonds wrote:
Using Solr 3.5.0 and in my schema.xml I'm using the following to 
mark the end

of sentences and replace the end punctuation with a symbolic token:



I'm not sure if that will even work for what I want, but first I 
need to

solve the problem of escaping the '<' character in the first '?<='
lookbehind.

I get the following error:

org.xml.sax.SAXParseException: The value of attribute "pattern" 
associated

with an element type "null" must not contain the '<' character.

I've tried using a '\' as in:


pattern="(?\<=[^.!?\\s][^.!?]*(?:[.!?](?![']?\s|$)[^.!?]*)*)[.!?]+(?=\\s|$)"

But I get the same error.

--
View this message in context:

http://lucene.472066.n3.nabble.com/How-to-escape-character-in-regex-in-Solr-schema-xml-tp3921961p3921961.html
Sent from the Solr - User mailing list archive at Nabble.com.




Re: Solr file size limit?

2012-04-19 Thread Bram Rongen
Hello Shawn,

Thanks very much for your answer.

Yesterday I've started indexing again but this time on Solr 3.6.. Again
Solr is failing around the same time, but not exactly (now the largest fdt
file is 4.8G).. It's right after the moment I receive memory-errors at the
Drupal side which make me suspicious that it maybe has something to do with
a huge document.. Is that possible? I was indexing 1500 documents at once
every minute. Drupal builds them all up in memory before submitting them to
Solr. At some point it runs out of memory and I have to switch to 10/20
documents per minute for a while.. then I can switch back to 1000 documents
per minute.

The disk is a software RAID1 over 2 disks. But I've also run into the same
problem at another server.. This was a VM-server with only 1GB ram and 40GB
of disk. With this server the merge-repeat happened at an earlier stage.

I've also let Solr continue with merging for about two days before  (in an
earlier attempt), without submitting new documents. The merging kept
repeating.

Somebody suggested it could be because I'm using Jetty, could that be right?

My schema.xml and solrconfig.xml can be found here:
http://pastebin.com/GeBrB903 http://pastebin.com/Su8q1WAh

Kind regards,
Bram Rongen


On Wed, Apr 18, 2012 at 10:54 PM, Shawn Heisey  wrote:

> On 4/18/2012 6:17 AM, Bram Rongen wrote:
>
>> I've been using Solr for a very short time now and I'm stuck. I'm trying
>> to
>> index a drupal website consisting of 1.2 million smaller nodes and 300k
>> larger nodes (~400kb avg)..
>>
>
> A followup to my previous reply: Your ramBufferSizeMB is only 32, the
> default in the example config.  I have seen recommendations indicating that
> going beyond 128MB is not usually helpful.  With such large input
> documents, that may not apply to you - try setting it to 512 or 1024.  That
> will result in far fewer index segments being created.  They will be
> larger, so merges will be much less frequent but take longer.
>
> Thanks,
> Shawn
>
>


Re: Solr file size limit?

2012-04-19 Thread Bram Rongen
I've discovered some documents are 100+MB in size.. Could this be the
problem?

On Thu, Apr 19, 2012 at 3:49 PM, Bram Rongen  wrote:

> Hello Shawn,
>
> Thanks very much for your answer.
>
> Yesterday I've started indexing again but this time on Solr 3.6.. Again
> Solr is failing around the same time, but not exactly (now the largest fdt
> file is 4.8G).. It's right after the moment I receive memory-errors at the
> Drupal side which make me suspicious that it maybe has something to do with
> a huge document.. Is that possible? I was indexing 1500 documents at once
> every minute. Drupal builds them all up in memory before submitting them to
> Solr. At some point it runs out of memory and I have to switch to 10/20
> documents per minute for a while.. then I can switch back to 1000 documents
> per minute.
>
> The disk is a software RAID1 over 2 disks. But I've also run into the same
> problem at another server.. This was a VM-server with only 1GB ram and 40GB
> of disk. With this server the merge-repeat happened at an earlier stage.
>
> I've also let Solr continue with merging for about two days before  (in an
> earlier attempt), without submitting new documents. The merging kept
> repeating.
>
> Somebody suggested it could be because I'm using Jetty, could that be
> right?
>
> My schema.xml and solrconfig.xml can be found here:
> http://pastebin.com/GeBrB903 http://pastebin.com/Su8q1WAh
>
> Kind regards,
> Bram Rongen
>
>
> On Wed, Apr 18, 2012 at 10:54 PM, Shawn Heisey  wrote:
>
>> On 4/18/2012 6:17 AM, Bram Rongen wrote:
>>
>>> I've been using Solr for a very short time now and I'm stuck. I'm trying
>>> to
>>> index a drupal website consisting of 1.2 million smaller nodes and 300k
>>> larger nodes (~400kb avg)..
>>>
>>
>> A followup to my previous reply: Your ramBufferSizeMB is only 32, the
>> default in the example config.  I have seen recommendations indicating that
>> going beyond 128MB is not usually helpful.  With such large input
>> documents, that may not apply to you - try setting it to 512 or 1024.  That
>> will result in far fewer index segments being created.  They will be
>> larger, so merges will be much less frequent but take longer.
>>
>> Thanks,
>> Shawn
>>
>>
>


AW: Wrong categorization with DIH

2012-04-19 Thread Ramo Karahasan
Does anyone has an idea what's going wrong here?

Thanks,
Ramo

-Ursprüngliche Nachricht-
Von: Gora Mohanty [mailto:g...@mimirtech.com] 
Gesendet: Dienstag, 17. April 2012 11:34
An: solr-user@lucene.apache.org
Betreff: Re: Wrong categorization with DIH

On 17 April 2012 14:47, Ramo Karahasan 
wrote:
> Hi,
>
>
>
> i currently face the followin issue:
>
> Testing the following sql statement which is also used in SOLR (DIH) 
> leads to a wrong categorization in solr:
>
> select p.title as title, p.id, p.category_id, p.pic_thumb, c.name as 
> category, c.id as category_id from product p, category c WHERE 
> p.category_id = c.id AND p.id = 3091328
>
>
>
> This returns in my sql client:
>
> Apple MacBook Pro MD313D/A 33,8 cm (13,3 Zoll) Notebook (Intel Core 
> i5-2435M, 2,4GHz, 4GB RAM, 500GB HDD, Intel HD 3000, Mac OS), 3091328, 
> 1003, http://m-d.ww.cdn.com/images/I/41teWbp-uAL._SL75_.jpg, Computer, 
> 1003
>
>
>
> As you see, the categoryid 1003 points to "Computer"
>
>
>
> Via the solr searchadmin i get the following result when searchgin for
> id:3091328
>
> Sport
>
> 1003
[...]

Please share with us the rest of the DIH configuration file, i.e., the part
where these data are saved to the Solr index.

Regards,
Gora



Re: How sorlcloud distribute data among shards of the same cluster?

2012-04-19 Thread Boon Low
Hi,

Is there any mechanism in SolrCloud for controlling how the data is distributed 
among the shards? For example, I'd like to create logical (standalone) shards 
('A', 'B', 'C') to make up a collection ('A-C"), and be able query both a 
particular shard (e.g. 'A') or the collection entirely. At the moment, my test 
suggests 'A' data is distributed to evenly to all shards in SolrCloud.

Regards,

Boon

-
Boon Low
Search UX and Engine Developer
brightsolid Online Publishing

On 18 Apr 2012, at 12:41, Erick Erickson wrote:

> Try looking at DistributedUpdateProcessor, there's
> a "hash(cmd)" method in there.
> 
> Best
> Erick
> 
> On Tue, Apr 17, 2012 at 4:45 PM, emma1023  wrote:
>> Thanks for your reply. In sorl 3.x, we need to manually hash the doc Id to
>> the server.How does solrcloud do this instead? I am working on a project
>> using solrcloud.But we need to monitor how the solrcloud distribute the
>> data. I cannot find which part of the code it is from source code.Is it
>> from the cloud part? Thanks.
>> 
>> 
>> On Tue, Apr 17, 2012 at 3:16 PM, Mark Miller-3 [via Lucene] <
>> ml-node+s472066n3918192...@n3.nabble.com> wrote:
>> 
>>> 
>>> On Apr 17, 2012, at 9:56 AM, emma1023 wrote:
>>> 
>>> It hashes the id. The doc distribution is fairly even - but sizes may be
>>> fairly different.
>>> 
 How solrcloud manage distribute data among shards of the same cluster
>>> when
 you query? Is it distribute the data equally? What is the basis? Which
>>> part
 of the code that I can find about it?Thank you so much!
 
 
 --
 View this message in context:
>>> http://lucene.472066.n3.nabble.com/How-sorlcloud-distribute-data-among-shards-of-the-same-cluster-tp3917323p3917323.html
 Sent from the Solr - User mailing list archive at Nabble.com.
>>> 
>>> - Mark Miller
>>> lucidimagination.com
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> --
>>>  If you reply to this email, your message will be added to the discussion
>>> below:
>>> 
>>> http://lucene.472066.n3.nabble.com/How-sorlcloud-distribute-data-among-shards-of-the-same-cluster-tp3917323p3918192.html
>>>  To unsubscribe from How sorlcloud distribute data among shards of the
>>> same cluster?, click 
>>> here
>>> .
>>> NAML
>>> 
>> 
>> 
>> --
>> View this message in context: 
>> http://lucene.472066.n3.nabble.com/How-sorlcloud-distribute-data-among-shards-of-the-same-cluster-tp3917323p3918348.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
> 
> __
> This email has been scanned by the brightsolid Email Security System. Powered 
> by MessageLabs
> __


__
"brightsolid" is used in this email to collectively mean brightsolid online 
innovation limited and its subsidiary companies brightsolid online publishing 
limited and brightsolid online technology limited.
findmypast.co.uk is a brand of brightsolid online publishing limited.
brightsolid online innovation limited, Gateway House, Luna Place, Dundee 
Technology Park, Dundee DD2 1TP.  Registered in Scotland No. SC274983.
brightsolid online publishing limited, The Glebe, 6 Chapel Place, Rivington 
Street, London EC2A 3DQ. Registered in England No. 04369607.
brightsolid online technology limited, Gateway House, Luna Place, Dundee 
Technology Park, Dundee DD2 1TP.  Registered in Scotland No. SC161678.

Email Disclaimer

This message is confidential and may contain privileged information. You should 
not disclose its contents to any other person. If you are not the intended 
recipient, please notify the sender named above immediately. It is expressly 
declared that this e-mail does not constitute nor form part of a contract or 
unilateral obligation. Opinions, conclusions and other information in this 
message that do not relate to the official business of brightsolid shall be 
understood as neither given nor endorsed by it.
__
This email has been scanned by the brightsolid Email Security System. Powered 
by MessageLabs
__


Re: Large Index and OutOfMemoryError: Map failed

2012-04-19 Thread Boon Low
Hi,

Also came across this error recently, while indexing with > 10 DIH processes in 
parallel + default index setting. The JVM grinds to a halt and throws this 
error. Checking the index of a core reveals thousands of files! Tuning the 
default autocommit from 15000ms to 90ms solved the problem for us. (no 
'autosoftcommit').

Boon 

-
Boon Low
Search UX and Engine Developer
brightsolid Online Publishing

On 14 Apr 2012, at 17:40, Gopal Patwa wrote:

> I checked it was "MMapDirectory.UNMAP_SUPPORTED=true" and below are my
> system data. Is their any existing test case to reproduce this issue? I am
> trying understand how I can reproduce this issue with unit/integration test
> 
> I will try recent solr trunk build too,  if it is some bug in solr or
> lucene keeping old searcher open then how to reproduce it?
> 
> SYSTEM DATA
> ===
> PROCESSOR: Intel(R) Xeon(R) CPU E5504 @ 2.00GHz
> SYSTEM ID: x86_64
> CURRENT CPU SPEED: 1600.000 MHz
> CPUS: 8 processor(s)
> MEMORY: 49449296 kB
> DISTRIBUTION: CentOS release 5.3 (Final)
> KERNEL NAME: 2.6.18-128.el5
> UPTIME: up 71 days
> LOAD AVERAGE: 1.42, 1.45, 1.53
> JBOSS Version: Implementation-Version: 4.2.2.GA (build:
> SVNTag=JBoss_4_2_2_GA date=20
> JAVA Version: java version "1.6.0_24"
> 
> 
> On Thu, Apr 12, 2012 at 3:07 AM, Michael McCandless <
> luc...@mikemccandless.com> wrote:
> 
>> Your largest index has 66 segments (690 files) ... biggish but not
>> insane.  With 64K maps you should be able to have ~47 searchers open
>> on each core.
>> 
>> Enabling compound file format (not the opposite!) will mean fewer maps
>> ... ie should improve this situation.
>> 
>> I don't understand why Solr defaults to compound file off... that
>> seems dangerous.
>> 
>> Really we need a Solr dev here... to answer "how long is a stale
>> searcher kept open".  Is it somehow possible 46 old searchers are
>> being left open...?
>> 
>> I don't see any other reason why you'd run out of maps.  Hmm, unless
>> MMapDirectory didn't think it could safely invoke unmap in your JVM.
>> Which exact JVM are you using?  If you can print the
>> MMapDirectory.UNMAP_SUPPORTED constant, we'd know for sure.
>> 
>> Yes, switching away from MMapDir will sidestep the "too many maps"
>> issue, however, 1) MMapDir has better perf than NIOFSDir, and 2) if
>> there really is a leak here (Solr not closing the old searchers or a
>> Lucene bug or something...) then you'll eventually run out of file
>> descriptors (ie, same  problem, different manifestation).
>> 
>> Mike McCandless
>> 
>> http://blog.mikemccandless.com
>> 
>> 2012/4/11 Gopal Patwa :
>>> 
>>> I have not change the mergefactor, it was 10. Compound index file is
>> disable
>>> in my config but I read from below post, that some one had similar issue
>> and
>>> it was resolved by switching from compound index file format to
>> non-compound
>>> index file.
>>> 
>>> and some folks resolved by "changing lucene code to disable
>> MMapDirectory."
>>> Is this best practice to do, if so is this can be done in configuration?
>>> 
>>> 
>> http://lucene.472066.n3.nabble.com/MMapDirectory-failed-to-map-a-23G-compound-index-segment-td3317208.html
>>> 
>>> I have index document of core1 = 5 million, core2=8million and
>>> core3=3million and all index are hosted in single Solr instance
>>> 
>>> I am going to use Solr for our site StubHub.com, see attached "ls -l"
>> list
>>> of index files for all core
>>> 
>>> SolrConfig.xml:
>>> 
>>> 
>>>  
>>>  false
>>>  10
>>>  2147483647
>>>  1
>>>  4096
>>>  10
>>>  1000
>>>  1
>>>  single
>>> 
>>>  
>>>0.0
>>>10.0
>>>  
>>> 
>>>  
>>>false
>>>0
>>>  
>>> 
>>>  
>>> 
>>> 
>>>  
>>>  1000
>>>   
>>> 90
>>> false
>>>   
>>>   
>>> 
>> ${inventory.solr.softcommit.duration:1000}
>>>   
>>> 
>>>  
>>> 
>>> 
>>> Forwarded conversation
>>> Subject: Large Index and OutOfMemoryError: Map failed
>>> 
>>> 
>>> From: Gopal Patwa 
>>> Date: Fri, Mar 30, 2012 at 10:26 PM
>>> To: solr-user@lucene.apache.org
>>> 
>>> 
>>> I need help!!
>>> 
>>> 
>>> 
>>> 
>>> 
>>> I am using Solr 4.0 nightly build with NRT and I often get this error
>> during
>>> auto commit "java.lang.OutOfMemoryError: Map failed". I have search this
>>> forum and what I found it is related to OS ulimit setting, please se
>> below
>>> my ulimit settings. I am not sure what ulimit setting I should have? and
>> we
>>> also get "java.net.SocketException: Too many open files" NOT sure how
>> many
>>> open file we need to set?
>>> 
>>> 
>>> I have 3 core with index size : core1 - 70GB, Core2 - 50GB and Core3 -
>> 15GB,
>>> with Single shard
>>> 
>>> 
>>> We update the index every 5 seconds, soft commit every 1 second and hard
>>> commit every 15 minutes
>>> 
>>

Re: Solr file size limit?

2012-04-19 Thread Shawn Heisey

On 4/19/2012 7:49 AM, Bram Rongen wrote:

Yesterday I've started indexing again but this time on Solr 3.6.. Again
Solr is failing around the same time, but not exactly (now the largest fdt
file is 4.8G).. It's right after the moment I receive memory-errors at the
Drupal side which make me suspicious that it maybe has something to do with
a huge document.. Is that possible? I was indexing 1500 documents at once
every minute. Drupal builds them all up in memory before submitting them to
Solr. At some point it runs out of memory and I have to switch to 10/20
documents per minute for a while.. then I can switch back to 1000 documents
per minute.

The disk is a software RAID1 over 2 disks. But I've also run into the same
problem at another server.. This was a VM-server with only 1GB ram and 40GB
of disk. With this server the merge-repeat happened at an earlier stage.

I've also let Solr continue with merging for about two days before  (in an
earlier attempt), without submitting new documents. The merging kept
repeating.

Somebody suggested it could be because I'm using Jetty, could that be right?


I am using Jetty for my Solr installation and it handles very large 
indexes without a problem.  I have created a single index with all my 
data (nearly 70 million documents, total index size over 100GB).  Aside 
from how long it takes to build and the fact that I don't have enough 
RAM to cache it for good performance, Solr handled it just fine.  For 
production I use a distributed index on multiple servers.


I don't know why you are seeing a merge that continually restarts, 
that's truly odd.  I've never used drupal, don't know a lot about it.  
From my small amount of research just now, I assume that it uses Tika, 
also another tool that I have no experience with.  I am guessing that 
you store the entire text of your documents into solr, and that they are 
indexed up to a maximum of 1 tokens (the default value of 
maxFieldLength in solrconfig.xml), based purely on speculation about the 
"body" field in your schema.


A document that's 100MB in size, if the whole thing gets stored, will 
completely overwhelm a 32MB buffer, and might even be enough to 
overwhelm a 256MB buffer as well, because it will basically have to 
build the entire index segment in RAM, with term vectors, indexed data, 
and stored data for all fields.


With such large documents, you may have to increase the maxFieldLength, 
or you won't be able to search on the entire document text.  Depending 
on the content of those documents, it may or may not be a problem that 
only the first 10,000 tokens will get indexed.  Large documents tend to 
be repetitive and there might not be any search value after the 
introduction and initial words.  Your documents may be different, so 
you'll have to make that decision.


To test whether my current thoughts are right, I recommend that you try 
with the following settings during the initial full import:  
ramBufferSizeMB: 1024 (or maybe higher), autoCommit maxTime: 0, 
autoCommit maxDocs: 0.  This will mean that unless the indexing process 
issues manual commits (either in the middle of indexing or at the end), 
you will have to do a manual one.  Once you have the initial index built 
and it is only doing updates, you will probably be able to go back to 
using autoCommit.


It's possible that I have no understanding of the real problem here, and 
my recommendation above may result in no improvement.  General 
recommendations, no matter what the current problem might be:


1) Get a lot more RAM.  Ideally you want to have enough free memory to 
cache your entire index.  That may not be possible, but you want to get 
as close to that goal as you can.
2) If you can, see what you can do to increase your IOPS.  Using 
mirrored high RPM SAS is an easy solution, and might be slightly cheaper 
than SATA RAID10, which is my solution.  SSD is easy and very fast, but 
expensive and not redundant -- I am currently not aware of any SSD RAID 
solutions that have OS TRIM support.  RAID10 with high RPM SAS would be 
best, but very expensive.  On the extreme high end, you could go with a 
high performance SAN.


Thanks,
Shawn



Re: Wrong categorization with DIH

2012-04-19 Thread Jeevanandam Madanagopal
Ramo -

Please share DIH configuration with us.

-Jeevanandam

On Apr 19, 2012, at 7:46 PM, Ramo Karahasan wrote:

> Does anyone has an idea what's going wrong here?
> 
> Thanks,
> Ramo
> 
> -Ursprüngliche Nachricht-
> Von: Gora Mohanty [mailto:g...@mimirtech.com] 
> Gesendet: Dienstag, 17. April 2012 11:34
> An: solr-user@lucene.apache.org
> Betreff: Re: Wrong categorization with DIH
> 
> On 17 April 2012 14:47, Ramo Karahasan 
> wrote:
>> Hi,
>> 
>> 
>> 
>> i currently face the followin issue:
>> 
>> Testing the following sql statement which is also used in SOLR (DIH) 
>> leads to a wrong categorization in solr:
>> 
>> select p.title as title, p.id, p.category_id, p.pic_thumb, c.name as 
>> category, c.id as category_id from product p, category c WHERE 
>> p.category_id = c.id AND p.id = 3091328
>> 
>> 
>> 
>> This returns in my sql client:
>> 
>> Apple MacBook Pro MD313D/A 33,8 cm (13,3 Zoll) Notebook (Intel Core 
>> i5-2435M, 2,4GHz, 4GB RAM, 500GB HDD, Intel HD 3000, Mac OS), 3091328, 
>> 1003, http://m-d.ww.cdn.com/images/I/41teWbp-uAL._SL75_.jpg, Computer, 
>> 1003
>> 
>> 
>> 
>> As you see, the categoryid 1003 points to "Computer"
>> 
>> 
>> 
>> Via the solr searchadmin i get the following result when searchgin for
>> id:3091328
>> 
>> Sport
>> 
>> 1003
> [...]
> 
> Please share with us the rest of the DIH configuration file, i.e., the part
> where these data are saved to the Solr index.
> 
> Regards,
> Gora
> 



Re: PolySearcher in Solr

2012-04-19 Thread Jeevanandam Madanagopal
Please have a look 

http://wiki.apache.org/solr/DistributedSearch

-Jeevanandam

On Apr 19, 2012, at 9:14 PM, Ramprakash Ramamoorthy wrote:

> Dear all,
> 
> 
> I came across this while browsing through lucy
> 
> http://lucy.apache.org/docs/perl/Lucy/Search/PolySearcher.html
> 
> Does solr have an equivalent of this? My usecase is exactly the same
> (reading through multiple indices in a single shard and perform a
> distribution across shards).
> 
> If not can someone give me a hint? I tried swapping readers for a single
> searcher, but didn't help.
> 
> -- 
> With Thanks and Regards,
> Ramprakash Ramamoorthy,
> Project Trainee,
> Zoho Corporation.
> +91 9626975420



maxMergeDocs in Solr 3.6

2012-04-19 Thread Burton-West, Tom
Hello all,

I'm getting ready to upgrade from Solr 3.4 to Solr 3.6 and I noticed that 
maxMergeDocs is no longer in the example solrconfig.xml.
Has maxMergeDocs been deprecated? or doe the tieredMergePolicy ignore it?

Since our Docs are about 800K or more and the setting in the old example 
solrconfig was 2,147,483,647 we would never hit this limit, but I was wondering 
about why it is no longer in the example.



Tom



Re: Solr with UIMA

2012-04-19 Thread dsy99
Hi Chris,
Are you been able to get success to integrate the UIMA in SOLR.

I too  tried to integrate Uima in Solr by following the instructions
provided in README i.e. the following four steps:

Step1. I set  tags in solrconfig.xml appropriately to point the jar
files.

   
   

Step2. modified my "schema.xml" adding the fields I wanted to  hold metadata
specifying proper values for type, indexed, stored and multiValued options
as follows:


  
  

Step3. modified my solrconfig.xml adding the following snippet:

  

  

  VALID_ALCHEMYAPI_KEY
  VALID_ALCHEMYAPI_KEY
  VALID_ALCHEMYAPI_KEY
  VALID_ALCHEMYAPI_KEY
  VALID_ALCHEMYAPI_KEY
  VALID_OPENCALAIS_KEY

/org/apache/uima/desc/OverridingParamsExtServicesAE.xml

true


  false
  
text
  


  
org.apache.uima.alchemy.ts.concept.ConceptFS

  text
  concept

  
  
org.apache.uima.alchemy.ts.language.LanguageFS

  language
  language

  
  
org.apache.uima.SentenceAnnotation

  coveredText
  sentence

  

  



  

Step 4: and finally created a new UpdateRequestHandler with the following:
  

  uima



Further I  indexed a word file called text.docx using the following command: 

curl
"http://localhost:8983/solr/update/extract?literal.id=doc1&uprefix=attr_&fmap.content=attr_content&commit=true";
-F "myfile=@UIMA_sample_test.docx"

When I searched the file I am not able to see the additional UIMA fields.

Can you please help if you been able to solve the problem.


With Regds & Thanks
Divakar

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-with-UIMA-tp3863324p3923443.html
Sent from the Solr - User mailing list archive at Nabble.com.


Looking at SOLR-3221

2012-04-19 Thread Shawn Heisey
Looking at the CHANGES.txt for 3.6 so I can plan my upgrade, I have some 
questions about SOLR-3221.  This might be a question more appropriate 
for the dev list, but I don't know, so I am starting here.


The wiki entry on this mentions the maxConnectionsPerHost setting, but 
then talks about the maximum number of connections per shard.  Is it 
shard, or host?  My distributed index has three shards per 
Host/Solr/Jetty container, so this is an important distinction.  If it 
actually refers to shards, I think the setting should be renamed.  
Alternatively, a new one could be added with the current functionality 
and the Host one could be renamed to Container and changed so that it 
tracks host/port combinations and adjusts accordingly.


http://wiki.apache.org/solr/SolrConfigXml#Configuration_of_Shard_Handlers_for_Distributed_searches

I have zero clue what good starting numbers might be for the 
configuration.  If it's not already there, what sort of debug logging 
statements might I add to get thread statistics so I can come up with 
some reasonable config numbers?  If I configure numbers that are too 
small and it ends up waiting an unreasonable amount of time for 
resources, is there any logging to let me know?


Thanks,
Shawn



AW: Wrong categorization with DIH

2012-04-19 Thread Ramo Karahasan
Hi,

my config is just the following:


  
  
   

  


I'm doing it as described on:

http://wiki.apache.org/solr/DataImportHandlerDeltaQueryViaFullImport

Any ideas?

Best regars,
Ramo

-Ursprüngliche Nachricht-
Von: Jeevanandam Madanagopal [mailto:je...@myjeeva.com] 
Gesendet: Donnerstag, 19. April 2012 17:44
An: solr-user@lucene.apache.org
Betreff: Re: Wrong categorization with DIH

Ramo -

Please share DIH configuration with us.

-Jeevanandam

On Apr 19, 2012, at 7:46 PM, Ramo Karahasan wrote:

> Does anyone has an idea what's going wrong here?
> 
> Thanks,
> Ramo
> 
> -Ursprüngliche Nachricht-
> Von: Gora Mohanty [mailto:g...@mimirtech.com]
> Gesendet: Dienstag, 17. April 2012 11:34
> An: solr-user@lucene.apache.org
> Betreff: Re: Wrong categorization with DIH
> 
> On 17 April 2012 14:47, Ramo Karahasan 
> wrote:
>> Hi,
>> 
>> 
>> 
>> i currently face the followin issue:
>> 
>> Testing the following sql statement which is also used in SOLR (DIH) 
>> leads to a wrong categorization in solr:
>> 
>> select p.title as title, p.id, p.category_id, p.pic_thumb, c.name as 
>> category, c.id as category_id from product p, category c WHERE 
>> p.category_id = c.id AND p.id = 3091328
>> 
>> 
>> 
>> This returns in my sql client:
>> 
>> Apple MacBook Pro MD313D/A 33,8 cm (13,3 Zoll) Notebook (Intel Core 
>> i5-2435M, 2,4GHz, 4GB RAM, 500GB HDD, Intel HD 3000, Mac OS), 
>> 3091328, 1003, http://m-d.ww.cdn.com/images/I/41teWbp-uAL._SL75_.jpg, 
>> Computer,
>> 1003
>> 
>> 
>> 
>> As you see, the categoryid 1003 points to "Computer"
>> 
>> 
>> 
>> Via the solr searchadmin i get the following result when searchgin 
>> for
>> id:3091328
>> 
>> Sport
>> 
>> 1003
> [...]
> 
> Please share with us the rest of the DIH configuration file, i.e., the 
> part where these data are saved to the Solr index.
> 
> Regards,
> Gora
> 




StandardTokenizer and domain names containing digits

2012-04-19 Thread Alex Willmer
TLDR; How should I make Solr treat "ns1.define.logica.com" as a single token in 
the same way "ns.define.logica.com" would be?

We are just starting to use Solr 3.5.0 in production and have run into a 
slightly surprising behaviour involving the query "ns1.define.logica.com", 
through an edismax handler with "q.op"=AND defined with


 
   explicit
   10
   
   edismax
   AND
   
body^0.5 comments^0.4 tags^1.2 title^2.0 involved^1.5 id^10.0
author^10.9 changed created oneline^0.7
   
   
body^0.2 tags^1.1 title^1.5
   
 


The schema is defined with fields of type text_general, as found in the example 
schema.xml, namely:


  



  
  




  


The search string is being tokenised to "ns2", "define.logica.com", and the 
resulting query becomes

+DisjunctionMaxQuerytags:ns1 tags:define.logica.com)^1.2) | 
id:ns1.define.logica.com^10.0 | ((body:ns1 body:define.logica.com)^0.5) | 
((author:ns1 author:define.logica.com)^10.9) | ((oneline:ns1 
oneline:define.logica.com)^0.7) | ((title:ns1 title:define.logica.com)^2.0) | 
((involved:ns1 involved:define.logica.com)^1.5) | ((comments:ns1 
comments:define.logica.com)^0.4))) DisjunctionMaxQuery((tags:"ns1 
define.logica.com"^1.1 | body:"ns1 define.logica.com"^0.2 | title:"ns1 
define.logica.com"^1.5))

meaning that documents containing "ns1" OR "define.logica.com" are returned. 
This is contrary to e.g. "ns.logica.define.com" which is treated as a single 
token. Is there a way I can make Solr treat both queries the same way?

Many thanks, Alex
-- 
Alex Willmer | Developer
2 Trinity Park,  Birmingham, B37 7ES | United Kingdom 
M: +44 7557 752744
al.will...@logica.com | www.logica.com
Logica UK Ltd, registered in UK (registered number 947968)
Registered Office: 250 Brook Drive, Green Park, Reading RG2 6UA, United Kingdom




Re: Wrong categorization with DIH

2012-04-19 Thread Jeevanandam Madanagopal
Ramo -

Are you using all the selected columns from the query?

select p.title as title, p.id, p.category_id,
p.pic_thumb, c.name as category, c.id as category_id from product p,
category c ...

I see following attributes 'p.id', 'p.category_id' & 'p.pic_thumb'  doesn't 
have alias defined.

Pointers:

- Select only required field in the sql query
- Ensure sql alias name and attribute name in the schema.xml should match
  or
- If you like to do explicit mapping for every column in DIH config as follow


Detailed Info refer this: http://wiki.apache.org/solr/DataImportHandler

-Jeevanandam


On Apr 19, 2012, at 9:37 PM, Ramo Karahasan wrote:

> Hi,
> 
> my config is just the following:
> 
> 
>driver="com.mysql.jdbc.Driver"
>  url="jdbc:mysql://xx/asdx"
>  user=""
>  password=""/>
>  
>   query="select p.title as title, p.id, p.category_id,
> p.pic_thumb, c.name as category, c.id as category_id from product p,
> category c WHERE p.category_id = c.id AND  '${dataimporter.request.clean}'
> != 'false' OR updated_at > '${dataimporter.last_index_time}' ">
>
>  
> 
> 
> I'm doing it as described on:
> 
> http://wiki.apache.org/solr/DataImportHandlerDeltaQueryViaFullImport
> 
> Any ideas?
> 
> Best regars,
> Ramo
> 
> -Ursprüngliche Nachricht-
> Von: Jeevanandam Madanagopal [mailto:je...@myjeeva.com] 
> Gesendet: Donnerstag, 19. April 2012 17:44
> An: solr-user@lucene.apache.org
> Betreff: Re: Wrong categorization with DIH
> 
> Ramo -
> 
> Please share DIH configuration with us.
> 
> -Jeevanandam
> 
> On Apr 19, 2012, at 7:46 PM, Ramo Karahasan wrote:
> 
>> Does anyone has an idea what's going wrong here?
>> 
>> Thanks,
>> Ramo
>> 
>> -Ursprüngliche Nachricht-
>> Von: Gora Mohanty [mailto:g...@mimirtech.com]
>> Gesendet: Dienstag, 17. April 2012 11:34
>> An: solr-user@lucene.apache.org
>> Betreff: Re: Wrong categorization with DIH
>> 
>> On 17 April 2012 14:47, Ramo Karahasan 
>> wrote:
>>> Hi,
>>> 
>>> 
>>> 
>>> i currently face the followin issue:
>>> 
>>> Testing the following sql statement which is also used in SOLR (DIH) 
>>> leads to a wrong categorization in solr:
>>> 
>>> select p.title as title, p.id, p.category_id, p.pic_thumb, c.name as 
>>> category, c.id as category_id from product p, category c WHERE 
>>> p.category_id = c.id AND p.id = 3091328
>>> 
>>> 
>>> 
>>> This returns in my sql client:
>>> 
>>> Apple MacBook Pro MD313D/A 33,8 cm (13,3 Zoll) Notebook (Intel Core 
>>> i5-2435M, 2,4GHz, 4GB RAM, 500GB HDD, Intel HD 3000, Mac OS), 
>>> 3091328, 1003, http://m-d.ww.cdn.com/images/I/41teWbp-uAL._SL75_.jpg, 
>>> Computer,
>>> 1003
>>> 
>>> 
>>> 
>>> As you see, the categoryid 1003 points to "Computer"
>>> 
>>> 
>>> 
>>> Via the solr searchadmin i get the following result when searchgin 
>>> for
>>> id:3091328
>>> 
>>> Sport
>>> 
>>> 1003
>> [...]
>> 
>> Please share with us the rest of the DIH configuration file, i.e., the 
>> part where these data are saved to the Solr index.
>> 
>> Regards,
>> Gora
>> 
> 
> 



Re: Date granularity

2012-04-19 Thread vybe3142
Thanks

So , I tried out the suggestions. I used the main query though (not a
filter)
1. Using a DATE range  and DAYdoes give me the desired results.
Specifically, the query that I used was

2. Without a DATE range, the parser seems to reduce the date to the
beginning of the day i.e. 00:00:00 and attempt to find docs matching that
doc exactly (at least that's what it appears to do, from the debugs). 

Although the range query appears to work,  isn't there a more elegant way to
specify a granular query without having to use a range? Debug output
attached

Thanks


http://lucene.472066.n3.nabble.com/file/n3923651/datedebug.txt datedebug.txt 

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Date-granularity-tp3920890p3923651.html
Sent from the Solr - User mailing list archive at Nabble.com.


AW: Wrong categorization with DIH

2012-04-19 Thread Ramo Karahasan
Hi,

yes i use every oft hem.

Thanks for your hint... I'll have a look at this and try to configure it
correctly.

Thank you,
Ramo

-Ursprüngliche Nachricht-
Von: Jeevanandam Madanagopal [mailto:je...@myjeeva.com] 
Gesendet: Donnerstag, 19. April 2012 18:42
An: solr-user@lucene.apache.org
Betreff: Re: Wrong categorization with DIH

Ramo -

Are you using all the selected columns from the query?

select p.title as title, p.id, p.category_id, p.pic_thumb, c.name as
category, c.id as category_id from product p, category c ...

I see following attributes 'p.id', 'p.category_id' & 'p.pic_thumb'  doesn't
have alias defined.

Pointers:

- Select only required field in the sql query
- Ensure sql alias name and attribute name in the schema.xml should match
  or
- If you like to do explicit mapping for every column in DIH config as
follow 

Detailed Info refer this: http://wiki.apache.org/solr/DataImportHandler

-Jeevanandam


On Apr 19, 2012, at 9:37 PM, Ramo Karahasan wrote:

> Hi,
> 
> my config is just the following:
> 
> 
>driver="com.mysql.jdbc.Driver"
>  url="jdbc:mysql://xx/asdx"
>  user=""
>  password=""/>
>  
>   query="select p.title as title, p.id, p.category_id, 
> p.pic_thumb, c.name as category, c.id as category_id from product p, 
> category c WHERE p.category_id = c.id AND  '${dataimporter.request.clean}'
> != 'false' OR updated_at > '${dataimporter.last_index_time}' ">
>
>  
> 
> 
> I'm doing it as described on:
> 
> http://wiki.apache.org/solr/DataImportHandlerDeltaQueryViaFullImport
> 
> Any ideas?
> 
> Best regars,
> Ramo
> 
> -Ursprüngliche Nachricht-
> Von: Jeevanandam Madanagopal [mailto:je...@myjeeva.com]
> Gesendet: Donnerstag, 19. April 2012 17:44
> An: solr-user@lucene.apache.org
> Betreff: Re: Wrong categorization with DIH
> 
> Ramo -
> 
> Please share DIH configuration with us.
> 
> -Jeevanandam
> 
> On Apr 19, 2012, at 7:46 PM, Ramo Karahasan wrote:
> 
>> Does anyone has an idea what's going wrong here?
>> 
>> Thanks,
>> Ramo
>> 
>> -Ursprüngliche Nachricht-
>> Von: Gora Mohanty [mailto:g...@mimirtech.com]
>> Gesendet: Dienstag, 17. April 2012 11:34
>> An: solr-user@lucene.apache.org
>> Betreff: Re: Wrong categorization with DIH
>> 
>> On 17 April 2012 14:47, Ramo Karahasan 
>> 
>> wrote:
>>> Hi,
>>> 
>>> 
>>> 
>>> i currently face the followin issue:
>>> 
>>> Testing the following sql statement which is also used in SOLR (DIH) 
>>> leads to a wrong categorization in solr:
>>> 
>>> select p.title as title, p.id, p.category_id, p.pic_thumb, c.name as 
>>> category, c.id as category_id from product p, category c WHERE 
>>> p.category_id = c.id AND p.id = 3091328
>>> 
>>> 
>>> 
>>> This returns in my sql client:
>>> 
>>> Apple MacBook Pro MD313D/A 33,8 cm (13,3 Zoll) Notebook (Intel Core 
>>> i5-2435M, 2,4GHz, 4GB RAM, 500GB HDD, Intel HD 3000, Mac OS), 
>>> 3091328, 1003, 
>>> http://m-d.ww.cdn.com/images/I/41teWbp-uAL._SL75_.jpg,
>>> Computer,
>>> 1003
>>> 
>>> 
>>> 
>>> As you see, the categoryid 1003 points to "Computer"
>>> 
>>> 
>>> 
>>> Via the solr searchadmin i get the following result when searchgin 
>>> for
>>> id:3091328
>>> 
>>> Sport
>>> 
>>> 1003
>> [...]
>> 
>> Please share with us the rest of the DIH configuration file, i.e., 
>> the part where these data are saved to the Solr index.
>> 
>> Regards,
>> Gora
>> 
> 
> 




Re: Solr with UIMA

2012-04-19 Thread Rahul Warawdekar
Hi Divakar,

Try making your updateRequestProcessorChain as default. Simply add
default="true" as follows and check if that works.




On Thu, Apr 19, 2012 at 12:01 PM, dsy99  wrote:

> Hi Chris,
> Are you been able to get success to integrate the UIMA in SOLR.
>
> I too  tried to integrate Uima in Solr by following the instructions
> provided in README i.e. the following four steps:
>
> Step1. I set  tags in solrconfig.xml appropriately to point the jar
> files.
>
>   
>
>
> Step2. modified my "schema.xml" adding the fields I wanted to  hold
> metadata
> specifying proper values for type, indexed, stored and multiValued options
> as follows:
>
> required="false"/>
>   multiValued="true" required="false"/>
>multiValued="true" required="false" />
>
> Step3. modified my solrconfig.xml adding the following snippet:
>
>  
> class="org.apache.solr.uima.processor.UIMAUpdateRequestProcessorFactory">
>  
>
>   VALID_ALCHEMYAPI_KEY
>  VALID_ALCHEMYAPI_KEY
>  VALID_ALCHEMYAPI_KEY
>  VALID_ALCHEMYAPI_KEY
>  VALID_ALCHEMYAPI_KEY
>  VALID_OPENCALAIS_KEY
>
>
> name="analysisEngine">/org/apache/uima/desc/OverridingParamsExtServicesAE.xml
>
>true
>
> 
>  false
>  
> text
>   
>
>
>  
> name="name">org.apache.uima.alchemy.ts.concept.ConceptFS
>
>  text
>  concept
>
>  
>  
> name="name">org.apache.uima.alchemy.ts.language.LanguageFS
>
>  language
>  language
>
>  
>  
>org.apache.uima.SentenceAnnotation
>
>  coveredText
>  sentence
> 
>  
>
>  
>
>
>
>  
>
> Step 4: and finally created a new UpdateRequestHandler with the following:
>   
>
>  uima
>
>
>
> Further I  indexed a word file called text.docx using the following
> command:
>
> curl
> "
> http://localhost:8983/solr/update/extract?literal.id=doc1&uprefix=attr_&fmap.content=attr_content&commit=true
> "
> -F "myfile=@UIMA_sample_test.docx"
>
> When I searched the file I am not able to see the additional UIMA fields.
>
> Can you please help if you been able to solve the problem.
>
>
> With Regds & Thanks
> Divakar
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Solr-with-UIMA-tp3863324p3923443.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 
Thanks and Regards
Rahul A. Warawdekar


RE: StandardTokenizer and domain names containing digits

2012-04-19 Thread Steven A Rowe
Hi Alex,

TLDR; Try adding WordDelimiterFilter to your analyzer(s).

StandardTokenizer in Lucene/Solr v3.1+ implements the Word Boundary rules from 
Unicode 6.0.0 Standard Annex #29, a.k.a. UAX#29: 
.  These 
rules don't include recognition of URLs or domain names.  (The details: in 
UAX#29 Word Boundary rules terminology, the default rule - WB14 - says that 
boundaries will be made everywhere they are not prohibited, and since there is 
no rule to prohibit making a boundary in the character sequence /Numeric, 
MidNumLet, ALetter/ - "." FULL STOP belongs to MidNumLet - boundaries are made 
between Number and MidNumLet, and between MidNumLet and ALetter.  
StandardTokenizer emits as tokens the character sequences between UAX#29 word 
boundaries that contain alphanumeric characters, so the MidNumLet-only token is 
dropped.)

Lucene/Solr includes another tokenizer that does recognize URLs and domain 
names, in addition to the UAX#29 Word Boundary rules: UAX29URLEmailTokenizer 
.
  (Stand-alone domain names are recognized as URLs.)

I think Lucene/Solr should have a way to tokenize URL (and e-mail) components, 
so that e.g. if you have "http://www.example.com/page.html"; in your text, your 
index can contain "www.example.com" and "example.com", to enable e.g. queries 
containing just "example.com".  I'd like to have a URLFilter and an EmailFilter 
that would configurably tokenize components (e.g. for URLs: protocol; domain; 
base domain; domain elements; full path; path elements; 
URL-decoded-uax29-word-boundary-tokenized path elements).

This doesn't solve your problem, though.

My suggestion is that you add a filter (for both the indexing and querying) 
that splits tokens containing periods: 
,
 something like (untested!):



Note that this filter will be applied to *all* of your tokens, not just domain 
names.
 
Steve
 
-Original Message-
From: Alex Willmer [mailto:al.will...@logica.com] 
Sent: Thursday, April 19, 2012 12:04 PM
To: solr-user@lucene.apache.org
Subject: StandardTokenizer and domain names containing digits

TLDR; How should I make Solr treat "ns1.define.logica.com" as a single token in 
the same way "ns.define.logica.com" would be?

We are just starting to use Solr 3.5.0 in production and have run into a 
slightly surprising behaviour involving the query "ns1.define.logica.com", 
through an edismax handler with "q.op"=AND defined with

  
   explicit
   10
   
   edismax
   AND
   
body^0.5 comments^0.4 tags^1.2 title^2.0 involved^1.5 id^10.0
author^10.9 changed created oneline^0.7
   
   
body^0.2 tags^1.1 title^1.5
   
 


The schema is defined with fields of type text_general, as found in the example 
schema.xml, namely:


  



  
  




  


The search string is being tokenised to "ns2", "define.logica.com", and the 
resulting query becomes

+DisjunctionMaxQuerytags:ns1 tags:define.logica.com)^1.2) |
id:ns1.define.logica.com^10.0 | ((body:ns1 body:define.logica.com)^0.5) |
((author:ns1 author:define.logica.com)^10.9) | ((oneline:ns1
oneline:define.logica.com)^0.7) | ((title:ns1 title:define.logica.com)^2.0) |
((involved:ns1 involved:define.logica.com)^1.5) | ((comments:ns1
comments:define.logica.com)^0.4))) DisjunctionMaxQuery((tags:"ns1
define.logica.com"^1.1 | body:"ns1 define.logica.com"^0.2 | title:"ns1
define.logica.com"^1.5))

meaning that documents containing "ns1" OR "define.logica.com" are returned. 
This is contrary to e.g. "ns.logica.define.com" which is treated as a single 
token. Is there a way I can make Solr treat both queries the same way?

Many thanks, Alex
--
Alex Willmer | Developer
2 Trinity Park,  Birmingham, B37 7ES | United Kingdom
M: +44 7557 752744
al.will...@logica.com | www.logica.com
Logica UK Ltd, registered in UK (registered number 947968) Registered Office: 
250 Brook Drive, Green Park, Reading RG2 6UA, United Kingdom




Re: maxMergeDocs in Solr 3.6

2012-04-19 Thread Robert Muir
On Thu, Apr 19, 2012 at 11:54 AM, Burton-West, Tom  wrote:
> Hello all,
>
> I'm getting ready to upgrade from Solr 3.4 to Solr 3.6 and I noticed that 
> maxMergeDocs is no longer in the example solrconfig.xml.
> Has maxMergeDocs been deprecated? or doe the tieredMergePolicy ignore it?

its not applicable to tieredMergePolicy.

when tieredmergepolicy was added, some previous "global" options were
'interpreted' for backwards compatibility:
useCompoundFile(X) -> setUseCompoundFile(X)
mergeFactor(X) -> setMaxMergeAtOnce(X) AND setSegmentsPerTier(X)

However, in my opinion there is an easier, less confusing, more
systematic approach you can use, and thats to not set these 'global'
params but just specify what you want directly to TieredMergePolicy:

For example for TieredMergePolicy, look at the javadocs of
TieredMergePolicy here:
http://lucene.staging.apache.org/core/3_6_0/api/core/org/apache/lucene/index/TieredMergePolicy.html

you would simply configure it like:


  19
  9
  1.0


this will invoke setMaxMergeAtOnceExplicit(19), setSegmentsPerTier(9),
and setNoCFSRatio(1.0). So you can do the same thing with any of those
TieredMergePolicy setters you see in the lucene javadocs.

-- 
lucidimagination.com


How can I use a function or fieldvalue as the default for query(subquery, default)?

2012-04-19 Thread jimtronic
Hi,

For the solr function query(subquery, default) I'd like to be able to
specify the value of another field or even a function as the default.

For example, I might have:

/solr/select?q=_val_:query("{!dismax qf=text v='solr rocks'}",
product(this_field, that_field))

Is this possible?

I see that Boolean functions are coming in Solr 4, but it is unclear whether
these would accept functions as defaults.

Thanks,
Jim

--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-can-I-use-a-function-or-fieldvalue-as-the-default-for-query-subquery-default-tp3924172p3924172.html
Sent from the Solr - User mailing list archive at Nabble.com.


String ordering appears different with sort vs range query

2012-04-19 Thread Cat Bieber
I'm trying to use a Solr query to find the next title in alphabetical 
order after a given string. The issue I'm facing is that the sort param 
seems to sort non-alphanumeric characters in a different order from the 
ordering used by a range filter in the q or fq param. I can't filter the 
non-alphanumeric characters out because they're integral to the data and 
it would not be a useful ordering if it were based only on the 
alphanumeric portion of the strings.


I'm running Solr version 3.5.

In my current approach, I have a field that is a unique string for each 
document:


sortMissingLast="true" omitNorms="true">









stored="true"/>


I'm passing the value for the current document in a range to query 
everything after the current string, sorted ascending:


/select?fl=uniqueSortString&sort=uniqueSortString+asc&q=uniqueSortString:["$1+ZX+Spectrum+HOBETA+format+file"+TO+*]&wt=xml&rows=5&version=2.2

In theory, I expect the first result to be the current item and the 
second result to be the next one. However, I'm finding that the sort and 
the range filter seem to use different ordering:




$1 ZX Spectrum - Emulator


$1 ZX Spectrum HOBETA format file


$1 ZX Spectrum Hobetta Picture Format


$? TR-DOS ZX Spectrum file in HOBETA 
format



$A AutoCAD Autosave File ( Autodesk Inc.)



Based on the results ordering, sort believes - precedes H, but the range 
filter should have excluded that first result if it ordered in the same 
way. Digging through the code, I think it looks like sorting uses 
String.compareTo() for ordering on a text/string field. However I 
haven't been able to track down where the range filter code is. If 
someone can point me in the right direction to find that code I'd love 
to look through it. Or, if anyone has suggestions regarding a different 
approach or changes I can make to this query/field, that would be very 
helpful.


Thanks for your time.
-Cat Bieber


Re: How sorlcloud distribute data among shards of the same cluster?

2012-04-19 Thread Mark Miller
You can remove the distrib update processor and just distrib the data yourself.

Eventually the hash implementation will also be pluggable I think.

On Apr 19, 2012, at 10:30 AM, Boon Low wrote:

> Hi,
> 
> Is there any mechanism in SolrCloud for controlling how the data is 
> distributed among the shards? For example, I'd like to create logical 
> (standalone) shards ('A', 'B', 'C') to make up a collection ('A-C"), and be 
> able query both a particular shard (e.g. 'A') or the collection entirely. At 
> the moment, my test suggests 'A' data is distributed to evenly to all shards 
> in SolrCloud.
> 
> Regards,
> 
> Boon
> 
> -
> Boon Low
> Search UX and Engine Developer
> brightsolid Online Publishing
> 
> On 18 Apr 2012, at 12:41, Erick Erickson wrote:
> 
>> Try looking at DistributedUpdateProcessor, there's
>> a "hash(cmd)" method in there.
>> 
>> Best
>> Erick
>> 
>> On Tue, Apr 17, 2012 at 4:45 PM, emma1023  wrote:
>>> Thanks for your reply. In sorl 3.x, we need to manually hash the doc Id to
>>> the server.How does solrcloud do this instead? I am working on a project
>>> using solrcloud.But we need to monitor how the solrcloud distribute the
>>> data. I cannot find which part of the code it is from source code.Is it
>>> from the cloud part? Thanks.
>>> 
>>> 
>>> On Tue, Apr 17, 2012 at 3:16 PM, Mark Miller-3 [via Lucene] <
>>> ml-node+s472066n3918192...@n3.nabble.com> wrote:
>>> 
 
 On Apr 17, 2012, at 9:56 AM, emma1023 wrote:
 
 It hashes the id. The doc distribution is fairly even - but sizes may be
 fairly different.
 
> How solrcloud manage distribute data among shards of the same cluster
 when
> you query? Is it distribute the data equally? What is the basis? Which
 part
> of the code that I can find about it?Thank you so much!
> 
> 
> --
> View this message in context:
 http://lucene.472066.n3.nabble.com/How-sorlcloud-distribute-data-among-shards-of-the-same-cluster-tp3917323p3917323.html
> Sent from the Solr - User mailing list archive at Nabble.com.
 
 - Mark Miller
 lucidimagination.com
 
 
 
 
 
 
 
 
 
 
 
 
 
 --
 If you reply to this email, your message will be added to the discussion
 below:
 
 http://lucene.472066.n3.nabble.com/How-sorlcloud-distribute-data-among-shards-of-the-same-cluster-tp3917323p3918192.html
 To unsubscribe from How sorlcloud distribute data among shards of the
 same cluster?, click 
 here
 .
 NAML
 
>>> 
>>> 
>>> --
>>> View this message in context: 
>>> http://lucene.472066.n3.nabble.com/How-sorlcloud-distribute-data-among-shards-of-the-same-cluster-tp3917323p3918348.html
>>> Sent from the Solr - User mailing list archive at Nabble.com.
>> 
>> __
>> This email has been scanned by the brightsolid Email Security System. 
>> Powered by MessageLabs
>> __
> 
> 
> __
> "brightsolid" is used in this email to collectively mean brightsolid online 
> innovation limited and its subsidiary companies brightsolid online publishing 
> limited and brightsolid online technology limited.
> findmypast.co.uk is a brand of brightsolid online publishing limited.
> brightsolid online innovation limited, Gateway House, Luna Place, Dundee 
> Technology Park, Dundee DD2 1TP.  Registered in Scotland No. SC274983.
> brightsolid online publishing limited, The Glebe, 6 Chapel Place, Rivington 
> Street, London EC2A 3DQ. Registered in England No. 04369607.
> brightsolid online technology limited, Gateway House, Luna Place, Dundee 
> Technology Park, Dundee DD2 1TP.  Registered in Scotland No. SC161678.
> 
> Email Disclaimer
> 
> This message is confidential and may contain privileged information. You 
> should not disclose its contents to any other person. If you are not the 
> intended recipient, please notify the sender named above immediately. It is 
> expressly declared that this e-mail does not constitute nor form part of a 
> contract or unilateral obligation. Opinions, conclusions and other 
> information in this message that do not relate to the official business of 
> brightsolid shall be understood as neither given nor endorsed by it.
> 

Re: Date granularity

2012-04-19 Thread vybe3142
Also, what's the performance impact of range queries vs. querying for a
particular DAY (as described in my last post)  when the index contains ,
say, 10 million docs ?

If the range queries result in a significant performance hit, one option for
us would be to define additional DAY fields when indexing TIME data
eg. when indexing METADATA.DATE_ABC= 2009-07-31T15:25:45Z , also create and
index something like METADATA.DATE_DAY_ABC= 2009-07-31T00:005:00Z



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Date-granularity-tp3920890p3924290.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: SolrCloud: Programmatically create multiple collections?

2012-04-19 Thread ravi
Hi Mark, 

Thanks for your response. I did manage to one example running with 2 solr
instance running and i checked that shards are created and replicated
properly. 

The problem that i am now facing is zookeeper's clusterstate. If i kill one
solr instance (which may hold one or more cores) by pressing CTRL+C,
zookeeper never show's that instance as *down* and keeps on sowing that
instance as *active*.

The other instance, becomes the leader for some of the shards that were
present in the first instance though. This suggests that zookeeper gets to
know that one instance went down but for some strange reason its not
updating clusterstate.json thing. 

Has this already been reported? or there is something that i am missing? 

Thanks!
Ravi

--
View this message in context: 
http://lucene.472066.n3.nabble.com/SolrCloud-Programmatically-create-multiple-collections-tp3916927p3924698.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: EmbeddedSolrServer and StreamingUpdateSolrServer

2012-04-19 Thread pcrao
Hi,

Any update? 
Thanks,
PC Rao

--
View this message in context: 
http://lucene.472066.n3.nabble.com/EmbeddedSolrServer-and-StreamingUpdateSolrServer-tp3889073p3925014.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr UI

2012-04-19 Thread dpt9876
Hi Erik,
Re this project, do you have any demos available to check it out?
https://github.com/lucidimagination/Prism

And will it work on standard solr installs or do you need a Lucid
imagination subscription.
Thanks

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-UI-tp3182594p3925211.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: PolySearcher in Solr

2012-04-19 Thread Ramprakash Ramamoorthy
On Thu, Apr 19, 2012 at 9:21 PM, Jeevanandam Madanagopal
wrote:

> Please have a look
>
> http://wiki.apache.org/solr/DistributedSearch
>
> -Jeevanandam
>
> On Apr 19, 2012, at 9:14 PM, Ramprakash Ramamoorthy wrote:
>
> > Dear all,
> >
> >
> > I came across this while browsing through lucy
> >
> > http://lucy.apache.org/docs/perl/Lucy/Search/PolySearcher.html
> >
> > Does solr have an equivalent of this? My usecase is exactly the same
> > (reading through multiple indices in a single shard and perform a
> > distribution across shards).
> >
> > If not can someone give me a hint? I tried swapping readers for a single
> > searcher, but didn't help.
> >
> > --
> > With Thanks and Regards,
> > Ramprakash Ramamoorthy,
> > Project Trainee,
> > Zoho Corporation.
> > +91 9626975420
>
>
Dear Jeevanandam,

 Thanks for the response, but come on, I am aware of it. Try
reading my mail again. I will have to read through multiple indices in a
single shard, and have a distributed search across all shards.

-- 
With Thanks and Regards,
Ramprakash Ramamoorthy,
Project Trainee,
Zoho Corporation.
+91 9626975420