date:20151119

Parallel SQL / calcite adapter

2015-11-19 Thread Kai Gülzau


We are currently evaluating calcite as a SQL facade for different Data Sources

-  JDBC

-  REST

>SOLR

-  ...

I didn't found a "native" calcite adapter for solr 
(http://calcite.apache.org/docs/adapter.html).

Is it a good idea to use the parallel sql feature (over jdbc) to connect 
calcite (or apache drill) to solr?
Any suggestions?


Thanks,

Kai Gülzau

Re: Upgrading from 4.x to 5.x

2015-11-19 Thread Daniel Miller

Thank you - but I still don't understand where to install/copy/modify 
config files or schema to point at my current index. My 4.x schema.xml was 
fairly well optimized, and I believe I removed any deprecated usage, so I 
assume it would be compatible with the 5.x server.


Daniel



On November 18, 2015 4:55:40 AM Jan Høydahl  wrote:


Hi

You could try this

Instead of example/, use the server/ folder (it has Jetty in it)
Start Solr using bin/solr start script instead of java -jar start.jar …
Leave your solrconfig and schema as is to keep back-compat with 4.x.
You may need to remove use of 3.x classes that were deprecated in 4.x

https://cwiki.apache.org/confluence/display/solr/Major+Changes+from+Solr+4+to+Solr+5

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com


18. nov. 2015 kl. 10.10 skrev Daniel Miller :

Hi!

I'm a very inexperienced user with Solr.  I've been using Solr to provide 
indexes for my Dovecot IMAP server.  Using version 3.x, and later 4.x, I 
have been able to do so without too much of a challenge.  However, version 
5.x has certainly changed quite a bit and I'm very uncertain how to proceed.


I currently have a working 4.10.3 installation, using the "example" server 
provided with the Solr distribution package, and a schema.xml optimized for 
Dovecot.  I haven't found anything on migrating from 4 to 5 - at least 
anything I actually understood.  Can you point me in the right direction?


--
Daniel

Re: Upgrading from 4.x to 5.x

2015-11-19 Thread Muhammad Zahid Iqbal

Hi daniel

You need to update your config/scehma file on the path like
'...\solr-dir\server\solr' . When you are done then you can update your
index path in solrconfig.xml.

I hope you got it.

Best,
Zahid


On Thu, Nov 19, 2015 at 1:58 PM, Daniel Miller  wrote:

> Thank you - but I still don't understand where to install/copy/modify
> config files or schema to point at my current index. My 4.x schema.xml was
> fairly well optimized, and I believe I removed any deprecated usage, so I
> assume it would be compatible with the 5.x server.
>
> Daniel
>
>
>
>
> On November 18, 2015 4:55:40 AM Jan Høydahl  wrote:
>
> Hi
>>
>> You could try this
>>
>> Instead of example/, use the server/ folder (it has Jetty in it)
>> Start Solr using bin/solr start script instead of java -jar start.jar …
>> Leave your solrconfig and schema as is to keep back-compat with 4.x.
>> You may need to remove use of 3.x classes that were deprecated in 4.x
>>
>>
>> https://cwiki.apache.org/confluence/display/solr/Major+Changes+from+Solr+4+to+Solr+5
>>
>> --
>> Jan Høydahl, search solution architect
>> Cominvent AS - www.cominvent.com
>>
>> 18. nov. 2015 kl. 10.10 skrev Daniel Miller :
>>>
>>> Hi!
>>>
>>> I'm a very inexperienced user with Solr.  I've been using Solr to
>>> provide indexes for my Dovecot IMAP server.  Using version 3.x, and later
>>> 4.x, I have been able to do so without too much of a challenge.  However,
>>> version 5.x has certainly changed quite a bit and I'm very uncertain how to
>>> proceed.
>>>
>>> I currently have a working 4.10.3 installation, using the "example"
>>> server provided with the Solr distribution package, and a schema.xml
>>> optimized for Dovecot.  I haven't found anything on migrating from 4 to 5 -
>>> at least anything I actually understood.  Can you point me in the right
>>> direction?
>>>
>>> --
>>> Daniel
>>>
>>
>>
>
>

Re: adding document with nested document require to set id

2015-11-19 Thread CrazyDiamond

if i add document without nesting then id is generated automatically(i use
uuid), and this was working perfectly until i tryed to add nesting. i want
the same behaviour for nested documents as it was for not nested. 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/adding-document-with-nested-document-require-to-set-id-tp4240908p4240979.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Upgrading from 4.x to 5.x

2015-11-19 Thread Daniel Miller


Not quite but I'm improving. Or something...

Looking under solr5/server/solr I see configsets with the three default 
choices. What "feels" right is to make a new folder in there for my app 
(dovecot) and then copy my solr4/example/solr/collection1/conf folder. I'm 
hoping I'm on the right track - maybe working too hard.


If that was correct, then I tried "solr create -n dovecot -c dovecot" 
(after stopping my old server and starting a new one) and it did create an 
entry. I then stopped the server, copied my old data folder over to the new 
location, and started the server.


I then tried searching, which may have worked...I'm not certain if the 
search results came from solr or my imap server manually searching.


I'm sure I'm overcomplicating things - just not seeing the obvious.

Daniel



On November 19, 2015 1:09:07 AM Muhammad Zahid Iqbal 
 wrote:



Hi daniel

You need to update your config/scehma file on the path like
'...\solr-dir\server\solr' . When you are done then you can update your
index path in solrconfig.xml.

I hope you got it.

Best,
Zahid


On Thu, Nov 19, 2015 at 1:58 PM, Daniel Miller  wrote:


Thank you - but I still don't understand where to install/copy/modify
config files or schema to point at my current index. My 4.x schema.xml was
fairly well optimized, and I believe I removed any deprecated usage, so I
assume it would be compatible with the 5.x server.

Daniel




On November 18, 2015 4:55:40 AM Jan Høydahl  wrote:

Hi


You could try this

Instead of example/, use the server/ folder (it has Jetty in it)
Start Solr using bin/solr start script instead of java -jar start.jar …
Leave your solrconfig and schema as is to keep back-compat with 4.x.
You may need to remove use of 3.x classes that were deprecated in 4.x


https://cwiki.apache.org/confluence/display/solr/Major+Changes+from+Solr+4+to+Solr+5

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

18. nov. 2015 kl. 10.10 skrev Daniel Miller :


Hi!

I'm a very inexperienced user with Solr.  I've been using Solr to
provide indexes for my Dovecot IMAP server.  Using version 3.x, and later
4.x, I have been able to do so without too much of a challenge.  However,
version 5.x has certainly changed quite a bit and I'm very uncertain how to
proceed.

I currently have a working 4.10.3 installation, using the "example"
server provided with the Solr distribution package, and a schema.xml
optimized for Dovecot.  I haven't found anything on migrating from 4 to 5 -
at least anything I actually understood.  Can you point me in the right
direction?

--
Daniel

Re: Upgrading from 4.x to 5.x

2015-11-19 Thread Jan Høydahl

> Looking under solr5/server/solr I see configsets with the three default 
> choices. What "feels" right is to make a new folder in there for my app 
> (dovecot) and then copy my solr4/example/solr/collection1/conf folder. I'm 
> hoping I'm on the right track - maybe working too hard.

If you have an existing conf you don’t need to worry about config sets. That is 
a new concept for kickstarting new cores from.

> If that was correct, then I tried "solr create -n dovecot -c dovecot" (after 
> stopping my old server and starting a new one) and it did create an entry. I 
> then stopped the server, copied my old data folder over to the new location, 
> and started the server.

Assuming you’re not using SolrCloud.
1. Copy the server/ folder to your new location
2. cd server/solr (or the location you defined as SOLR HOME)
3. cp -r /path/to/old/solr4/SOLR_HOME/dovecot .
4. Make sure there is a core.properties file in the dovecot folder
5. Start solr, and you should have your core up and running as before


If you’re using SolrCloud and ZooKeeper there are other steps to follow.

Jan

Re: Upgrading from 4.x to 5.x

2015-11-19 Thread Upayavira



On Thu, Nov 19, 2015, at 10:03 AM, Jan Høydahl wrote:
> > Looking under solr5/server/solr I see configsets with the three default 
> > choices. What "feels" right is to make a new folder in there for my app 
> > (dovecot) and then copy my solr4/example/solr/collection1/conf folder. I'm 
> > hoping I'm on the right track - maybe working too hard.
> 
> If you have an existing conf you don’t need to worry about config sets.
> That is a new concept for kickstarting new cores from.
> 
> > If that was correct, then I tried "solr create -n dovecot -c dovecot" 
> > (after stopping my old server and starting a new one) and it did create an 
> > entry. I then stopped the server, copied my old data folder over to the new 
> > location, and started the server.
> 
> Assuming you’re not using SolrCloud.
> 1. Copy the server/ folder to your new location
> 2. cd server/solr (or the location you defined as SOLR HOME)
> 3. cp -r /path/to/old/solr4/SOLR_HOME/dovecot .
> 4. Make sure there is a core.properties file in the dovecot folder
> 5. Start solr, and you should have your core up and running as before

You can point Solr 5 at an existing directory (SOLR_HOME) that contains
your index and configs with the -s parameter:

bin/solr start -s /path/to/old/solr_home

Upayavira

How to config security.json?

2015-11-19 Thread Byzen Ma

Hi, I'm not quite familar with security.json. I want to achieve these
implementations. (1)Anyone who wants to do read/select/query action should
be required passwd and username, namely authentication, no matter from Admin
UI and solrj, especially from Admin UI! For I need to restrict strangers to
reach my solr. (2)Anyone who wants to update, includes delete, create and so
on, also needs passwd and username same as (1). I config the security.json
as follows:

 

{

  "authentication":{

"class":"solr.BasicAuthPlugin",

"credentials":{"solr":"IV0EHq1OnNrj6gvRCwvFwTrZ1+z1oBbnQdiVC3otuq0=
Ndd7LKvVBAaZIF0QAVi1ekCfAJXr1GGfLtRUXhgrF8c="}},

  "authorization":{

"class":"solr.RuleBasedAuthorizationPlugin",

"user-role":{"solr":"admin"},

"permissions":[

  {

"name":"security-edit",

"role":"admin"},

  {

"name":"read",

"role":"admin"},

  {

"name":"update",

"role":"admin"}],

"":{"v":3}}}

 

However, it does't work! More than that, an error happened again and again
when node2 recover from node1. I start my solrcloud by commandLine with
"solr start -e cloud" and set all the configs as default. How can I set the
security.json to achieve my goals? Here are the server logs:

 

ERROR (RecoveryThread-gettingstarted_shard2_replica2) [c:gettingstarted
s:shard2 r:core_node2 x:gettingstarted_shard2_replica2]
o.a.s.c.RecoveryStrategy Error while trying to
recover:org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException
: Error from server at
http://(myhost):7574/solr/gettingstarted_shard2_replica1: Expected mime type
application/octet-stream but got text/html. 





Error 401 Unauthorized request, Response code: 401



HTTP ERROR 401

Problem accessing /solr/gettingstarted_shard2_replica1/update. Reason:

Unauthorized request, Response code:
401Powered by Jetty://

 





 

 at
org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClien
t.java:528)

 at
org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java
:234)

 at
org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java
:226)

 at
org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:135)

 at
org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:152)

 at
org.apache.solr.cloud.RecoveryStrategy.commitOnLeader(RecoveryStrategy.java:
207)

 at
org.apache.solr.cloud.RecoveryStrategy.replicate(RecoveryStrategy.java:147)

 at
org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:437)

 at
org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:227)

Error while trying to
recover:org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException
: Error from server at
http://(myhost):7574/solr/gettingstarted_shard2_replica1: Expected mime type
application/octet-stream but got text/html. 





Error 401 Unauthorized request, Response code: 401



HTTP ERROR 401

Problem accessing /solr/gettingstarted_shard2_replica1/update. Reason:

Unauthorized request, Response code:
401Powered by Jetty://

 





 

 at
org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClien
t.java:528)

 at
org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java
:234)

 at
org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java
:226)

 at
org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:135)

 at
org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:152)

 at
org.apache.solr.cloud.RecoveryStrategy.commitOnLeader(RecoveryStrategy.java:
207)

 at
org.apache.solr.cloud.RecoveryStrategy.replicate(RecoveryStrategy.java:147)

 at
org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:437)

 at
org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:227)

1531614 ERROR (qtp5264648-21) [c:gettingstarted s:shard1 r:core_node1
x:gettingstarted_shard1_replica2] o.a.s.h.a.ShowFileRequestHandler Can not
find: /configs/gettingstarted/admin-extra.menu-top.html

1531614 ERROR (qtp5264648-16) [c:gettingstarted s:shard1 r:core_node1
x:gettingstarted_shard1_replica2] o.a.s.h.a.ShowFileRequestHandler Can not
find: /configs/gettingstarted/admin-extra.menu-bottom.html

1531661 ERROR (qtp5264648-14) [c:gettingstarted s:shard1 r:core_node1
x:gettingstarted_shard1_replica2] o.a.s.h.a.ShowFileRequestHandler Can not
find: /configs/gettingstarted/admin-extra.html

..

 

Kind regards,

Byzen Ma

Re: Upgrading from 4.x to 5.x

2015-11-19 Thread Muhammad Zahid Iqbal

Daniel,

You are close, delete those *configsets* folder and paste you
*collection1 *folder
and run the server. It will do the trick.

On Thu, Nov 19, 2015 at 2:54 PM, Daniel Miller  wrote:

> Not quite but I'm improving. Or something...
>
> Looking under solr5/server/solr I see configsets with the three default
> choices. What "feels" right is to make a new folder in there for my app
> (dovecot) and then copy my solr4/example/solr/collection1/conf folder. I'm
> hoping I'm on the right track - maybe working too hard.
>
> If that was correct, then I tried "solr create -n dovecot -c dovecot"
> (after stopping my old server and starting a new one) and it did create an
> entry. I then stopped the server, copied my old data folder over to the new
> location, and started the server.
>
> I then tried searching, which may have worked...I'm not certain if the
> search results came from solr or my imap server manually searching.
>
> I'm sure I'm overcomplicating things - just not seeing the obvious.
>
> Daniel
>
>
>
>
> On November 19, 2015 1:09:07 AM Muhammad Zahid Iqbal <
> zahid.iq...@northbaysolutions.net> wrote:
>
> Hi daniel
>>
>> You need to update your config/scehma file on the path like
>> '...\solr-dir\server\solr' . When you are done then you can update your
>> index path in solrconfig.xml.
>>
>> I hope you got it.
>>
>> Best,
>> Zahid
>>
>>
>> On Thu, Nov 19, 2015 at 1:58 PM, Daniel Miller  wrote:
>>
>> Thank you - but I still don't understand where to install/copy/modify
>>> config files or schema to point at my current index. My 4.x schema.xml
>>> was
>>> fairly well optimized, and I believe I removed any deprecated usage, so I
>>> assume it would be compatible with the 5.x server.
>>>
>>> Daniel
>>>
>>>
>>>
>>>
>>> On November 18, 2015 4:55:40 AM Jan Høydahl 
>>> wrote:
>>>
>>> Hi
>>>

 You could try this

 Instead of example/, use the server/ folder (it has Jetty in it)
 Start Solr using bin/solr start script instead of java -jar start.jar …
 Leave your solrconfig and schema as is to keep back-compat with 4.x.
 You may need to remove use of 3.x classes that were deprecated in 4.x



 https://cwiki.apache.org/confluence/display/solr/Major+Changes+from+Solr+4+to+Solr+5

 --
 Jan Høydahl, search solution architect
 Cominvent AS - www.cominvent.com

 18. nov. 2015 kl. 10.10 skrev Daniel Miller :

>
> Hi!
>
> I'm a very inexperienced user with Solr.  I've been using Solr to
> provide indexes for my Dovecot IMAP server.  Using version 3.x, and
> later
> 4.x, I have been able to do so without too much of a challenge.
> However,
> version 5.x has certainly changed quite a bit and I'm very uncertain
> how to
> proceed.
>
> I currently have a working 4.10.3 installation, using the "example"
> server provided with the Solr distribution package, and a schema.xml
> optimized for Dovecot.  I haven't found anything on migrating from 4
> to 5 -
> at least anything I actually understood.  Can you point me in the right
> direction?
>
> --
> Daniel
>
>


>>>
>>>
>
>

solr indexing warning

2015-11-19 Thread Midas A

Getting following log on solr


PERFORMANCE WARNING: Overlapping onDeckSearchers=2`

Json Facet api on nested doc

2015-11-19 Thread xavi jmlucjav

Hi,

I am trying to get some faceting with the json facet api on nested doc, but
I am having issues. Solr 5.3.1.

This query gest the buckets numbers ok:

curl http://shost:8983/solr/collection1/query -d 'q=*:*&rows=0&
 json.facet={
   yearly-salaries : {
type: terms,
field: salary,
domain: { blockChildren : "parent:true" }
  }
 }
'
Salary is a field in child docs only. But if I add another facet outside
it, the inner one returns no data:

curl http://shost:8983/solr/collection1/query -d 'q=*:*&rows=0&
 json.facet={
department:{
   type: terms,
   field: department,
   facet:{
   yearly-salaries : {
type: terms,
field: salary,
domain: { blockChildren : "parent:true" }
  }
  }
  }
 }
'
Results in:

"facets":{

 "count":3144071,

"department":{

"buckets":[{

"val":"Development",

"count":85707,

"yearly-salaries":{

"buckets":[]}},


department is field only in parent docs. Am I doing something wrong that I
am missing?
thanks
xavi

Re: Security Problems

2015-11-19 Thread Jan Høydahl

Would it not be less surprising if ALL requests to Solr required authentication 
once an AuthenticationPlugin was enabled?
Then, if no AuthorizationPlugin was active, all authenticated users could do 
anything.
But if AuthorizationPlugin was configured, you could only do what your role 
allows you to?

As it is now it is super easy to forget a path, say you protect /select but not 
/browse and /query, or someone creates a collection
with some new endpoints and forgets to update security.json - then that 
endpoint would be wide open! 

What is the smallest possible security.json required currently to protect all 
possible paths (except those served by Jetty)?

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

> 18. nov. 2015 kl. 20.31 skrev Upayavira :
> 
> I'm very happy for the admin UI to be served another way - i.e. not
> direct from Jetty, if that makes the task of securing it easier.
> 
> Perhaps a request handler specifically for UI resources which would make
> it possible to secure it all in a more straight-forward way?
> 
> Upayavira
> 
> On Wed, Nov 18, 2015, at 01:54 PM, Noble Paul wrote:
>> As of now the admin-ui calls are not protected. The static calls are
>> served by jetty and it bypasses the authentication mechanism
>> completely. If the admin UI relies on some API call which is served by
>> Solr.
>> The other option is to revamp the framework to take care of admin UI
>> (static content) as well. This would be cleaner solution
>> 
>> 
>> On Wed, Nov 18, 2015 at 2:32 PM, Upayavira  wrote:
>>> Not sure I quite understand.
>>> 
>>> You're saying that the cost for the UI is not large, but then suggesting
>>> we protect just one resource (/admin/security-check)?
>>> 
>>> Why couldn't we create the permission called 'admin-ui' and protect
>>> everything under /admin/ui/ for example? Along with the root HTML link
>>> too.
>>> 
>>> Upayavira
>>> 
>>> On Wed, Nov 18, 2015, at 07:46 AM, Noble Paul wrote:
 The authentication plugin is not expensive if you are talking in the
 context of admin UI. After all it is used not like 100s of requests
 per second.
 
 The simplest solution would be
 
 provide a well known permission name called "admin-ui"
 
 ensure that every admin page load makes a call to some resource say
 "/admin/security-check"
 
 Then we can just protect that .
 
 The only concern thatI have is the false sense of security it would
 give to the user
 
 But, that is a different point altogether
 
 On Wed, Nov 11, 2015 at 1:52 AM, Upayavira  wrote:
> Is the authentication plugin that expensive?
> 
> I can help by minifying the UI down to a smaller number of CSS/JS/etc
> files :-)
> 
> It may be overkill, but it would also give better experience. And isn't
> that what most applications do? Check authentication tokens on every
> request?
> 
> Upayavira
> 
> On Tue, Nov 10, 2015, at 07:33 PM, Anshum Gupta wrote:
>> The reason why we bypass that is so that we don't hit the authentication
>> plugin for every request that comes in for static content. I think we
>> could
>> call the authentication plugin for that but that'd be an overkill. Better
>> experience ? yes
>> 
>> On Tue, Nov 10, 2015 at 11:24 AM, Upayavira  wrote:
>> 
>>> Noble,
>>> 
>>> I get that a UI which is open source does not benefit from ACL control -
>>> we're not giving away anything that isn't public (other than perhaps
>>> info that could be used to identify the version of Solr, or even the
>>> fact that it *is* solr).
>>> 
>>> However, from a user experience point of view, requiring credentials to
>>> see the UI would be more conventional, and therefore lead to less
>>> confusion. Is it possible for us to protect the UI static files, only
>>> for the sake of user experience, rather than security?
>>> 
>>> Upayavira
>>> 
>>> On Tue, Nov 10, 2015, at 12:01 PM, Noble Paul wrote:
 The admin UI is a bunch of static pages . We don't let the ACL control
 static content
 
 you must blacklist all the core/collection apis and it is pretty much
 useless for anyone to access the admin UI (w/o the credentials , of
 course)
 
 On Tue, Nov 10, 2015 at 7:08 AM, 马柏樟  wrote:
> Hi,
> 
> After I configure Authentication with Basic Authentication Plugin and
>>> Authorization with Rule-Based Authorization Plugin, How can I prevent 
>>> the
>>> strangers from visiting my solr by browser? For example, if the stranger
>>> visit the http://(my host):8983, the browser will pop up a window and
>>> says "the server http://(my host):8983 requires a username and
>>> password"
 
 
 
 --
 -
 Noble Paul
>>>

Re: adding document with nested document require to set id

2015-11-19 Thread Mikhail Khludnev

Hello,
On Thu, Nov 19, 2015 at 12:48 PM, CrazyDiamond 
wrote:

> id is generated automatically(i use
> uuid)
>

How exactly you are doing that?

i tryed to add nesting. i want
> the same behaviour for nested documents as it was for not nested.
>

How exactly you want it to work with them?

-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

Re: solr indexing warning

2015-11-19 Thread Emir Arnautovic

This means that one searcher is still warming when other searcher 
created due to commit with openSearcher=true. This can be due to 
frequent commits of searcher warmup taking too long.


Emir

--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/



On 19.11.2015 12:16, Midas A wrote:

Getting following log on solr


PERFORMANCE WARNING: Overlapping onDeckSearchers=2`

Re: Boost non stemmed keywords (KStem filter)

2015-11-19 Thread Jan Høydahl

Do you have a concept code for this? Don’t you also have to hack your query 
parser, e.g. dismax, to use other Query objects supporting payloads?

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

> 18. nov. 2015 kl. 22.24 skrev Markus Jelsma :
> 
> Hi - easiest approach is to use KeywordRepeatFilter and 
> RemoveDuplicatesTokenFilter. This creates a slightly higher IDF for unstemmed 
> words which might be just enough in your case. We found it not to be enough, 
> so we also attach payloads to signify stemmed words amongst others. This 
> allows you to decrease score for stemmed words at query time via your 
> similarity impl.
> 
> M.
> 
> 
> 
> -Original message-
>> From:bbarani 
>> Sent: Wednesday 18th November 2015 22:07
>> To: solr-user@lucene.apache.org
>> Subject: Boost non stemmed keywords (KStem filter)
>> 
>> Hi,
>> 
>> I am using KStem factory for stemming. This stemmer converts 'france to
>> french', 'chinese to china' etc.. I am good with this stemming but I am
>> trying to boost the results that contain the original term compared to the
>> stemmed terms. Is this possible?
>> 
>> Thanks,
>> Learner
>> 
>> 
>> 
>> 
>> --
>> View this message in context: 
>> http://lucene.472066.n3.nabble.com/Boost-non-stemmed-keywords-KStem-filter-tp4240880.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>

Re: replica recovery

2015-11-19 Thread Brian Scholl

I have opted to modify the number and size of transaction logs that I keep to 
resolve the original issue I described.  In so doing I think I have created a 
new problem, feedback is appreciated.

Here are the new updateLog settings:


  ${solr.ulog.dir:}
  ${solr.ulog.numVersionBuckets:65536}
  1000
  5760


First I want to make sure I understand what these settings do:
numRecordsToKeep: per transaction log file keep this number of documents
maxNumLogsToKeep: retain this number of transaction log files total

During my testing I thought I observed that a new tlog is created every time 
auto-commit is triggered (every 15 seconds in my case) so I set 
maxNumLogsToKeep high enough to contain an entire days worth of updates.   
Knowing that I could potentially need to bulk load some data I set 
numRecordsToKeep higher than my max throughput per replica for 15 seconds.

The problem that I think this has created is I am now running out of file 
descriptors on the servers.  After indexing new documents for a couple hours a 
some servers (not all) will start logging this error rapidly:

73021439 WARN  
(qtp1476011703-18-acceptor-0@6d5514d9-ServerConnector@6392e703{HTTP/1.1}{0.0.0.0:8983})
 [   ] o.e.j.s.ServerConnector
java.io.IOException: Too many open files
at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
at 
sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:422)
at 
sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:250)
at 
org.eclipse.jetty.server.ServerConnector.accept(ServerConnector.java:377)
at 
org.eclipse.jetty.server.AbstractConnector$Acceptor.run(AbstractConnector.java:500)
at 
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635)
at 
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555)
at java.lang.Thread.run(Thread.java:745)

The output of ulimit -n for the user running the solr process is 1024.  I am 
pretty sure I can prevent this error from occurring  by increasing the limit on 
each server but it isn't clear to me how high it should be or if raising the 
limit will cause new problems.

Any advice you could provide in this situation would be awesome!

Cheers,
Brian



> On Oct 27, 2015, at 20:50, Jeff Wartes  wrote:
> 
> 
> On the face of it, your scenario seems plausible. I can offer two pieces
> of info that may or may not help you:
> 
> 1. A write request to Solr will not be acknowledged until an attempt has
> been made to write to all relevant replicas. So, B won’t ever be missing
> updates that were applied to A, unless communication with B was disrupted
> somehow at the time of the update request. You can add a min_rf param to
> your write request, in which case the response will tell you how many
> replicas received the update, but it’s still up to your indexer client to
> decide what to do if that’s less than your replication factor.
> 
> See 
> https://cwiki.apache.org/confluence/display/solr/Read+and+Write+Side+Fault+
> Tolerance for more info.
> 
> 2. There are two forms of replication. The usual thing is for the leader
> for each shard to write an update to all replicas before acknowledging the
> write itself, as above. If a replica is less than N docs behind the
> leader, the leader can replay those docs to the replica from its
> transaction log. If a replica is more than N docs behind though, it falls
> back to the replication handler recovery mode you mention, and attempts to
> re-sync the whole shard from the leader.
> The default N for this is 100, which is pretty low for a high-update-rate
> index. It can be changed by increasing the size of the transaction log,
> (via numRecordsToKeep) but be aware that a large transaction log size can
> delay node restart.
> 
> See 
> https://cwiki.apache.org/confluence/display/solr/UpdateHandlers+in+SolrConf
> ig#UpdateHandlersinSolrConfig-TransactionLog for more info.
> 
> 
> Hope some of that helps, I don’t know a way to say
> delete-first-on-recovery.
> 
> 
> 
> On 10/27/15, 5:21 PM, "Brian Scholl"  wrote:
> 
>> Whoops, in the description of my setup that should say 2 replicas per
>> shard.  Every server has a replica.
>> 
>> 
>>> On Oct 27, 2015, at 20:16, Brian Scholl  wrote:
>>> 
>>> Hello,
>>> 
>>> I am experiencing a failure mode where a replica is unable to recover
>>> and it will try to do so forever.  In writing this email I want to make
>>> sure that I haven't missed anything obvious or missed a configurable
>>> option that could help.  If something about this looks funny, I would
>>> really like to hear from you.
>>> 
>>> Relevant details:
>>> - solr 5.3.1
>>> - java 1.8
>>> - ubuntu linux 14.04 lts
>>> - the cluster is composed of 1 SolrCloud collection with 100 shards
>>> backed by a 3 node zookeeper ensemble
>>> - there are 200 solr servers in the cluster, 1 replica per shard
>>> - a shard replica is larger than 50

Re:Re: Implementing security.json is breaking ADDREPLICA

2015-11-19 Thread 马柏樟

Hi Anshum,
I encounter the same problem after I config my security.json like this:
{ "authentication":{
"class":"solr.BasicAuthPlugin",
"credentials":{"solr":"IV0EHq1OnNrj6gvRCwvFwTrZ1+z1oBbnQdiVC3otuq0= 
Ndd7LKvVBAaZIF0QAVi1ekCfAJXr1GGfLtRUXhgrF8c="}},
  "authorization":{
"class":"solr.RuleBasedAuthorizationPlugin",
"user-role":{"solr":"admin"},
"permissions":[
  { "name":"security-edit",
"role":"admin"},
  { "name":"read",
"role":"admin"},
  { "name":"update",
"role":"admin"}],
"":{"v":3}}}

I just want to restrict strangers to do select/create/update operations on my 
collections and configs like schema.xml or other things related to solr itself 
from both Admin UI and sorj. But it is useless and error occurs like this:
ERROR (RecoveryThread-gettingstarted_shard2_replica2) [c:gettingstarted 
s:shard2 r:core_node2 x:gettingstarted_shard2_replica2] 
o.a.s.c.RecoveryStrategy Error while trying to 
recover:org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: 
Error from server at http://(myhost):7574/solr/gettingstarted_shard2_replica1: 
Expected mime type application/octet-stream but got text/html. 

Error 401 Unauthorized request, Response code: 401

HTTP ERROR 401
Problem accessing /solr/gettingstarted_shard2_replica1/update. Reason:
Unauthorized request, Response code: 
401Powered by Jetty://

 at 
org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:528)
 at 
org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:234)
 at 
org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:226)
 at 
org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:135)
 at 
org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:152)
 at 
org.apache.solr.cloud.RecoveryStrategy.commitOnLeader(RecoveryStrategy.java:207)
 at 
org.apache.solr.cloud.RecoveryStrategy.replicate(RecoveryStrategy.java:147)
 at 
org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:437)
 at 
org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:227)
Error while trying to 
recover:org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: 
Error from server at http://(myhost):7574/solr/gettingstarted_shard2_replica1: 
Expected mime type application/octet-stream but got text/html. 

Kind regards
Byzen Ma

At 2015-11-19 13:46:08, "Anshum Gupta"  wrote:
>Hi Craig,
>
>Just to be sure that you're using the feature as it should be used, can you
>outline what is it that you're trying to do here? There are a few things
>that aren't clear to me here, e.g. I see permissions for the /admin handler
>for a particular collection.
>
>What are the kind of permissions you're trying to set up.
>
>Solr uses it's internal PKI based mechanism for inter-shard communication
>and so you shouldn't really be hitting this. Can you check your logs and
>tell me if there are any other exceptions you see while bringing the node
>up etc. ? Something from PKI itself.
>
>About restricting the UI, there's another thread in parallel that's been
>discussing exactly that. The thing with the current UI implementation is
>that it bypasses all of this, primarily because most of that content is
>static. I am not saying we should be able to put it behind the
>authentication layer, but just that it's not currently supported through
>this plugin.
>
>On Wed, Nov 18, 2015 at 11:20 AM, Oakley, Craig (NIH/NLM/NCBI) [C] <
>craig.oak...@nih.gov> wrote:
>
>> Implementing security.json is breaking ADDREPLICA
>>
>> I have been able to reproduce this issue with minimal changes from an
>> out-of-the-box Zookeeper (3.4.6) and Solr (5.3.1): loading
>> configsets/basic_configs/conf into Zookeeper, creating the security.json
>> listed below, creating two nodes (one with a core named xmpl and one
>> without any core)- I can provide details if helpful.
>>
>> The security.json is as follows:
>>
>> {
>>   "authentication":{
>> "class":"solr.BasicAuthPlugin",
>> "credentials":{
>>   "solr":"IV0EHq1OnNrj6gvRCwvFwTrZ1+z1oBbnQdiVC3otuq0=
>> Ndd7LKvVBAaZIF0QAVi1ekCfAJXr1GGfLtRUXhgrF8c=",
>>   "solruser":"VgZX1TAMNHT2IJikoGdKtxQdXc+MbNwfqzf89YqcLEE=
>> 37pPWQ9v4gciIKHuTmFmN0Rv66rnlMOFEWfEy9qjJfY="},
>> "":{"v":9}},
>>   "authorization":{
>> "class":"solr.RuleBasedAuthorizationPlugin",
>> "user-role":{
>>   "solr":[
>> "admin",
>> "read",
>> "xmpladmin",
>> "xmplgen",
>> "xmplsel"],
>>   "solruser":[
>> "read",
>> "xmplgen",
>> "xmplsel"]},
>> "permissions":[
>>   {
>> "name":"security-edit",
>> "role":"admin"},
>>   {
>> "name":"xmpl_admin",
>> "collection":"xmpl",
>> "path":"/admin/*",
>> "role":"xmpladmin"},
>>   {
>> "name":"xmpl_sel",
>> "collection":

RealTimeGetHandler doesn't retrieve documents

2015-11-19 Thread Jérémie MONSINJON

Hello everyone !

I'm using SolR 5.3.1 with solrj.SolrClient.
My index is sliced in 3 shards, each on different server. (No replica on
dev platform)
It has been up to date for a few days...

[image: Images intégrées 2]

I'm trying to use the RealTimeGetHandler to get documents by their Id.
In our usecase, documents are updated very frequently,  so  we have to look
in the tlog before searching the index.

When I use the SolrClient.getById() (with a list of document Ids recently
extracted from the index)

SolR doesn't return *all* the documents corresponding to these Ids.
So I tried to use the Solr api directly:

http://server:port/solr/index/get/ids=id1,id2,id3
And this is the same. Some ids don't works.

In my example, id1 doesn't return a document, id2 and id3 or OK.

If I try a filtered query with the id1, it works fine, the document exists
in the index and is found by SolR

Can anybody explain why a document, present in the index, with no
uncommited update or delete, is not found by the Real Time Get Handler ?

Regards,
Jeremie

Re: Solr Keyword query on a specific field.

2015-11-19 Thread Aaron Gibbons

I apologize for the long delay in response.  I was able to get it to work
tho! Thank you!!

The local parameters were confusing to me at first. I'm using SolrNet to
build the search which has LocalParams that I am already specifying, but
those are not applied to the title portion.  What I ended up doing is
applying the local params like the link you provided suggested, basically
wrapping just the Title portion of the query.

So in SolrNet I did:
solrQueries.Add(new SolrQuery("{!q.op=AND mm=100% df=current_position_title
v='" + Title Keywords + "'}"));

And the Solr query ends up being:
{!q.op=AND+mm=100%+df=current_position_title+v='Title Keywords'}

Now users can enter their keywords or boolean string and it's working just
as a Solr keyword search does.  Exactly what I wanted!

I just have to figure out how to do the same thing using Sunspot tho.


On Sun, Nov 8, 2015 at 7:01 PM, davidphilip cherian <
davidphilipcher...@gmail.com> wrote:

> Nested queries might help.
>
> http://www.slideshare.net/erikhatcher/solr-query-parsing-tips-and-tricks
>
> On Mon, Nov 2, 2015 at 10:20 AM, Aaron Gibbons <
> agibb...@synergydatasystems.com> wrote:
>
> > The input for the title field is user based so a wide range of things can
> > be entered there.  Quoting the title is not what I'm looking for.  I also
> > checked and q.op is AND and MM is 100%.  In addition to the Title field
> the
> > user can also use general keywords so setting local params (df) to
> > something else would not work either to my knowledge.
> >
> > To give you a better idea of what I'm trying to accomplish: I have a form
> > to allow users to search on Title, Keywords and add a location. The
> correct
> > operators are applied between each of these and also for the main
> keywords
> > themselves.  The only issue is with the default operator being applied
> > within the Title sections's keywords. My goal is to have the Title
> keywords
> > work the same as the general keywords but only be applied to the title
> > field vs the default text field.
> >
> > On Fri, Oct 30, 2015 at 6:35 PM, davidphilip cherian <
> > davidphilipcher...@gmail.com> wrote:
> >
> > > >> "Is there any way to have a single field search use the same keyword
> > > search logic as the default query?"
> > > Do a phrase search, with double quotes surrounding the multiple
> keywords,
> > > it should work.
> > >
> > > Try q=title:("Test Keywords")
> > >
> > > You could possibly try adding this q.op as local param to query as
> shown
> > > below.
> > >
> > >
> >
> https://cwiki.apache.org/confluence/display/solr/Local+Parameters+in+Queries
> > >
> > > If you are using edismax query parser, check for what is mm pram
> > > set. q.op=AND => mm=100%; q.op=OR => mm=0%)
> > >
> > >
> >
> https://wiki.apache.org/solr/ExtendedDisMax#mm_.28Minimum_.27Should.27_Match.29
> > >
> > >
> > > On Fri, Oct 30, 2015 at 3:27 PM, Aaron Gibbons <
> > > agibb...@synergydatasystems.com> wrote:
> > >
> > > > Is there any way to have a single field search use the same keyword
> > > search
> > > > logic as the default query? I define q.op as AND in my query which
> gets
> > > > applied to any main keywords but any keywords I'm trying to use
> within
> > a
> > > > field do not get the same logic applied.
> > > > Example:
> > > > q=(title:(Test Keywords)) the space is treated as OR regardless of
> q.op
> > > > q=(Test Keywords) the space is defined by q.op which is AND
> > > >
> > > > Using the correct operators (AND OR * - +...) it works great as I
> have
> > it
> > > > defined. There's just this one little caveat when you use spaces
> > between
> > > > keywords expecting the q.op operator to be applied.
> > > > Thanks,
> > > > Aaron
> > > >
> > >
> >
>

RE: Re:Re: Implementing security.json is breaking ADDREPLICA

2015-11-19 Thread Oakley, Craig (NIH/NLM/NCBI) [C]

Thank you for the reply.

What we are attempting is to require a password for practically everything, so 
that even were a hacker to get within the firewall, they would have limited 
access to the various services (the Security team even complained that, for 
Solr 4.5 servers, attempts to access host:port (without "/solr") resulted in an 
error message that gave the full pathname to solr.war)

I am sending the solr.log files directly to Anshum, so as not to clutter up the 
Email list.

The steps I used to recreate the problem are as follows:
cd zookeeper-3.4.6/conf/
sed 's/2181/4545/' zoo_sample.cfg | tee zoo_sample4545.cfg 
cd ../bin
./zkServer.sh start zoo_sample4545.cfg
cd ../../solr-5.3.1/server/solr
mkdir xmpl
echo 'name=xmpl' | tee xmpl/core.properties
mkdir xmpl/data
mkdir xmpl/data/index
mkdir xmpl/data/tlog
cd ../scripts/cloud-scripts/
./zkcli.sh --zkhost localhost:4545 -cmd makepath /solr
./zkcli.sh --zkhost localhost:4545 -cmd makepath /solr/xmpl
./zkcli.sh --zkhost localhost:4545/solr/xmpl  -cmd upconfig -confdir 
../../solr/configsets/basic_configs/conf -confname xmpl
mkdir ../../example/solr
cp solr.xml ../../example/solr
./zkcli.sh --zkhost localhost:4545/solr/xmpl  -cmd putfile /security.json 
~/solr/security151117a.json 
cd ../../../bin
mkdir  ../example/solr/pid
./solr -c -p 4575 -d ~dbman/solr/straight531outofbox/solr-5.3.1/server/ -z 
localhost:4545/solr/xmpl -s 
~dbman/solr/straight531outofbox/solr-5.3.1/example/solr
./solr -c -p 4565 -d ~dbman/solr/straight531outofbox/solr-5.3.1/server/ -z 
localhost:4545/solr/xmpl -s 
~dbman/solr/straight531outofbox/solr-5.3.1/server/solr
curl -u solr:SolrRocks 'http:// 
{IP-address-redacted}:4575/solr/admin/collections?action=ADDREPLICA&collection=xmpl&shard=shard1&node={IP-address-redacted}:4575_solr&wt=json&indent=true'

The contents of security151117a.json is included in the original post

If there is a better way using the Well Known Permissions as described at 
lucidworks.com/blog/2015/08/17/securing-solr-basic-auth-permission-rules, I'm 
open to trying that.

I would like to acknowledge that there definitely seem to be some IMPROVEMENTS 
in the security.json implementation: particularly in terms of Core Admin (using 
jetty-implemented Authentication in webdefault.xml, anyone who could get into 
the GUI front page could rename cores, unless prevented by OS-level permissions 
on core.properties).


Thanks again

Re: replica recovery

2015-11-19 Thread Erick Erickson

First, every time you autocommit there _should_ be a new
tlog created. A hard commit truncates the tlog by design.

My guess (not based on knowing the code) is that
Real Time Get needs file handle open to the tlog files
and you'll have a bunch of them. Lots and lots and lots. Thus
the too many file handles is just waiting out there for you.

However, this entire approach is, IMO, not going to solve
anything for you. Or rather other problems will come out
of the woodwork.

To whit: At some point, you _will_ need to have at least as
much free space on your disk as your current index occupies,
even without recovery. Background merging of segments can
effectively do the same thing as an optimize step, which rewrites
the entire index to new segments before deleting the old
segments. So far you haven't hit that situation in steady-state,
but you will.

Simply put, I think you're wasting your time pursuing the tlog
option. You must have bigger disks or smaller indexes such
that there is at least as much free disk space at all times as
your index occupies. In fact if the tlogs are on the same
drive as your index, the tlog option you're pursuing is making
the situation _worse_ by making running out of disk space
during a merge even more likely.

So unless there's a compelling reason you can't use bigger
disks, IMO you'll waste lots and lots of valuable
engineering time before... buying bigger disks.

Best,
Erick

On Thu, Nov 19, 2015 at 6:21 AM, Brian Scholl  wrote:
> I have opted to modify the number and size of transaction logs that I keep to 
> resolve the original issue I described.  In so doing I think I have created a 
> new problem, feedback is appreciated.
>
> Here are the new updateLog settings:
>
> 
>   ${solr.ulog.dir:}
>   ${solr.ulog.numVersionBuckets:65536}
>   1000
>   5760
> 
>
> First I want to make sure I understand what these settings do:
> numRecordsToKeep: per transaction log file keep this number of 
> documents
> maxNumLogsToKeep: retain this number of transaction log files total
>
> During my testing I thought I observed that a new tlog is created every time 
> auto-commit is triggered (every 15 seconds in my case) so I set 
> maxNumLogsToKeep high enough to contain an entire days worth of updates.   
> Knowing that I could potentially need to bulk load some data I set 
> numRecordsToKeep higher than my max throughput per replica for 15 seconds.
>
> The problem that I think this has created is I am now running out of file 
> descriptors on the servers.  After indexing new documents for a couple hours 
> a some servers (not all) will start logging this error rapidly:
>
> 73021439 WARN  
> (qtp1476011703-18-acceptor-0@6d5514d9-ServerConnector@6392e703{HTTP/1.1}{0.0.0.0:8983})
>  [   ] o.e.j.s.ServerConnector
> java.io.IOException: Too many open files
> at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
> at 
> sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:422)
> at 
> sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:250)
> at 
> org.eclipse.jetty.server.ServerConnector.accept(ServerConnector.java:377)
> at 
> org.eclipse.jetty.server.AbstractConnector$Acceptor.run(AbstractConnector.java:500)
> at 
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635)
> at 
> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555)
> at java.lang.Thread.run(Thread.java:745)
>
> The output of ulimit -n for the user running the solr process is 1024.  I am 
> pretty sure I can prevent this error from occurring  by increasing the limit 
> on each server but it isn't clear to me how high it should be or if raising 
> the limit will cause new problems.
>
> Any advice you could provide in this situation would be awesome!
>
> Cheers,
> Brian
>
>
>
>> On Oct 27, 2015, at 20:50, Jeff Wartes  wrote:
>>
>>
>> On the face of it, your scenario seems plausible. I can offer two pieces
>> of info that may or may not help you:
>>
>> 1. A write request to Solr will not be acknowledged until an attempt has
>> been made to write to all relevant replicas. So, B won’t ever be missing
>> updates that were applied to A, unless communication with B was disrupted
>> somehow at the time of the update request. You can add a min_rf param to
>> your write request, in which case the response will tell you how many
>> replicas received the update, but it’s still up to your indexer client to
>> decide what to do if that’s less than your replication factor.
>>
>> See
>> https://cwiki.apache.org/confluence/display/solr/Read+and+Write+Side+Fault+
>> Tolerance for more info.
>>
>> 2. There are two forms of replication. The usual thing is for the leader
>> for each shard to write an update to all replicas before acknowledging the
>> write itself, as above. If a replica is less than N docs behind the
>> leader, the leader can replay thos

Re: adding document with nested document require to set id

2015-11-19 Thread CrazyDiamond

How exactly you are doing that? 
Doing what? 

this is from schema.
   
   
id
   
...
 


this is from config
 i want to store in nested document  multiple values that should be grouped
together, like pages ids and pages urls  



--
View this message in context: 
http://lucene.472066.n3.nabble.com/adding-document-with-nested-document-require-to-set-id-tp4240908p4241058.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Stem Words Highlighted - Keyword Not Highlighted

2015-11-19 Thread Ann B

Thank you Jack.  The field I was passing to Solr actually uses the
following:

Tokenizer:  StandardTokenizerFactory

Filters:

StopFilterFactory
LengthFilterFactory
LowerCaseFilterFactory
RemoveDuplicatesTokenFilterFactory

Once I passed in the correct field that uses the white space tokenizer and
the
WordDelimiterFilterFactory, all is well.


On Thu, Oct 29, 2015 at 8:16 AM, Jack Krupansky 
wrote:

> Did you index the data before adding the word delimiter filter? The white
> space tokenizer preserves the period after "stocks.", but the WDF should
> remove it. The period is likely interfering with stemming.
>
> Are your filters the same for index time and query time?
>
> -- Jack Krupansky
>
> On Tue, Aug 18, 2015 at 3:31 PM, Ann B  wrote:
>
> > Question:
> >
> > Can I configure solr to highlight the keyword also?  The search results
> are
> > correct, but the highlighting is not complete.
> >
> > *
> >
> > Example:
> >
> > Keyword: stocks
> >
> > Request: (I only provided the url parameters below.)
> >
> > hl=true&
> > hl.fl=spell&
> > hl.simple.pre=%5BHIGHLIGHT%5D&
> > hl.simple.post=%5B%2FHIGHLIGHT%5D&
> > hl.snippets=3&
> > hl.fragsize=70&
> > hl.mergeContiguous=true&
> >
> > fl=item_id%2Cscore&
> >
> > qf=tm_body%3Avalue%5E1.0&
> > qf=tm_title%5E13.0&
> >
> > fq=im_field_webresource_category%3A%226013%22&
> > fq=index_id%3Atest&
> >
> >
> >
> start=0&rows=10&facet=true&facet.sort=count&facet.limit=10&facet.mincount=1&facet.missing=false&facet.field=im_field_webresource_category&f.im_field_webresource_category.facet.limit=50&
> >
> > wt=json&json.nl=map&
> >
> > q=%22stocks%22
> >
> > *
> >
> > Response:
> >
> > "highlighting":{
> > "test-49904":{"spell":[
> > "Includes free access to [HIGHLIGHT]stock[/HIGHLIGHT] charts and
> > instruction about using [HIGHLIGHT]stock[/HIGHLIGHT] charts in technical
> > analysis of stocks. Paid subscriptions provide access to more
> > information."]},...
> >
> > *
> >
> > Details:
> >
> > Tokenizer:  
> >
> > Filters:
> >
> >  > ignoreCase="true" expand="true"/>  > ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"/>
> >  > generateWordParts="1" generateNumberParts="1" catenateWords="0"
> > catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"
> > preserveOriginal="1"/>  > max="100"/>   > class="solr.*SnowballPorterFilterFactory*" language="English"
> > protected="protwords.txt"/>  > class="solr.RemoveDuplicatesTokenFilterFactory"/>
> >
> > I think I'm using the Standard Highlighter.
> >
> > I’m using the Drupal 7 search api solr configuration files without
> > modification.
> >
> >
> > Thank you,
> >
> > Ann
> >
>

Re: Security Problems

2015-11-19 Thread Noble Paul

What is the smallest possible security.json required currently to
protect all possible paths (except those served by Jetty)?



You would need 2 rules
1)
"name":"all-admin",
"collection": null,
"path":"/*"
"role:"somerole"

2) all core handlers
"name":"all-core-handlers",
"path":"/*"
"role":"somerole"


ideally we should have a simple permission name called "all" (which we
don't have)

so that one rule should be enough

"name":"all",
"role":"somerole"

Open a ticket and we should fix it for 5.4.0
It should also include  the admin paths as well


On Thu, Nov 19, 2015 at 6:02 PM, Jan Høydahl  wrote:
> Would it not be less surprising if ALL requests to Solr required 
> authentication once an AuthenticationPlugin was enabled?
> Then, if no AuthorizationPlugin was active, all authenticated users could do 
> anything.
> But if AuthorizationPlugin was configured, you could only do what your role 
> allows you to?
>
> As it is now it is super easy to forget a path, say you protect /select but 
> not /browse and /query, or someone creates a collection
> with some new endpoints and forgets to update security.json - then that 
> endpoint would be wide open!
>
> What is the smallest possible security.json required currently to protect all 
> possible paths (except those served by Jetty)?
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
>
>> 18. nov. 2015 kl. 20.31 skrev Upayavira :
>>
>> I'm very happy for the admin UI to be served another way - i.e. not
>> direct from Jetty, if that makes the task of securing it easier.
>>
>> Perhaps a request handler specifically for UI resources which would make
>> it possible to secure it all in a more straight-forward way?
>>
>> Upayavira
>>
>> On Wed, Nov 18, 2015, at 01:54 PM, Noble Paul wrote:
>>> As of now the admin-ui calls are not protected. The static calls are
>>> served by jetty and it bypasses the authentication mechanism
>>> completely. If the admin UI relies on some API call which is served by
>>> Solr.
>>> The other option is to revamp the framework to take care of admin UI
>>> (static content) as well. This would be cleaner solution
>>>
>>>
>>> On Wed, Nov 18, 2015 at 2:32 PM, Upayavira  wrote:
 Not sure I quite understand.

 You're saying that the cost for the UI is not large, but then suggesting
 we protect just one resource (/admin/security-check)?

 Why couldn't we create the permission called 'admin-ui' and protect
 everything under /admin/ui/ for example? Along with the root HTML link
 too.

 Upayavira

 On Wed, Nov 18, 2015, at 07:46 AM, Noble Paul wrote:
> The authentication plugin is not expensive if you are talking in the
> context of admin UI. After all it is used not like 100s of requests
> per second.
>
> The simplest solution would be
>
> provide a well known permission name called "admin-ui"
>
> ensure that every admin page load makes a call to some resource say
> "/admin/security-check"
>
> Then we can just protect that .
>
> The only concern thatI have is the false sense of security it would
> give to the user
>
> But, that is a different point altogether
>
> On Wed, Nov 11, 2015 at 1:52 AM, Upayavira  wrote:
>> Is the authentication plugin that expensive?
>>
>> I can help by minifying the UI down to a smaller number of CSS/JS/etc
>> files :-)
>>
>> It may be overkill, but it would also give better experience. And isn't
>> that what most applications do? Check authentication tokens on every
>> request?
>>
>> Upayavira
>>
>> On Tue, Nov 10, 2015, at 07:33 PM, Anshum Gupta wrote:
>>> The reason why we bypass that is so that we don't hit the authentication
>>> plugin for every request that comes in for static content. I think we
>>> could
>>> call the authentication plugin for that but that'd be an overkill. 
>>> Better
>>> experience ? yes
>>>
>>> On Tue, Nov 10, 2015 at 11:24 AM, Upayavira  wrote:
>>>
 Noble,

 I get that a UI which is open source does not benefit from ACL control 
 -
 we're not giving away anything that isn't public (other than perhaps
 info that could be used to identify the version of Solr, or even the
 fact that it *is* solr).

 However, from a user experience point of view, requiring credentials to
 see the UI would be more conventional, and therefore lead to less
 confusion. Is it possible for us to protect the UI static files, only
 for the sake of user experience, rather than security?

 Upayavira

 On Tue, Nov 10, 2015, at 12:01 PM, Noble Paul wrote:
> The admin UI is a bunch of static pages . We don't let the ACL control
> static content
>
> you must blacklist all the core/collection apis and it is pretty much
> useless f

Re: adding document with nested document require to set id

2015-11-19 Thread Mikhail Khludnev

Hi,

Perhaps you want UUIDUpdateProcessorFactory loop through
SolrInputDocument.getChildDocuments() and assign generated value. You need
to implement an own update processor (by extending one of existing).

On Thu, Nov 19, 2015 at 7:41 PM, CrazyDiamond  wrote:

> How exactly you are doing that?
> Doing what?
>  required="true"
> />
> this is from schema.
>
>
> id
>
> ...
>  
> 
>
> this is from config
>  i want to store in nested document  multiple values that should be grouped
> together, like pages ids and pages urls
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/adding-document-with-nested-document-require-to-set-id-tp4240908p4241058.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

Re: Error in log after upgrading Solr

2015-11-19 Thread Shawn Heisey

On 11/18/2015 3:29 PM, Shawn Heisey wrote:
> I'll see if I can put together a minimal configuration to reproduce. 

The really obvious idea here was to start with the example server, built
from the same source I used for the real thing, and create a core based
on sample_techproducts_configs.  Because removing newSearcher and
firstSearcher fixed the problem for me, the next step was to configure
similar newSearcher and firstSearcher queries to what I used to have in
my config, and try indexing docs.

I did this, and the problem did not reproduce.

Next I copied my configuration over, removing the custom components and
my DIH config so I would not need to add those jars.  The problem still
did not reproduce.

I then copied my schema.xml and adjusted that so it would start.  For
this I needed to create server/solr/lib and add the ICU jars to it.  The
5.3.2 snapshot with SOLR-6188 allows this to work right.

When I make simple indexing requests using the admin UI that include all
required fields, the problem does not happen.  Figuring out a minimum
reproducible testcase is going to take more work, and I don't have a lot
of time to put into it.

Thanks,
Shawn

Re: replica recovery

2015-11-19 Thread Brian Scholl

Hey Erick,

Thanks for the reply.  

I plan on rebuilding my cluster soon with more nodes so that the index size 
(including tlogs) is under 50% of the available disk at a minimum, ideally we 
will shoot for under 33% budget permitting.  I think I now understand the 
problem that managing this resource will solve and I appreciate your (and 
Shawn's) feedback. 

I would still like to increase the number of transaction logs retained so that 
shard recovery (outside of long term failures) is faster than replicating the 
entire shard from the leader.  I understand that this is an optimization and 
not a 
solution for replication.  If I'm being thick about this call me out :)

Cheers,
Brian




> On Nov 19, 2015, at 11:30, Erick Erickson  wrote:
> 
> First, every time you autocommit there _should_ be a new
> tlog created. A hard commit truncates the tlog by design.
> 
> My guess (not based on knowing the code) is that
> Real Time Get needs file handle open to the tlog files
> and you'll have a bunch of them. Lots and lots and lots. Thus
> the too many file handles is just waiting out there for you.
> 
> However, this entire approach is, IMO, not going to solve
> anything for you. Or rather other problems will come out
> of the woodwork.
> 
> To whit: At some point, you _will_ need to have at least as
> much free space on your disk as your current index occupies,
> even without recovery. Background merging of segments can
> effectively do the same thing as an optimize step, which rewrites
> the entire index to new segments before deleting the old
> segments. So far you haven't hit that situation in steady-state,
> but you will.
> 
> Simply put, I think you're wasting your time pursuing the tlog
> option. You must have bigger disks or smaller indexes such
> that there is at least as much free disk space at all times as
> your index occupies. In fact if the tlogs are on the same
> drive as your index, the tlog option you're pursuing is making
> the situation _worse_ by making running out of disk space
> during a merge even more likely.
> 
> So unless there's a compelling reason you can't use bigger
> disks, IMO you'll waste lots and lots of valuable
> engineering time before... buying bigger disks.
> 
> Best,
> Erick
> 
> On Thu, Nov 19, 2015 at 6:21 AM, Brian Scholl  wrote:
>> I have opted to modify the number and size of transaction logs that I keep 
>> to resolve the original issue I described.  In so doing I think I have 
>> created a new problem, feedback is appreciated.
>> 
>> Here are the new updateLog settings:
>> 
>>
>>  ${solr.ulog.dir:}
>>  ${solr.ulog.numVersionBuckets:65536}
>>  1000
>>  5760
>>
>> 
>> First I want to make sure I understand what these settings do:
>>numRecordsToKeep: per transaction log file keep this number of 
>> documents
>>maxNumLogsToKeep: retain this number of transaction log files total
>> 
>> During my testing I thought I observed that a new tlog is created every time 
>> auto-commit is triggered (every 15 seconds in my case) so I set 
>> maxNumLogsToKeep high enough to contain an entire days worth of updates.   
>> Knowing that I could potentially need to bulk load some data I set 
>> numRecordsToKeep higher than my max throughput per replica for 15 seconds.
>> 
>> The problem that I think this has created is I am now running out of file 
>> descriptors on the servers.  After indexing new documents for a couple hours 
>> a some servers (not all) will start logging this error rapidly:
>> 
>> 73021439 WARN  
>> (qtp1476011703-18-acceptor-0@6d5514d9-ServerConnector@6392e703{HTTP/1.1}{0.0.0.0:8983})
>>  [   ] o.e.j.s.ServerConnector
>> java.io.IOException: Too many open files
>>at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
>>at 
>> sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:422)
>>at 
>> sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:250)
>>at 
>> org.eclipse.jetty.server.ServerConnector.accept(ServerConnector.java:377)
>>at 
>> org.eclipse.jetty.server.AbstractConnector$Acceptor.run(AbstractConnector.java:500)
>>at 
>> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635)
>>at 
>> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555)
>>at java.lang.Thread.run(Thread.java:745)
>> 
>> The output of ulimit -n for the user running the solr process is 1024.  I am 
>> pretty sure I can prevent this error from occurring  by increasing the limit 
>> on each server but it isn't clear to me how high it should be or if raising 
>> the limit will cause new problems.
>> 
>> Any advice you could provide in this situation would be awesome!
>> 
>> Cheers,
>> Brian
>> 
>> 
>> 
>>> On Oct 27, 2015, at 20:50, Jeff Wartes  wrote:
>>> 
>>> 
>>> On the face of it, your scenario seems plausible. I can offer two pieces
>>> of info that may or may not help you:
>>> 
>>> 1. A writ

Re: adding document with nested document require to set id

2015-11-19 Thread CrazyDiamond

How to do this?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/adding-document-with-nested-document-require-to-set-id-tp4240908p4241091.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: Re:Re: Implementing security.json is breaking ADDREPLICA

2015-11-19 Thread Oakley, Craig (NIH/NLM/NCBI) [C]

I note that the thread called "Security Problems" (most recent post by Nobel 
Paul) seems like it may help with much of what I'm trying to do. I will see to 
what extent that may help.

Generating Index offline and loading into solrcloud

2015-11-19 Thread KNitin

Hi,

 I was wondering if there are existing tools that will generate solr index
offline (in solrcloud mode)  that can be later on loaded into solrcloud,
before I decide to implement my own. I found some tools that do only solr
based index loading (non-zk mode). Is there one with zk mode enabled?


Thanks in advance!
Nitin

Re: Generating Index offline and loading into solrcloud

2015-11-19 Thread Sameer Maggon

If you are trying to create a large index and want speedups there, you
could use the MapReduceTool -
https://github.com/cloudera/search/tree/cdh5-1.0.0_5.2.1/search-mr. At a
high level, it takes your files (csv, json, etc) as input can create either
a single or a sharded index that you can either copy it to your Solr
Servers. I've used this to create indexes that include hundreds of millions
of documents in fairly decent amount of time.

Thanks,
-- 
*Sameer Maggon*
Measured Search
www.measuredsearch.com 

On Thu, Nov 19, 2015 at 11:17 AM, KNitin  wrote:

> Hi,
>
>  I was wondering if there are existing tools that will generate solr index
> offline (in solrcloud mode)  that can be later on loaded into solrcloud,
> before I decide to implement my own. I found some tools that do only solr
> based index loading (non-zk mode). Is there one with zk mode enabled?
>
>
> Thanks in advance!
> Nitin
>

Re: Generating Index offline and loading into solrcloud

2015-11-19 Thread KNitin

Great. Thanks!

On Thu, Nov 19, 2015 at 11:24 AM, Sameer Maggon 
wrote:

> If you are trying to create a large index and want speedups there, you
> could use the MapReduceTool -
> https://github.com/cloudera/search/tree/cdh5-1.0.0_5.2.1/search-mr. At a
> high level, it takes your files (csv, json, etc) as input can create either
> a single or a sharded index that you can either copy it to your Solr
> Servers. I've used this to create indexes that include hundreds of millions
> of documents in fairly decent amount of time.
>
> Thanks,
> --
> *Sameer Maggon*
> Measured Search
> www.measuredsearch.com 
>
> On Thu, Nov 19, 2015 at 11:17 AM, KNitin  wrote:
>
> > Hi,
> >
> >  I was wondering if there are existing tools that will generate solr
> index
> > offline (in solrcloud mode)  that can be later on loaded into solrcloud,
> > before I decide to implement my own. I found some tools that do only solr
> > based index loading (non-zk mode). Is there one with zk mode enabled?
> >
> >
> > Thanks in advance!
> > Nitin
> >
>

Re: adding document with nested document require to set id

2015-11-19 Thread Mikhail Khludnev

It should be explained http://wiki.apache.org/solr/UpdateRequestProcessor


On Thu, Nov 19, 2015 at 9:27 PM, CrazyDiamond  wrote:

> How to do this?
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/adding-document-with-nested-document-require-to-set-id-tp4240908p4241091.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

RE: Expand Component Fields Response

2015-11-19 Thread Sanders, Marshall (AT - Atlanta)

Joel,

Thanks for the response.  I updated the JIRA with 2 new patches.  One for 
trunk, and one for branches/branch_5x.  It would be great if it could get 
reviewed and make it in before 5.4 if it meets approval.

https://issues.apache.org/jira/browse/SOLR-8306


Thanks,

Marshall Sanders
Technical Lead – Software Engineer
Autotrader.com
404-568-7130

-Original Message-
From: Joel Bernstein [mailto:joels...@gmail.com] 
Sent: Tuesday, November 17, 2015 5:37 PM
To: solr-user@lucene.apache.org
Subject: Re: Expand Component Fields Response

Hi Marshall,

This sounds pretty reasonable. I should have some to review the patch later in 
the week.

Joel Bernstein
http://joelsolr.blogspot.com/

On Tue, Nov 17, 2015 at 3:42 PM, Sanders, Marshall (AT - Atlanta) < 
marshall.sand...@autotrader.com> wrote:

> Well I didn't receive any responses and couldn't find any resources so 
> I created a patch and a corresponding JIRA to allow the 
> ExpandComponent to use the TotalHitCountCollector which will only 
> return the total hit count when expand.rows=0 which more accurately 
> reflected my use case.  (We don't care about the expanded document 
> itself, just the number available so that we can show something like 
> "52 more like this")
>
> https://issues.apache.org/jira/browse/SOLR-8306
>
> I'm not sure if this is the right forum for something like this, or 
> how to get feedback on the patch.  If someone more knowledgeable could 
> help me out in that area it would be excellent.  Thanks!
>
> Marshall Sanders
> Technical Lead - Software Engineer
> Autotrader.com
> 404-568-7130
>
> -Original Message-
> From: Sanders, Marshall (AT - Atlanta) [mailto:
> marshall.sand...@autotrader.com]
> Sent: Monday, November 16, 2015 11:00 AM
> To: solr-user@lucene.apache.org
> Subject: Expand Component Fields Response
>
> Is it possible to specify a separate set of fields to return from the 
> expand component which is different from the standard fl parameter?
> Something like this:
>
> fl=fielda&expand.fl=fieldb
>
> Our current use case means we actually only care about the numFound 
> from the expand component and not any of the actual fields.  We could 
> also use a facet for the field we're collapsing on, but this means 
> mapping from the field we collapsed on to the different facets and 
> isn't very elegant, and we also have to ask for a large facet.limit to 
> make sure that we get the appropriate counts back.  This is pretty poor for 
> high cardinality fields.
> The alternative is the current where we ask for the expand component 
> and get TONS of information back that we don't care about.
>
> Thanks for any help!
>
> Marshall Sanders
> Technical Lead - Software Engineer
> Autotrader.com
> 404-568-7130
>
>

Re: Generating Index offline and loading into solrcloud

2015-11-19 Thread Erick Erickson

Note two things:

1> this is running on Hadoop
2> it is part of the standard Solr release as MapReduceIndexerTool,
look in the contribs...

If you're trying to do this yourself, you must be very careful to index docs
to the correct shard then merge the correct shards. MRIT does this all
automatically.

Additionally, it has the cool feature that if (and only if) your Solr
index is running over
HDFS, the --go-live option will automatically merge the indexes into
the appropriate
running Solr instances.

One caveat. This tool doesn't handle _updating_ documents. So if you
run it twice
on the same data set, you'll have two copies of every doc. It's
designed as a bulk
initial-load tool.

Best,
Erick

On Thu, Nov 19, 2015 at 11:45 AM, KNitin  wrote:
> Great. Thanks!
>
> On Thu, Nov 19, 2015 at 11:24 AM, Sameer Maggon 
> wrote:
>
>> If you are trying to create a large index and want speedups there, you
>> could use the MapReduceTool -
>> https://github.com/cloudera/search/tree/cdh5-1.0.0_5.2.1/search-mr. At a
>> high level, it takes your files (csv, json, etc) as input can create either
>> a single or a sharded index that you can either copy it to your Solr
>> Servers. I've used this to create indexes that include hundreds of millions
>> of documents in fairly decent amount of time.
>>
>> Thanks,
>> --
>> *Sameer Maggon*
>> Measured Search
>> www.measuredsearch.com 
>>
>> On Thu, Nov 19, 2015 at 11:17 AM, KNitin  wrote:
>>
>> > Hi,
>> >
>> >  I was wondering if there are existing tools that will generate solr
>> index
>> > offline (in solrcloud mode)  that can be later on loaded into solrcloud,
>> > before I decide to implement my own. I found some tools that do only solr
>> > based index loading (non-zk mode). Is there one with zk mode enabled?
>> >
>> >
>> > Thanks in advance!
>> > Nitin
>> >
>>

Re: replica recovery

2015-11-19 Thread Erick Erickson

bq: I would still like to increase the number of transaction logs
retained so that shard recovery (outside of long term failures) is
faster than replicating the entire shard from the leader

That's legitimate, but (you knew that was coming!) nodes having to
recover _should_ be a rare event. Is this happening often or is it a
result of testing? If nodes are going into recovery for no good reason
(i.e. network being unplugged, whatever) I'd put some energy into
understanding that as well. Perhaps there are operational type things
that should be addressed (e.g. stop indexing, wait for commit, _then_
bounce Solr instances).


Best,
Erick



On Thu, Nov 19, 2015 at 10:17 AM, Brian Scholl  wrote:
> Hey Erick,
>
> Thanks for the reply.
>
> I plan on rebuilding my cluster soon with more nodes so that the index size 
> (including tlogs) is under 50% of the available disk at a minimum, ideally we 
> will shoot for under 33% budget permitting.  I think I now understand the 
> problem that managing this resource will solve and I appreciate your (and 
> Shawn's) feedback.
>
> I would still like to increase the number of transaction logs retained so 
> that shard recovery (outside of long term failures) is faster than 
> replicating the entire shard from the leader.  I understand that this is an 
> optimization and not a
> solution for replication.  If I'm being thick about this call me out :)
>
> Cheers,
> Brian
>
>
>
>
>> On Nov 19, 2015, at 11:30, Erick Erickson  wrote:
>>
>> First, every time you autocommit there _should_ be a new
>> tlog created. A hard commit truncates the tlog by design.
>>
>> My guess (not based on knowing the code) is that
>> Real Time Get needs file handle open to the tlog files
>> and you'll have a bunch of them. Lots and lots and lots. Thus
>> the too many file handles is just waiting out there for you.
>>
>> However, this entire approach is, IMO, not going to solve
>> anything for you. Or rather other problems will come out
>> of the woodwork.
>>
>> To whit: At some point, you _will_ need to have at least as
>> much free space on your disk as your current index occupies,
>> even without recovery. Background merging of segments can
>> effectively do the same thing as an optimize step, which rewrites
>> the entire index to new segments before deleting the old
>> segments. So far you haven't hit that situation in steady-state,
>> but you will.
>>
>> Simply put, I think you're wasting your time pursuing the tlog
>> option. You must have bigger disks or smaller indexes such
>> that there is at least as much free disk space at all times as
>> your index occupies. In fact if the tlogs are on the same
>> drive as your index, the tlog option you're pursuing is making
>> the situation _worse_ by making running out of disk space
>> during a merge even more likely.
>>
>> So unless there's a compelling reason you can't use bigger
>> disks, IMO you'll waste lots and lots of valuable
>> engineering time before... buying bigger disks.
>>
>> Best,
>> Erick
>>
>> On Thu, Nov 19, 2015 at 6:21 AM, Brian Scholl  wrote:
>>> I have opted to modify the number and size of transaction logs that I keep 
>>> to resolve the original issue I described.  In so doing I think I have 
>>> created a new problem, feedback is appreciated.
>>>
>>> Here are the new updateLog settings:
>>>
>>>
>>>  ${solr.ulog.dir:}
>>>  >> name="numVersionBuckets">${solr.ulog.numVersionBuckets:65536}
>>>  1000
>>>  5760
>>>
>>>
>>> First I want to make sure I understand what these settings do:
>>>numRecordsToKeep: per transaction log file keep this number of 
>>> documents
>>>maxNumLogsToKeep: retain this number of transaction log files total
>>>
>>> During my testing I thought I observed that a new tlog is created every 
>>> time auto-commit is triggered (every 15 seconds in my case) so I set 
>>> maxNumLogsToKeep high enough to contain an entire days worth of updates.   
>>> Knowing that I could potentially need to bulk load some data I set 
>>> numRecordsToKeep higher than my max throughput per replica for 15 seconds.
>>>
>>> The problem that I think this has created is I am now running out of file 
>>> descriptors on the servers.  After indexing new documents for a couple 
>>> hours a some servers (not all) will start logging this error rapidly:
>>>
>>> 73021439 WARN  
>>> (qtp1476011703-18-acceptor-0@6d5514d9-ServerConnector@6392e703{HTTP/1.1}{0.0.0.0:8983})
>>>  [   ] o.e.j.s.ServerConnector
>>> java.io.IOException: Too many open files
>>>at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
>>>at 
>>> sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:422)
>>>at 
>>> sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:250)
>>>at 
>>> org.eclipse.jetty.server.ServerConnector.accept(ServerConnector.java:377)
>>>at 
>>> org.eclipse.jetty.server.AbstractConnector$Acceptor.run(AbstractConnector.java:500)
>>>

Re: replica recovery

2015-11-19 Thread Brian Scholl

Primarily our outages are caused by Java crashes or really long GC pauses, in 
short not all of our developers have a good sense of what types of queries are 
unsafe if abused (for example, cursorMark or start=).  

Honestly, stability of the JVM is another task I have coming up.  I agree that 
recovery should be uncommon, we're just not where we need to be yet.

Cheers,
Brian




> On Nov 19, 2015, at 15:14, Erick Erickson  wrote:
> 
> bq: I would still like to increase the number of transaction logs
> retained so that shard recovery (outside of long term failures) is
> faster than replicating the entire shard from the leader
> 
> That's legitimate, but (you knew that was coming!) nodes having to
> recover _should_ be a rare event. Is this happening often or is it a
> result of testing? If nodes are going into recovery for no good reason
> (i.e. network being unplugged, whatever) I'd put some energy into
> understanding that as well. Perhaps there are operational type things
> that should be addressed (e.g. stop indexing, wait for commit, _then_
> bounce Solr instances).
> 
> 
> Best,
> Erick
> 
> 
> 
> On Thu, Nov 19, 2015 at 10:17 AM, Brian Scholl  wrote:
>> Hey Erick,
>> 
>> Thanks for the reply.
>> 
>> I plan on rebuilding my cluster soon with more nodes so that the index size 
>> (including tlogs) is under 50% of the available disk at a minimum, ideally 
>> we will shoot for under 33% budget permitting.  I think I now understand the 
>> problem that managing this resource will solve and I appreciate your (and 
>> Shawn's) feedback.
>> 
>> I would still like to increase the number of transaction logs retained so 
>> that shard recovery (outside of long term failures) is faster than 
>> replicating the entire shard from the leader.  I understand that this is an 
>> optimization and not a
>> solution for replication.  If I'm being thick about this call me out :)
>> 
>> Cheers,
>> Brian
>> 
>> 
>> 
>> 
>>> On Nov 19, 2015, at 11:30, Erick Erickson  wrote:
>>> 
>>> First, every time you autocommit there _should_ be a new
>>> tlog created. A hard commit truncates the tlog by design.
>>> 
>>> My guess (not based on knowing the code) is that
>>> Real Time Get needs file handle open to the tlog files
>>> and you'll have a bunch of them. Lots and lots and lots. Thus
>>> the too many file handles is just waiting out there for you.
>>> 
>>> However, this entire approach is, IMO, not going to solve
>>> anything for you. Or rather other problems will come out
>>> of the woodwork.
>>> 
>>> To whit: At some point, you _will_ need to have at least as
>>> much free space on your disk as your current index occupies,
>>> even without recovery. Background merging of segments can
>>> effectively do the same thing as an optimize step, which rewrites
>>> the entire index to new segments before deleting the old
>>> segments. So far you haven't hit that situation in steady-state,
>>> but you will.
>>> 
>>> Simply put, I think you're wasting your time pursuing the tlog
>>> option. You must have bigger disks or smaller indexes such
>>> that there is at least as much free disk space at all times as
>>> your index occupies. In fact if the tlogs are on the same
>>> drive as your index, the tlog option you're pursuing is making
>>> the situation _worse_ by making running out of disk space
>>> during a merge even more likely.
>>> 
>>> So unless there's a compelling reason you can't use bigger
>>> disks, IMO you'll waste lots and lots of valuable
>>> engineering time before... buying bigger disks.
>>> 
>>> Best,
>>> Erick
>>> 
>>> On Thu, Nov 19, 2015 at 6:21 AM, Brian Scholl  wrote:
 I have opted to modify the number and size of transaction logs that I keep 
 to resolve the original issue I described.  In so doing I think I have 
 created a new problem, feedback is appreciated.
 
 Here are the new updateLog settings:
 
   
 ${solr.ulog.dir:}
 >>> name="numVersionBuckets">${solr.ulog.numVersionBuckets:65536}
 1000
 5760
   
 
 First I want to make sure I understand what these settings do:
   numRecordsToKeep: per transaction log file keep this number of 
 documents
   maxNumLogsToKeep: retain this number of transaction log files total
 
 During my testing I thought I observed that a new tlog is created every 
 time auto-commit is triggered (every 15 seconds in my case) so I set 
 maxNumLogsToKeep high enough to contain an entire days worth of updates.   
 Knowing that I could potentially need to bulk load some data I set 
 numRecordsToKeep higher than my max throughput per replica for 15 seconds.
 
 The problem that I think this has created is I am now running out of file 
 descriptors on the servers.  After indexing new documents for a couple 
 hours a some servers (not all) will start logging this error rapidly:
 
 73021439 WARN  
 (qtp1476011703-18-acceptor-0@6d5514d9-ServerConnect

RE: Boost non stemmed keywords (KStem filter)

2015-11-19 Thread Markus Jelsma

Hello Jan - i have no code i can show but we are using it to power our search 
servers. You are correct, you need to deal with payloads at query time as well. 
This means you need a custom similarity but also customize your query parser to 
rewrite queries to payload supported types. This is also not very hard, some 
ancient examples can still be found on the web. But you also need to copy over 
existing TokenFilters to emit payloads whenever you want. Overriding 
TokenFilters is usually impossible due to crazy private members (i still cannot 
figure out why so many parts are private..)

It can be very powerful, especially if you do not use payloads to contain just 
a score. But instead to carry a WORD_TYPE, such as stemmed, unstemmed but also 
stopwords, acronyms, compound and subwords, headings or normal text but also 
NER types (which we don't have yet). For this to work you just need to treat 
the payload as a bitset for different types so you can have really tuneable 
scoring at query time via your similarity. Unfortunately, payloads can only 
carry a relative small amount of bits :)

M.

-Original message-
> From:Jan Høydahl 
> Sent: Thursday 19th November 2015 14:30
> To: solr-user@lucene.apache.org
> Subject: Re: Boost non stemmed keywords (KStem filter)
> 
> Do you have a concept code for this? Don’t you also have to hack your query 
> parser, e.g. dismax, to use other Query objects supporting payloads?
> 
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
> 
> > 18. nov. 2015 kl. 22.24 skrev Markus Jelsma :
> > 
> > Hi - easiest approach is to use KeywordRepeatFilter and 
> > RemoveDuplicatesTokenFilter. This creates a slightly higher IDF for 
> > unstemmed words which might be just enough in your case. We found it not to 
> > be enough, so we also attach payloads to signify stemmed words amongst 
> > others. This allows you to decrease score for stemmed words at query time 
> > via your similarity impl.
> > 
> > M.
> > 
> > 
> > 
> > -Original message-
> >> From:bbarani 
> >> Sent: Wednesday 18th November 2015 22:07
> >> To: solr-user@lucene.apache.org
> >> Subject: Boost non stemmed keywords (KStem filter)
> >> 
> >> Hi,
> >> 
> >> I am using KStem factory for stemming. This stemmer converts 'france to
> >> french', 'chinese to china' etc.. I am good with this stemming but I am
> >> trying to boost the results that contain the original term compared to the
> >> stemmed terms. Is this possible?
> >> 
> >> Thanks,
> >> Learner
> >> 
> >> 
> >> 
> >> 
> >> --
> >> View this message in context: 
> >> http://lucene.472066.n3.nabble.com/Boost-non-stemmed-keywords-KStem-filter-tp4240880.html
> >> Sent from the Solr - User mailing list archive at Nabble.com.
> >> 
> 
>

RE: Re:Re: Implementing security.json is breaking ADDREPLICA

2015-11-19 Thread Oakley, Craig (NIH/NLM/NCBI) [C]

I tried again with the following security.json, but the results were the same:

{
  "authentication":{
"class":"solr.BasicAuthPlugin",
"credentials":{
  "solr":"IV0EHq1OnNrj6gvRCwvFwTrZ1+z1oBbnQdiVC3otuq0= 
Ndd7LKvVBAaZIF0QAVi1ekCfAJXr1GGfLtRUXhgrF8c=",
  "solruser":"VgZX1TAMNHT2IJikoGdKtxQdXc+MbNwfqzf89YqcLEE= 
37pPWQ9v4gciIKHuTmFmN0Rv66rnlMOFEWfEy9qjJfY="},
"":{"v":9}},
  "authorization":{
"class":"solr.RuleBasedAuthorizationPlugin",
"user-role":{
  "solr":[
"admin",
"read",
"xmpladmin",
"xmplgen",
"xmplsel"],
  "solruser":[
"read",
"xmplgen",
"xmplsel"]},
"permissions":[
  {
"name":"security-edit",
"role":"admin"},
  {
"name":"xmpl_admin",
"collection":"xmpl",
"path":"/admin/*",
"role":"xmpladmin"},
  {
"name":"xmpl_sel",
"collection":"xmpl",
"path":"/select/*",
"role":null},
  {
 "name":"all-admin",
 "collection":null,
 "path":"/*",
 "role":"xmplgen"},
  {
 "name":"all-core-handlers",
 "path":"/*",
 "role":"xmplgen"}],
"":{"v":42}}}

-Original Message-
From: Oakley, Craig (NIH/NLM/NCBI) [C] 
Sent: Thursday, November 19, 2015 1:46 PM
To: 'solr-user@lucene.apache.org' 
Subject: RE: Re:Re: Implementing security.json is breaking ADDREPLICA

I note that the thread called "Security Problems" (most recent post by Nobel 
Paul) seems like it may help with much of what I'm trying to do. I will see to 
what extent that may help.

Large multivalued field and overseer problem

2015-11-19 Thread Olivier

Hi,

We have a Solrcloud cluster with 3 nodes (4 processors, 24 Gb RAM per node).
We have 3 shards per node and the replication factor is 3. We host 3
collections, the biggest is about 40K documents only.
The most important thing is a multivalued field with about 200K to 300K
values per document (each value is a kind of reference product of type
String).
We have some very big issues with our SolrCloud cluster. It crashes
entirely very frequently at the indexation time. It starts with an overseer
issue :

Session expired de l’overseer : KeeperErrorCode = Session expired for
/overseer_elect/leader

Then an another node is elected overseer. But the recovery phase seems to
failed indefinitely. It seems that the communication between the overseer
and ZK is impossible.
And after a short period of time, all the cluster is unavailable (out of
memory JVM error). And we have to restart it.

So I wanted to know if we can continue to use huge multivalued field with
SolrCloud.
We are on Solr 4.10.4 for now, do you think that if we upgrade to Solr 5,
with an overseer per collection it can fix our issues ?
Or do we have to rethink the schema to avoid this very large multivalued
field ?

Thanks,
Best,

Olivier

Re: Re:Re: Implementing security.json is breaking ADDREPLICA

2015-11-19 Thread Anshum Gupta

I'll try out what you did later in the day, as soon as I get time but why
exactly are you creating cores manually? Seems like you manually create a
core and the try to add a replica. Can you try using the Collections API to
create a collection?

Starting Solr 5.0, the only supported way to create a new collection is via
the Collections API. Creating a core would lead to a collection creation
but that's not really supported. It was just something that was done when
there were no Collections API.


On Thu, Nov 19, 2015 at 12:36 PM, Oakley, Craig (NIH/NLM/NCBI) [C] <
craig.oak...@nih.gov> wrote:

> I tried again with the following security.json, but the results were the
> same:
>
> {
>   "authentication":{
> "class":"solr.BasicAuthPlugin",
> "credentials":{
>   "solr":"IV0EHq1OnNrj6gvRCwvFwTrZ1+z1oBbnQdiVC3otuq0=
> Ndd7LKvVBAaZIF0QAVi1ekCfAJXr1GGfLtRUXhgrF8c=",
>   "solruser":"VgZX1TAMNHT2IJikoGdKtxQdXc+MbNwfqzf89YqcLEE=
> 37pPWQ9v4gciIKHuTmFmN0Rv66rnlMOFEWfEy9qjJfY="},
> "":{"v":9}},
>   "authorization":{
> "class":"solr.RuleBasedAuthorizationPlugin",
> "user-role":{
>   "solr":[
> "admin",
> "read",
> "xmpladmin",
> "xmplgen",
> "xmplsel"],
>   "solruser":[
> "read",
> "xmplgen",
> "xmplsel"]},
> "permissions":[
>   {
> "name":"security-edit",
> "role":"admin"},
>   {
> "name":"xmpl_admin",
> "collection":"xmpl",
> "path":"/admin/*",
> "role":"xmpladmin"},
>   {
> "name":"xmpl_sel",
> "collection":"xmpl",
> "path":"/select/*",
> "role":null},
>   {
>  "name":"all-admin",
>  "collection":null,
>  "path":"/*",
>  "role":"xmplgen"},
>   {
>  "name":"all-core-handlers",
>  "path":"/*",
>  "role":"xmplgen"}],
> "":{"v":42}}}
>
> -Original Message-
> From: Oakley, Craig (NIH/NLM/NCBI) [C]
> Sent: Thursday, November 19, 2015 1:46 PM
> To: 'solr-user@lucene.apache.org' 
> Subject: RE: Re:Re: Implementing security.json is breaking ADDREPLICA
>
> I note that the thread called "Security Problems" (most recent post by
> Nobel Paul) seems like it may help with much of what I'm trying to do. I
> will see to what extent that may help.
>



-- 
Anshum Gupta

Re: Generating Index offline and loading into solrcloud

2015-11-19 Thread KNitin

Thanks, Eric.  Looks like  MRIT uses Embedded solr running per
mapper/reducer and uses that to index documents. Is that the recommended
model? Can we use raw lucene libraries to generate index and then load them
into solrcloud? (Barring the complexities for indexing into right shard and
merging them).

I am thinking of using this for regular offline indexing which needs to be
idempotent.  When you mean update do you mean partial updates using _set?
If we add and delete every time for a document that should work, right?
(since all docs are indexed by doc id which contains all operational
history)? Let me know if I am missing something.

On Thu, Nov 19, 2015 at 12:09 PM, Erick Erickson 
wrote:

> Note two things:
>
> 1> this is running on Hadoop
> 2> it is part of the standard Solr release as MapReduceIndexerTool,
> look in the contribs...
>
> If you're trying to do this yourself, you must be very careful to index
> docs
> to the correct shard then merge the correct shards. MRIT does this all
> automatically.
>
> Additionally, it has the cool feature that if (and only if) your Solr
> index is running over
> HDFS, the --go-live option will automatically merge the indexes into
> the appropriate
> running Solr instances.
>
> One caveat. This tool doesn't handle _updating_ documents. So if you
> run it twice
> on the same data set, you'll have two copies of every doc. It's
> designed as a bulk
> initial-load tool.
>
> Best,
> Erick
>
>
>
> On Thu, Nov 19, 2015 at 11:45 AM, KNitin  wrote:
> > Great. Thanks!
> >
> > On Thu, Nov 19, 2015 at 11:24 AM, Sameer Maggon <
> sam...@measuredsearch.com>
> > wrote:
> >
> >> If you are trying to create a large index and want speedups there, you
> >> could use the MapReduceTool -
> >> https://github.com/cloudera/search/tree/cdh5-1.0.0_5.2.1/search-mr. At
> a
> >> high level, it takes your files (csv, json, etc) as input can create
> either
> >> a single or a sharded index that you can either copy it to your Solr
> >> Servers. I've used this to create indexes that include hundreds of
> millions
> >> of documents in fairly decent amount of time.
> >>
> >> Thanks,
> >> --
> >> *Sameer Maggon*
> >> Measured Search
> >> www.measuredsearch.com 
> >>
> >> On Thu, Nov 19, 2015 at 11:17 AM, KNitin  wrote:
> >>
> >> > Hi,
> >> >
> >> >  I was wondering if there are existing tools that will generate solr
> >> index
> >> > offline (in solrcloud mode)  that can be later on loaded into
> solrcloud,
> >> > before I decide to implement my own. I found some tools that do only
> solr
> >> > based index loading (non-zk mode). Is there one with zk mode enabled?
> >> >
> >> >
> >> > Thanks in advance!
> >> > Nitin
> >> >
> >>
>

Re: Error in log after upgrading Solr

2015-11-19 Thread Chris Hostetter


: on sample_techproducts_configs.  Because removing newSearcher and
: firstSearcher fixed the problem for me, the next step was to configure
: similar newSearcher and firstSearcher queries to what I used to have in
: my config, and try indexing docs.
: 
: I did this, and the problem did not reproduce.

when you indexed docs into this test config, did you use waitSearcher=true 
like in your original logs?

I think that + newSearcher QuerySendListener is the key to triggering the 
error logging.


-Hoss
http://www.lucidworks.com/

Re: Large multivalued field and overseer problem

2015-11-19 Thread Anshum Gupta

Hi Olivier,

A few things that you should know:
1. The Overseer is at a per cluster level and not at a per-collection level.
2. Also, documents/fields/etc. should have zero impact on the Overseer
itself.

So, while the upgrade to a more recent Solr version comes with a lot of
good stuff, the cluster state or the Overseer are not what you should be
looking at. Also, failing recovery also has nothing to do with the Overseer.

Now, the problem that might help people here to help you better.

Can you tell something about your zookeeper ? version, #nodes ?

Also, is the network between the Solr nodes and zk fine ?

You mention that you're seeing this issue while indexing. How are you
indexing (CloudSolrClient ? ) and what are your indexing settings
(auto-commit etc.).

Most importantly, what is the heap size of the Solr processes?

On Thu, Nov 19, 2015 at 12:43 PM, Olivier  wrote:

> Hi,
>
> We have a Solrcloud cluster with 3 nodes (4 processors, 24 Gb RAM per
> node).
> We have 3 shards per node and the replication factor is 3. We host 3
> collections, the biggest is about 40K documents only.
> The most important thing is a multivalued field with about 200K to 300K
> values per document (each value is a kind of reference product of type
> String).
> We have some very big issues with our SolrCloud cluster. It crashes
> entirely very frequently at the indexation time. It starts with an overseer
> issue :
>
> Session expired de l’overseer : KeeperErrorCode = Session expired for
> /overseer_elect/leader
>
> Then an another node is elected overseer. But the recovery phase seems to
> failed indefinitely. It seems that the communication between the overseer
> and ZK is impossible.
> And after a short period of time, all the cluster is unavailable (out of
> memory JVM error). And we have to restart it.
>
> So I wanted to know if we can continue to use huge multivalued field with
> SolrCloud.
> We are on Solr 4.10.4 for now, do you think that if we upgrade to Solr 5,
> with an overseer per collection it can fix our issues ?
> Or do we have to rethink the schema to avoid this very large multivalued
> field ?
>
> Thanks,
> Best,
>
> Olivier
>

-- 
Anshum Gupta

Re: Boost non stemmed keywords (KStem filter)

2015-11-19 Thread Ahmet Arslan

Hi,

I wonder about using two fields (text_stem and text_no_stem) and applying query 
time boost
text_stem^0.3 text_no_stem^0.6

What is the advantage of keyword repeat/paylad approach compared with this one?

Ahmet

On Thursday, November 19, 2015 10:24 PM, Markus Jelsma 
 wrote:
Hello Jan - i have no code i can show but we are using it to power our search 
servers. You are correct, you need to deal with payloads at query time as well. 
This means you need a custom similarity but also customize your query parser to 
rewrite queries to payload supported types. This is also not very hard, some 
ancient examples can still be found on the web. But you also need to copy over 
existing TokenFilters to emit payloads whenever you want. Overriding 
TokenFilters is usually impossible due to crazy private members (i still cannot 
figure out why so many parts are private..)

It can be very powerful, especially if you do not use payloads to contain just 
a score. But instead to carry a WORD_TYPE, such as stemmed, unstemmed but also 
stopwords, acronyms, compound and subwords, headings or normal text but also 
NER types (which we don't have yet). For this to work you just need to treat 
the payload as a bitset for different types so you can have really tuneable 
scoring at query time via your similarity. Unfortunately, payloads can only 
carry a relative small amount of bits :)

M.

-Original message-
> From:Jan Høydahl 
> Sent: Thursday 19th November 2015 14:30
> To: solr-user@lucene.apache.org
> Subject: Re: Boost non stemmed keywords (KStem filter)
> 
> Do you have a concept code for this? Don’t you also have to hack your query 
> parser, e.g. dismax, to use other Query objects supporting payloads?
> 
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
> 
> > 18. nov. 2015 kl. 22.24 skrev Markus Jelsma :
> > 
> > Hi - easiest approach is to use KeywordRepeatFilter and 
> > RemoveDuplicatesTokenFilter. This creates a slightly higher IDF for 
> > unstemmed words which might be just enough in your case. We found it not to 
> > be enough, so we also attach payloads to signify stemmed words amongst 
> > others. This allows you to decrease score for stemmed words at query time 
> > via your similarity impl.
> > 
> > M.
> > 
> > 
> > 
> > -Original message-
> >> From:bbarani 
> >> Sent: Wednesday 18th November 2015 22:07
> >> To: solr-user@lucene.apache.org
> >> Subject: Boost non stemmed keywords (KStem filter)
> >> 
> >> Hi,
> >> 
> >> I am using KStem factory for stemming. This stemmer converts 'france to
> >> french', 'chinese to china' etc.. I am good with this stemming but I am
> >> trying to boost the results that contain the original term compared to the
> >> stemmed terms. Is this possible?
> >> 
> >> Thanks,
> >> Learner
> >> 
> >> 
> >> 
> >> 
> >> --
> >> View this message in context: 
> >> http://lucene.472066.n3.nabble.com/Boost-non-stemmed-keywords-KStem-filter-tp4240880.html
> >> Sent from the Solr - User mailing list archive at Nabble.com.
> >> 
> 
>

Re: RealTimeGetHandler doesn't retrieve documents

2015-11-19 Thread Jack Krupansky

Do the failing IDs have any special characters that might need to be
escaped?

Can you find the documents using a normal query on the unique key field?

-- Jack Krupansky

On Thu, Nov 19, 2015 at 10:27 AM, Jérémie MONSINJON <
jeremie.monsin...@gmail.com> wrote:

> Hello everyone !
>
> I'm using SolR 5.3.1 with solrj.SolrClient.
> My index is sliced in 3 shards, each on different server. (No replica on
> dev platform)
> It has been up to date for a few days...
>
> [image: Images intégrées 2]
>
> I'm trying to use the RealTimeGetHandler to get documents by their Id.
> In our usecase, documents are updated very frequently,  so  we have to
> look in the tlog before searching the index.
>
> When I use the SolrClient.getById() (with a list of document Ids recently
> extracted from the index)
>
> SolR doesn't return *all* the documents corresponding to these Ids.
> So I tried to use the Solr api directly:
>
> http://server:port/solr/index/get/ids=id1,id2,id3
> And this is the same. Some ids don't works.
>
> In my example, id1 doesn't return a document, id2 and id3 or OK.
>
> If I try a filtered query with the id1, it works fine, the document exists
> in the index and is found by SolR
>
> Can anybody explain why a document, present in the index, with no
> uncommited update or delete, is not found by the Real Time Get Handler ?
>
> Regards,
> Jeremie
>
>

Re: Error in log after upgrading Solr

2015-11-19 Thread Shawn Heisey

On 11/19/2015 2:10 PM, Chris Hostetter wrote:
> when you indexed docs into this test config, did you use waitSearcher=true 
> like in your original logs?
>
> I think that + newSearcher QuerySendListener is the key to triggering the 
> error logging.

I don't recall ever configuring anything with an explicit
waitSearcher=true.  I think that's in the log because that's the default
value for that parameter.  Have I overlooked something?

I just went to the Documents tab of the admin UI and fixed up the sample
document it has there to have the right fields.

Thanks,
Shawn

Re: Error in log after upgrading Solr

2015-11-19 Thread Shawn Heisey

On 11/19/2015 3:02 PM, Shawn Heisey wrote:
> On 11/19/2015 2:10 PM, Chris Hostetter wrote:
>> when you indexed docs into this test config, did you use waitSearcher=true 
>> like in your original logs?
>>
>> I think that + newSearcher QuerySendListener is the key to triggering the 
>> error logging.
> I don't recall ever configuring anything with an explicit
> waitSearcher=true.  I think that's in the log because that's the default
> value for that parameter.  Have I overlooked something?

After I sent this, I realized that there is one place where that
parameter is explicitly set to true.  The SolrJ code sends an explicit
soft commit after each update cycle.  I'm using the following code with
HttpSolrClient:

  UpdateResponse ur = _client.commit(_name, true, true, true);

The only method that does a soft commit also requires explicitly setting
the waitSearcher parameter, and I definitely do want my code to wait for
that searcher.

When I find some time to work on the problem, I'll do the update with
SolrJ and try different kinds of commits.

Thanks,
Shawn

Re: Large multivalued field and overseer problem

2015-11-19 Thread Erick Erickson

In addition to Anshum's excellent points:

bq: And after a short period of time, all the cluster is unavailable (out of
memory JVM error).

This is where I'd focus my efforts. I suspect your memory-bound and are
actually seeing OOM errors about the time this problem manifests itself. Or
you're getting long GC pauses that make Zookeeper think the Solr instance is
gone.

I'd turn on GC logging and analyze that as a first step.

Best,
Erick

On Thu, Nov 19, 2015 at 1:19 PM, Anshum Gupta  wrote:
> Hi Olivier,
>
> A few things that you should know:
> 1. The Overseer is at a per cluster level and not at a per-collection level.
> 2. Also, documents/fields/etc. should have zero impact on the Overseer
> itself.
>
> So, while the upgrade to a more recent Solr version comes with a lot of
> good stuff, the cluster state or the Overseer are not what you should be
> looking at. Also, failing recovery also has nothing to do with the Overseer.
>
> Now, the problem that might help people here to help you better.
>
> Can you tell something about your zookeeper ? version, #nodes ?
>
> Also, is the network between the Solr nodes and zk fine ?
>
> You mention that you're seeing this issue while indexing. How are you
> indexing (CloudSolrClient ? ) and what are your indexing settings
> (auto-commit etc.).
>
> Most importantly, what is the heap size of the Solr processes?
>
>
> On Thu, Nov 19, 2015 at 12:43 PM, Olivier  wrote:
>
>> Hi,
>>
>> We have a Solrcloud cluster with 3 nodes (4 processors, 24 Gb RAM per
>> node).
>> We have 3 shards per node and the replication factor is 3. We host 3
>> collections, the biggest is about 40K documents only.
>> The most important thing is a multivalued field with about 200K to 300K
>> values per document (each value is a kind of reference product of type
>> String).
>> We have some very big issues with our SolrCloud cluster. It crashes
>> entirely very frequently at the indexation time. It starts with an overseer
>> issue :
>>
>> Session expired de l’overseer : KeeperErrorCode = Session expired for
>> /overseer_elect/leader
>>
>> Then an another node is elected overseer. But the recovery phase seems to
>> failed indefinitely. It seems that the communication between the overseer
>> and ZK is impossible.
>> And after a short period of time, all the cluster is unavailable (out of
>> memory JVM error). And we have to restart it.
>>
>> So I wanted to know if we can continue to use huge multivalued field with
>> SolrCloud.
>> We are on Solr 4.10.4 for now, do you think that if we upgrade to Solr 5,
>> with an overseer per collection it can fix our issues ?
>> Or do we have to rethink the schema to avoid this very large multivalued
>> field ?
>>
>> Thanks,
>> Best,
>>
>> Olivier
>>
>
>
>
> --
> Anshum Gupta

Re: Generating Index offline and loading into solrcloud

2015-11-19 Thread Erick Erickson

Sure, you can use Lucene to create indexes for shards
if (and only if) you deal with the routing issues

About updates: I'm not talking about atomic updates at all.
The usual model for Solr is if you have a unique key
defined, new versions of documents replace old versions
of documents based on uniqueKey. That process is
not guaranteed by MRIT is all.

Best,
Erick

On Thu, Nov 19, 2015 at 12:56 PM, KNitin  wrote:
> Thanks, Eric.  Looks like  MRIT uses Embedded solr running per
> mapper/reducer and uses that to index documents. Is that the recommended
> model? Can we use raw lucene libraries to generate index and then load them
> into solrcloud? (Barring the complexities for indexing into right shard and
> merging them).
>
> I am thinking of using this for regular offline indexing which needs to be
> idempotent.  When you mean update do you mean partial updates using _set?
> If we add and delete every time for a document that should work, right?
> (since all docs are indexed by doc id which contains all operational
> history)? Let me know if I am missing something.
>
> On Thu, Nov 19, 2015 at 12:09 PM, Erick Erickson 
> wrote:
>
>> Note two things:
>>
>> 1> this is running on Hadoop
>> 2> it is part of the standard Solr release as MapReduceIndexerTool,
>> look in the contribs...
>>
>> If you're trying to do this yourself, you must be very careful to index
>> docs
>> to the correct shard then merge the correct shards. MRIT does this all
>> automatically.
>>
>> Additionally, it has the cool feature that if (and only if) your Solr
>> index is running over
>> HDFS, the --go-live option will automatically merge the indexes into
>> the appropriate
>> running Solr instances.
>>
>> One caveat. This tool doesn't handle _updating_ documents. So if you
>> run it twice
>> on the same data set, you'll have two copies of every doc. It's
>> designed as a bulk
>> initial-load tool.
>>
>> Best,
>> Erick
>>
>>
>>
>> On Thu, Nov 19, 2015 at 11:45 AM, KNitin  wrote:
>> > Great. Thanks!
>> >
>> > On Thu, Nov 19, 2015 at 11:24 AM, Sameer Maggon <
>> sam...@measuredsearch.com>
>> > wrote:
>> >
>> >> If you are trying to create a large index and want speedups there, you
>> >> could use the MapReduceTool -
>> >> https://github.com/cloudera/search/tree/cdh5-1.0.0_5.2.1/search-mr. At
>> a
>> >> high level, it takes your files (csv, json, etc) as input can create
>> either
>> >> a single or a sharded index that you can either copy it to your Solr
>> >> Servers. I've used this to create indexes that include hundreds of
>> millions
>> >> of documents in fairly decent amount of time.
>> >>
>> >> Thanks,
>> >> --
>> >> *Sameer Maggon*
>> >> Measured Search
>> >> www.measuredsearch.com 
>> >>
>> >> On Thu, Nov 19, 2015 at 11:17 AM, KNitin  wrote:
>> >>
>> >> > Hi,
>> >> >
>> >> >  I was wondering if there are existing tools that will generate solr
>> >> index
>> >> > offline (in solrcloud mode)  that can be later on loaded into
>> solrcloud,
>> >> > before I decide to implement my own. I found some tools that do only
>> solr
>> >> > based index loading (non-zk mode). Is there one with zk mode enabled?
>> >> >
>> >> >
>> >> > Thanks in advance!
>> >> > Nitin
>> >> >
>> >>
>>

Re: replica recovery

2015-11-19 Thread Erick Erickson

Right, I've managed to double the memory required by Solr
by varying the _query_. Siiih.

There are some JIRAs out there (don't have them readily available, sorry)
that short-circuit queries that take "too long", and there are some others
to short circuit "expensive" queries. I believe this latter has been
talked about
but not committed.

Best,
Erick

On Thu, Nov 19, 2015 at 12:21 PM, Brian Scholl  wrote:
> Primarily our outages are caused by Java crashes or really long GC pauses, in 
> short not all of our developers have a good sense of what types of queries 
> are unsafe if abused (for example, cursorMark or start=).
>
> Honestly, stability of the JVM is another task I have coming up.  I agree 
> that recovery should be uncommon, we're just not where we need to be yet.
>
> Cheers,
> Brian
>
>
>
>
>> On Nov 19, 2015, at 15:14, Erick Erickson  wrote:
>>
>> bq: I would still like to increase the number of transaction logs
>> retained so that shard recovery (outside of long term failures) is
>> faster than replicating the entire shard from the leader
>>
>> That's legitimate, but (you knew that was coming!) nodes having to
>> recover _should_ be a rare event. Is this happening often or is it a
>> result of testing? If nodes are going into recovery for no good reason
>> (i.e. network being unplugged, whatever) I'd put some energy into
>> understanding that as well. Perhaps there are operational type things
>> that should be addressed (e.g. stop indexing, wait for commit, _then_
>> bounce Solr instances).
>>
>>
>> Best,
>> Erick
>>
>>
>>
>> On Thu, Nov 19, 2015 at 10:17 AM, Brian Scholl  wrote:
>>> Hey Erick,
>>>
>>> Thanks for the reply.
>>>
>>> I plan on rebuilding my cluster soon with more nodes so that the index size 
>>> (including tlogs) is under 50% of the available disk at a minimum, ideally 
>>> we will shoot for under 33% budget permitting.  I think I now understand 
>>> the problem that managing this resource will solve and I appreciate your 
>>> (and Shawn's) feedback.
>>>
>>> I would still like to increase the number of transaction logs retained so 
>>> that shard recovery (outside of long term failures) is faster than 
>>> replicating the entire shard from the leader.  I understand that this is an 
>>> optimization and not a
>>> solution for replication.  If I'm being thick about this call me out :)
>>>
>>> Cheers,
>>> Brian
>>>
>>>
>>>
>>>
 On Nov 19, 2015, at 11:30, Erick Erickson  wrote:

 First, every time you autocommit there _should_ be a new
 tlog created. A hard commit truncates the tlog by design.

 My guess (not based on knowing the code) is that
 Real Time Get needs file handle open to the tlog files
 and you'll have a bunch of them. Lots and lots and lots. Thus
 the too many file handles is just waiting out there for you.

 However, this entire approach is, IMO, not going to solve
 anything for you. Or rather other problems will come out
 of the woodwork.

 To whit: At some point, you _will_ need to have at least as
 much free space on your disk as your current index occupies,
 even without recovery. Background merging of segments can
 effectively do the same thing as an optimize step, which rewrites
 the entire index to new segments before deleting the old
 segments. So far you haven't hit that situation in steady-state,
 but you will.

 Simply put, I think you're wasting your time pursuing the tlog
 option. You must have bigger disks or smaller indexes such
 that there is at least as much free disk space at all times as
 your index occupies. In fact if the tlogs are on the same
 drive as your index, the tlog option you're pursuing is making
 the situation _worse_ by making running out of disk space
 during a merge even more likely.

 So unless there's a compelling reason you can't use bigger
 disks, IMO you'll waste lots and lots of valuable
 engineering time before... buying bigger disks.

 Best,
 Erick

 On Thu, Nov 19, 2015 at 6:21 AM, Brian Scholl  
 wrote:
> I have opted to modify the number and size of transaction logs that I 
> keep to resolve the original issue I described.  In so doing I think I 
> have created a new problem, feedback is appreciated.
>
> Here are the new updateLog settings:
>
>   
> ${solr.ulog.dir:}
>  name="numVersionBuckets">${solr.ulog.numVersionBuckets:65536}
> 1000
> 5760
>   
>
> First I want to make sure I understand what these settings do:
>   numRecordsToKeep: per transaction log file keep this number of 
> documents
>   maxNumLogsToKeep: retain this number of transaction log files total
>
> During my testing I thought I observed that a new tlog is created every 
> time auto-commit is triggered (every 15 seconds in my case) so I set 
> maxNumLogsToKeep high enough to conta

Re: Boost non stemmed keywords (KStem filter)

2015-11-19 Thread Walter Underwood

That is the approach I’ve been using for years. Simple and effective.

It probably makes the index bigger. Make sure that only one of the fields is 
stored, because the stored text will be exactly the same in both.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Nov 19, 2015, at 1:47 PM, Ahmet Arslan  wrote:
> 
> Hi,
> 
> I wonder about using two fields (text_stem and text_no_stem) and applying 
> query time boost
> text_stem^0.3 text_no_stem^0.6
> 
> What is the advantage of keyword repeat/paylad approach compared with this 
> one?
> 
> Ahmet
> 
> 
> On Thursday, November 19, 2015 10:24 PM, Markus Jelsma 
>  wrote:
> Hello Jan - i have no code i can show but we are using it to power our search 
> servers. You are correct, you need to deal with payloads at query time as 
> well. This means you need a custom similarity but also customize your query 
> parser to rewrite queries to payload supported types. This is also not very 
> hard, some ancient examples can still be found on the web. But you also need 
> to copy over existing TokenFilters to emit payloads whenever you want. 
> Overriding TokenFilters is usually impossible due to crazy private members (i 
> still cannot figure out why so many parts are private..)
> 
> It can be very powerful, especially if you do not use payloads to contain 
> just a score. But instead to carry a WORD_TYPE, such as stemmed, unstemmed 
> but also stopwords, acronyms, compound and subwords, headings or normal text 
> but also NER types (which we don't have yet). For this to work you just need 
> to treat the payload as a bitset for different types so you can have really 
> tuneable scoring at query time via your similarity. Unfortunately, payloads 
> can only carry a relative small amount of bits :)
> 
> M.
> 
> -Original message-
>> From:Jan Høydahl 
>> Sent: Thursday 19th November 2015 14:30
>> To: solr-user@lucene.apache.org
>> Subject: Re: Boost non stemmed keywords (KStem filter)
>> 
>> Do you have a concept code for this? Don’t you also have to hack your query 
>> parser, e.g. dismax, to use other Query objects supporting payloads?
>> 
>> --
>> Jan Høydahl, search solution architect
>> Cominvent AS - www.cominvent.com
>> 
>>> 18. nov. 2015 kl. 22.24 skrev Markus Jelsma :
>>> 
>>> Hi - easiest approach is to use KeywordRepeatFilter and 
>>> RemoveDuplicatesTokenFilter. This creates a slightly higher IDF for 
>>> unstemmed words which might be just enough in your case. We found it not to 
>>> be enough, so we also attach payloads to signify stemmed words amongst 
>>> others. This allows you to decrease score for stemmed words at query time 
>>> via your similarity impl.
>>> 
>>> M.
>>> 
>>> 
>>> 
>>> -Original message-
 From:bbarani 
 Sent: Wednesday 18th November 2015 22:07
 To: solr-user@lucene.apache.org
 Subject: Boost non stemmed keywords (KStem filter)
 
 Hi,
 
 I am using KStem factory for stemming. This stemmer converts 'france to
 french', 'chinese to china' etc.. I am good with this stemming but I am
 trying to boost the results that contain the original term compared to the
 stemmed terms. Is this possible?
 
 Thanks,
 Learner
 
 
 
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Boost-non-stemmed-keywords-KStem-filter-tp4240880.html
 Sent from the Solr - User mailing list archive at Nabble.com.
 
>> 
>>

Number of fields in qf & fq

2015-11-19 Thread Steven White

Hi everyone

What is considered too many fields for qf and fq?  On average I will have
1500 fields in qf and 100 in fq (all of which are OR'ed).  Assuming I can
(I have to check with the design) for qf, if I cut it down to 1 field, will
I see noticeable performance improvement?  It will take a lot of effort to
test this which is why I'm asking first.

As is, I'm seeing 2-5 sec response time for searches on an index of 1
million records with total index size (on disk) of 4 GB.  I gave Solr 2 GB
of RAM (also tested at 4 GB) in both cases Solr didn't use more then 1 GB.

Thanks in advanced

Steve

Re: Number of fields in qf & fq

2015-11-19 Thread Walter Underwood

With one field in qf for a single-term query, Solr is fetching one posting 
list. With 1500 fields, it is fetching 1500 posting lists. It could easily be 
1500 times slower.

It might be even slower than that, because we can’t guarantee that: a) every 
algorithm in Solr is linear, b) that all those lists will fit in memory.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Nov 19, 2015, at 3:46 PM, Steven White  wrote:
> 
> Hi everyone
> 
> What is considered too many fields for qf and fq?  On average I will have
> 1500 fields in qf and 100 in fq (all of which are OR'ed).  Assuming I can
> (I have to check with the design) for qf, if I cut it down to 1 field, will
> I see noticeable performance improvement?  It will take a lot of effort to
> test this which is why I'm asking first.
> 
> As is, I'm seeing 2-5 sec response time for searches on an index of 1
> million records with total index size (on disk) of 4 GB.  I gave Solr 2 GB
> of RAM (also tested at 4 GB) in both cases Solr didn't use more then 1 GB.
> 
> Thanks in advanced
> 
> Steve

Re: Number of fields in qf & fq

2015-11-19 Thread Steven White

Thanks Walter.  I see your point.  Does this apply to fq as will?

Also, how does one go about debugging performance issues in Solr to find
out where time is mostly spent?

Steve

On Thu, Nov 19, 2015 at 6:54 PM, Walter Underwood 
wrote:

> With one field in qf for a single-term query, Solr is fetching one posting
> list. With 1500 fields, it is fetching 1500 posting lists. It could easily
> be 1500 times slower.
>
> It might be even slower than that, because we can’t guarantee that: a)
> every algorithm in Solr is linear, b) that all those lists will fit in
> memory.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
>
> > On Nov 19, 2015, at 3:46 PM, Steven White  wrote:
> >
> > Hi everyone
> >
> > What is considered too many fields for qf and fq?  On average I will have
> > 1500 fields in qf and 100 in fq (all of which are OR'ed).  Assuming I can
> > (I have to check with the design) for qf, if I cut it down to 1 field,
> will
> > I see noticeable performance improvement?  It will take a lot of effort
> to
> > test this which is why I'm asking first.
> >
> > As is, I'm seeing 2-5 sec response time for searches on an index of 1
> > million records with total index size (on disk) of 4 GB.  I gave Solr 2
> GB
> > of RAM (also tested at 4 GB) in both cases Solr didn't use more then 1
> GB.
> >
> > Thanks in advanced
> >
> > Steve
>
>

Re: Number of fields in qf & fq

2015-11-19 Thread Walter Underwood

The implementation for fq has changed from 4.x to 5.x, so I’ll let someone else 
answer that in detail.

In 4.x, the result of each filter query can be cached. After that, they are 
quite fast.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Nov 19, 2015, at 3:59 PM, Steven White  wrote:
> 
> Thanks Walter.  I see your point.  Does this apply to fq as will?
> 
> Also, how does one go about debugging performance issues in Solr to find
> out where time is mostly spent?
> 
> Steve
> 
> On Thu, Nov 19, 2015 at 6:54 PM, Walter Underwood 
> wrote:
> 
>> With one field in qf for a single-term query, Solr is fetching one posting
>> list. With 1500 fields, it is fetching 1500 posting lists. It could easily
>> be 1500 times slower.
>> 
>> It might be even slower than that, because we can’t guarantee that: a)
>> every algorithm in Solr is linear, b) that all those lists will fit in
>> memory.
>> 
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>> 
>> 
>>> On Nov 19, 2015, at 3:46 PM, Steven White  wrote:
>>> 
>>> Hi everyone
>>> 
>>> What is considered too many fields for qf and fq?  On average I will have
>>> 1500 fields in qf and 100 in fq (all of which are OR'ed).  Assuming I can
>>> (I have to check with the design) for qf, if I cut it down to 1 field,
>> will
>>> I see noticeable performance improvement?  It will take a lot of effort
>> to
>>> test this which is why I'm asking first.
>>> 
>>> As is, I'm seeing 2-5 sec response time for searches on an index of 1
>>> million records with total index size (on disk) of 4 GB.  I gave Solr 2
>> GB
>>> of RAM (also tested at 4 GB) in both cases Solr didn't use more then 1
>> GB.
>>> 
>>> Thanks in advanced
>>> 
>>> Steve
>> 
>>

Re: Generating Index offline and loading into solrcloud

2015-11-19 Thread KNitin

Ah got it. Another generic question, is there too much of a difference
between generating files in map reduce and loading into solrcloud vs using
solr NRT api? Has any one run any test of that sort?

Thanks a ton,
Nitin

On Thu, Nov 19, 2015 at 3:00 PM, Erick Erickson 
wrote:

> Sure, you can use Lucene to create indexes for shards
> if (and only if) you deal with the routing issues
>
> About updates: I'm not talking about atomic updates at all.
> The usual model for Solr is if you have a unique key
> defined, new versions of documents replace old versions
> of documents based on uniqueKey. That process is
> not guaranteed by MRIT is all.
>
> Best,
> Erick
>
> On Thu, Nov 19, 2015 at 12:56 PM, KNitin  wrote:
> > Thanks, Eric.  Looks like  MRIT uses Embedded solr running per
> > mapper/reducer and uses that to index documents. Is that the recommended
> > model? Can we use raw lucene libraries to generate index and then load
> them
> > into solrcloud? (Barring the complexities for indexing into right shard
> and
> > merging them).
> >
> > I am thinking of using this for regular offline indexing which needs to
> be
> > idempotent.  When you mean update do you mean partial updates using _set?
> > If we add and delete every time for a document that should work, right?
> > (since all docs are indexed by doc id which contains all operational
> > history)? Let me know if I am missing something.
> >
> > On Thu, Nov 19, 2015 at 12:09 PM, Erick Erickson <
> erickerick...@gmail.com>
> > wrote:
> >
> >> Note two things:
> >>
> >> 1> this is running on Hadoop
> >> 2> it is part of the standard Solr release as MapReduceIndexerTool,
> >> look in the contribs...
> >>
> >> If you're trying to do this yourself, you must be very careful to index
> >> docs
> >> to the correct shard then merge the correct shards. MRIT does this all
> >> automatically.
> >>
> >> Additionally, it has the cool feature that if (and only if) your Solr
> >> index is running over
> >> HDFS, the --go-live option will automatically merge the indexes into
> >> the appropriate
> >> running Solr instances.
> >>
> >> One caveat. This tool doesn't handle _updating_ documents. So if you
> >> run it twice
> >> on the same data set, you'll have two copies of every doc. It's
> >> designed as a bulk
> >> initial-load tool.
> >>
> >> Best,
> >> Erick
> >>
> >>
> >>
> >> On Thu, Nov 19, 2015 at 11:45 AM, KNitin  wrote:
> >> > Great. Thanks!
> >> >
> >> > On Thu, Nov 19, 2015 at 11:24 AM, Sameer Maggon <
> >> sam...@measuredsearch.com>
> >> > wrote:
> >> >
> >> >> If you are trying to create a large index and want speedups there,
> you
> >> >> could use the MapReduceTool -
> >> >> https://github.com/cloudera/search/tree/cdh5-1.0.0_5.2.1/search-mr.
> At
> >> a
> >> >> high level, it takes your files (csv, json, etc) as input can create
> >> either
> >> >> a single or a sharded index that you can either copy it to your Solr
> >> >> Servers. I've used this to create indexes that include hundreds of
> >> millions
> >> >> of documents in fairly decent amount of time.
> >> >>
> >> >> Thanks,
> >> >> --
> >> >> *Sameer Maggon*
> >> >> Measured Search
> >> >> www.measuredsearch.com 
> >> >>
> >> >> On Thu, Nov 19, 2015 at 11:17 AM, KNitin 
> wrote:
> >> >>
> >> >> > Hi,
> >> >> >
> >> >> >  I was wondering if there are existing tools that will generate
> solr
> >> >> index
> >> >> > offline (in solrcloud mode)  that can be later on loaded into
> >> solrcloud,
> >> >> > before I decide to implement my own. I found some tools that do
> only
> >> solr
> >> >> > based index loading (non-zk mode). Is there one with zk mode
> enabled?
> >> >> >
> >> >> >
> >> >> > Thanks in advance!
> >> >> > Nitin
> >> >> >
> >> >>
> >>
>

Re: Parallel SQL / calcite adapter

2015-11-19 Thread Joel Bernstein

It's an interesting question. The JDBC driver is still very basic. It would
depend on how much of the JDBC spec needs to be implemented to connect to
Calcite/Drill.

Joel Bernstein
http://joelsolr.blogspot.com/

On Thu, Nov 19, 2015 at 3:28 AM, Kai Gülzau  wrote:

>
> We are currently evaluating calcite as a SQL facade for different Data
> Sources
>
> -  JDBC
>
> -  REST
>
> >SOLR
>
> -  ...
>
> I didn't found a "native" calcite adapter for solr (
> http://calcite.apache.org/docs/adapter.html).
>
> Is it a good idea to use the parallel sql feature (over jdbc) to connect
> calcite (or apache drill) to solr?
> Any suggestions?
>
>
> Thanks,
>
> Kai Gülzau
>

Re: replica recovery

2015-11-19 Thread Jeff Wartes


I completely agree with the other comments on this thread with regard to
needing more disk space asap, but I thought I’d add a few comments
regarding the specific questions here.

If your goal is to prevent full recovery requests, you only need to cover
the duration you expect a replica to be unavailable.

If your common issue is GC due to bad queries, you probably don’t need to
cover more than the number of docs you write in your typical full GC
pause. I suspect this is less than 10M.
If your common issue is the length of time it takes you to notice a server
crashed and restart it, you may need to cover something like 10 minutes
worth of docs. I suspect this is still less than 10M.

You certainly don’t need to keep an entire day’s transaction logs. If your
servers routinely go down for a whole day, solve that by fixing your
servers. :)

With respect to ulimit, if Solr is the only thing of significance on the
box, there’s no reason not to bump that up. I usually just set something
like 32k and stop thinking about it. I get hit by a low ulimit every now
and then, but I can’t recall ever having had an issue with it being too
high.



On 11/19/15, 6:21 AM, "Brian Scholl"  wrote:

>I have opted to modify the number and size of transaction logs that I
>keep to resolve the original issue I described.  In so doing I think I
>have created a new problem, feedback is appreciated.
>
>Here are the new updateLog settings:
>
>
>  ${solr.ulog.dir:}
>  name="numVersionBuckets">${solr.ulog.numVersionBuckets:65536}
>  1000
>  5760
>
>
>First I want to make sure I understand what these settings do:
>   numRecordsToKeep: per transaction log file keep this number of documents
>   maxNumLogsToKeep: retain this number of transaction log files total
>
>During my testing I thought I observed that a new tlog is created every
>time auto-commit is triggered (every 15 seconds in my case) so I set
>maxNumLogsToKeep high enough to contain an entire days worth of updates.
> Knowing that I could potentially need to bulk load some data I set
>numRecordsToKeep higher than my max throughput per replica for 15 seconds.
>
>The problem that I think this has created is I am now running out of file
>descriptors on the servers.  After indexing new documents for a couple
>hours a some servers (not all) will start logging this error rapidly:
>
>73021439 WARN  
>(qtp1476011703-18-acceptor-0@6d5514d9-ServerConnector@6392e703{HTTP/1.1}{0
>.0.0.0:8983}) [   ] o.e.j.s.ServerConnector
>java.io.IOException: Too many open files
>   at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
>   at 
>sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:422
>)
>   at 
>sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:250
>)
>   at 
>org.eclipse.jetty.server.ServerConnector.accept(ServerConnector.java:377)
>   at 
>org.eclipse.jetty.server.AbstractConnector$Acceptor.run(AbstractConnector.
>java:500)
>   at 
>org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.jav
>a:635)
>   at 
>org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java
>:555)
>   at java.lang.Thread.run(Thread.java:745)
>
>The output of ulimit -n for the user running the solr process is 1024.  I
>am pretty sure I can prevent this error from occurring  by increasing the
>limit on each server but it isn't clear to me how high it should be or if
>raising the limit will cause new problems.
>
>Any advice you could provide in this situation would be awesome!
>
>Cheers,
>Brian
>
>
>
>> On Oct 27, 2015, at 20:50, Jeff Wartes  wrote:
>> 
>> 
>> On the face of it, your scenario seems plausible. I can offer two pieces
>> of info that may or may not help you:
>> 
>> 1. A write request to Solr will not be acknowledged until an attempt has
>> been made to write to all relevant replicas. So, B won’t ever be missing
>> updates that were applied to A, unless communication with B was
>>disrupted
>> somehow at the time of the update request. You can add a min_rf param to
>> your write request, in which case the response will tell you how many
>> replicas received the update, but it’s still up to your indexer client
>>to
>> decide what to do if that’s less than your replication factor.
>> 
>> See 
>> 
>>https://cwiki.apache.org/confluence/display/solr/Read+and+Write+Side+Faul
>>t+
>> Tolerance for more info.
>> 
>> 2. There are two forms of replication. The usual thing is for the leader
>> for each shard to write an update to all replicas before acknowledging
>>the
>> write itself, as above. If a replica is less than N docs behind the
>> leader, the leader can replay those docs to the replica from its
>> transaction log. If a replica is more than N docs behind though, it
>>falls
>> back to the replication handler recovery mode you mention, and attempts
>>to
>> re-sync the whole shard from the leader.
>> The default N for this is 100, which is pretty low for a
>>high-up

Re: Generating Index offline and loading into solrcloud

2015-11-19 Thread Erick Erickson

Apples/Oranges question:

They're different beasts. The NRT stuff (spark-solr for example,
Cloudera's Flume sink as well, custom SolrJ clients, whatever) is
constrained by the number of Solr servers you have running, more
specifically the number of shards. When you're feeding docs fast
enough that you max out those CPUs, that's it; you're going flat out
and nothing you can do can drive indexing any faster.

With MRIT, you have the entire capacity of your Hadoop cluster at your
disposal. If you have 1,000 nodes you can be driving all of them as
fast as you can make them go, even if you only have 10 shards. Of
course in this case there'll be some copying time to deal with, but
you get the idea.

In terms of the end result, it's just a Lucene index; It doesn't
matter what process generates it.

Best,
Erick


On Thu, Nov 19, 2015 at 4:52 PM, KNitin  wrote:
> Ah got it. Another generic question, is there too much of a difference
> between generating files in map reduce and loading into solrcloud vs using
> solr NRT api? Has any one run any test of that sort?
>
> Thanks a ton,
> Nitin
>
> On Thu, Nov 19, 2015 at 3:00 PM, Erick Erickson 
> wrote:
>
>> Sure, you can use Lucene to create indexes for shards
>> if (and only if) you deal with the routing issues
>>
>> About updates: I'm not talking about atomic updates at all.
>> The usual model for Solr is if you have a unique key
>> defined, new versions of documents replace old versions
>> of documents based on uniqueKey. That process is
>> not guaranteed by MRIT is all.
>>
>> Best,
>> Erick
>>
>> On Thu, Nov 19, 2015 at 12:56 PM, KNitin  wrote:
>> > Thanks, Eric.  Looks like  MRIT uses Embedded solr running per
>> > mapper/reducer and uses that to index documents. Is that the recommended
>> > model? Can we use raw lucene libraries to generate index and then load
>> them
>> > into solrcloud? (Barring the complexities for indexing into right shard
>> and
>> > merging them).
>> >
>> > I am thinking of using this for regular offline indexing which needs to
>> be
>> > idempotent.  When you mean update do you mean partial updates using _set?
>> > If we add and delete every time for a document that should work, right?
>> > (since all docs are indexed by doc id which contains all operational
>> > history)? Let me know if I am missing something.
>> >
>> > On Thu, Nov 19, 2015 at 12:09 PM, Erick Erickson <
>> erickerick...@gmail.com>
>> > wrote:
>> >
>> >> Note two things:
>> >>
>> >> 1> this is running on Hadoop
>> >> 2> it is part of the standard Solr release as MapReduceIndexerTool,
>> >> look in the contribs...
>> >>
>> >> If you're trying to do this yourself, you must be very careful to index
>> >> docs
>> >> to the correct shard then merge the correct shards. MRIT does this all
>> >> automatically.
>> >>
>> >> Additionally, it has the cool feature that if (and only if) your Solr
>> >> index is running over
>> >> HDFS, the --go-live option will automatically merge the indexes into
>> >> the appropriate
>> >> running Solr instances.
>> >>
>> >> One caveat. This tool doesn't handle _updating_ documents. So if you
>> >> run it twice
>> >> on the same data set, you'll have two copies of every doc. It's
>> >> designed as a bulk
>> >> initial-load tool.
>> >>
>> >> Best,
>> >> Erick
>> >>
>> >>
>> >>
>> >> On Thu, Nov 19, 2015 at 11:45 AM, KNitin  wrote:
>> >> > Great. Thanks!
>> >> >
>> >> > On Thu, Nov 19, 2015 at 11:24 AM, Sameer Maggon <
>> >> sam...@measuredsearch.com>
>> >> > wrote:
>> >> >
>> >> >> If you are trying to create a large index and want speedups there,
>> you
>> >> >> could use the MapReduceTool -
>> >> >> https://github.com/cloudera/search/tree/cdh5-1.0.0_5.2.1/search-mr.
>> At
>> >> a
>> >> >> high level, it takes your files (csv, json, etc) as input can create
>> >> either
>> >> >> a single or a sharded index that you can either copy it to your Solr
>> >> >> Servers. I've used this to create indexes that include hundreds of
>> >> millions
>> >> >> of documents in fairly decent amount of time.
>> >> >>
>> >> >> Thanks,
>> >> >> --
>> >> >> *Sameer Maggon*
>> >> >> Measured Search
>> >> >> www.measuredsearch.com 
>> >> >>
>> >> >> On Thu, Nov 19, 2015 at 11:17 AM, KNitin 
>> wrote:
>> >> >>
>> >> >> > Hi,
>> >> >> >
>> >> >> >  I was wondering if there are existing tools that will generate
>> solr
>> >> >> index
>> >> >> > offline (in solrcloud mode)  that can be later on loaded into
>> >> solrcloud,
>> >> >> > before I decide to implement my own. I found some tools that do
>> only
>> >> solr
>> >> >> > based index loading (non-zk mode). Is there one with zk mode
>> enabled?
>> >> >> >
>> >> >> >
>> >> >> > Thanks in advance!
>> >> >> > Nitin
>> >> >> >
>> >> >>
>> >>
>>

Re: Number of fields in qf & fq

2015-11-19 Thread Erick Erickson

An fq is still a single entry in your filterCache so from that
perspective it's the same.

And to create that entry, you're still using all the underlying fields
to search, so they have to be loaded just like they would be in a q
clause.

But really, the fundamental question here is why your design even has
1,500 fields and, more specifically, why you would want to search them
all at once. From a 10,000 ft. view, that's a very suspect design.

Best,
Erick

On Thu, Nov 19, 2015 at 4:06 PM, Walter Underwood  wrote:
> The implementation for fq has changed from 4.x to 5.x, so I’ll let someone 
> else answer that in detail.
>
> In 4.x, the result of each filter query can be cached. After that, they are 
> quite fast.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
>
>> On Nov 19, 2015, at 3:59 PM, Steven White  wrote:
>>
>> Thanks Walter.  I see your point.  Does this apply to fq as will?
>>
>> Also, how does one go about debugging performance issues in Solr to find
>> out where time is mostly spent?
>>
>> Steve
>>
>> On Thu, Nov 19, 2015 at 6:54 PM, Walter Underwood 
>> wrote:
>>
>>> With one field in qf for a single-term query, Solr is fetching one posting
>>> list. With 1500 fields, it is fetching 1500 posting lists. It could easily
>>> be 1500 times slower.
>>>
>>> It might be even slower than that, because we can’t guarantee that: a)
>>> every algorithm in Solr is linear, b) that all those lists will fit in
>>> memory.
>>>
>>> wunder
>>> Walter Underwood
>>> wun...@wunderwood.org
>>> http://observer.wunderwood.org/  (my blog)
>>>
>>>
 On Nov 19, 2015, at 3:46 PM, Steven White  wrote:

 Hi everyone

 What is considered too many fields for qf and fq?  On average I will have
 1500 fields in qf and 100 in fq (all of which are OR'ed).  Assuming I can
 (I have to check with the design) for qf, if I cut it down to 1 field,
>>> will
 I see noticeable performance improvement?  It will take a lot of effort
>>> to
 test this which is why I'm asking first.

 As is, I'm seeing 2-5 sec response time for searches on an index of 1
 million records with total index size (on disk) of 4 GB.  I gave Solr 2
>>> GB
 of RAM (also tested at 4 GB) in both cases Solr didn't use more then 1
>>> GB.

 Thanks in advanced

 Steve
>>>
>>>
>

答复: Security Problems

2015-11-19 Thread Byzen Ma

Thanks for the reply. The two smallest rules 
1)
"name":"all-admin",
"collection": null,
"path":"/*"
"role:"somerole"

2) all core handlers
"name":"all-core-handlers",
"path":"/*"
"role":"somerole"

do work after I reset my security.json. But another magic things happened. 
After I accidentally just set the rule as :
"name":"admin-all"
"role":"admin"
It still works well and I even need permission to Admin UI. Perfect for me but 
why? I am a inexperience solr user and can't figure out it, but I want to know 
why it works? And another error happens as thread "replica recovery" describe. 
The RecoveryStrategy on main server attempt to recover replica again and again 
maybe last forever. Now I can see happy red cramming on my Admin Logging UI!


-邮件原件-
发件人: solr-user-return-118135-mabaizhang=126@lucene.apache.org 
[mailto:solr-user-return-118135-mabaizhang=126@lucene.apache.org] 代表 Noble 
Paul
发送时间: 2015年11月20日 1:40
收件人: solr-user@lucene.apache.org
主题: Re: Security Problems

What is the smallest possible security.json required currently to protect all 
possible paths (except those served by Jetty)?



You would need 2 rules
1)
"name":"all-admin",
"collection": null,
"path":"/*"
"role:"somerole"

2) all core handlers
"name":"all-core-handlers",
"path":"/*"
"role":"somerole"


ideally we should have a simple permission name called "all" (which we don't 
have)

so that one rule should be enough

"name":"all",
"role":"somerole"

Open a ticket and we should fix it for 5.4.0 It should also include  the admin 
paths as well


On Thu, Nov 19, 2015 at 6:02 PM, Jan Høydahl  wrote:
> Would it not be less surprising if ALL requests to Solr required 
> authentication once an AuthenticationPlugin was enabled?
> Then, if no AuthorizationPlugin was active, all authenticated users could do 
> anything.
> But if AuthorizationPlugin was configured, you could only do what your role 
> allows you to?
>
> As it is now it is super easy to forget a path, say you protect 
> /select but not /browse and /query, or someone creates a collection with some 
> new endpoints and forgets to update security.json - then that endpoint would 
> be wide open!
>
> What is the smallest possible security.json required currently to protect all 
> possible paths (except those served by Jetty)?
>
> --
> Jan Høydahl, search solution architect Cominvent AS - 
> www.cominvent.com
>
>> 18. nov. 2015 kl. 20.31 skrev Upayavira :
>>
>> I'm very happy for the admin UI to be served another way - i.e. not 
>> direct from Jetty, if that makes the task of securing it easier.
>>
>> Perhaps a request handler specifically for UI resources which would 
>> make it possible to secure it all in a more straight-forward way?
>>
>> Upayavira
>>
>> On Wed, Nov 18, 2015, at 01:54 PM, Noble Paul wrote:
>>> As of now the admin-ui calls are not protected. The static calls are 
>>> served by jetty and it bypasses the authentication mechanism 
>>> completely. If the admin UI relies on some API call which is served 
>>> by Solr.
>>> The other option is to revamp the framework to take care of admin UI 
>>> (static content) as well. This would be cleaner solution
>>>
>>>
>>> On Wed, Nov 18, 2015 at 2:32 PM, Upayavira  wrote:
 Not sure I quite understand.

 You're saying that the cost for the UI is not large, but then 
 suggesting we protect just one resource (/admin/security-check)?

 Why couldn't we create the permission called 'admin-ui' and protect 
 everything under /admin/ui/ for example? Along with the root HTML 
 link too.

 Upayavira

 On Wed, Nov 18, 2015, at 07:46 AM, Noble Paul wrote:
> The authentication plugin is not expensive if you are talking in 
> the context of admin UI. After all it is used not like 100s of 
> requests per second.
>
> The simplest solution would be
>
> provide a well known permission name called "admin-ui"
>
> ensure that every admin page load makes a call to some resource 
> say "/admin/security-check"
>
> Then we can just protect that .
>
> The only concern thatI have is the false sense of security it 
> would give to the user
>
> But, that is a different point altogether
>
> On Wed, Nov 11, 2015 at 1:52 AM, Upayavira  wrote:
>> Is the authentication plugin that expensive?
>>
>> I can help by minifying the UI down to a smaller number of 
>> CSS/JS/etc files :-)
>>
>> It may be overkill, but it would also give better experience. And 
>> isn't that what most applications do? Check authentication tokens 
>> on every request?
>>
>> Upayavira
>>
>> On Tue, Nov 10, 2015, at 07:33 PM, Anshum Gupta wrote:
>>> The reason why we bypass that is so that we don't hit the 
>>> authentication plugin for every request that comes in for static 
>>> content. I think we could call the authentication plugin for 
>>> that but that'd be an overkill. Better experience ?

Re: solr indexing warning

2015-11-19 Thread Midas A

Thanks Emir ,

So what we need to do to resolve this issue .





This is my solr configuration.  what changes should i do to avoid the
warning .

~abhishek

On Thu, Nov 19, 2015 at 6:37 PM, Emir Arnautovic <
emir.arnauto...@sematext.com> wrote:

> This means that one searcher is still warming when other searcher created
> due to commit with openSearcher=true. This can be due to frequent commits
> of searcher warmup taking too long.
>
> Emir
>
> --
> Monitoring * Alerting * Anomaly Detection * Centralized Log Management
> Solr & Elasticsearch Support * http://sematext.com/
>
>
>
>
> On 19.11.2015 12:16, Midas A wrote:
>
>> Getting following log on solr
>>
>>
>> PERFORMANCE WARNING: Overlapping onDeckSearchers=2`
>>
>>

答复: Security Problems

2015-11-19 Thread Byzen Ma

Apology for I did't read thread " replica recovery " carefully. It may be 
another problem. But the thread " Implementing security.json is breaking 
ADDREPLICA " is same as me.

-邮件原件-
发件人: solr-user-return-118173-mabaizhang=126@lucene.apache.org 
[mailto:solr-user-return-118173-mabaizhang=126@lucene.apache.org] 代表 Byzen 
Ma
发送时间: 2015年11月20日 13:26
收件人: solr-user@lucene.apache.org
主题: 答复: Security Problems

Thanks for the reply. The two smallest rules
1)
"name":"all-admin",
"collection": null,
"path":"/*"
"role:"somerole"

2) all core handlers
"name":"all-core-handlers",
"path":"/*"
"role":"somerole"

do work after I reset my security.json. But another magic things happened. 
After I accidentally just set the rule as :
"name":"admin-all"
"role":"admin"
It still works well and I even need permission to Admin UI. Perfect for me but 
why? I am a inexperience solr user and can't figure out it, but I want to know 
why it works? And another error happens as thread "replica recovery" describe. 
The RecoveryStrategy on main server attempt to recover replica again and again 
maybe last forever. Now I can see happy red cramming on my Admin Logging UI!


-邮件原件-
发件人: solr-user-return-118135-mabaizhang=126@lucene.apache.org 
[mailto:solr-user-return-118135-mabaizhang=126@lucene.apache.org] 代表 Noble 
Paul
发送时间: 2015年11月20日 1:40
收件人: solr-user@lucene.apache.org
主题: Re: Security Problems

What is the smallest possible security.json required currently to protect all 
possible paths (except those served by Jetty)?



You would need 2 rules
1)
"name":"all-admin",
"collection": null,
"path":"/*"
"role:"somerole"

2) all core handlers
"name":"all-core-handlers",
"path":"/*"
"role":"somerole"


ideally we should have a simple permission name called "all" (which we don't 
have)

so that one rule should be enough

"name":"all",
"role":"somerole"

Open a ticket and we should fix it for 5.4.0 It should also include  the admin 
paths as well


On Thu, Nov 19, 2015 at 6:02 PM, Jan Høydahl  wrote:
> Would it not be less surprising if ALL requests to Solr required 
> authentication once an AuthenticationPlugin was enabled?
> Then, if no AuthorizationPlugin was active, all authenticated users could do 
> anything.
> But if AuthorizationPlugin was configured, you could only do what your role 
> allows you to?
>
> As it is now it is super easy to forget a path, say you protect 
> /select but not /browse and /query, or someone creates a collection with some 
> new endpoints and forgets to update security.json - then that endpoint would 
> be wide open!
>
> What is the smallest possible security.json required currently to protect all 
> possible paths (except those served by Jetty)?
>
> --
> Jan Høydahl, search solution architect Cominvent AS - 
> www.cominvent.com
>
>> 18. nov. 2015 kl. 20.31 skrev Upayavira :
>>
>> I'm very happy for the admin UI to be served another way - i.e. not 
>> direct from Jetty, if that makes the task of securing it easier.
>>
>> Perhaps a request handler specifically for UI resources which would 
>> make it possible to secure it all in a more straight-forward way?
>>
>> Upayavira
>>
>> On Wed, Nov 18, 2015, at 01:54 PM, Noble Paul wrote:
>>> As of now the admin-ui calls are not protected. The static calls are 
>>> served by jetty and it bypasses the authentication mechanism 
>>> completely. If the admin UI relies on some API call which is served 
>>> by Solr.
>>> The other option is to revamp the framework to take care of admin UI 
>>> (static content) as well. This would be cleaner solution
>>>
>>>
>>> On Wed, Nov 18, 2015 at 2:32 PM, Upayavira  wrote:
 Not sure I quite understand.

 You're saying that the cost for the UI is not large, but then 
 suggesting we protect just one resource (/admin/security-check)?

 Why couldn't we create the permission called 'admin-ui' and protect 
 everything under /admin/ui/ for example? Along with the root HTML 
 link too.

 Upayavira

 On Wed, Nov 18, 2015, at 07:46 AM, Noble Paul wrote:
> The authentication plugin is not expensive if you are talking in 
> the context of admin UI. After all it is used not like 100s of 
> requests per second.
>
> The simplest solution would be
>
> provide a well known permission name called "admin-ui"
>
> ensure that every admin page load makes a call to some resource 
> say "/admin/security-check"
>
> Then we can just protect that .
>
> The only concern thatI have is the false sense of security it 
> would give to the user
>
> But, that is a different point altogether
>
> On Wed, Nov 11, 2015 at 1:52 AM, Upayavira  wrote:
>> Is the authentication plugin that expensive?
>>
>> I can help by minifying the UI down to a smaller number of 
>> CSS/JS/etc files :-)
>>
>> It may be overkill, but it would also give better experience. And 
>> isn't that what most applicati

Re: solr indexing warning

2015-11-19 Thread Shawn Heisey

On 11/19/2015 11:06 PM, Midas A wrote:
>  autowarmCount="1000"/>   size="1000" initialSize="1000" autowarmCount="1000"/>   ="1000" autowarmCount="1000"/>

Your caches are quite large.  More importantly, your autowarmCount is
very large.  How many documents are in each of your cores?  If you check
the Plugins/Stats area in the admin UI for your core(s), how many
entries are actually in each of those three caches?  Also shown there is
the number of milliseconds that it took for each cache to warm.

The documentCache cannot be autowarmed, so that config is not doing
anything.

When a cache is autowarmed, what this does is look up the key for the
top N entries in the old cache, which contains the query used to
generate that cache entry, and executes each of those queries on the new
index to populate the new cache.

This means that up to 2000 queries are being executed every time you
commit and open a new searcher.  The actual number may be less, if the
filterCache and queryResultCache are not actually reaching 1000 entries
each.  Autowarming can take a significant amount of time when the
autowarmCount is high.  It should be lowered.

Thanks,
Shawn

Re: solr indexing warning

2015-11-19 Thread Midas A

thanks Shawn,

As we are this server as a master server  there are no queries running on
it  . in that case should i remove these configuration from config file .

Total Docs: 40 0

Stats
#

Document cache :
lookups:823
hits:4
hitratio:0.00
inserts:820
evictions:0
size:820
warmupTime:0
cumulative_lookups:24474
cumulative_hits:1746
cumulative_hitratio:0.07
cumulative_inserts:22728
cumulative_evictions:13345


fieldcache:
stats:
entries_count:2
entry#0:'SegmentCoreReader(owner=_3bph(4.2.1):C3918553)'=>'_version_',long,org.apache.lucene.search.FieldCache.NUMERIC_UTILS_LONG_PARSER=>org.apache.lucene.search.FieldCacheImpl$LongsFromArray#1919958905
entry#1:'SegmentCoreReader(owner=_3bph(4.2.1):C3918553)'=>'_version_',class
org.apache.lucene.search.FieldCacheImpl$DocsWithFieldCache,null=>org.apache.lucene.util.Bits$MatchAllBits#660036513
insanity_count:0


fieldValuecache:

lookups:0

hits:0

hitratio:0.00

inserts:0

evictions:0

size:0

warmupTime:0

cumulative_lookups:0

cumulative_hits:0

cumulative_hitratio:0.00

cumulative_inserts:0

cumulative_evictions:0


filtercache:


lookups:0

hits:0

hitratio:0.00

inserts:0

evictions:0

size:0

warmupTime:0

cumulative_lookups:0

cumulative_hits:0

cumulative_hitratio:0.00

cumulative_inserts:0

cumulative_evictions:0


QueryResultCache:

lookups:3841

hits:0

hitratio:0.00

inserts:4841

evictions:3841

size:1000

warmupTime:213

cumulative_lookups:58438

cumulative_hits:153

cumulative_hitratio:0.00

cumulative_inserts:58285

cumulative_evictions:57285



Please suggest .



On Fri, Nov 20, 2015 at 12:15 PM, Shawn Heisey  wrote:

> On 11/19/2015 11:06 PM, Midas A wrote:
> >  > autowarmCount="1000"/>   > size="1000" initialSize="1000" autowarmCount="1000"/>   initialSize
> > ="1000" autowarmCount="1000"/>
>
> Your caches are quite large.  More importantly, your autowarmCount is
> very large.  How many documents are in each of your cores?  If you check
> the Plugins/Stats area in the admin UI for your core(s), how many
> entries are actually in each of those three caches?  Also shown there is
> the number of milliseconds that it took for each cache to warm.
>
> The documentCache cannot be autowarmed, so that config is not doing
> anything.
>
> When a cache is autowarmed, what this does is look up the key for the
> top N entries in the old cache, which contains the query used to
> generate that cache entry, and executes each of those queries on the new
> index to populate the new cache.
>
> This means that up to 2000 queries are being executed every time you
> commit and open a new searcher.  The actual number may be less, if the
> filterCache and queryResultCache are not actually reaching 1000 entries
> each.  Autowarming can take a significant amount of time when the
> autowarmCount is high.  It should be lowered.
>
> Thanks,
> Shawn
>
>

66 matches

Mail list logo